令人头大之IBM Model

2020-03-29

最近一段时间工作中，急需补充giza、fast align算法的背后原理，因此集中补一补这些令人头大的算法。本来打算看完IBM-Model1~Model5和HMM，但后来卡到了Model-3上，准备在后续的博客中继续更新。因此本篇将重点介绍Model-1~Model2。

参考：
https://wenku.baidu.com/view/4cda374769eae009581becd2.html
https://www.cl.uni-heidelberg.de/courses/ss15/smt/scribe2.pdf

词对齐算法之IBM Model

假设任意一个英语句子e和一个法语句子f，定义f翻译成e的概率为Pr(e|f)，其归一化条件为$\sum_{e} \operatorname{Pr}(e | f)=1$，于是将f翻译成e的问题就变成求解$\hat{e}=\operatorname{argmax} \operatorname{Pr}(e | f)$。我们可以把它理解成一个噪声信道模型，假设我们看到的源语言文本F是由一段目标语言文本E经过某种奇怪的编码得到的，那么翻译的目标就是要将F还原成E，这也就是就是一个解码的过程，如下图：

表达成公式如下：$\mathrm{E}=\arg \max _{\mathrm{E}} P(\mathrm{E}) P(\mathrm{F} | \mathrm{E})$。P(E)为语言模型，它反映“E像一个句子”的程度，即流利度；P(F|E)为翻译模型，它反映“F像E”的程度，即忠实度；联合使用两个模型效果好于单独使用翻译模型，因为后者容易导致一些不好的译文。因此，统计机器翻译要解决的是3个问题：

语言模型P(E)的建模和参数估计
翻译模型P(F|E)的建模和参数估计
解码（搜索）算法

语言模型给出任何一个句子的出现概率$\operatorname{Pr}\left(E=e_{1} e_{2} \ldots e_{n}\right)$，N元语法模型是最简单也是最常见的语言模型，其他语言模型包括：隐马尔科夫模型（HMM）（加入词性标记信息）、概率上下文无关语法（PCFG）（加入短语结构信息）、概率链语法（Probabilistic Link Grammar）（加入链语法的结构信息）。N元语言模型公式如下：

$\begin{aligned} P(w) &=\prod_{i=1}^{n} P\left(w_{i} | w_{1} w_{2} \ldots w_{i-1}\right) \ & \approx \prod_{i=1}^{n} P\left(w_{i} | w_{i-N+1} w_{i-N+2} \ldots w_{i-1}\right) \end{aligned}$

用一张概率转移图可形象表示，如下为一个2元语言模型：

翻译模型P(F|E)反映的是一个源语言句子E翻译成一个目标语言句子F的概率，由于源语言句子和目标语言句子几乎不可能在语料库中出现过，因此这个概率无法直接从语料库统计得到，必须分解成词语翻译的概率和句子结构（或者顺序）翻译的概率。因此翻译模型的计算引入了隐含变量词对齐：$P(\mathrm{F} | \mathrm{E})=\sum_{A} P(\mathrm{F}, \mathrm{A} | \mathrm{E})$

翻译概率P(F|E)的计算转化为对齐概率P(F,A|E)的估计。IBM Model安成了对P(F,A|E)的估计。那么IBM Model从1~3分别有什么不同呢？IBM Model 1仅考虑词对词的互译概率，IBM Model 2加入了词的位置变化的概率，IBM Model 3加入了一个词翻译成多个词。

IBM模型是份经典的研究工作，这5个模型既是当初基于词的统计机器翻译模型的基础，也是现在统计机器翻译中主流技术中的重要一步。作为一个生成模型，IBM模型有着自身”严密”的模型演绎。总的来说，Model 1和2是在一个展开公式下的建模，而Model 3、4和5则是在另一个展开公式下的建模(fertility based model)。IBM Model属于single word based model，它只允许一对一和一对多的对齐，不存在多对一的对齐，这跟phrase based SMT模型不同。当然，从模型的复杂程度上讲，这5个模型之间的关系是1<2<3<4<5，从模型的计算顺序来讲，是1->2->3->4->5。

IBM Model-1

参考：
https://www.nltk.org/_modules/nltk/translate/ibm1.html
http://mt-class.org/jhu/slides/lecture-ibm-model1.pdf
https://zhuanlan.zhihu.com/p/72160554

IBM模型1&2的推导过程：

猜测目标语言句子长度
从左至右，对于每个目标语言单词
- 首先猜测该单词由哪一个源语言单词翻译而来
- 再猜测该单词应该翻译成什么目标语言词

IBM Model-1进行了如下假设：

假设翻译的目标语言句子为： $\mathrm{F}=f_{1}^{m}=f_{1} f_{2} \cdots f_{m}$
假设翻译的源语言句子为：$\mathrm{E}=e_{1}^{l}=e_{1} e_{2} \cdots e_{l}$
假设词语对齐表示为：$\mathrm{A}=a_{1}^{m}=a_{1} a_{2} \cdots a_{m}, \forall i \in\{1, \cdots, m\}, a_{i} \in\{0, \cdots, l\}$
那么词语对齐的概率可以表示为：$\operatorname{Pr}(\mathrm{F}, \mathrm{A} | \mathrm{E})=\operatorname{Pr}(m | \mathrm{E}) \prod_{j=1}^{m} \operatorname{Pr}\left(a_{j} | a_{1}^{j-1}, f_{1}^{j-1}, m, \mathrm{E}\right) \operatorname{Pr}\left(f_{j} | a_{1}^{j}, f_{1}^{j-1}, m, \mathrm{E}\right)$
在Model-1中，假设所有翻译长度都是等概率的
假设词语对齐只与源语言长度有关，与其他因素无关：$\operatorname{Pr}\left(a_{j} | a_{1}^{j-1}, f_{1}^{j-1}, m, \mathrm{E}\right)=\frac{1}{l+1}$
假设目标词语的选择只与其对应的源语言词语有关，与其他因素无关：$\operatorname{Pr}\left(f_{j} | a_{1}^{j}, f_{1}^{j-1}, m, \mathrm{E}\right)=t\left(f_{j} | e_{a_{j}}\right)$
那么对齐概率可以表示为：$\operatorname{Pr}(\mathrm{F}, \mathrm{A} | \mathrm{E})=\frac{\varepsilon}{(l+1)^{m}} \prod_{j=1}^{m} t\left(f_{j} | e_{a_{j}}\right)$
对所有可能的对齐求和，那么翻译概率就可以表示为：$\operatorname{Pr}(\mathrm{F} | \mathrm{E})=\sum_{\mathrm{A}} \operatorname{Pr}(\mathrm{F}, \mathrm{A} | \mathrm{E})=\frac{\varepsilon}{(l+1)^{m}} \sum_{a_{1}=1}^{l} \cdots \sum_{a_{m}=1}^{l} \prod_{j=1}^{m} t\left(f_{j} | e_{a_{j}}\right)$

这就是IBM Model 1的翻译模型公式，也就是说，给定参数t(f|e)，我们就可以计算出句子E翻译成句子F的概率。对于其中翻译概率表t(f|e)满足归一约束条件：$\sum_{f} t(f | e)=1$

延伸：在IBM-Model2中增加如下假设：

假设词语对齐只与源语言长度、目标语言的长度和两个词的位置有关，与其他因素无关：$\operatorname{Pr}\left(a_{j} | a_{1}^{j-1}, f_{1}^{j-1}, m, \mathrm{E}\right)=a\left(a_{j} | j, m, l\right)$，归一化条件为$\sum_{i=0}^{l} a(i | j, m, l)=1$

翻译概率的定义

对于长度为$l_f$的外语句子$f={f_1,…,f_{l_f}}$，长度为$l_e$的英文句子$e={e_1,…,e_{l_e}}$，从英文到外文的词对齐$e_j->f_i$关系$a:j->i$，有翻译概率定义：

$p(\mathbf{e}, a | \mathbf{f})=\frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \prod_{j=1}^{l_{e}} t\left(e_{j} | f_{a(j)}\right)$，例如下图：

参数求解

根据最大似然估计，我们希望得到一组概率分布，使得我们的训练语料库出现的概率最大。也就是说，给定训练语料库E和F，我们要求解一个概率分布t(f|e)，使得翻译概率Pr(F|E)最大。这是一个受约束的极值问题，约束条件即是t(f|e)的归一性条件。为了求解这个问题，我们需要引入拉格朗日乘子，构造一个辅助函数，将上述受约束的极值问题转换成一个不受约束的极值问题。

引入拉格朗日乘子$\Lambda_{e}$，构造辅助函数：$h(t, \lambda) \equiv \frac{\varepsilon}{(l+1)^{m}} \sum_{a_{i}=1}^{l} \cdots \sum_{a_{m}=1}^{l} \prod_{j=1}^{m} t\left(f_{j} | e_{a_{j}}\right)-\sum_{e} \lambda_{e}\left(\sum_{f} t(f | e)-1\right)$。将上述函数对t(f|e)求导得到：$\frac{\partial h(t, \lambda)}{\partial t(f | e)}=\frac{\varepsilon}{(l+1)^{m}} \sum_{a_{1}=1}^{l} \cdots \sum_{a_{n}=1}^{l} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right) \frac{\prod_{k=1}^{m} t\left(f_{k} | e_{a_{k}}\right)}{\mathrm{t}(f | e)}-\lambda_{e}$。

令上式为0，我们得到：$t(f | e)=\lambda_{e}^{-1} \frac{\varepsilon}{(l+1)^{m}} \sum_{a_{1}=1}^{l} \cdots \sum_{a_{m}=1}^{l} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right) \prod_{k=1}^{m} t\left(f_{k} | e_{a_{k}}\right)$。我们看到，这个公式的左边和右边都出现了t(f|e)，我们无法直接用这个公式从给定的语料库(F|E)中计算出t(f|e)，我们可以将这个公式看成是一个迭代公式，给定一个初值t(f|e)，利用这个公式反复迭代，最后可以收敛到一个稳定的t(f|e)值，这就是EM算法。其中$\sum_{j=1}^{m} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right)$表示对齐A中e连接到f的次数。

定义在E和F的所有可能的对齐A下e和f连接数的均值为：$c(f | e ; \mathrm{F}, \mathrm{E}) \equiv \sum_{\mathrm{A}} \operatorname{Pr}(\mathrm{A} | \mathrm{F}, \mathrm{E}) \sum_{j=1}^{m} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right)$，且$\begin{array}{c}c(f | e ; \mathrm{F}, \mathrm{E})=\sum_{\mathrm{A}} \frac{\operatorname{Pr}(\mathrm{F}, \mathrm{A} | \mathrm{E})}{\operatorname{Pr}(\mathrm{F} | \mathrm{E})} \sum_{j=1}^{m} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right) \ =\frac{\sum_{A} \operatorname{Pr}(\mathrm{F}, \mathrm{A} | \mathrm{E}) \sum_{j=1}^{m} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right)}{\operatorname{Pr}(\mathrm{F}[\mathrm{E})}\end{array}$。

将c(f|e;F,E)代入迭代公式，并将Pr(F|E)并入参数λe，我们得到新的迭代公式：$t(f | e)=\lambda_{e}^{-1} c(f | e ; \mathrm{F}, \mathrm{E})$

这个新的迭代公式可以理解为：

一旦我们得到了一组参数t(f|e)，我们就可以计算所有的词语对齐的概率Pr(F,A|E)
有了每个词语对齐的概率Pr(F,A|E)，我们就可以计算新的t(f|e)的值，就是所有的出现词语链接(e,f)的词语对齐概率之和，并对e进行归一化。

以上就是EM算法的中心思想。

EM算法迭代

我们可以使用EM算法迭代求解出对齐概率t，下面为EM算法求解过程：

初始化$t(e|f)$
E-step: probability of alignments
- 计算目标函数$p(a | \mathbf{e}, \mathbf{f})=\frac{p(\mathbf{e}, a | \mathbf{f})}{p(\mathbf{e} | \mathbf{f})}$, (使用上面公式计算$p(e,a|f)$)
  - 计算$p(e|f)$
    
    $\begin{aligned} p(\mathbf{e} | \mathbf{f}) &=\sum_{a} p(\mathbf{e}, a | \mathbf{f}) \ &=\sum_{a(1)=0}^{l_{f}} \ldots \sum_{a\left(l_{e}\right)=0}^{l_{f}} p(\mathbf{e}, a | \mathbf{f}) \ &=\sum_{a(1)=0}^{l_{f}} \ldots \sum_{a\left(l_{e}\right)=0} \frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \prod_{j=1}^{l_{e}} t\left(e_{j} | f_{a(j)}\right) \end{aligned}$
    
    $\begin{array}{l}{=\frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \sum_{a(1)=0}^{l_{f}} \ldots \sum_{a\left(l_{e}\right)=0}^{l_{f}} \prod_{j=1}^{l_{e}} t\left(e_{j} | f_{a(j)}\right)} \ {=\frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \prod_{j=1}^{l_{e}} \sum_{i=0}^{l_{f}} t\left(e_{j} | f_{i}\right)}\end{array}$
  一个计算例子如下：
- 计算$p(a|e,f)$
  
  $\begin{aligned} p(\mathbf{a} | \mathbf{e}, \mathbf{f}) &=p(\mathbf{e}, \mathbf{a} | \mathbf{f}) / p(\mathbf{e} | \mathbf{f}) \ &=\frac{\frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \prod_{j=1}^{l_{e}} t\left(e_{j} | f_{a(j)}\right)}{\frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \prod_{j=1}^{l_{e}} \sum_{i=0}^{l_{f}} t\left(e_{j} | f_{i}\right)} \ &=\prod_{j=1}^{l_{e}} \frac{t\left(e_{j} | f_{a(j)}\right)}{\sum_{i=0}^{l_{f}} t\left(e_{j} | f_{i}\right)} \end{aligned}$
- M-step: count collection
  - 计算$c(e | f ; \mathbf{e}, \mathbf{f})=\sum_{a} p(a | \mathbf{e}, \mathbf{f}) \sum_{j=1}^{l_{e}} \delta\left(e, e_{j}\right) \delta\left(f, f_{a(j)}\right)$，并可以简化为$c(e | f ; \mathbf{e}, \mathbf{f})=\frac{t(e | f)}{\sum_{i=0}^{l_{f}} t\left(e | f_{i}\right)} \sum_{j=1}^{l_{e}} \delta\left(e, e_{j}\right) \sum_{i=0}^{l_{f}} \delta\left(f, f_{i}\right)$
  - 估计模型$t(e | f ; \mathbf{e}, \mathbf{f})=\frac{\left.\sum_{\mathbf{e}} \mathbf{f}_{\mathbf{j}} c(e | f ; \mathbf{e}, \mathbf{f})\right)}{\left.\sum_{e} \sum_{(\mathbf{e}, \mathbf{f})} c(e | f ; \mathbf{e}, \mathbf{f})\right)}$

计算实例

训练句子：
- sentence1: the house ||| la maison
- sentence2: house ||| maison
画出所有对齐的可能
- sentence1:
  - a1: the->la、house->maison
  - a2: the->maison、house->la
- sentence2:
  - a3: house->aison
初始化
- source_side_vocabulary: {the, house}, size=2
- t(la|the) = 1/size = 1/2
- t(maison|the) = 1/size = 1/2
- t(la|house) = 1/size = 1/2
- t(maison|house) = 1/size = 1/2
第一次迭代
- Expectation:
  - alignment probability
    - p(e,a1|f) = t(la|the) t(maison|house) =1/2 1/2 = 1/4
    - p(e, a2|f) = t(maison|the) t(la|house) = 1/2 1/2 = 1/4
    - p(e, a3|f) = t(maison|house) = 1/2
  - normalize alignment probability
    - p(a1|E,F) = p(e,a1|f)/sum(p(E,a|F)) = p(e,a1|f)/(p(e,a1|f)+p(e, a2|f)) = 1/4/(1/4+1/4) = 1/2
    - p(a2|E,F) = p(e,a2|f)/(p(e,a1|f)+p(e, a2|f)) = 1/4/(1/4+1/4) = 1/2
    - p(a3|E,F) = p(e,a3|f)/p(e,a3|f) = 1
- Max:
  - collect counts
    - c(la|the) = p(a1|E,F) count(la|the) = 1/2 1 = 1/2
    - c(maison|the) = p(a2|E,F) count(maison|the) = 1/2 1 = 1/2
    - c(la|house) = p(a2|E,F) count(la|house) = 1/2 1 = 1/2
    - c(maison|house) = p(a1|E,F) count(maison|house) + p(a3|E,F) count(maison|house) = 1/2 1 + 1 1 = 3/2
  - normalize
    - t(la|the) = c(la|the)/sum(c(*|the)) = c(la|the)/(c(la|the) + c(maison|the) ) = 1/2/(1/2 + 1/2) = 1/2
    - t(maison|the) = c(maison|the)/sum(c(*|the)) = c(maison|the)/(c(la|the) + c(maison|the) ) = 1/2/(1/2 + 1/2) = 1/2
    - t(la|house) = 1/2/(1/2 + 3/2) = 1/4
    - t(maison|house) = 3/2/(1/2 + 3/2) = 3/4
第二次迭代
- Expectation:
  - alignment probability
    - p(e,a1|f) = 1/2 * 3/4 = 3/8
    - p(e, a2|f) = 1/2 * 1/4 = 1/8
    - p(e, a3|f) = 3/4
  - normalize alignment probability
    - p(a1|E,F) = 3/8/(3/8 + 1/8) = 3/4
    - p(a2|E,F) = 1/8/(3/8 + 1/8) = 1/4
    - p(a3|E,F) = 3/4/3/4 = 1
- Max:
  - collect counts
    - c(la|the) = 3/4 * 1 = 3/4
    - c(maison|the) = 1/4 * 1 = 1/4
    - c(la|house) = 1/4 * 1 = 1/4
    - c(maison|house) = 3/4 1 + 1 1 = 7/4
  - normalize
    - t(la|the) = 3/4/(3/4 + 1/4) = 3/4
    - t(maison|the) = 1/4/(3/4 + 1/4) = 1/4
    - t(la|house) = 1/4/(1/4 + 7/4) = 1/8
    - t(maison|house) = 7/4/(1/4 + 7/4) = 7/8

NLTK源码分析

整体代码

counts = Counts()
for aligned_sentence in parallel_corpus:
    trg_sentence = aligned_sentence.words
    src_sentence = [None] + aligned_sentence.mots

    # E step (a): Compute normalization factors to weigh counts
    total_count = self.prob_all_alignments(src_sentence, trg_sentence)

    # E step (b): Collect counts
    for t in trg_sentence:
        for s in src_sentence:
            count = self.translation_table[t][s]
            normalized_count = count / total_count[t]
            counts.t_given_s[t][s] += normalized_count
            counts.any_t_given_s[s] += normalized_count
    
    # M step: Update probabilities with maximum likelihood estimate
    self.maximize_lexical_translation_probabilities(counts)

计算$\sum_{a} p(\mathbf{e}, a | \mathbf{f})$

def prob_all_alignments(self, src_sentence, trg_sentence):
  alignment_prob_for_t = defaultdict(lambda: 0.0)
  for t in trg_sentence:
      for s in src_sentence:
          alignment_prob_for_t[t] += self.translation_table[t][s]
  return alignment_prob_for_t

计算M-step

def maximize_lexical_translation_probabilities(self, counts):
    for t, src_words in counts.t_given_s.items():
        for s in src_words:
            estimate = counts.t_given_s[t][s] / counts.any_t_given_s[s]
            self.translation_table[t][s] = max(estimate, IBMModel.MIN_PROB)

给定句对计算alignment

best_alignment = []
for j, trg_word in enumerate(sentence_pair.words):
    best_prob = max(self.translation_table[trg_word][None], IBMModel.MIN_PROB)
    best_alignment_point = None
    for i, src_word in enumerate(sentence_pair.mots):
        align_prob = self.translation_table[trg_word][src_word]
        if align_prob >= best_prob:  # prefer newer word in case of tie
            best_prob = align_prob
            best_alignment_point = i

    best_alignment.append((j, best_alignment_point))
sentence_pair.alignment = Alignment(best_alignment)

IBM Model-2

参考：
https://www.nltk.org/_modules/nltk/translate/ibm2.html
http://www.ai.mit.edu/courses/6.891-nlp/l11.pdf

IBM模型1的一个问题是它的重新排序能力非常弱，因为$p(f,a|s)$仅使用词法转换概率$t(t|s)$来计算。因此，如果模型有2个候选$t_1$和$t_2$具有相同的词汇翻译，但是对翻译后的单词进行了不同的重新排序，那么模型对这两个翻译都给出相同的分数。IBM-Model2使用对齐概率模型$p(i|j,s,f)$表示位置i、j的对齐概率，并用它计算$\operatorname{Pr}(t, a | s)=\frac{\epsilon}{(J+1)^{I}} \prod_{j=1}^{J} \operatorname{tr}\left(t_{j} | s_{a(j)}\right) \operatorname{Pr}_{a}(a(j) | j, J, I)$，其中$\operatorname{Pr}_{a}(a(j) | j, J, I)$模拟一个单词在源句中的位置i被重新排序到目标句中的位置j的概率。

问题定义为：求解概率P(f|e)，其中$e=\{e_1,…,e_l\}$, $f=\{f_1,..,f_m\}$, 对齐定义为$\{a_1, …, a_m\}$且$a_j\in \{0,..,l\}$, 一共有$(l+1)^m$个对齐。IBM Model1的对齐概率认为是等概率的，即$P(\mathbf{a} | \mathbf{e})=C \times \frac{1}{(l+1)^{m}}$，且C是常数$C=\operatorname{prob}(\operatorname{length}(\mathbf{f})=m)$. 对于IBM Model1来说它生成出一个句子的具体过程为：

选择长度为f(均等概率C产生的长度)
使用均等概率$1/(l+1)^m$产生对齐a
使用概率$P(\mathbf{f} | \mathbf{a}, \mathbf{e})=\prod_{j=1}^{m} \mathbf{T}\left(f_{j} | e_{a_{j}}\right)$产生目标语言
最终结果$P(\mathbf{f}, \mathbf{a} | \mathbf{e})=P(\mathbf{a} | e) P(\mathbf{f} | \mathbf{a}, e)=\frac{C}{(l+1)^{m}} \prod_{j=1}^{m} \mathbf{T}\left(f_{j} | e_{a_{j}}\right)$

翻译概率的定义

在IBM Model-2中，定义D(i|j,l,m) = 给定source lenth e和target lenth f，第j个target word和第i个source word对齐的概。因此定义$P\left(\mathbf{a}=\left\{a_{1}, \ldots a_{m}\right\} | \mathbf{e}, l, m\right)=\prod_{j=1}^{m} \mathbf{D}\left(a_{j} | j, l, m\right)$，则翻译概率定义为：$P(\mathbf{f}, \mathbf{a} | \mathbf{e}, l, m)=\prod_{j=1}^{m} \mathbf{D}\left(a_{j} | j, l, m\right) \mathbf{T}\left(f_{j} | e_{a_{j}}\right)$。可以看出来，IBM-Model1是Model2的一个特例，在IBM-Model1中$\mathrm{D}(i | j, l, m)=\frac{1}{l+1}$

因此对于IBM Model2来说它生成出一个句子的具体过程为：

选择长度为f(均等概率C产生的长度)
使用概率$\prod_{j=1}^{m} \mathrm{D}\left(a_{j} | j, l, m\right)$产生对齐$a=\{a_1,…,a_m\}$
使用概率$P(\mathbf{f} | \mathbf{a}, \mathbf{e})=\prod_{j=1}^{m} \mathbf{T}\left(f_{j} | e_{a_{j}}\right)$产生目标语言
最终结果$P(\mathbf{f}, \mathbf{a} | \mathbf{e})=P(\mathbf{a} | \mathbf{e}) P(\mathbf{f} | \mathbf{a}, \mathbf{e})=C \prod_{j=1}^{m} \mathbf{D}\left(a_{j} | j, l, m\right) \mathbf{T}\left(f_{j} | e_{a_{j}}\right)$

参数求解

假设Model-2翻译模型定义为：$\operatorname{Pr}(\mathrm{F} | \mathrm{E})=\varepsilon \prod_{j=1}^{m} \sum_{i=0}^{l} t\left(f_{j} | e_{a_{j}}\right) a\left(a_{j} | j, m, l\right)$，同样通过引入拉格朗日乘子推导可以得到：

$t(f | e)=\lambda_{e}^{-1} c(f | e ; \mathrm{F}, \mathrm{E})$

$a(i | j, m, l)=\mu_{j m l}^{-1} c(i | j, m, l ; \mathrm{F}, \mathrm{E})$

$c(f | e ; \mathrm{F}, \mathrm{E})=\sum_{j=1}^{m} \sum_{i=0}^{l} \frac{t(f | e) a(i | j, m, l) \delta\left(f, f_{j}\right) \delta\left(e, e_{i}\right)}{t\left(f | e_{0}\right) a(0 | j, m, l)+\cdots+t\left(f | e_{l}\right) a(l | j, m, l)}$

$c(i | j, m, l ; \mathrm{F}, \mathrm{E})=\frac{t\left(f_{j} | e_{i}\right) a(i | j, m, l)}{t\left(f_{j} | e_{0}\right) a(0 | j, m, l)+\cdots+t\left(f_{j} | e_{l}\right) a(l | j, m, l)}$

考虑到训练语料库(F|E)是由一系列句子对组成的：$\left(\mathrm{F}^{(1)}, \mathrm{E}^{(\mathrm{i})}\right),\left(\mathrm{F}^{(2)}, \mathrm{E}^{(2)}\right), \cdots,\left(\mathrm{F}^{(\mathrm{s})}, \mathrm{E}^{(\mathrm{s})}\right)$

因此实际计算时我们采用以下公式：

$t(f | e)=\lambda_{e}^{-1} \sum_{s} c\left(f | e ; \mathrm{F}^{(\mathrm{s})}, \mathrm{E}^{(\mathrm{s})}\right)$

$a(i | j, m, l)=\mu_{j m l}^{-1} \sum_{s} c\left(i | j, m, l ; \mathrm{F}^{(\mathrm{s})}, \mathrm{E}^{(\mathrm{s})}\right)$

这里$\Lambda_{e}$和$\mu_{j m}$仅仅起到归一化因子的作用。

EM算法迭代

初始化$\begin{array}{r}\mathrm{T}(f | e) \ \mathrm{D}(i | j, l, m)\end{array}$
计算对齐概率$a[i, j, k]=\frac{\mathrm{D}\left(a_{j}=i | j, l, m\right) \mathrm{T}\left(f_{j} | e_{i}\right)}{\sum_{i^{\prime}=0}^{l} \mathrm{D}\left(a_{j}=i^{\prime} | j, l, m\right) \mathrm{T}\left(f_{j} | e_{i^{\prime}}\right)}$
E步：
- 计算e和f对齐的期望次数：$\text { tcount }(e, f)=\sum_{\{|k, j|=f} a[i, j, k]$
- 计算e和任意target word对齐的期望次数：$\text { scount }(e)=\sum_{e[k, i]=\ell}^{i, k} \sum_{j=1}^{m[k]} a[i, j, k]$
- 计算对于source lenth 为l和target lenth 为m的句对ei和fj对齐的期望次数：$\operatorname{acount}(i, j, l, m)=\sum_{, m[k]=m} a_{\vdots=m}[i, j, k]$
- 计算表示source lenth 为l和target lenth 为m的期望次数：$\operatorname{acount}(j, l, m)=|\{k: l[k]=l, m[k]=m\}|$
M步：
- 重新估算翻译概率T(f|e)：$P(f | e)=\frac{\text { tcount }(e, f)}{\text { scount }(e)}$
- 重新估算对齐概率D(i|j,l,m)：$\left.P\left(a_{j}=i | j, l, m\right)\right)=\frac{a \operatorname{count}(i, j, l, m)}{\operatorname{acount}(j, l, m)}$

计算实例

训练句子：
- sentence1: the dog ||| le chien
- sentence2: the cat ||| le chat
- sentence3: the bus ||| I’ autobus
初始化
- source_side_vocabulary: {the, dog, cat, bus}, size=4
- 随机初始化
  - T(f|e)

* D(i|j,l,m)

E步：
- 计算tcount(e,f)

如tcount(the, le) = a(1, 1, 0) + a(1, 1, 1) = 0.5264 + 0.4665 = 0.9929

计算scount(e)
计算account(i,j,l,m)
计算account(j,l,m)
M步：
- 重新估算T(f|e)(下图为经过几轮迭代后的变化)

重新估算D(i|j,l,m)
IBM paper中建议用Model-1估计T(f|e)并用来初始化Model-2

NLTK源码分析

整体代码
- 初始化

if probability_tables is None:
    ibm1 = IBMModel1(sentence_aligned_corpus, 2 * iterations)
    self.translation_table = ibm1.translation_table
    # a(i | j,l,m) = 1 / (l+1) for all i, j, l, m
    l_m_combinations = set()
    for aligned_sentence in sentence_aligned_corpus:
        l = len(aligned_sentence.mots)
        m = len(aligned_sentence.words)
        if (l, m) not in l_m_combinations:
            l_m_combinations.add((l, m))
            initial_prob = 1 / (l + 1)
            for i in range(0, l + 1):
                for j in range(1, m + 1):
                    self.alignment_table[i][j][l][m] = initial_prob
else:
    self.translation_table = probability_tables["translation_table"]
    # D(i|j,l,m)
    self.alignment_table = probability_tables["alignment_table"]
    for n in range(0, iterations):
        self.train(sentence_aligned_corpus)
    self.align_all(sentence_aligned_corpus)

训练

def train(self, parallel_corpus):
    counts = Model2Counts()
    for aligned_sentence in parallel_corpus:
        src_sentence = [None] + aligned_sentence.mots
        trg_sentence = ["UNUSED"] + aligned_sentence.words
        l = len(aligned_sentence.mots)
        m = len(aligned_sentence.words)
        # E step (a): Compute normalization factors to weigh counts
        alignment_prob_for_t = defaultdict(lambda: 0.0)
        for j in range(1, len(trg_sentence)):
            t = trg_sentence[j]
            for i in range(0, len(src_sentence)):
                alignment_prob_for_t[t] += self.prob_alignment_point(
                    i, j, src_sentence, trg_sentence
                )
        total_count = alignment_prob_for_t
        # E step (b): Collect counts
        for j in range(1, m + 1):
            t = trg_sentence[j]
            for i in range(0, l + 1):
                s = src_sentence[i]
                l = len(src_sentence) - 1
                m = len(trg_sentence) - 1
                s = src_sentence[i]
                t = trg_sentence[j]
                count = self.translation_table[t][s] * self.alignment_table[i][j][l][m]
                normalized_count = count / total_count[t]
                self.t_given_s[t][s] += normalized_count
                self.any_t_given_s[s] += normalized_count
                self.alignment[i][j][l][m] += normalized_count
                self.alignment_for_any_i[j][l][m] += normalized_count
    # M step: Update probabilities with maximum likelihood estimates
    self.maximize_lexical_translation_probabilities(counts)
    self.maximize_alignment_probabilities(counts)

计算M-step

def maximize_alignment_probabilities(self, counts):
    MIN_PROB = IBMModel.MIN_PROB
    for i, j_s in counts.alignment.items():
        for j, src_sentence_lengths in j_s.items():
            for l, trg_sentence_lengths in src_sentence_lengths.items():
                for m in trg_sentence_lengths:
                    estimate = (
                        counts.alignment[i][j][l][m]
                        / counts.alignment_for_any_i[j][l][m]
                    )
                    self.alignment_table[i][j][l][m] = max(estimate, MIN_PROB)

给定句对计算alignment：$a_{j}^{*, 2}=\operatorname{argmax}_{j}\left(\mathrm{T}\left(f_{j} | e_{a_{j}}\right) \mathrm{D}(j | i, l, m)\right)$

def align(self, sentence_pair):
    best_alignment = []
    l = len(sentence_pair.mots)
    m = len(sentence_pair.words)
    for j, trg_word in enumerate(sentence_pair.words):
        # Initialize trg_word to align with the NULL token
        best_prob = (
            self.translation_table[trg_word][None]
            * self.alignment_table[0][j + 1][l][m]
        )
        best_prob = max(best_prob, IBMModel.MIN_PROB)
        best_alignment_point = None
        for i, src_word in enumerate(sentence_pair.mots):
            align_prob = (
                self.translation_table[trg_word][src_word]
                * self.alignment_table[i + 1][j + 1][l][m]
            )
            if align_prob >= best_prob:
                best_prob = align_prob
                best_alignment_point = i

        best_alignment.append((j, best_alignment_point))

    sentence_pair.alignment = Alignment(best_alignment)

IBM Model-3

参考：
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1162/handouts/cs224n-lecture3-MT.pdf
http://www.ai.mit.edu/courses/6.891-nlp/l11.pdf
http://www.ai.mit.edu/courses/6.891-nlp/l12.pdf
https://www.nltk.org/_modules/nltk/translate/ibm3.html
http://www1.maths.lth.se/matematiklth/vision/publdb/reports/pdf/schoenemann-ccnll-10.pdf

从Model-3到Model-5，翻译模型如下图所示：

具体步骤是：

首先根据源语言词语的繁殖概率，确定每个源语言词翻译成多少个目标语言词
根据每个源语言词语的目标语言词数，将每个源语言词复制若干次
将复制后得到的每个源语言词，根据翻译概率，翻译成一个目标语言词
根据调序概率，将翻译得到的目标语言词重新调整顺序，得到目标语言句子

对于Model-3来说，其推导过程为：

对于句子中每一个英语单词e，选择一个产出率φ，其概率为n(φ|e)
对于所有单词的产出率求和得到m-prime
按照下面的方式构造一个新的英语单词串：删除产出率为0的单词，复制产出率为1的单词，复制两遍产出率为2的单词，依此类推
在这m-prime个单词的每一个后面，决定是否插入一个空单词NULL，插入和不插入的概率分别为p1和p0
φ0为插入的空单词NULL的个数
设m为目前的总单词数：m-prime+φ0
根据概率表t(f|e)，将每一个单词e替换为外文单词f
对于不是由空单词NULL产生的每一个外语单词，根据概率表d(j|i,l,m)，赋予一个位置。这里j是法语单词在法语串中的位置，i是产生当前这个法语单词的对应英语单词在英语句子中的位置，l是英语串的长度，m是法语串的长度
如果任何一个目标语言位置被多重登录（含有一个以上单词），则返回失败
给空单词NULL产生的单词赋予一个目标语言位置。这些位置必须是空位置（没有被占用）。任何一个赋值都被认为是等概率的，概率值为1/φ0
最后，读出法语串，其概率为上述每一步概率的乘积

翻译概率的定义

Model-3引入fertility参数来约束一个源语言单词可以产生多少个目标语言单词, 并且从Model-3开始其建模过程和Model-1、Model-2有所不同，其产生句子的总体过程如下：

对每个ej产生一个fertility $\phi_{j}$只依赖于ej）
对每个ej生成$\phi_{j}$个target word词（只依赖于ej而不依赖于任何其他context）
给每个target words选择一个位置（只依赖于ej的位置和句子长度）

具体过程如下：

有英文句子$e=\{e_1,…,e_l\}$, 想要建模$P(f|e)$
使用概率$P\left(\left\{\phi_{0} \ldots \phi_{l}\right\} | \mathbf{e}\right)$选择$l+1$个fertilities$\{\phi_{0} \ldots \phi_{l}\}$
- $P\left(\left\{\phi_{0} \ldots \phi_{l}\right\} | \mathbf{e}\right)=P\left(\phi_{0} | \phi_{1} \ldots \phi_{l}\right) \prod_{i=1}^{l} \mathrm{F}\left(\phi_{i} | e_{i}\right)$
  - $F(\phi|e)$表示e和$\phi$个单词对齐的概率：F(0|the)=0.1、F(1|the)=0.9、F(2|the)=0…F(0|not)=0.01、F(1|not)=0.09、F(2|not)=0.9
  - $P\left(\phi_{0} | \phi_{1} \ldots \phi_{l}\right)$表示以出现$p_1$正面的概率硬币m次，出现$\phi_0$次正面的概率：$P\left(\phi_{0} | \phi_{1} \ldots \phi_{l}\right)=\frac{m !}{\left(m-\phi_{0}\right) ! \phi_{0} !} p_{1}^{\phi_{0}}\left(1-p_{1}\right)^{m-\phi_{0}}$
    - $m=\sum_{i=1}^{l} \phi_{i}$，可以这样了理解：假设已经有m个source word产生，则这m个词的每个词都增加一个$p_1$概率的空对齐
对于每一个$e_i$, 使用概率$\prod_{i=0}^{l} \prod_{k=1}^{\phi_{i}} \mathrm{R}\left(\pi_{i, k} | i, l, m\right) \mathrm{T}\left(f_{i, k} | e_{i}\right)$选择位置$\pi_{i,k}\in 1…m$和target word $f_{i,k}$
- $R(j|i,l,m)$表示给定source lenth l、target lenth m 和 source position i，产生target position j的概率
- $R(j|i,l,m)$要注意与之前的$D(i|j,l,m)$区别，$D(i|j,l,m)$表示给定source lenth l、target lenth m 和 target position j，产生source position i的概率
根据上面推导产生模型:

$\begin{aligned} \phi=&\left\{\phi_{0} \ldots \phi_{m}\right\} \ \pi=&\left\{\pi_{i, k}: i=0 \ldots m, k=1 \ldots \phi_{i}\right\} \ \mathbf{f} 2=&\left\{f_{i, k}: i=0 \ldots m, k=1 \ldots \phi_{i}\right\} \ & P(\phi, \pi, \mathbf{f} 2 | \mathbf{e})=\ \frac{m !}{\left(m-\phi_{0}\right) ! \phi_{0} !} p_{1}^{\phi_{0}}\left(1-p_{1}\right)^{m-\phi_{0}} &\left(\prod_{i=0}^{l} \prod_{k=1}^{\phi_{i}} \mathbf{R}\left(\pi_{i, k} | i, l, m\right) \mathbf{T}\left(f_{i, k} | e_{i}\right)\right) \end{aligned}$

进一步优化：

$\begin{array}{c}P(\mathbf{f}, \mathbf{a} | \mathbf{e})= \ \frac{m !}{\left(m-\phi_{0}\right) ! \phi_{0} !} p_{1}^{\phi_{0}}\left(1-p_{1}\right)^{m-\phi_{0}}\left(\prod_{i=1}^{l} \mathbf{F}\left(\phi_{i} | e_{i}\right) \phi_{i} !\right)\left(\prod_{i=1}^{l} \prod_{k=1}^{\phi_{i}} \mathbf{R}\left(\pi_{i, k} | i, l, m\right)\right)\left(\prod_{i=0}^{l} \prod_{k=1}^{\phi_{i}} \mathbf{T}\left(f_{i, k} | e_{i}\right)\right)\end{array}$

EM算法迭代

初始化
- fertility_table：$F(\phi | e)$
- translation_table：T(f|e)用Model-2的结果初始化
- distortion_table：R(j|i,l,m)
- p1=0.5
- alignment_table：a[i,j,k]用Model-2的结果初始化
E步：Model-1和Model-2可以计算出expect count，Model-3计算量太大，所以使用了一个近似算法
- 什么是expect count？以句对e = I do not understand the logic of these people，f = Je ne comprends pas la logique de ces gens -la为例，在给定当前模型参数下，$\phi_{3}$（not 的 fertility）的期望值为$\sum_{\mathbf{a} \in \mathcal{A}} P(\mathbf{a} | \mathbf{f}, \mathbf{e}) \phi_{3}(\mathbf{a})=\sum_{\mathbf{a} \in \mathcal{A}} \frac{P(\mathbf{f}, \mathbf{a} | \mathbf{e})}{\sum_{\mathbf{a}^{\prime} \in \mathcal{A}} P\left(\mathbf{f}, \mathbf{a}^{\prime} | \mathbf{e}\right)} \phi_{3}(\mathbf{a})$，其中$\phi_{3}(\mathbf{a})$表示对齐a中$\phi_{3}(\mathbf{a})$的值
- 在Model-3中使用high probability alignments $\overline{\mathcal{A}}$计算$\sum_{\mathbf{a} \in \mathcal{A}} \frac{P(\mathbf{f}, \mathbf{a} | \mathbf{e})}{\sum_{\mathbf{a}^{\prime} \in \overline{\mathcal{A}}} P\left(\mathbf{f}, \mathbf{a}^{\prime} | \mathbf{e}\right)} \phi_{3}(\mathbf{a})$减少计算复杂度
M步：重新估算参数概率

Viterbi训练

参考：
词语对齐的对数线性模型：http://nlp.ict.ac.cn/~liuyang/papers/acl05_chn.pdf

Viterbi参数训练算法的总体思路：

给定初始参数
用已有的参数求概率最大（Viterbi）的词语对齐
用得到的概率最大的词语对齐重新计算参数
回到第二步，直到收敛为止

在对参数计算公式无法化简的情况下，采用Viterbi参数训练算法是一种可行的做法，这种算法通常可以迅速收敛到一个可以接受的结果。

由于IBM Model 1和2存在简化的迭代公式，实际上在EM算法迭代是并不用真的去计算所有的对齐，而是可以利用迭代公式直接计算下一次的参数；由于IBM Model 3、4、5的翻译模型公式无法化简，理论上应该进行EM迭代。由于实际上由于计算所有词语对齐的代价太大，通常采用Viterbi训练，每次E步骤只生成最好的一个或者若干个对齐。

Generalized Iterative Scaling算法(GIS)。

计算实例

以一个实际例子来说明翻译过程：I do not understand the logic of these people翻译成法语。

pick fertilities：以概率$\begin{aligned} P\left(\phi_{1} \ldots \phi_{l} | \mathbf{e}\right) &=\prod_{i=1}^{l} \mathbf{F}\left(\phi_{i} | e_{i}\right) \ &=\mathrm{F}(1 | I) \mathrm{F}(0 | d o) \mathrm{F}(2 | n o t) \mathrm{F}(1 | \text { understand }) \mathrm{F}(1 | \text { the }) \ & \mathrm{F}(1 | \text { logic }) \mathrm{F}(1 | \text { of }) \mathrm{F}(1 | \text { these }) \mathrm{F}(1 | \text { people }) \end{aligned}$产生

replace words：以概率$\begin{aligned} \prod_{i=1}^{l} \phi_{i} ! \prod_{k=1}^{\phi_{i}} \mathrm{T}\left(\mathrm{f}_{i, k} | e_{i}\right)=& 1 ! \times 0 ! \times 2 ! \times 1 ! \times 1 ! \times 1 ! \times 1 ! \times 1 ! \times 1 ! \times \ & \mathrm{T}(J e | I) \mathrm{T}(n e | n o t) \mathrm{T}(p a s | n o t) \times \ & \mathrm{T}(\text {comprends} | \text {understand}) \mathrm{T}(l a | \text {the}) \mathrm{T}(\text {logique } | \text {logic}) \times \ & \mathrm{T}(\text {de } | \text {of}) \mathrm{T}(\text {ces} | \text {these}) \mathrm{T}(\text {gens } | \text {people}) \end{aligned}$产生目标词汇

reorder：以概率$\begin{aligned} \prod_{i=1}^{1} \prod_{k=1}^{\phi_{i}} \mathrm{R}\left(\pi_{i, k} | i, l, m\right)=& \mathrm{R}(j=1 | i=1, l=9, m=10) \mathrm{R}(2 | 3,9,10) \mathrm{R}(3 | 4,9,10) \times \ & \mathrm{R}(4 | 3,9,10) \mathrm{R}(5 | 5,9,10) \mathrm{R}(6 | 6,9,10) \times \ & \mathrm{R}(7 | 7,9,10) \mathrm{R}(8 | 8,9,10) \mathrm{R}(9 | 9,9,10) \end{aligned}$重新排序

spurious words：以概率$\begin{aligned} P\left(\phi_{0} | \phi_{1} \ldots \phi_{1}\right) \prod_{k=1}^{60} \mathrm{T}\left(\mathrm{f}_{0, k} | N U L L\right) &=\frac{n !}{\left(n-\phi_{0}\right) ! \phi_{0} !} p_{1}^{00}\left(1-p_{1}\right)^{n-\phi_{0}} \prod_{k=1}^{60} \mathrm{T}\left(\mathrm{f}_{0, k} | N U L L\right) \ &=\frac{9 !}{811 !} p_{1}\left(1-p_{1}\right)^{8} \mathrm{T}(-l a | N U L L) \ &=9 p_{1}\left(1-p_{1}\right)^{8} \mathrm{T}(-l a | N U L L) \end{aligned}$产生

，这里$n=\sum_{i=1}^{l} \phi_{l}=m-\phi_{0}$

p1是$\phi_{0}$（即空对齐）的概率，spurious words即产生T(f|NULL)

NLTK源码分析

整体代码
- 初始化

if probability_tables is None:
    ibm2 = IBMModel2(sentence_aligned_corpus, iterations)
    self.translation_table = ibm2.translation_table
    self.alignment_table = ibm2.alignment_table
    # d(j | i,l,m) = 1 / m for all i, j, l, m
    l_m_combinations = set()
    for aligned_sentence in sentence_aligned_corpus:
        l = len(aligned_sentence.mots)
        m = len(aligned_sentence.words)
        if (l, m) not in l_m_combinations:
            l_m_combinations.add((l, m))
            initial_prob = 1 / m
            for j in range(1, m + 1):
                for i in range(0, l + 1):
                    self.distortion_table[j][i][l][m] = initial_prob
     # simple initialization, taken from GIZA++
    self.fertility_table[0] = defaultdict(lambda: 0.2)
    self.fertility_table[1] = defaultdict(lambda: 0.65)
    self.fertility_table[2] = defaultdict(lambda: 0.1)
    self.fertility_table[3] = defaultdict(lambda: 0.04)
    MAX_FERTILITY = 10
    initial_fert_prob = 0.01 / (MAX_FERTILITY - 4)
    for phi in range(4, MAX_FERTILITY):
        self.fertility_table[phi] = defaultdict(lambda: initial_fert_prob)
    self.p1 = 0.5
else:
    self.translation_table = probability_tables["translation_table"]
    self.alignment_table = probability_tables["alignment_table"]
    self.fertility_table = probability_tables["fertility_table"]
    self.p1 = probability_tables["p1"]
    self.distortion_table = probability_tables["distortion_table"]

训练

def train(self, parallel_corpus):
        counts = Model3Counts()
        for aligned_sentence in parallel_corpus:
            l = len(aligned_sentence.mots)
            m = len(aligned_sentence.words)
            # Sample the alignment space
            sampled_alignments, best_alignment = self.sample(aligned_sentence)
            # Record the most probable alignment
            aligned_sentence.alignment = Alignment(best_alignment.zero_indexed_alignment())
            # E step (a): Compute normalization factors to weigh counts, https://github.com/nltk/nltk/blob/5023d6b933ef1a5b1f25fba1d5ed11a8a43a47e4/nltk/translate/ibm_model.py中有详细计算
            total_count = self.prob_of_alignments(sampled_alignments)
            # E step (b): Collect counts
            for alignment_info in sampled_alignments:
                count = self.prob_t_a_given_s(alignment_info)
                normalized_count = count / total_count
                for j in range(1, m + 1):
                   counts.update_lexical_translation(normalized_count, alignment_info, j)
                   counts.update_distortion(normalized_count, alignment_info, j, l, m)
                counts.update_null_generation(normalized_count, alignment_info)
                counts.update_fertility(normalized_count, alignment_info)
        # M step:
        # If any probability is less than MIN_PROB, clamp it to MIN_PROB
        existing_alignment_table = self.alignment_table
        self.reset_probabilities()
        self.alignment_table = existing_alignment_table  # don't retrain

        self.maximize_lexical_translation_probabilities(counts)
        self.maximize_distortion_probabilities(counts)
        self.maximize_fertility_probabilities(counts)
        self.maximize_null_generation_probabilities(counts)

计算M-step
- 最大化distortion概率

def maximize_distortion_probabilities(self, counts):
    MIN_PROB = IBMModel.MIN_PROB
    for j, i_s in counts.distortion.items():
        for i, src_sentence_lengths in i_s.items():
            for l, trg_sentence_lengths in src_sentence_lengths.items():
                for m in trg_sentence_lengths:
                    estimate = (counts.distortion[j][i][l][m]/ counts.distortion_for_any_j[i][l][m])
                    self.distortion_table[j][i][l][m] = max(estimate, MIN_PROB)

计算给定source sentence，产生target sentence和alignment的概率

def prob_t_a_given_s(self, alignment_info):
    src_sentence = alignment_info.src_sentence
    trg_sentence = alignment_info.trg_sentence
    l = len(src_sentence) - 1  # exclude NULL
    m = len(trg_sentence) - 1
    p1 = self.p1
    p0 = 1 - p1

    probability = 1.0
    MIN_PROB = IBMModel.MIN_PROB

    # Combine NULL insertion probability
    null_fertility = alignment_info.fertility_of_i(0)
    probability *= pow(p1, null_fertility) * pow(p0, m - 2 * null_fertility)
    if probability < MIN_PROB:
        return MIN_PROB

    # Compute combination (m - null_fertility) choose null_fertility
    for i in range(1, null_fertility + 1):
        probability *= (m - null_fertility - i + 1) / i
        if probability < MIN_PROB:
            return MIN_PROB

    # Combine fertility probabilities
    for i in range(1, l + 1):
        fertility = alignment_info.fertility_of_i(i)
        probability *= (
            factorial(fertility) * self.fertility_table[fertility][src_sentence[i]]
        )
        if probability < MIN_PROB:
            return MIN_PROB

    # Combine lexical and distortion probabilities
    for j in range(1, m + 1):
        t = trg_sentence[j]
        i = alignment_info.alignment[j]
        s = src_sentence[i]
        probability *= (self.translation_table[t][s] * self.distortion_table[j][i][l][m])
        if probability < MIN_PROB:
            return MIN_PROB

    return probability

计算alignment

用Model-2计算最可能的alignment：$\mathbf{a}^{, 2}=\operatorname{argmax}_{\mathbf{a} \in \mathcal{A}} P_{2}(\mathbf{f}, \mathbf{a} | \mathbf{e})$、$a_{j}^{ 2}=\operatorname{argmax}_{j}\left(\mathrm{T}\left(f_{j} | e_{a_{j}}\right) \mathrm{D}(j | i, l, m)\right)$
计算a2的邻居（邻居指通过替换a2中某个对齐或交换某两个对齐产生的新的对齐）
迭代计算a3（初值设置为a2）：$\mathbf{a}^{, 3}=\operatorname{argmax}_{\mathbf{a} \in \mathcal{N}\left(\mathbf{a}^{, 3}\right)} \quad P_{3}(\mathbf{a}, \mathbf{f} | \mathbf{e})$
- 上式等价于求解概率的负对数
- 对于所有的source position $j \in\{1, \ldots, J\}$和target position $i \in\{0, \ldots, \tilde{I}\}$,引入$x_{i j} \in\{0,1\}$表示i和j是否对齐，且由于每个i只能和1个j对齐，因此存在约束$\sum_{i} x_{i j}=1 \quad, j=1, \ldots, J$，从而求解目标为：$c_{i j}^{x}=-\log \left[p\left(f_{j} | e_{i}\right) \cdot p(j | i)\right]$