令人头大之IBM Model

最近一段时间工作中,急需补充giza、fast align算法的背后原理,因此集中补一补这些令人头大的算法。本来打算看完IBM-Model1~Model5和HMM,但后来卡到了Model-3上,准备在后续的博客中继续更新。因此本篇将重点介绍Model-1~Model2。

参考:
https://wenku.baidu.com/view/4cda374769eae009581becd2.html
https://www.cl.uni-heidelberg.de/courses/ss15/smt/scribe2.pdf

词对齐算法之IBM Model

假设任意一个英语句子e和一个法语句子f,定义f翻译成e的概率为Pr(e|f),其归一化条件为$\sum_{e} \operatorname{Pr}(e | f)=1$,于是将f翻译成e的问题就变成求解$\hat{e}=\operatorname{argmax} \operatorname{Pr}(e | f)$。我们可以把它理解成一个噪声信道模型,假设我们看到的源语言文本F是由一段目标语言文本E经过某种奇怪的编码得到的,那么翻译的目标就是要将F还原成E,这也就是就是一个解码的过程,如下图:图片

表达成公式如下:$\mathrm{E}=\arg \max _{\mathrm{E}} P(\mathrm{E}) P(\mathrm{F} | \mathrm{E})$。P(E)为语言模型,它反映“E像一个句子”的程度,即流利度;P(F|E)为翻译模型,它反映“F像E”的程度,即忠实度;联合使用两个模型效果好于单独使用翻译模型,因为后者容易导致一些不好的译文。因此,统计机器翻译要解决的是3个问题:

  • 语言模型P(E)的建模和参数估计
  • 翻译模型P(F|E)的建模和参数估计
  • 解码(搜索)算法

语言模型给出任何一个句子的出现概率$\operatorname{Pr}\left(E=e_{1} e_{2} \ldots e_{n}\right)$,N元语法模型是最简单也是最常见的语言模型,其他语言模型包括:隐马尔科夫模型(HMM)(加入词性标记信息)、概率上下文无关语法(PCFG)(加入短语结构信息)、概率链语法(Probabilistic Link Grammar)(加入链语法的结构信息)。N元语言模型公式如下:

$\begin{aligned} P(w) &=\prod_{i=1}^{n} P\left(w_{i} | w_{1} w_{2} \ldots w_{i-1}\right) \ & \approx \prod_{i=1}^{n} P\left(w_{i} | w_{i-N+1} w_{i-N+2} \ldots w_{i-1}\right) \end{aligned}$

用一张概率转移图可形象表示,如下为一个2元语言模型:

图片

翻译模型P(F|E)反映的是一个源语言句子E翻译成一个目标语言句子F的概率,由于源语言句子和目标语言句子几乎不可能在语料库中出现过,因此这个概率无法直接从语料库统计得到,必须分解成词语翻译的概率和句子结构(或者顺序)翻译的概率。因此翻译模型的计算引入了隐含变量词对齐:$P(\mathrm{F} | \mathrm{E})=\sum_{A} P(\mathrm{F}, \mathrm{A} | \mathrm{E})$

翻译概率P(F|E)的计算转化为对齐概率P(F,A|E)的估计。IBM Model安成了对P(F,A|E)的估计。那么IBM Model从1~3分别有什么不同呢?IBM Model 1仅考虑词对词的互译概率,IBM Model 2加入了词的位置变化的概率,IBM Model 3加入了一个词翻译成多个词。

IBM模型是份经典的研究工作,这5个模型既是当初基于词的统计机器翻译模型的基础,也是现在统计机器翻译中主流技术中的重要一步。作为一个生成模型,IBM模型有着自身”严密”的模型演绎。总的来说,Model 1和2是在一个展开公式下的建模,而Model 3、4和5则是在另一个展开公式下的建模(fertility based model)。IBM Model属于single word based model,它只允许一对一和一对多的对齐,不存在多对一的对齐,这跟phrase based SMT模型不同。当然,从模型的复杂程度上讲,这5个模型之间的关系是1<2<3<4<5,从模型的计算顺序来讲,是1->2->3->4->5。

IBM Model-1

参考:
https://www.nltk.org/_modules/nltk/translate/ibm1.html
http://mt-class.org/jhu/slides/lecture-ibm-model1.pdf
https://zhuanlan.zhihu.com/p/72160554

IBM模型1&2的推导过程:

  • 猜测目标语言句子长度
  • 从左至右,对于每个目标语言单词
    • 首先猜测该单词由哪一个源语言单词翻译而来
    • 再猜测该单词应该翻译成什么目标语言词

图片

IBM Model-1进行了如下假设:

  • 假设翻译的目标语言句子为: $\mathrm{F}=f_{1}^{m}=f_{1} f_{2} \cdots f_{m}$
  • 假设翻译的源语言句子为:$\mathrm{E}=e_{1}^{l}=e_{1} e_{2} \cdots e_{l}$
  • 假设词语对齐表示为:$\mathrm{A}=a_{1}^{m}=a_{1} a_{2} \cdots a_{m}, \forall i \in\{1, \cdots, m\}, a_{i} \in\{0, \cdots, l\}$
  • 那么词语对齐的概率可以表示为:$\operatorname{Pr}(\mathrm{F}, \mathrm{A} | \mathrm{E})=\operatorname{Pr}(m | \mathrm{E}) \prod_{j=1}^{m} \operatorname{Pr}\left(a_{j} | a_{1}^{j-1}, f_{1}^{j-1}, m, \mathrm{E}\right) \operatorname{Pr}\left(f_{j} | a_{1}^{j}, f_{1}^{j-1}, m, \mathrm{E}\right)$
  • 在Model-1中,假设所有翻译长度都是等概率的
  • 假设词语对齐只与源语言长度有关,与其他因素无关:$\operatorname{Pr}\left(a_{j} | a_{1}^{j-1}, f_{1}^{j-1}, m, \mathrm{E}\right)=\frac{1}{l+1}$
  • 假设目标词语的选择只与其对应的源语言词语有关,与其他因素无关:$\operatorname{Pr}\left(f_{j} | a_{1}^{j}, f_{1}^{j-1}, m, \mathrm{E}\right)=t\left(f_{j} | e_{a_{j}}\right)$
  • 那么对齐概率可以表示为:$\operatorname{Pr}(\mathrm{F}, \mathrm{A} | \mathrm{E})=\frac{\varepsilon}{(l+1)^{m}} \prod_{j=1}^{m} t\left(f_{j} | e_{a_{j}}\right)$
  • 对所有可能的对齐求和,那么翻译概率就可以表示为:$\operatorname{Pr}(\mathrm{F} | \mathrm{E})=\sum_{\mathrm{A}} \operatorname{Pr}(\mathrm{F}, \mathrm{A} | \mathrm{E})=\frac{\varepsilon}{(l+1)^{m}} \sum_{a_{1}=1}^{l} \cdots \sum_{a_{m}=1}^{l} \prod_{j=1}^{m} t\left(f_{j} | e_{a_{j}}\right)$

这就是IBM Model 1的翻译模型公式,也就是说,给定参数t(f|e),我们就可以计算出句子E翻译成句子F的概率。对于其中翻译概率表t(f|e)满足归一约束条件:$\sum_{f} t(f | e)=1$

延伸:在IBM-Model2中增加如下假设:

  • 假设词语对齐只与源语言长度、目标语言的长度和两个词的位置有关,与其他因素无关:$\operatorname{Pr}\left(a_{j} | a_{1}^{j-1}, f_{1}^{j-1}, m, \mathrm{E}\right)=a\left(a_{j} | j, m, l\right)$,归一化条件为$\sum_{i=0}^{l} a(i | j, m, l)=1$

翻译概率的定义

对于长度为$l_f$的外语句子$f={f_1,…,f_{l_f}}$,长度为$l_e$的英文句子$e={e_1,…,e_{l_e}}$,从英文到外文的词对齐$e_j->f_i$关系$a:j->i$,有翻译概率定义:

$p(\mathbf{e}, a | \mathbf{f})=\frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \prod_{j=1}^{l_{e}} t\left(e_{j} | f_{a(j)}\right)$,例如下图:

图片

参数求解

根据最大似然估计,我们希望得到一组概率分布,使得我们的训练语料库出现的概率最大。也就是说,给定训练语料库E和F,我们要求解一个概率分布t(f|e),使得翻译概率Pr(F|E)最大。这是一个受约束的极值问题,约束条件即是t(f|e)的归一性条件。为了求解这个问题,我们需要引入拉格朗日乘子,构造一个辅助函数,将上述受约束的极值问题转换成一个不受约束的极值问题。

引入拉格朗日乘子$\Lambda_{e}$,构造辅助函数:$h(t, \lambda) \equiv \frac{\varepsilon}{(l+1)^{m}} \sum_{a_{i}=1}^{l} \cdots \sum_{a_{m}=1}^{l} \prod_{j=1}^{m} t\left(f_{j} | e_{a_{j}}\right)-\sum_{e} \lambda_{e}\left(\sum_{f} t(f | e)-1\right)$。将上述函数对t(f|e)求导得到:$\frac{\partial h(t, \lambda)}{\partial t(f | e)}=\frac{\varepsilon}{(l+1)^{m}} \sum_{a_{1}=1}^{l} \cdots \sum_{a_{n}=1}^{l} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right) \frac{\prod_{k=1}^{m} t\left(f_{k} | e_{a_{k}}\right)}{\mathrm{t}(f | e)}-\lambda_{e}$。

令上式为0,我们得到:$t(f | e)=\lambda_{e}^{-1} \frac{\varepsilon}{(l+1)^{m}} \sum_{a_{1}=1}^{l} \cdots \sum_{a_{m}=1}^{l} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right) \prod_{k=1}^{m} t\left(f_{k} | e_{a_{k}}\right)$。我们看到,这个公式的左边和右边都出现了t(f|e),我们无法直接用这个公式从给定的语料库(F|E)中计算出t(f|e),我们可以将这个公式看成是一个迭代公式,给定一个初值t(f|e),利用这个公式反复迭代,最后可以收敛到一个稳定的t(f|e)值,这就是EM算法。其中$\sum_{j=1}^{m} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right)$表示对齐A中e连接到f的次数。

定义在E和F的所有可能的对齐A下e和f连接数的均值为:$c(f | e ; \mathrm{F}, \mathrm{E}) \equiv \sum_{\mathrm{A}} \operatorname{Pr}(\mathrm{A} | \mathrm{F}, \mathrm{E}) \sum_{j=1}^{m} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right)$,且$\begin{array}{c}c(f | e ; \mathrm{F}, \mathrm{E})=\sum_{\mathrm{A}} \frac{\operatorname{Pr}(\mathrm{F}, \mathrm{A} | \mathrm{E})}{\operatorname{Pr}(\mathrm{F} | \mathrm{E})} \sum_{j=1}^{m} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right) \ =\frac{\sum_{A} \operatorname{Pr}(\mathrm{F}, \mathrm{A} | \mathrm{E}) \sum_{j=1}^{m} \delta\left(f, f_{j}\right) \delta\left(e, e_{a_{j}}\right)}{\operatorname{Pr}(\mathrm{F}[\mathrm{E})}\end{array}$。

将c(f|e;F,E)代入迭代公式,并将Pr(F|E)并入参数λe,我们得到新的迭代公式:$t(f | e)=\lambda_{e}^{-1} c(f | e ; \mathrm{F}, \mathrm{E})$

这个新的迭代公式可以理解为:

  • 一旦我们得到了一组参数t(f|e),我们就可以计算所有的词语对齐的概率Pr(F,A|E)
  • 有了每个词语对齐的概率Pr(F,A|E),我们就可以计算新的t(f|e)的值,就是所有的出现词语链接(e,f)的词语对齐概率之和,并对e进行归一化。

以上就是EM算法的中心思想。

EM算法迭代

我们可以使用EM算法迭代求解出对齐概率t,下面为EM算法求解过程:

  • 初始化$t(e|f)$

  • E-step: probability of alignments

    • 计算目标函数$p(a | \mathbf{e}, \mathbf{f})=\frac{p(\mathbf{e}, a | \mathbf{f})}{p(\mathbf{e} | \mathbf{f})}$, (使用上面公式计算$p(e,a|f)$)

      • 计算$p(e|f)$

        $\begin{aligned} p(\mathbf{e} | \mathbf{f}) &=\sum_{a} p(\mathbf{e}, a | \mathbf{f}) \ &=\sum_{a(1)=0}^{l_{f}} \ldots \sum_{a\left(l_{e}\right)=0}^{l_{f}} p(\mathbf{e}, a | \mathbf{f}) \ &=\sum_{a(1)=0}^{l_{f}} \ldots \sum_{a\left(l_{e}\right)=0} \frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \prod_{j=1}^{l_{e}} t\left(e_{j} | f_{a(j)}\right) \end{aligned}$

        ​ $\begin{array}{l}{=\frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \sum_{a(1)=0}^{l_{f}} \ldots \sum_{a\left(l_{e}\right)=0}^{l_{f}} \prod_{j=1}^{l_{e}} t\left(e_{j} | f_{a(j)}\right)} \ {=\frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \prod_{j=1}^{l_{e}} \sum_{i=0}^{l_{f}} t\left(e_{j} | f_{i}\right)}\end{array}​$

      ​ 一个计算例子如下:
      图片

    • 计算$p(a|e,f)$

      ​ $\begin{aligned} p(\mathbf{a} | \mathbf{e}, \mathbf{f}) &=p(\mathbf{e}, \mathbf{a} | \mathbf{f}) / p(\mathbf{e} | \mathbf{f}) \ &=\frac{\frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \prod_{j=1}^{l_{e}} t\left(e_{j} | f_{a(j)}\right)}{\frac{\epsilon}{\left(l_{f}+1\right)^{l_{e}}} \prod_{j=1}^{l_{e}} \sum_{i=0}^{l_{f}} t\left(e_{j} | f_{i}\right)} \ &=\prod_{j=1}^{l_{e}} \frac{t\left(e_{j} | f_{a(j)}\right)}{\sum_{i=0}^{l_{f}} t\left(e_{j} | f_{i}\right)} \end{aligned}$

    • M-step: count collection

      • 计算$c(e | f ; \mathbf{e}, \mathbf{f})=\sum_{a} p(a | \mathbf{e}, \mathbf{f}) \sum_{j=1}^{l_{e}} \delta\left(e, e_{j}\right) \delta\left(f, f_{a(j)}\right)​$,并可以简化为$c(e | f ; \mathbf{e}, \mathbf{f})=\frac{t(e | f)}{\sum_{i=0}^{l_{f}} t\left(e | f_{i}\right)} \sum_{j=1}^{l_{e}} \delta\left(e, e_{j}\right) \sum_{i=0}^{l_{f}} \delta\left(f, f_{i}\right)​$
      • 估计模型$t(e | f ; \mathbf{e}, \mathbf{f})=\frac{\left.\sum_{\mathbf{e}} \mathbf{f}_{\mathbf{j}} c(e | f ; \mathbf{e}, \mathbf{f})\right)}{\left.\sum_{e} \sum_{(\mathbf{e}, \mathbf{f})} c(e | f ; \mathbf{e}, \mathbf{f})\right)}$

计算实例

  • 训练句子:
    • sentence1: the house ||| la maison
    • sentence2: house ||| maison
  • 画出所有对齐的可能
    • sentence1:
      • a1: the->la、house->maison
      • a2: the->maison、house->la
    • sentence2:
      • a3: house->aison
  • 初始化
    • source_side_vocabulary: {the, house}, size=2
    • t(la|the) = 1/size = 1/2
    • t(maison|the) = 1/size = 1/2
    • t(la|house) = 1/size = 1/2
    • t(maison|house) = 1/size = 1/2
  • 第一次迭代
    • Expectation:
      • alignment probability
        • p(e,a1|f) = t(la|the) t(maison|house) =1/2 1/2 = 1/4
        • p(e, a2|f) = t(maison|the) t(la|house) = 1/2 1/2 = 1/4
        • p(e, a3|f) = t(maison|house) = 1/2
      • normalize alignment probability
        • p(a1|E,F) = p(e,a1|f)/sum(p(E,a|F)) = p(e,a1|f)/(p(e,a1|f)+p(e, a2|f)) = 1/4/(1/4+1/4) = 1/2
        • p(a2|E,F) = p(e,a2|f)/(p(e,a1|f)+p(e, a2|f)) = 1/4/(1/4+1/4) = 1/2
        • p(a3|E,F) = p(e,a3|f)/p(e,a3|f) = 1
    • Max:
      • collect counts
        • c(la|the) = p(a1|E,F) count(la|the) = 1/2 1 = 1/2
        • c(maison|the) = p(a2|E,F) count(maison|the) = 1/2 1 = 1/2
        • c(la|house) = p(a2|E,F) count(la|house) = 1/2 1 = 1/2
        • c(maison|house) = p(a1|E,F) count(maison|house) + p(a3|E,F) count(maison|house) = 1/2 1 + 1 1 = 3/2
      • normalize
        • t(la|the) = c(la|the)/sum(c(*|the)) = c(la|the)/(c(la|the) + c(maison|the) ) = 1/2/(1/2 + 1/2) = 1/2
        • t(maison|the) = c(maison|the)/sum(c(*|the)) = c(maison|the)/(c(la|the) + c(maison|the) ) = 1/2/(1/2 + 1/2) = 1/2
        • t(la|house) = 1/2/(1/2 + 3/2) = 1/4
        • t(maison|house) = 3/2/(1/2 + 3/2) = 3/4
  • 第二次迭代
    • Expectation:
      • alignment probability
        • p(e,a1|f) = 1/2 * 3/4 = 3/8
        • p(e, a2|f) = 1/2 * 1/4 = 1/8
        • p(e, a3|f) = 3/4
      • normalize alignment probability
        • p(a1|E,F) = 3/8/(3/8 + 1/8) = 3/4
        • p(a2|E,F) = 1/8/(3/8 + 1/8) = 1/4
        • p(a3|E,F) = 3/4/3/4 = 1
    • Max:
      • collect counts
        • c(la|the) = 3/4 * 1 = 3/4
        • c(maison|the) = 1/4 * 1 = 1/4
        • c(la|house) = 1/4 * 1 = 1/4
        • c(maison|house) = 3/4 1 + 1 1 = 7/4
      • normalize
        • t(la|the) = 3/4/(3/4 + 1/4) = 3/4
        • t(maison|the) = 1/4/(3/4 + 1/4) = 1/4
        • t(la|house) = 1/4/(1/4 + 7/4) = 1/8
        • t(maison|house) = 7/4/(1/4 + 7/4) = 7/8

NLTK源码分析

  • 整体代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
counts = Counts()
for aligned_sentence in parallel_corpus:
trg_sentence = aligned_sentence.words
src_sentence = [None] + aligned_sentence.mots

# E step (a): Compute normalization factors to weigh counts
total_count = self.prob_all_alignments(src_sentence, trg_sentence)

# E step (b): Collect counts
for t in trg_sentence:
for s in src_sentence:
count = self.translation_table[t][s]
normalized_count = count / total_count[t]
counts.t_given_s[t][s] += normalized_count
counts.any_t_given_s[s] += normalized_count

# M step: Update probabilities with maximum likelihood estimate
self.maximize_lexical_translation_probabilities(counts)
  • 计算$\sum_{a} p(\mathbf{e}, a | \mathbf{f})$
1
2
3
4
5
6
def prob_all_alignments(self, src_sentence, trg_sentence):
alignment_prob_for_t = defaultdict(lambda: 0.0)
for t in trg_sentence:
for s in src_sentence:
alignment_prob_for_t[t] += self.translation_table[t][s]
return alignment_prob_for_t
  • 计算M-step
1
2
3
4
5
def maximize_lexical_translation_probabilities(self, counts):
for t, src_words in counts.t_given_s.items():
for s in src_words:
estimate = counts.t_given_s[t][s] / counts.any_t_given_s[s]
self.translation_table[t][s] = max(estimate, IBMModel.MIN_PROB)
  • 给定句对计算alignment
1
2
3
4
5
6
7
8
9
10
11
12
best_alignment = []
for j, trg_word in enumerate(sentence_pair.words):
best_prob = max(self.translation_table[trg_word][None], IBMModel.MIN_PROB)
best_alignment_point = None
for i, src_word in enumerate(sentence_pair.mots):
align_prob = self.translation_table[trg_word][src_word]
if align_prob >= best_prob: # prefer newer word in case of tie
best_prob = align_prob
best_alignment_point = i

best_alignment.append((j, best_alignment_point))
sentence_pair.alignment = Alignment(best_alignment)

IBM Model-2

参考:
https://www.nltk.org/_modules/nltk/translate/ibm2.html
http://www.ai.mit.edu/courses/6.891-nlp/l11.pdf

IBM模型1的一个问题是它的重新排序能力非常弱,因为$p(f,a|s)$仅使用词法转换概率$t(t|s)$来计算。因此,如果模型有2个候选$t_1$和$t_2$具有相同的词汇翻译,但是对翻译后的单词进行了不同的重新排序,那么模型对这两个翻译都给出相同的分数。IBM-Model2使用对齐概率模型$p(i|j,s,f)$表示位置i、j的对齐概率,并用它计算$\operatorname{Pr}(t, a | s)=\frac{\epsilon}{(J+1)^{I}} \prod_{j=1}^{J} \operatorname{tr}\left(t_{j} | s_{a(j)}\right) \operatorname{Pr}_{a}(a(j) | j, J, I)$,其中$\operatorname{Pr}_{a}(a(j) | j, J, I)​$模拟一个单词在源句中的位置i被重新排序到目标句中的位置j的概率。

问题定义为:求解概率P(f|e),其中$e=\{e_1,…,e_l\}$, $f=\{f_1,..,f_m\}$, 对齐定义为$\{a_1, …, a_m\}$且$a_j\in \{0,..,l\}$, 一共有$(l+1)^m$个对齐。IBM Model1的对齐概率认为是等概率的,即$P(\mathbf{a} | \mathbf{e})=C \times \frac{1}{(l+1)^{m}}$,且C是常数$C=\operatorname{prob}(\operatorname{length}(\mathbf{f})=m)$. 对于IBM Model1来说它生成出一个句子的具体过程为:

  • 选择长度为f(均等概率C产生的长度)
  • 使用均等概率$1/(l+1)^m$产生对齐a
  • 使用概率$P(\mathbf{f} | \mathbf{a}, \mathbf{e})=\prod_{j=1}^{m} \mathbf{T}\left(f_{j} | e_{a_{j}}\right)$产生目标语言
  • 最终结果$P(\mathbf{f}, \mathbf{a} | \mathbf{e})=P(\mathbf{a} | e) P(\mathbf{f} | \mathbf{a}, e)=\frac{C}{(l+1)^{m}} \prod_{j=1}^{m} \mathbf{T}\left(f_{j} | e_{a_{j}}\right)$

翻译概率的定义

在IBM Model-2中,定义D(i|j,l,m) = 给定source lenth e和target lenth f,第j个target word和第i个source word对齐的概。因此定义$P\left(\mathbf{a}=\left\{a_{1}, \ldots a_{m}\right\} | \mathbf{e}, l, m\right)=\prod_{j=1}^{m} \mathbf{D}\left(a_{j} | j, l, m\right)$,则翻译概率定义为:$P(\mathbf{f}, \mathbf{a} | \mathbf{e}, l, m)=\prod_{j=1}^{m} \mathbf{D}\left(a_{j} | j, l, m\right) \mathbf{T}\left(f_{j} | e_{a_{j}}\right)$。可以看出来,IBM-Model1是Model2的一个特例,在IBM-Model1中$\mathrm{D}(i | j, l, m)=\frac{1}{l+1}$

因此对于IBM Model2来说它生成出一个句子的具体过程为:

  • 选择长度为f(均等概率C产生的长度)
  • 使用概率$\prod_{j=1}^{m} \mathrm{D}\left(a_{j} | j, l, m\right)$产生对齐$a=\{a_1,…,a_m\}$
  • 使用概率$P(\mathbf{f} | \mathbf{a}, \mathbf{e})=\prod_{j=1}^{m} \mathbf{T}\left(f_{j} | e_{a_{j}}\right)$产生目标语言
  • 最终结果$P(\mathbf{f}, \mathbf{a} | \mathbf{e})=P(\mathbf{a} | \mathbf{e}) P(\mathbf{f} | \mathbf{a}, \mathbf{e})=C \prod_{j=1}^{m} \mathbf{D}\left(a_{j} | j, l, m\right) \mathbf{T}\left(f_{j} | e_{a_{j}}\right)$

参数求解

假设Model-2翻译模型定义为:$\operatorname{Pr}(\mathrm{F} | \mathrm{E})=\varepsilon \prod_{j=1}^{m} \sum_{i=0}^{l} t\left(f_{j} | e_{a_{j}}\right) a\left(a_{j} | j, m, l\right)$,同样通过引入拉格朗日乘子推导可以得到:

$t(f | e)=\lambda_{e}^{-1} c(f | e ; \mathrm{F}, \mathrm{E})$

$a(i | j, m, l)=\mu_{j m l}^{-1} c(i | j, m, l ; \mathrm{F}, \mathrm{E})$

$c(f | e ; \mathrm{F}, \mathrm{E})=\sum_{j=1}^{m} \sum_{i=0}^{l} \frac{t(f | e) a(i | j, m, l) \delta\left(f, f_{j}\right) \delta\left(e, e_{i}\right)}{t\left(f | e_{0}\right) a(0 | j, m, l)+\cdots+t\left(f | e_{l}\right) a(l | j, m, l)}$

$c(i | j, m, l ; \mathrm{F}, \mathrm{E})=\frac{t\left(f_{j} | e_{i}\right) a(i | j, m, l)}{t\left(f_{j} | e_{0}\right) a(0 | j, m, l)+\cdots+t\left(f_{j} | e_{l}\right) a(l | j, m, l)}$

考虑到训练语料库(F|E)是由一系列句子对组成的:$\left(\mathrm{F}^{(1)}, \mathrm{E}^{(\mathrm{i})}\right),\left(\mathrm{F}^{(2)}, \mathrm{E}^{(2)}\right), \cdots,\left(\mathrm{F}^{(\mathrm{s})}, \mathrm{E}^{(\mathrm{s})}\right)$

因此实际计算时我们采用以下公式:

$t(f | e)=\lambda_{e}^{-1} \sum_{s} c\left(f | e ; \mathrm{F}^{(\mathrm{s})}, \mathrm{E}^{(\mathrm{s})}\right)$

$a(i | j, m, l)=\mu_{j m l}^{-1} \sum_{s} c\left(i | j, m, l ; \mathrm{F}^{(\mathrm{s})}, \mathrm{E}^{(\mathrm{s})}\right)$

这里$\Lambda_{e}$和$\mu_{j m}$仅仅起到归一化因子的作用。

EM算法迭代

  • 初始化$\begin{array}{r}\mathrm{T}(f | e) \ \mathrm{D}(i | j, l, m)\end{array}$
  • 计算对齐概率$a[i, j, k]=\frac{\mathrm{D}\left(a_{j}=i | j, l, m\right) \mathrm{T}\left(f_{j} | e_{i}\right)}{\sum_{i^{\prime}=0}^{l} \mathrm{D}\left(a_{j}=i^{\prime} | j, l, m\right) \mathrm{T}\left(f_{j} | e_{i^{\prime}}\right)}$
  • E步:
    • 计算e和f对齐的期望次数:$\text { tcount }(e, f)=\sum_{\{|k, j|=f} a[i, j, k]$
    • 计算e和任意target word对齐的期望次数:$\text { scount }(e)=\sum_{e[k, i]=\ell}^{i, k} \sum_{j=1}^{m[k]} a[i, j, k]$
    • 计算对于source lenth 为l和target lenth 为m的句对ei和fj对齐的期望次数:$\operatorname{acount}(i, j, l, m)=\sum_{, m[k]=m} a_{\vdots=m}[i, j, k]$
    • 计算表示source lenth 为l和target lenth 为m的期望次数:$\operatorname{acount}(j, l, m)=|\{k: l[k]=l, m[k]=m\}|$
  • M步:
    • 重新估算翻译概率T(f|e):$P(f | e)=\frac{\text { tcount }(e, f)}{\text { scount }(e)}$
    • 重新估算对齐概率D(i|j,l,m):$\left.P\left(a_{j}=i | j, l, m\right)\right)=\frac{a \operatorname{count}(i, j, l, m)}{\operatorname{acount}(j, l, m)}$

计算实例

  • 训练句子:
    • sentence1: the dog ||| le chien
    • sentence2: the cat ||| le chat
    • sentence3: the bus ||| I’ autobus
  • 初始化
    • source_side_vocabulary: {the, dog, cat, bus}, size=4
    • 随机初始化
      • T(f|e)

图片

​ * D(i|j,l,m)

图片

  • E步:
    • 计算tcount(e,f)

图片

如tcount(the, le) = a(1, 1, 0) + a(1, 1, 1) = 0.5264 + 0.4665 = 0.9929

  • 计算scount(e)
  • 计算account(i,j,l,m)
  • 计算account(j,l,m)
  • M步:
    • 重新估算T(f|e)(下图为经过几轮迭代后的变化)

图片

  • 重新估算D(i|j,l,m)
  • IBM paper中建议用Model-1估计T(f|e)并用来初始化Model-2

NLTK源码分析

  • 整体代码
    • 初始化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
if probability_tables is None:
ibm1 = IBMModel1(sentence_aligned_corpus, 2 * iterations)
self.translation_table = ibm1.translation_table
# a(i | j,l,m) = 1 / (l+1) for all i, j, l, m
l_m_combinations = set()
for aligned_sentence in sentence_aligned_corpus:
l = len(aligned_sentence.mots)
m = len(aligned_sentence.words)
if (l, m) not in l_m_combinations:
l_m_combinations.add((l, m))
initial_prob = 1 / (l + 1)
for i in range(0, l + 1):
for j in range(1, m + 1):
self.alignment_table[i][j][l][m] = initial_prob
else:
self.translation_table = probability_tables["translation_table"]
# D(i|j,l,m)
self.alignment_table = probability_tables["alignment_table"]
for n in range(0, iterations):
self.train(sentence_aligned_corpus)
self.align_all(sentence_aligned_corpus)
  • 训练
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def train(self, parallel_corpus):
counts = Model2Counts()
for aligned_sentence in parallel_corpus:
src_sentence = [None] + aligned_sentence.mots
trg_sentence = ["UNUSED"] + aligned_sentence.words
l = len(aligned_sentence.mots)
m = len(aligned_sentence.words)
# E step (a): Compute normalization factors to weigh counts
alignment_prob_for_t = defaultdict(lambda: 0.0)
for j in range(1, len(trg_sentence)):
t = trg_sentence[j]
for i in range(0, len(src_sentence)):
alignment_prob_for_t[t] += self.prob_alignment_point(
i, j, src_sentence, trg_sentence
)
total_count = alignment_prob_for_t
# E step (b): Collect counts
for j in range(1, m + 1):
t = trg_sentence[j]
for i in range(0, l + 1):
s = src_sentence[i]
l = len(src_sentence) - 1
m = len(trg_sentence) - 1
s = src_sentence[i]
t = trg_sentence[j]
count = self.translation_table[t][s] * self.alignment_table[i][j][l][m]
normalized_count = count / total_count[t]
self.t_given_s[t][s] += normalized_count
self.any_t_given_s[s] += normalized_count
self.alignment[i][j][l][m] += normalized_count
self.alignment_for_any_i[j][l][m] += normalized_count
# M step: Update probabilities with maximum likelihood estimates
self.maximize_lexical_translation_probabilities(counts)
self.maximize_alignment_probabilities(counts)
  • 计算M-step
1
2
3
4
5
6
7
8
9
10
11
def maximize_alignment_probabilities(self, counts):
MIN_PROB = IBMModel.MIN_PROB
for i, j_s in counts.alignment.items():
for j, src_sentence_lengths in j_s.items():
for l, trg_sentence_lengths in src_sentence_lengths.items():
for m in trg_sentence_lengths:
estimate = (
counts.alignment[i][j][l][m]
/ counts.alignment_for_any_i[j][l][m]
)
self.alignment_table[i][j][l][m] = max(estimate, MIN_PROB)
  • 给定句对计算alignment:$a_{j}^{*, 2}=\operatorname{argmax}_{j}\left(\mathrm{T}\left(f_{j} | e_{a_{j}}\right) \mathrm{D}(j | i, l, m)\right)​$
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def align(self, sentence_pair):
best_alignment = []
l = len(sentence_pair.mots)
m = len(sentence_pair.words)
for j, trg_word in enumerate(sentence_pair.words):
# Initialize trg_word to align with the NULL token
best_prob = (
self.translation_table[trg_word][None]
* self.alignment_table[0][j + 1][l][m]
)
best_prob = max(best_prob, IBMModel.MIN_PROB)
best_alignment_point = None
for i, src_word in enumerate(sentence_pair.mots):
align_prob = (
self.translation_table[trg_word][src_word]
* self.alignment_table[i + 1][j + 1][l][m]
)
if align_prob >= best_prob:
best_prob = align_prob
best_alignment_point = i

best_alignment.append((j, best_alignment_point))

sentence_pair.alignment = Alignment(best_alignment)

IBM Model-3

参考:
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1162/handouts/cs224n-lecture3-MT.pdf
http://www.ai.mit.edu/courses/6.891-nlp/l11.pdf
http://www.ai.mit.edu/courses/6.891-nlp/l12.pdf
https://www.nltk.org/_modules/nltk/translate/ibm3.html
http://www1.maths.lth.se/matematiklth/vision/publdb/reports/pdf/schoenemann-ccnll-10.pdf

从Model-3到Model-5,翻译模型如下图所示:

图片

具体步骤是:

  • 首先根据源语言词语的繁殖概率,确定每个源语言词翻译成多少个目标语言词
  • 根据每个源语言词语的目标语言词数,将每个源语言词复制若干次
  • 将复制后得到的每个源语言词,根据翻译概率,翻译成一个目标语言词
  • 根据调序概率,将翻译得到的目标语言词重新调整顺序,得到目标语言句子

对于Model-3来说,其推导过程为:

  • 对于句子中每一个英语单词e,选择一个产出率φ,其概率为n(φ|e)
  • 对于所有单词的产出率求和得到m-prime
  • 按照下面的方式构造一个新的英语单词串:删除产出率为0的单词,复制产出率为1的单词,复制两遍产出率为2的单词,依此类推
  • 在这m-prime个单词的每一个后面,决定是否插入一个空单词NULL,插入和不插入的概率分别为p1和p0
  • φ0为插入的空单词NULL的个数
  • 设m为目前的总单词数:m-prime+φ0
  • 根据概率表t(f|e),将每一个单词e替换为外文单词f
  • 对于不是由空单词NULL产生的每一个外语单词,根据概率表d(j|i,l,m),赋予一个位置。这里j是法语单词在法语串中的位置,i是产生当前这个法语单词的对应英语单词在英语句子中的位置,l是英语串的长度,m是法语串的长度
  • 如果任何一个目标语言位置被多重登录(含有一个以上单词),则返回失败
  • 给空单词NULL产生的单词赋予一个目标语言位置。这些位置必须是空位置(没有被占用)。任何一个赋值都被认为是等概率的,概率值为1/φ0
  • 最后,读出法语串,其概率为上述每一步概率的乘积

翻译概率的定义

Model-3引入fertility参数来约束一个源语言单词可以产生多少个目标语言单词, 并且从Model-3开始其建模过程和Model-1、Model-2有所不同,其产生句子的总体过程如下:

  • 对每个ej产生一个fertility $\phi_{j}$只依赖于ej)
  • 对每个ej生成$\phi_{j}$个target word词(只依赖于ej而不依赖于任何其他context)
  • 给每个target words选择一个位置(只依赖于ej的位置和句子长度)

具体过程如下:

  • 有英文句子$e=\{e_1,…,e_l\}$, 想要建模$P(f|e)$
  • 使用概率$P\left(\left\{\phi_{0} \ldots \phi_{l}\right\} | \mathbf{e}\right)$选择$l+1$个fertilities$\{\phi_{0} \ldots \phi_{l}\}$
    • $P\left(\left\{\phi_{0} \ldots \phi_{l}\right\} | \mathbf{e}\right)=P\left(\phi_{0} | \phi_{1} \ldots \phi_{l}\right) \prod_{i=1}^{l} \mathrm{F}\left(\phi_{i} | e_{i}\right)$
      • $F(\phi|e)$表示e和$\phi$个单词对齐的概率:F(0|the)=0.1、F(1|the)=0.9、F(2|the)=0…F(0|not)=0.01、F(1|not)=0.09、F(2|not)=0.9
      • $P\left(\phi_{0} | \phi_{1} \ldots \phi_{l}\right)$表示以出现$p_1$正面的概率硬币m次,出现$\phi_0$次正面的概率:$P\left(\phi_{0} | \phi_{1} \ldots \phi_{l}\right)=\frac{m !}{\left(m-\phi_{0}\right) ! \phi_{0} !} p_{1}^{\phi_{0}}\left(1-p_{1}\right)^{m-\phi_{0}}$
        • $m=\sum_{i=1}^{l} \phi_{i}$,可以这样了理解:假设已经有m个source word产生,则这m个词的每个词都增加一个$p_1$概率的空对齐
  • 对于每一个$e_i$, 使用概率$\prod_{i=0}^{l} \prod_{k=1}^{\phi_{i}} \mathrm{R}\left(\pi_{i, k} | i, l, m\right) \mathrm{T}\left(f_{i, k} | e_{i}\right)$选择位置$\pi_{i,k}\in 1…m$和target word $f_{i,k}$
    • $R(j|i,l,m)$表示给定source lenth l、target lenth m 和 source position i,产生target position j的概率
    • $R(j|i,l,m)$要注意与之前的$D(i|j,l,m)$区别,$D(i|j,l,m)$表示给定source lenth l、target lenth m 和 target position j,产生source position i的概率
  • 根据上面推导产生模型:

$\begin{aligned} \phi=&\left\{\phi_{0} \ldots \phi_{m}\right\} \ \pi=&\left\{\pi_{i, k}: i=0 \ldots m, k=1 \ldots \phi_{i}\right\} \ \mathbf{f} 2=&\left\{f_{i, k}: i=0 \ldots m, k=1 \ldots \phi_{i}\right\} \ & P(\phi, \pi, \mathbf{f} 2 | \mathbf{e})=\ \frac{m !}{\left(m-\phi_{0}\right) ! \phi_{0} !} p_{1}^{\phi_{0}}\left(1-p_{1}\right)^{m-\phi_{0}} &\left(\prod_{i=0}^{l} \prod_{k=1}^{\phi_{i}} \mathbf{R}\left(\pi_{i, k} | i, l, m\right) \mathbf{T}\left(f_{i, k} | e_{i}\right)\right) \end{aligned}$

  • 进一步优化:

$\begin{array}{c}P(\mathbf{f}, \mathbf{a} | \mathbf{e})= \ \frac{m !}{\left(m-\phi_{0}\right) ! \phi_{0} !} p_{1}^{\phi_{0}}\left(1-p_{1}\right)^{m-\phi_{0}}\left(\prod_{i=1}^{l} \mathbf{F}\left(\phi_{i} | e_{i}\right) \phi_{i} !\right)\left(\prod_{i=1}^{l} \prod_{k=1}^{\phi_{i}} \mathbf{R}\left(\pi_{i, k} | i, l, m\right)\right)\left(\prod_{i=0}^{l} \prod_{k=1}^{\phi_{i}} \mathbf{T}\left(f_{i, k} | e_{i}\right)\right)\end{array}$

EM算法迭代

  • 初始化
    • fertility_table:$F(\phi | e)$
    • translation_table:T(f|e)用Model-2的结果初始化
    • distortion_table:R(j|i,l,m)
    • p1=0.5
    • alignment_table:a[i,j,k]用Model-2的结果初始化
  • E步:Model-1和Model-2可以计算出expect count,Model-3计算量太大,所以使用了一个近似算法
    • 什么是expect count?以句对e = I do not understand the logic of these people,f = Je ne comprends pas la logique de ces gens -la为例,在给定当前模型参数下,$\phi_{3}$(not 的 fertility)的期望值为$\sum_{\mathbf{a} \in \mathcal{A}} P(\mathbf{a} | \mathbf{f}, \mathbf{e}) \phi_{3}(\mathbf{a})=\sum_{\mathbf{a} \in \mathcal{A}} \frac{P(\mathbf{f}, \mathbf{a} | \mathbf{e})}{\sum_{\mathbf{a}^{\prime} \in \mathcal{A}} P\left(\mathbf{f}, \mathbf{a}^{\prime} | \mathbf{e}\right)} \phi_{3}(\mathbf{a})$,其中$\phi_{3}(\mathbf{a})$表示对齐a中$\phi_{3}(\mathbf{a})$的值
    • 在Model-3中使用high probability alignments $\overline{\mathcal{A}}$计算$\sum_{\mathbf{a} \in \mathcal{A}} \frac{P(\mathbf{f}, \mathbf{a} | \mathbf{e})}{\sum_{\mathbf{a}^{\prime} \in \overline{\mathcal{A}}} P\left(\mathbf{f}, \mathbf{a}^{\prime} | \mathbf{e}\right)} \phi_{3}(\mathbf{a})$减少计算复杂度
  • M步:重新估算参数概率

Viterbi训练

参考:
词语对齐的对数线性模型:http://nlp.ict.ac.cn/~liuyang/papers/acl05_chn.pdf

Viterbi参数训练算法的总体思路:

  • 给定初始参数
  • 用已有的参数求概率最大(Viterbi)的词语对齐
  • 用得到的概率最大的词语对齐重新计算参数
  • 回到第二步,直到收敛为止

在对参数计算公式无法化简的情况下,采用Viterbi参数训练算法是一种可行的做法,这种算法通常可以迅速收敛到一个可以接受的结果。

由于IBM Model 1和2存在简化的迭代公式,实际上在EM算法迭代是并不用真的去计算所有的对齐,而是可以利用迭代公式直接计算下一次的参数;由于IBM Model 3、4、5的翻译模型公式无法化简,理论上应该进行EM迭代。由于实际上由于计算所有词语对齐的代价太大,通常采用Viterbi训练,每次E步骤只生成最好的一个或者若干个对齐。

Generalized Iterative Scaling算法(GIS)。

计算实例

以一个实际例子来说明翻译过程:I do not understand the logic of these people翻译成法语。

  • pick fertilities:以概率$\begin{aligned} P\left(\phi_{1} \ldots \phi_{l} | \mathbf{e}\right) &=\prod_{i=1}^{l} \mathbf{F}\left(\phi_{i} | e_{i}\right) \ &=\mathrm{F}(1 | I) \mathrm{F}(0 | d o) \mathrm{F}(2 | n o t) \mathrm{F}(1 | \text { understand }) \mathrm{F}(1 | \text { the }) \ & \mathrm{F}(1 | \text { logic }) \mathrm{F}(1 | \text { of }) \mathrm{F}(1 | \text { these }) \mathrm{F}(1 | \text { people }) \end{aligned}$产生

图片

  • replace words:以概率$\begin{aligned} \prod_{i=1}^{l} \phi_{i} ! \prod_{k=1}^{\phi_{i}} \mathrm{T}\left(\mathrm{f}_{i, k} | e_{i}\right)=& 1 ! \times 0 ! \times 2 ! \times 1 ! \times 1 ! \times 1 ! \times 1 ! \times 1 ! \times 1 ! \times \ & \mathrm{T}(J e | I) \mathrm{T}(n e | n o t) \mathrm{T}(p a s | n o t) \times \ & \mathrm{T}(\text {comprends} | \text {understand}) \mathrm{T}(l a | \text {the}) \mathrm{T}(\text {logique } | \text {logic}) \times \ & \mathrm{T}(\text {de } | \text {of}) \mathrm{T}(\text {ces} | \text {these}) \mathrm{T}(\text {gens } | \text {people}) \end{aligned}​$产生目标词汇

图片

  • reorder:以概率$\begin{aligned} \prod_{i=1}^{1} \prod_{k=1}^{\phi_{i}} \mathrm{R}\left(\pi_{i, k} | i, l, m\right)=& \mathrm{R}(j=1 | i=1, l=9, m=10) \mathrm{R}(2 | 3,9,10) \mathrm{R}(3 | 4,9,10) \times \ & \mathrm{R}(4 | 3,9,10) \mathrm{R}(5 | 5,9,10) \mathrm{R}(6 | 6,9,10) \times \ & \mathrm{R}(7 | 7,9,10) \mathrm{R}(8 | 8,9,10) \mathrm{R}(9 | 9,9,10) \end{aligned}$重新排序

图片

  • spurious words:以概率$\begin{aligned} P\left(\phi_{0} | \phi_{1} \ldots \phi_{1}\right) \prod_{k=1}^{60} \mathrm{T}\left(\mathrm{f}_{0, k} | N U L L\right) &=\frac{n !}{\left(n-\phi_{0}\right) ! \phi_{0} !} p_{1}^{00}\left(1-p_{1}\right)^{n-\phi_{0}} \prod_{k=1}^{60} \mathrm{T}\left(\mathrm{f}_{0, k} | N U L L\right) \ &=\frac{9 !}{811 !} p_{1}\left(1-p_{1}\right)^{8} \mathrm{T}(-l a | N U L L) \ &=9 p_{1}\left(1-p_{1}\right)^{8} \mathrm{T}(-l a | N U L L) \end{aligned}​$产生

图片,这里$n=\sum_{i=1}^{l} \phi_{l}=m-\phi_{0}$

  • p1是$\phi_{0}$(即空对齐)的概率,spurious words即产生T(f|NULL)

NLTK源码分析

  • 整体代码
    • 初始化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
if probability_tables is None:
ibm2 = IBMModel2(sentence_aligned_corpus, iterations)
self.translation_table = ibm2.translation_table
self.alignment_table = ibm2.alignment_table
# d(j | i,l,m) = 1 / m for all i, j, l, m
l_m_combinations = set()
for aligned_sentence in sentence_aligned_corpus:
l = len(aligned_sentence.mots)
m = len(aligned_sentence.words)
if (l, m) not in l_m_combinations:
l_m_combinations.add((l, m))
initial_prob = 1 / m
for j in range(1, m + 1):
for i in range(0, l + 1):
self.distortion_table[j][i][l][m] = initial_prob
# simple initialization, taken from GIZA++
self.fertility_table[0] = defaultdict(lambda: 0.2)
self.fertility_table[1] = defaultdict(lambda: 0.65)
self.fertility_table[2] = defaultdict(lambda: 0.1)
self.fertility_table[3] = defaultdict(lambda: 0.04)
MAX_FERTILITY = 10
initial_fert_prob = 0.01 / (MAX_FERTILITY - 4)
for phi in range(4, MAX_FERTILITY):
self.fertility_table[phi] = defaultdict(lambda: initial_fert_prob)
self.p1 = 0.5
else:
self.translation_table = probability_tables["translation_table"]
self.alignment_table = probability_tables["alignment_table"]
self.fertility_table = probability_tables["fertility_table"]
self.p1 = probability_tables["p1"]
self.distortion_table = probability_tables["distortion_table"]
  • 训练
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def train(self, parallel_corpus):
counts = Model3Counts()
for aligned_sentence in parallel_corpus:
l = len(aligned_sentence.mots)
m = len(aligned_sentence.words)
# Sample the alignment space
sampled_alignments, best_alignment = self.sample(aligned_sentence)
# Record the most probable alignment
aligned_sentence.alignment = Alignment(best_alignment.zero_indexed_alignment())
# E step (a): Compute normalization factors to weigh counts, https://github.com/nltk/nltk/blob/5023d6b933ef1a5b1f25fba1d5ed11a8a43a47e4/nltk/translate/ibm_model.py中有详细计算
total_count = self.prob_of_alignments(sampled_alignments)
# E step (b): Collect counts
for alignment_info in sampled_alignments:
count = self.prob_t_a_given_s(alignment_info)
normalized_count = count / total_count
for j in range(1, m + 1):
counts.update_lexical_translation(normalized_count, alignment_info, j)
counts.update_distortion(normalized_count, alignment_info, j, l, m)
counts.update_null_generation(normalized_count, alignment_info)
counts.update_fertility(normalized_count, alignment_info)
# M step:
# If any probability is less than MIN_PROB, clamp it to MIN_PROB
existing_alignment_table = self.alignment_table
self.reset_probabilities()
self.alignment_table = existing_alignment_table # don't retrain

self.maximize_lexical_translation_probabilities(counts)
self.maximize_distortion_probabilities(counts)
self.maximize_fertility_probabilities(counts)
self.maximize_null_generation_probabilities(counts)
  • 计算M-step
    • 最大化distortion概率
1
2
3
4
5
6
7
8
def maximize_distortion_probabilities(self, counts):
MIN_PROB = IBMModel.MIN_PROB
for j, i_s in counts.distortion.items():
for i, src_sentence_lengths in i_s.items():
for l, trg_sentence_lengths in src_sentence_lengths.items():
for m in trg_sentence_lengths:
estimate = (counts.distortion[j][i][l][m]/ counts.distortion_for_any_j[i][l][m])
self.distortion_table[j][i][l][m] = max(estimate, MIN_PROB)
  • 计算给定source sentence,产生target sentence和alignment的概率
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def prob_t_a_given_s(self, alignment_info):
src_sentence = alignment_info.src_sentence
trg_sentence = alignment_info.trg_sentence
l = len(src_sentence) - 1 # exclude NULL
m = len(trg_sentence) - 1
p1 = self.p1
p0 = 1 - p1

probability = 1.0
MIN_PROB = IBMModel.MIN_PROB

# Combine NULL insertion probability
null_fertility = alignment_info.fertility_of_i(0)
probability *= pow(p1, null_fertility) * pow(p0, m - 2 * null_fertility)
if probability < MIN_PROB:
return MIN_PROB

# Compute combination (m - null_fertility) choose null_fertility
for i in range(1, null_fertility + 1):
probability *= (m - null_fertility - i + 1) / i
if probability < MIN_PROB:
return MIN_PROB

# Combine fertility probabilities
for i in range(1, l + 1):
fertility = alignment_info.fertility_of_i(i)
probability *= (
factorial(fertility) * self.fertility_table[fertility][src_sentence[i]]
)
if probability < MIN_PROB:
return MIN_PROB

# Combine lexical and distortion probabilities
for j in range(1, m + 1):
t = trg_sentence[j]
i = alignment_info.alignment[j]
s = src_sentence[i]
probability *= (self.translation_table[t][s] * self.distortion_table[j][i][l][m])
if probability < MIN_PROB:
return MIN_PROB

return probability

计算alignment

  • 用Model-2计算最可能的alignment:$\mathbf{a}^{, 2}=\operatorname{argmax}_{\mathbf{a} \in \mathcal{A}} P_{2}(\mathbf{f}, \mathbf{a} | \mathbf{e})$、$a_{j}^{ 2}=\operatorname{argmax}_{j}\left(\mathrm{T}\left(f_{j} | e_{a_{j}}\right) \mathrm{D}(j | i, l, m)\right)$
  • 计算a2的邻居(邻居指通过替换a2中某个对齐或交换某两个对齐产生的新的对齐)
  • 迭代计算a3(初值设置为a2):$\mathbf{a}^{, 3}=\operatorname{argmax}_{\mathbf{a} \in \mathcal{N}\left(\mathbf{a}^{, 3}\right)} \quad P_{3}(\mathbf{a}, \mathbf{f} | \mathbf{e})​$
    • 上式等价于求解概率的负对数
    • 对于所有的source position $j \in\{1, \ldots, J\}$和target position $i \in\{0, \ldots, \tilde{I}\}$,引入$x_{i j} \in\{0,1\}$表示i和j是否对齐,且由于每个i只能和1个j对齐,因此存在约束$\sum_{i} x_{i j}=1 \quad, j=1, \ldots, J$,从而求解目标为:$c_{i j}^{x}=-\log \left[p\left(f_{j} | e_{i}\right) \cdot p(j | i)\right]$