论文阅读：《Jointly Learning to Align and Translate with Transformer》

2020-03-29

最近在对齐方面看的比较多，这一篇是去年看到的使用多任务学习提高对齐效果的文章。今天仔细读一遍。

参考：
https://arxiv.org/pdf/1909.02074.pdf

论文简介

神经机器翻译已经统治了翻译领域，其中的attention机制是从词对齐借鉴出来的，但是NMT中的attention和词对齐又有很大不同。Attention更倾向于attend到context word而不是source word本身，而且现在的multi-layer、multi-head机制又使得attention非常复杂。

由于词对齐可以用在很多地方，比如对于实体名词的翻译，或者对于low-resource语言的借助词表的翻译有很大作用。因此本文提出了一个多任务学习的方法，使用NMT的negative log likelihood（NLL）loss和alignment loss结合作为多任务学习的loss。而且和NMT的auto-regressive式的模型不同，NMT在翻译时需要借助past target context的信息，而对于词对齐来说是不够用的，因此本文在多任务中使用了不同的context信息进行生成。

问题定义和基线模型

给定source sentence：f(1,J)=f1,….,fj,…fJ和target translation：e(1,I)=f1,….,fj,…fI，则对齐就是位置的笛卡尔序列：$\mathcal{A} \subseteq\{(j, i): j=1, \ldots, J ; i=1, \ldots, I\}$。词对齐任务就是找到这样的多对多的对应关系。Transformer模型计算过程如下：

$\tilde{\mathbf{q}}_{n}^{i}=\mathbf{q}^{i} W_{n}^{Q}, \tilde{K}_{n}=K W_{n}^{K}, \tilde{V}_{n}=V W_{n}^{V}$

$H_{n}^{i}=\text { Attention }\left(\tilde{\mathbf{q}}_{n}^{i}, \tilde{K}_{n}, \tilde{V}_{n}\right)$

$\mathcal{M}\left(\mathbf{q}^{i}, K, V\right)=\operatorname{Concat}\left(H_{1}^{i}, \ldots, H_{N}^{i}\right) W^{O}$

$\text { Attention }\left(\tilde{\mathbf{q}}_{n}^{i}, \tilde{K}_{n}, \tilde{V}_{n}\right)=\mathbf{a}_{n}^{i} \tilde{V}_{n}$

$\mathbf{a}_{n}^{i}=\operatorname{softmax}\left(\frac{\tilde{\mathbf{q}}_{n}^{i} \tilde{K}_{n}^{T}}{\sqrt{d_{k}}}\right)$

其中$\mathbf{a}_{n}^{i} \in \mathbb{R}^{1 \times J}$表示第i个source token和全部target token的关系，整个attention matrix为$A_{I \times J}$。基线模型就是Transformer的attention矩阵抽取出的对齐，论文在这里介绍了两个前人工作，这里不做介绍了。

本文方法

Averaging Layer-wise Attention Scores

单一的attention矩阵是对称的，但是不同层、不同head的attention学习到的是不同的东西，因此我们把所有head的attention矩阵加起来做平均，这样能更好地观察对齐。并且我们发现倒数第二层的attention矩阵G更好得表达了对齐。

Multi-task Learning

由于标注alignment是个很费力的事，本文使用G来指导attention。Gij是一个0-1矩阵（可以使用layer-wise attention或giza++产生的alignment），Aij是某一个head的attention，通过最小化Gij和Aij的KL散度来进行优化：$\mathcal{L}_{a}(A)=-\frac{1}{I} \sum_{i=1}^{I} \sum_{j=1}^{J} G_{i, j}^{p} \log \left(A_{i, j}\right)$。整个模型的损失函数为：$\mathcal{L}=\mathcal{L}_{t}+\lambda \mathcal{L}_{a}(A)$，其中Lt是翻译的NLL loss。

Providing Full Target Context

训练翻译模型时用的时auto-regressive进行解码，也就是生成每个target tokens时只依赖于source tokens和之前生成的target token，但对于对齐任务来说，需要指导全部的target tokens。本文使用的方法是对于不同的loss，使用不同的context计算（计算两次前向）：

$\mathcal{L}_{t}=-\frac{1}{I} \sum_{i=1}^{I} \log \left(p\left(e_{i} | f_{1}^{J}, e_{1}^{i-1}\right)\right)$

$\mathcal{L}_{a}^{\prime}=\mathcal{L}_{a}\left(A | f_{1}^{J}, e_{1}^{I}\right)$

实验部分

评价指标选择alignment error rate（AER）。Transformer选择base model参数设置如下：

embed_size=512
6 encoder + 6 decoder
8 attention head
share input and output embedding
relu activation
sinusoidal positional embedding（参考：https://www.zhihu.com/question/307293465）
validation translation loss for early stopping
Adam, learning rate=3e-4, beta1=0.9, beta2=0.98
warmup step=4000
learning rate scheduler = inverse square root
dropout=0.1
label smooth=0.1

本文选择的Statistical Baseline设置如下：

5次迭代IBM1 + HMM + IBM3 + IBM4
使用grow-diagonal将两个方向的对齐合并

最终实验结果如下：

在实验中我们发现，模型更容易将代词和名词对齐，提示我们可以将对齐分为sure和possible两种。在基于统计的GIZA++对齐中possible对齐较少出现，这可能是因为giza是通过统计共现来实现的。可以认为，将context进行更好得建模有助于对齐。