机器翻译检测

2020-02-13

因为本周要做一个机器翻译检测的任务，因此搜到了几篇论文，看一下大概有哪些思路。论文基本上只简单扫了一眼，简单介绍一下其中的3篇。

参考：

Machine Translation Detection from Monolingual Web-Text

Automatic Detection of Machine Translated Text and Translation Quality

Detecting Machine-Translated Paragraphs by Matching Similar Words

Automatic Detection of Translated Text and its Impact on Machine Translation

BLEU: a Method for Automatic Evaluation of Machine Translation

Building a Web-based parallel corpus and filtering out machinetranslated text

Machine Translation Detection from MonolingualWeb-Text

Machine Translationness: a Concept for Machine Translation Evaluation and Detection

MT Detection in Web-Scraped Parallel Corpora

On the Features of Translationese

Translationese and Its Dialects

Machine Translation Detection from Monolingual Web-Text

首先强调的是，这篇论文检测的是SMT机器翻译。看到论文摘要时我想到，针对不同的机器翻译模型，检测的机制也是不一样的，要有这点意识。这篇论文关注到的是SMT系统中“phrase salad”现象，并使用单语语料就可以达到95.8%的准确率。

Introduction

SMT翻译中的‘phrase salad’现象，指的是翻译结果的每个短语单独拿出来是对的，但组合到一起是错的

比如上面这个例子，not only后面应该有but also，但这个短语在SMT翻译系统里只有一半被翻译了
使用了一个分类器对句子是否是‘phrase salad’进行检测，使用到的特征包括两个，一个是语言模型，另外是一些人们常用但对SMT来说难以生成的短语

Proposed Method

基于SMT翻译的特点，在特征选择时主要考虑3点：句子流畅度、语法正确度、短语完整度。从人工翻译提取到的特征表达了它和人工产生句子的相似性，从机器翻译中提取到的特征表达了它和机器翻译句子的相似性。

特征选择

Fluency Feature
- 使用两个语言模型，f(w,H)和f(w,MT)，前者表示人工翻译的语言模型，后者是机器翻译的语言模型
Grammaticality Feature
- 使用POS语言模型，f(pos,H)和f(pos, MT)，前者表示人工翻译的POS序列的语言模型，后者是机器翻译的
- 对提取出的function word的POS语言模型，f(fw, H)和f(fw, MT)
  - function word:
    - Prepositions: of, at, in, without, between
    - Pronouns: he, they, anybody, it, one
    - Determiners: the, a, that, my, more, much, either,neither
    - Conjunctions: and, that, when, while, although, or
    - Auxiliary verbs: be (is, am, are), have, got, do
    - Particles: no, not, nor, as
Gappy-Phrase Feature
- 中间有间隔的短语：如not only * but also
- 使用character级别的LM衡量
- Sequential Pattern Mining
  - 使用sequential pattern挖掘的方法找到所有Gappy-Phrase
- 使用的信息增益进行的短语选择，但是没看懂是如何计算特征的
最后使用SVM进行分类
其他可考虑的feature：
- Translationese and its dialects论文
- On the features of translationese论文（比较学术）
- average token length
- type-token ratio