TY - JOUR
T1 - Automatic grader of MT outputs in colloquial style by using multiple edit distances
AU - Akiba, Yasuhiro
AU - Imamura, Kenji
AU - Sumita, Eiichiro
AU - Nakaiwa, Hiromi
AU - Yamamoto, Seiichi
AU - Okuno, Hiroshi G.
PY - 2005
Y1 - 2005
N2 - This paper addresses the challenging problem of automating the human's intelligent ability to evaluate output from machine translation (MT) systems, which are subsystems of Speech-to-Speech MT (SSMT) systems. Conventional automatic MT evaluation methods include BLEU, which MT researchers have frequently used. BLEU is unsuitable for SSMT evaluation for two reasons. First, BLEU assesses errors lightly at the beginning or ending of translations and heavily in the middle, although the assessments should be independent from the positions. Second, BLEU lacks tolerance in accepting colloquial sentences with small errors, although such errors do not prevent us from continuing conversation. In this paper, the authors report a new evaluation method called RED that automatically grades each MT output by using a decision tree (DT). The DT is learned from training examples that are encoded by using multiple edit distances and their grades. The multiple edit distances are normal edit dista nee (ED) defined by insertion, deletion, and replacement, as well as extensions of ED. The use of multiple edit distances allows more tolerance than either ED or BLEU. Each evaluated MT output is assigned a grade by using the DT. RED and BLEU were compared for the task of evaluating SSMT systems, which have various performances, on a spoken language corpus, ATR's Basic Travel Expression Corpus (BTEC). Experimental results showed that RED significantly outperformed BLEU.
AB - This paper addresses the challenging problem of automating the human's intelligent ability to evaluate output from machine translation (MT) systems, which are subsystems of Speech-to-Speech MT (SSMT) systems. Conventional automatic MT evaluation methods include BLEU, which MT researchers have frequently used. BLEU is unsuitable for SSMT evaluation for two reasons. First, BLEU assesses errors lightly at the beginning or ending of translations and heavily in the middle, although the assessments should be independent from the positions. Second, BLEU lacks tolerance in accepting colloquial sentences with small errors, although such errors do not prevent us from continuing conversation. In this paper, the authors report a new evaluation method called RED that automatically grades each MT output by using a decision tree (DT). The DT is learned from training examples that are encoded by using multiple edit distances and their grades. The multiple edit distances are normal edit dista nee (ED) defined by insertion, deletion, and replacement, as well as extensions of ED. The use of multiple edit distances allows more tolerance than either ED or BLEU. Each evaluated MT output is assigned a grade by using the DT. RED and BLEU were compared for the task of evaluating SSMT systems, which have various performances, on a spoken language corpus, ATR's Basic Travel Expression Corpus (BTEC). Experimental results showed that RED significantly outperformed BLEU.
KW - BLEU
KW - Decision tree
KW - Edit distances
KW - Machine translation evaluation
KW - mWER
KW - Reference translations
UR - http://www.scopus.com/inward/record.url?scp=18544386509&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=18544386509&partnerID=8YFLogxK
U2 - 10.1527/tjsai.20.139
DO - 10.1527/tjsai.20.139
M3 - Article
AN - SCOPUS:18544386509
SN - 1346-0714
VL - 20
SP - 139
EP - 148
JO - Transactions of the Japanese Society for Artificial Intelligence
JF - Transactions of the Japanese Society for Artificial Intelligence
IS - 3
ER -