Date of Release: Mon, Aug 1, 2005
Version 3
The NIST 2005 Machine Translation Evaluation (MT-05) was part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities.
Disclaimer These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT-05 was an evaluation of research algorithms, the MT-05 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for scoring. The systems themselves were not evaluated. There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance. The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems. For that reason, this should not be considered a product testing exercise. |
The MT-05 evaluation consisted of two tasks. Each task required performing translation from a given source language into the target language. The source languages were Arabic and Chinese, and the target language was English.
MT research and development requires language data resources. System performance is strongly affected by the type and amount of resources used. Therefore, two different resource categories were defined as conditions of evaluation. The categories differed solely by the amount of data that was available for use in system training and development. The evaluation conditions were called "Large Data Track" and "Unlimited Data Track".
There were two source sets, one for each language under test. Each source set contained 100 articles (or documents). These articles were drawn from newswire documents published by the Agence France Presse and the Xinhua News Agency from December 1, 2004 to January 24, 2005. The source documents were encoded in UTF-8 for Arabic and GB for Chinese.
Each source set had four sets of high quality independently generated human translations. Each translation agency was required to have native speaker(s) of the source and target languages, working on their translations.
Machine translation quality was measured automatically using an N-gram co-occurrence statistic metric developed by IBM and referred to as BLEU. BLEU measures translation accuracy according to the N-grams or sequence of N-words that it shares with one or more high quality reference translations. Thus, the more co-occurrences the better the score. BLEU is an accuracy metric, ranging from "0" to "1" with "1" being the best possible score. A detailed description of BLEU can be found in the paper Papineni, Roukos, Ward, Zhu (2001). "Bleu: a Method for Automatic Evaluation of Machine Translation" (keyword = RC22176).
The main benefit of BLEU is that it is automatically generated when given a system translation and one or more reference translations, allowing quick, inexpensive, and repeatable evaluations that do not require human assessments. It has been found to generally rank systems in the same order as human assessments. BLEU, however, does not have the power to distinguish subtle differences in high quality translations.
There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance. The protocols and metrics employed in this evaluation were created to support MT research and not application effectiveness.
A software utility implementing BLEU is provided on the NIST website as a downloadable tool for anyone who wants to use it to support his own research effort, independent of NIST evaluations.
The table below lists the sites, and the tasks which they participated in, for this year's machine translation evaluation.
NIST ID | Site | Location | Arabic | Chinese | ||
---|---|---|---|---|---|---|
Large | Unlimited | Large | Unlimited | |||
ARL | U.S. Army Research Laboratory | USA | x | |||
ATR | Advanced Telecommunications Research Institute International Spoken Language Translation Research Laboratories | Japan | x | |||
EDINBURGH | University of Edinburgh | UK | x | x | ||
FSC | Fitchburg State College | USA | x | |||
USA | x | x | x | x | ||
HIT# | Harbin Institute of Technology Machine Intelligence & Translation Laboratory | China | x | |||
IBM | IBM | USA | x | x | ||
ICT# | Chinese Academy of Sciences Institute of Computing Technology | China | x | |||
ISI | University of Southern California Information Sciences Institute | USA | x | x | ||
ITCIRST | ITC-IRST | Italy | x | |||
JHU-CU | Johns Hopkins University & University of Cambridge | USA, UK | x | x | ||
LINEARB | Linear B | UK | ||||
MITRE | MITRE Corporation | USA | x | x | ||
NRC | National Research Council of Canada | Canada | x | |||
NTT | NTT Communication Science Laboratories | Japan | x | |||
RWTH | RWTH Aachen University | Germany | x | |||
SAAR | Saarland University | Germany | x | |||
SAKHR | Sakhr Software | USA | x | |||
SYSTRAN | SYSTRAN Language Translation Technologies | USA | x | x | ||
UMD | University of Maryland | USA | x | x |
# Sites that did not fulfill their obligation of attending the follow-up workshop.
The tables below list the official results of the NIST 2005 Machine Translation Evaluation.
This section reviews the results for each site's primary submission for the Arabic-to-English translation task. There is a separate table for each data track (large and unlimited).
Arabic-to-English Task, Large Data Track
Table 1 | |
---|---|
Site | BLEU-4 Score |
0.5131 | |
ISI | 0.4657 |
IBM | 0.4646 |
UMD | 0.4497 |
JHU-CU | 0.4348 |
EDINBURGH | 0.3970 |
SYSTRAN | 0.1079 |
MITRE | 0.0772 |
FSC | 0.0037 |
Arabic-to-English Task, Unlimited Data Track
Table 2 | |
---|---|
Site | BLEU-4 Score |
0.5137 | |
SAKHR | 0.3403 |
ARL | 0.2257 |
Arabic-to-English Task, Data Track Undefined
Table 3 | ||
---|---|---|
LINEARB* | 0.4300* |
* Linear B was not a submission of a fully automatic MT system. Rather, it was a human-aided statistical MT system that used non-Arabic speakers to correct the English fluency by selecting optional English phrases from the system's lattices. Search engines were used to look up the spelling of proper names.
By having humans in the loop and using a search engine at the time of evaluation, the Linear B system had access to data past the training cut-off date (December 1, 2004). Therefore, this system did not belong to either the "Large Data Track" or the "Unlimited Data Track" and is not directly comparable to the other systems in either track.
This section reviews the results for each site's primary submission for the Chinese-to-English translation task. There is a separate table for each data track (large and unlimited).
Chinese-to-English Task, Large Data Track
Table 4 | |
---|---|
Site | BLEU-4 Score |
0.3531 | |
ISI | 0.3073 |
UMD | 0.3000 |
RWTH | 0.2931 |
JHU-CU | 0.2827 |
IBM | 0.2571 |
EDINBURGH | 0.2513 |
ITCIRST | 0.2445 |
NRC | 0.2323 |
NTT | 0.2321 |
ATR | 0.1822 |
SYSTRAN | 0.1471 |
SAAR | 0.1310 |
MITRE | 0.0542 |
Chinese-to-English Task, Unlimited Data Track
Table 5 | |
---|---|
Site | BLEU-4 Score |
0.3516 | |
ICT | 0.1293 |
HIT | 0.0797 |