Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

NIST 2008 Open Machine Translation Evaluation - (MT08)

Official Evaluation Results

Date of release: Fri Jun 06, 2008

Version: mt08_official_release_v0

The NIST 2008 Machine Translation Evaluation (MT-08) is part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT-08 evaluation plan.

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT-08 was an evaluation of research algorithms, the MT-08 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

Certain commercial equipment, instruments, software, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by the (NIST), nor is it intended to imply that the equipment, instruments, software or materials are necessarily the best available for the purpose.

There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.


Evaluation Tasks

The MT-08 evaluation consisted of four tasks. Each task required a system to perform translation from a given source language into the target language. The source and target language pairs that made up the four MT-08 tasks were:

  • Translate Arabic text into English text
  • Translate Chinese text into English text
  • Translate Urdu text into English text
  • Translate English text into Chinese text

Evaluation Conditions

MT research and development requires language data resources. System performance is strongly affected by the type and amount of resources used. Therefore, two different resource categories were defined as conditions of evaluation. The categories differed solely by the amount of data that was available for use in the training and development of the core MT engine. The evaluation conditions were called "Constrained Data Track" and "Un-Constrained Data Track".

  • Constrained Data Track - limited the training data to data in the LDC public catalogue existing before July 1st, 2007.
  • Un-Constrained Data Track - extended the training data to any publicly available data existing before July 1st, 2007.

Other submissions not in categories described above will not be reported in the final release.

Evaluation Data

Source Data

MT-08 evaluation data sets contained documents drawn from newswire text documents and web-based newsgroup documents. The source documents were encoded in UTF-8.

The test data was selected from a pool of data collected by the LDC during July 2007. The careful selection process sought to have a variety of sources (see below) and publication dates while hitting the target test set size.

Source Language
Sources
Newswire
Newsgroup / Web
Arabic
AAW, AFP, AHR, ASB, HYT, NHR, QDS, XIN
Assabah
Xinhua News Agency
various web forums
Chinese
AFP, CNS, GMW, PDA, PLA, XIN
various web forums
Urdu
BBC,JNG, PTB, VOA
various web forums
English
AFP, APW, LTW, NYT, XIN
n/a

Reference Data

MT-08 reference data consists of four independently generated high quality translations that were produced by professional translation companies. Each translation agency was required to have native speaker(s) of the source and target languages working on the translations.

Current versus Progress Data Division

For those willing to abide by the strict processing rules, a "PROGRESS" test set was distributed to use as a BLIND benchmark for several evaluations. Teams that processed this data submitted their translations to NIST and deleted all related files (source, translations, and any other derivitive file). The scores of the progress test sets were reported to the participants but were not reported here. Future OpenMT evaluations will report PROGRESS test set scores from year to year.

Performance Measurement

Machine translation quality was measured automatically using an N-gram co-occurrence statistic metric developed by IBM and referred to as BLEU. BLEU measures translation accuracy according to N-grams or sequence of N-words that it shares with one or more high quality reference translations. Thus, the more co-occurrences, the better the score. BLEU is an accuracy metric, ranging from "0" to "1" with "1" being the best possible score. A detailed description of BLEU can be found in the paper Papineni, Roukos, Ward, Zhu (2001). "Bleu: a Method for Automatic Evaluation of Machine Translation" (keyword = RC22176).

Although BLEU was the official metric for MT-08, measuring translation quality is an ongoing research topic in the MT community. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

Automatic metrics reported:

  • BLEU-4 (MTeval-v11b: official metric)
  • IBM BLEU (IBM's BLEU with original brevity penalty)
  • NIST (NIST's refinement of BLEU, commonly referred to as NIST)
  • TER
  • METEOR

Other metrics (to be) reported:

  • Human Assessments of Adequacy (judged by participants and others)
  • Human judgments of Preference (judged by participants and others)
  • MT Comprehension Test (implemented by MIT-LL)

Evaluation Participants

The table below lists the organizations entered as participants and the evaluation tasks they are registered for in MT-08.

Site IDOrganizationLocation    
apptekApplications Technology Inc.USA    
aucThe American University in CairoEgypt    
basistechBasis TechnologyUSA    
bbnBBN TechnologiesUSA    
bjut-mtgBeijing University of Technology,
Machine Translation Group
China    
cas-iaChinese Academy of Sciences, Institute of AutomationChina    
cas-ictChinese Academy of Sciences, Institute of Computing TechnologyChina    
cas-isChinese Academy of Sciences, Institute of SoftwareChina    
cmu-ebmtCarnegie MellonUSA    
cmu-smtCarnegie Mellon, interACTUSA    
cmu-xferCarnegie MellonUSA    
columbiaColumbia UniversityUSA    
cuedUniversity of Cambridge, Dept. of EngineeringUK    
edinburghUniversity of EdinburghUK    
googleGoogleUSA    
hit-irHarbin Institute of Technology, Information Retrieval LaboratoryChina    
hkust.China    
ibmIBMUSA    
liumUniversite du Maine (Le Mans), Laboratoire d'InformatiqueFrance    
msraMicrosoft Research AsiaChina    
nrcNational Research CouncilCanada    
nthuNational Tsing Hua UniversityTaiwan    
nttNTT Communication Science LaboratoriesJapan    
qmulQueen Mary University of LondonUK    
sakhrSakhr Software Co.Egypt    
sriSRI InternationalUSA    
stanfordStanford UniversityUSA    
ukaUniversitaet KarlsruheGermany    
umdUniversity of MarylandUSA    
upc-lsiUniversitat Politechnica de Catalunya, LSISpain    
upc-talpUniversitat Politechnica de Catalunya, TALPSpain    
xmu-iaiXiamen University, Institute of Artificial IntelligenceChina    

Collaborations

ibm_umdIBM /
University of Maryland MD
USA    
jhu_umdJohns Hopkins University /
University of Maryland
USA    
isi_lwUSC-ISI /
Language Weaver Inc.
USA    
msr_msraMicrosoft Research /
Microsoft Research Asia
.    
msr_nrc_sriMicrosoft Research /
Microsoft Research Asia /
National Research Council Canada /
SRI International
.    
nict_atrNICT /
ATR
Japan    
nrc_systranNational Research Council Canada /
SYSTRAN
.    

Evaluation Systems

Each site/team could submit up to four systems for evaluation with one system marked as its primary system. The primary system indicated the site/team's best effort. This official public version of the results report the results only for the primary systems. Note that these charts show an absolute ranking according to the primary metric.

Systems that fail to meet the requirements for either track will not be reported here.

"significance groups*" shows areas where the wilcoxon signed rank test was not able to differenciate system performance at the 95% confidence level. That is, if two systems belong to the same significance group (by sharing the same number), then they are determined to be comparble, based n BLEU-4 scoring.


Results Section

Contains Valid On-Time Submissions

Late and corrected submission will be linked here
 


Overall System Results

Arabic to English (primary system) Results

Entire Current Evaluation Test Set

significance
groups*
systemBLEU-4*IBM BLEUNISTTERMETEOR

Constrained Training Track

1google_arabic_constrained_primary0.45570.452610.882148.5350.6857
2IBM-UMD_arabic_constrained_primary0.45250.430010.618348.4360.6539
3IBM_arabic_constrained_primary0.45070.427610.590448.5470.6530
3bbn_arabic_constrained_primary0.43400.429010.659049.5990.6784
4LIUM_arabic_constrained_primary0.42980.410510.273250.4840.6490
5isi-lw_arabic_constrained_primary0.42480.422710.407751.8200.6695
6CUED_arabic_constrained_primary0.42380.40189.948651.5570.6274
6SRI_arabic_constrained_primary0.42290.403110.193549.7800.6430
7Edinburgh_arabic_constrained_primary0.40290.38339.964151.1650.6396
8UMD_arabic_constrained_primary0.39060.378410.117652.1580.6553
9UPC_arabic_constrained_primary0.37430.35769.655353.2600.6380
10columbia_arabic_constrained_primary0.37400.35949.480651.9730.6092
9,10NTT_arabic_constrained_primary0.36710.35409.880656.0770.6312
11CMUEBMT_arabic_constrained_primary0.34810.34799.216557.3760.6057
12qmul_arabic_constrained_primary0.33080.31818.812455.1450.5893
13SAKHR_arabic_constrained_primary0.31330.31339.137357.1590.6659
14UPC.lsi_english_constrained_primary0.30210.28768.635058.2280.5639
15BASISTECH_arabic_constrained_primary0.25290.24237.878163.0150.5454
16AUC_arabic_constrained_primary0.14150.13596.321076.4060.4468

UnConstrained Training Track

17google_arabic_unconstrained_primary0.47720.473911.186446.8530.6996
18IBM_arabic_unconstrained_primary0.47170.452711.059146.7550.6902
19apptek_arabic_unconstrained_primary0.44830.447410.842048.2630.7160
20cmu-smt_arabic_unconstrained_primary0.43120.411410.361750.0820.6672

* designates primary metric

Chinese to English (primary system) Results

Entire Current Evaluation Test Set

significance
groups*
systemBLEU-4*IBM BLEUNISTTERMETEOR

Constrained Training Track

1MSR-NRC-SRI_chinese_constrained_primary0.30890.29478.505958.4600.5379
1bbn_chinese_constrained_primary0.30590.29598.202357.0670.5468
1isi-lw_chinese_constrained_primary0.30410.29408.095057.7340.5467
1google_chinese_constrained_primary0.29990.28878.514358.3590.5567
2MSR-MSRA_chinese_constrained_primary0.29010.27668.148060.0730.5171
3SRI_chinese_constrained_primary0.26970.25757.894261.6220.5101
3Edinburgh_chinese_constrained_primary0.26080.25137.811760.6540.5142
4SU_chinese_constrained_primary0.25470.24207.799463.2880.5122
4,5UMD_chinese_constrained_primary0.25060.23877.823662.1340.5167
4,5NTT_chinese_constrained_primary0.24690.22707.951163.4150.5126
5NRC_chinese_constrained_primary0.24580.23737.996463.8350.5362
5CASIA_chinese_constrained_primary0.24070.23107.579062.5180.4999
6NICT-ATR_chinese_constrained_primary0.22690.21847.163564.5240.4962
6ICT_chinese_constrained_primary0.22580.22136.155161.3870.4878
7JHU-UMD_chinese_constrained_primary0.21110.20796.050961.8340.4691
8XMU_chinese_constrained_primary0.19790.19386.751463.1390.4780
9HITIRLab_chinese_constrained_primary0.18660.17956.594267.3760.4458
10hkust_large_primary0.16780.16246.712475.8030.4332
10ISCAS_chinese_constrained_primary0.15690.15205.955768.2210.4354
11NTHU_Chinese_constrained_primary0.03930.03903.509693.8920.3209

UnConstrained Training Track

12google_chinese_unconstrained_primary0.31950.30698.862857.0090.5707
13cmu-smt_chinese_unconstrained_primary0.25970.24748.002662.4110.5363
14NRC-SYSTRAN_chinese_unconstrained_primary0.25230.24438.047363.0020.5490
15UKA_chinese_unconstrained_primary0.24060.23237.457161.7060.4916
16CMUXfer_chinese_unconstrained_primary0.13100.13096.245276.7220.4614
17BJUT_chinese_unconstrained_primary0.07350.06944.723977.6850.3944

* designates primary metric

Urdu to English (primary system) Results

significance
groups*
systemBLEU-4*IBM BLEUNISTTERMETEOR

Constrained Training Track

1google_urdu_constrained_primary0.22810.22807.840669.9060.5693
2bbn_urdu_constrained_primary0.20280.20267.692770.8850.5437
2IBM_urdu_constrained_primary0.20260.19997.702268.8600.5096
2isi-lw_urdu_constrained_primary0.19830.19857.303072.7490.5239
3UMD_urdu_constrained_primary0.18290.18267.290568.7480.5053
4MITLLAFRL_urdu_constrained_primary0.16660.16667.046072.859 
5UPC_urdu_constrained_primary0.16140.16147.095872.8390.4904
6columbia_urdu_constrained_primary0.14590.14606.547478.6860.4903
6,7Edinburgh_urdu_constrained_primary0.14560.14556.439375.9820.5215
7,8NTT_urdu_constrained_primary0.13940.13836.960475.6050.5022
8qmul_urdu_constrained_primary0.13380.13386.291581.4570.4728
8CMU-XFER_urdu_constrained_primary#0.10160.10174.1885108.1670.3518

* designates primary metric
# designates system with known alignment problem, corrected system submitted late.
 

English to Chinese (primary system) Results

Here is a description of the scores:

  • BLEU-4*: primary metric produced using mteval-v12, which is a language independent version that tokenizes on every unicode symbol.
  • BLEU-4 normalized: makes use of a mapping file to normalize both the reference and system translations to a single variant of certain sybmols.
  • NIST: the Doddington improvment to BLEU as reported from mteval-v12.
  • BLEU-4 word segmented: mteval-v12 with word scoring, using a standard word segmenter for both reference and system translation.

We are not identifying significance groups for this task.

.systemBLEU-4*BLEU-4
normalized
NISTBLEU-4
Word Segmented
.

Constrained Training Track

.google_english_constrained_primary0.41420.43099.77270.1643.
.MSRA_English_constrained_primary0.40990.43439.49180.1769.
.isi-lw_english_constrained_primary0.38570.41638.68100.1687.
.NICT-ATR_english_constrained_primary0.34380.37187.96080.1416.
.HITIRLab_english_constrained_primary0.32250.34367.37680.0946.
.ICT_english_constrained_primary0.31760.34117.70300.0879.
.CMUEBMT_english_constrained_primary0.27380.29547.30420.0760.
.XMU_english_constrained_primary0.25020.26646.20830.0593.
.UMD_english_constrained_primary0.19820.23913.69220.0899.

UnConstrained Training Track

.google_english_unconstrained_primary0.47100.491410.78680.1963.
.BJUT_english_unconstrained_primary0.27650.29067.81850.1046.

* designates primary metric


Results by Genre

All reported scores are limited to the entire "CURRENT" data sets. All primary submissions are shown here.

Site Results ( alphabetic order )

All scores are BLEU-4*

.systemAll dataNWWB

Arabic to English

.AUC_arabic_constrained_primary0.14150.17180.0983
.BASISTECH_arabic_constrained_primary0.25290.29510.1900
.CMUEBMT_arabic_constrained_primary0.34810.40940.2695
.CUED_arabic_constrained_primary0.42380.48190.3456
.Edinburgh_arabic_constrained_primary0.40290.46750.3008
.IBM-UMD_arabic_constrained_primary0.45250.50850.3489
.IBM_arabic_constrained_primary0.45070.50890.3432
.LIUM_arabic_constrained_primary0.42980.48300.3431
.NTT_arabic_constrained_primary0.36710.41860.2923
.SAKHR_arabic_constrained_primary0.31330.35050.2622
.SRI_arabic_constrained_primary0.42290.48860.3171
.UMD_arabic_constrained_primary0.39060.44520.3117
.UPC.lsi_english_constrained_primary0.30210.34750.2292
.UPC_arabic_constrained_primary0.37430.42810.2840
.bbn_arabic_constrained_primary0.43400.49190.3497
.columbia_arabic_constrained_primary0.37400.44310.2797
.google_arabic_constrained_primary0.45570.51640.3724
.isi-lw_arabic_constrained_primary0.42480.48700.3355
.qmul_arabic_constrained_primary0.33080.40050.2358

UNCONSTRAINED SYSTEMS

All dataNWWB
.IBM_arabic_unconstrained_primary0.47170.52640.3762
.apptek_arabic_unconstrained_primary0.44830.49000.3925
.cmu-smt_arabic_unconstrained_primary0.43120.48840.3392
.google_arabic_unconstrained_primary0.47720.53850.3940

Chinese to English

.CASIA_chinese_constrained_primary0.24070.27560.1936
.Edinburgh_chinese_constrained_primary0.26080.29760.2116
.HITIRLab_chinese_constrained_primary0.18660.21160.1529
.ICT_chinese_constrained_primary0.22580.27600.1586
.ISCAS_chinese_constrained_primary0.15690.18050.1257
.JHU-UMD_chinese_constrained_primary0.21110.25020.1586
.MSR-MSRA_chinese_constrained_primary0.29010.34350.2175
.MSR-NRC-SRI_chinese_constrained_primary0.30890.36140.2376
.NICT-ATR_chinese_constrained_primary0.22690.25790.1854
.NRC_chinese_constrained_primary0.24580.26790.2150
.NTHU_Chinese_constrained_primary0.03930.03670.0425
.NTT_chinese_constrained_primary0.24690.28280.1991
.SRI_chinese_constrained_primary0.26970.31540.2075
.SU_chinese_constrained_primary0.25470.29240.2039
.UMD_chinese_constrained_primary0.25060.29390.1871
.XMU_chinese_constrained_primary0.19790.24010.1401
.bbn_chinese_constrained_primary0.30590.36390.2273
.google_chinese_constrained_primary0.29990.34890.2344
.hkust_large_primary0.16780.18910.1377
.isi-lw_chinese_constrained_primary0.30410.36760.2176

UNCONSTRAINED SYSTEMS

All dataNWWB
.BJUT_chinese_unconstrained_primary0.07350.07510.0689
.CMUXfer_chinese_unconstrained_primary0.13100.15360.0994
.NRC-SYSTRAN_chinese_unconstrained_primary0.25230.27570.2192
.UKA_chinese_unconstrained_primary0.24060.28460.1810
.cmu-smt_chinese_unconstrained_primary0.25970.29090.2127
.google_chinese_unconstrained_primary0.31950.37010.2515

Urdu to English

.CMU-XFER_urdu_constrained_primary#0.10160.18270.0183
.Edinburgh_urdu_constrained_primary0.14560.16090.1291
.IBM_urdu_constrained_primary0.20260.23470.1668
.MITLLAFRL_urdu_constrained_primary0.16660.19390.1373
.NTT_urdu_constrained_primary0.13940.16300.1155
.UMD_urdu_constrained_primary0.18290.21600.1478
.UPC_urdu_constrained_primary0.16140.18780.1320
.bbn_urdu_constrained_primary0.20280.23880.1632
.columbia_urdu_constrained_primary0.14590.17140.1195
.google_urdu_constrained_primary0.22810.26190.1903
.isi-lw_urdu_constrained_primary0.19830.22920.1645
.qmul_urdu_constrained_primary0.13380.15780.1077

* designates primary metric
# designates system with known alignment problem, corrected system submitted late.

Created August 27, 2024, Updated August 30, 2024