Automatic metrics of machine translation (MT) quality are vital for research progress at a fast pace. Many automatic metrics of MT quality have been proposed and evaluated in terms of correlation with human judgments, while various techniques of manual judgingtechniques for manual judgingmanual judgment techniques are being examined as well, see e.g. MetricsMATR08 (Przybocki et al., 2008), WMT08 and WMT09 (Callison-Burch et al., 2008; Callison-Burch et al., 2009).
The contribution of this paper is twofold. Section 2 illustrates and explains severe problems ofwith a widely used BLEU metric (Papineni et al., 2002) when applied to Czech as a representative of languages with rich morphology. We see this as an instance of the sparse data problem well known for MT itselfwhich is well known in MT itselffor which MT itself is well knownwhich is well known in MTfor which MT is well known: too much detail in the formal representation leading to low coverage ofin e.g. a translation dictionary. In MT evaluation, too much detail leads to thea lack of comparable parts ofcomparable parts incomparable parts betweenagreement betweenconcord between the hypothesis and the reference.
Section 3 introduces and evaluates some new variations of SemPOS (Kos and Bojar, 2009), a metric which is based on the deep syntactic representation of the sentence, performingwhich performsand which performs very well forwith Czech as the target language. Aside fromAs well asBesides including dependency and n-gram relations in the scoring, we also apply and evaluate SemPOS for English.
BLEU (Papineni et al., 2002) is an established language-independent MT metric. Its correlation towith human judgments was originally deemed high (for English) but better correlatingbetter-correlating metrics (esp.especially for other languages) were found later, usually employing language-specific tools, usually employing language-specific tools, were found later, see e.g. Przybocki et al. (2008) or Callison-Burch et al. (2009). The unbeaten advantage of BLEU is its simplicity.
Figure 1 illustrates a very low correlation towith human judgments when translating tointo Czech. We plot the official BLEU score against the rank established as the percentage of sentences where a system is ranked no worselower than all its competitors (Callison-Burch et al., 2009). The systems developed at Charles University (cu-) are described in Bojar et al. (2009),; uedin is a vanilla configuration of Moses (Koehn et al., 2007); and the remaining ones are commercial MT systems.
In a manual analysis, we identified the reasons for the low correlation: BLEU is overly sensitive to sequences and forms in the hypothesis matching the reference translation. This focus goes directly against the properties of Czech: relatively free word order allows many permutations of words, and rich morphology renders many valid word forms not confirmed by the reference. These problems are to some extent mitigated if several reference translations are available, but this is often not the case.
Figure 2 illustrates the problem of "sparse data" in the reference. Due to the lexical and morphological variance of Czech, only a single word in each hypothesis matches a word in the reference. In the case of pctrans, the match is even a false positive,;: "do" (to) is a preposition that should be used for the "minus" phrase and not for the "end of the day" phrase. In terms ofWith BLEU, both hypotheses are equally poor but 90% of their tokens were not evaluated.
Table 1 estimates the overall magnitude of this issue: Forfor 1-grams to 4-grams in 16401,640 instances (different MT outputs and different annotators) of 200 sentences with manually flagged errors, we count how often the n-gram is confirmed by the reference and how often it contains an error flag. The suspicious cases are n-grams confirmed by the reference but still containing a flag (false positives) and n-grams not confirmed despite containing no error flag (false negatives).
Fortunately, there are relatively few false positives in n-gram basedn-gram-based metrics: 6.3% of unigrams and far fewer higher n-grams.
The issue of false negatives is more serious and confirms the problem of sparse data if only one reference is available. 3030% to 40% of n-grams do not contain any error and yet they are not confirmed by the reference. This amounts to 34% of running unigrams, giving enough space to differ inwith human judgments and still remain unscored.
Figure 3 documents the issue across languages: the lower the BLEU score itself (i.e. fewer confirmed n-grams), the lower the correlation towith human judgments, regardless of the target language (WMT09 shared task, 20252,025 sentences per language).
Figure 4 illustrates the overestimation of scores caused by paying too much attention to sequences of tokens. A phrase-based system like Moses (cubojar) can sometimes produce a long sequence of tokens exactly as required by the reference, leading to a high BLEU score. The framed words in the illustration are not confirmed by the reference, but the actual error in these words is very severe foris very serious forhas a very severe effect on comprehension: nouns were used twice instead of finite verbs, and a misleading translation of a preposition was chosen. The output by pctrans preserves the meaning much better despite not scoring inon either of the finite verbs and producing far shorter confirmed sequences.
SemPOS (Kos and Bojar, 2009) is inspired by metrics based on the overlapping of linguistic features in the reference and in the translation (Gimenez and Marquez, 2007). It operates on the so-called 'tectogrammatical' (deep syntactic) representation of the sentence (Sgall et al., 1986; Hajic et al., 2006), formally a dependency tree that includes only autosemantic (content-bearing) words. SemPOS, as defined in Kos and Bojar (2009), disregards the syntactic structure and uses the semantic part of speech of the words (noun, verb, etc.). There are 19 fine-grained parts of speech. For each semantic part of speech t, the overlapping O(t) is set to zero if the part of speech does not occur in the reference or the candidate set, and otherwise it is computed as given in Equation 1 below.
SemPOS uses semantic parts of speech to classify autosemantic words. The tectogrammatical layer offers alsoalso offers a feature called Functor describingwhich describes the relation of a word to its governor similarly asin a similar way to the way semantic roles do. There are 67 functor types in total.
Using Functor instead of SemPOS increases the number of word classes that independently require a high overlap. For a contrast, we also completely remove the classification and use only one global class (Void).
In SemPOS, an autosemantic word of a class is confirmed if its lemma matches the reference. We utilize the dependency relations at the tectogrammatical layer to validate valence by refining the overlap and requiring alsoalso requiring the lemma of 1) the parent (denoted "par"), or 2) all the children, regardless of their order (denoted "sons"), to match.