Evaluation in HOO 2012

What Gets Evaluated

In HOO 2012, we are evaluating the performance of systems in detecting and correcting preposition and determiner errors. More precisely, the gold-standard data indicates cases where:

an incorrect preposition or determiner has been used (a substitution error; RT or RD respectively);
a preposition or determiner is present when it should not be (a spurious word error; UT or UD respectively); and
a preposition or determiner is absent when it should be present (a missing word error; MT or MD respectively).

The set of error tags we use in HOO is based on the Cambridge University Press Error Coding System; this is copyright to Cambridge University Press and may only be used with their written permission. The coding is used by CUP to annotate the Cambridge Learner Corpus (CLC), which informs English Language Teaching materials published by CUP. The scheme is discussed in detail in [Nicholls, 2003]. The HOO coding scheme uses a different syntax from the CLC, but has essentially the same semantics.

Here are some examples of the error types we focus on:

Error Code	Description	Example Errored Form	Corresponding Correction
RT	Replace preposition	When I arrived at London	When I arrived in London
MT	Missing preposition	I gave it John	I gave it to John
UT	Unnecessary preposition	I told to John that ...	I told John that ...
RD	Replace determiner	Have the nice day	Have a nice day
MD	Missing determiner	I have car	I have a car
UD	Unnecessary determiner	There was a lot of the traffic	There was a lot of traffic

For each such error, systems are required to:

detect an error of the appropriate type (one of RT, MT, UT, RD, MD or UD; and
in the case of substitution errors and missing word errors, propose the preposition or determiner that should be used.

Note that, whereas in HOO 2011 we permitted optional and multiple corrections, this flexibility is not present here; since we are using 'found data', we restrict corrections to those specified in the original annotations, where there is always and only one correction specified. As the MD example in the table above illustrates, there are likely to be circumstances where this is questionable: in this example, the missing determiner might alternatively be the.

How Scoring Works

We distinguish scores and measures. We calculate three separate scores:

Detection: does the system determine that an edit of the specified type is required at some point in the text?
Recognition: does the system correctly determine the extent of the source text that requires editing?
Correction: does the system offer a correction that is identical to that provided in the gold standard?

The detection score provides a way of giving credit to a system when it identifies the need to fix something but disagrees with the offsets for the edit provided in the gold standard. This may happen for a number of reasons: in particular, there may be a genuine disagreement as to the scope of the error when multiword elements are concerned (for example, prepositions like in between), or a system's offset calculations may be based on a tokenisation regime that handles punctuation marks adjacent to tokens in a manner other than that used in the gold standard. Teams can use the difference between the detection score and the recognition score to track down and address such disagreements.

The recognition score provides a way of giving credit to a system when it identifies the precise location of something requiring correction, but provides a correction that is not the same as that specified in the gold standard. All recognitions also count as detections.

The correction score gives credit for corrections that match the gold-standard in both extent and content. Note that the default behaviour of the evaluation tools is to ignore any casing difference; so, for example, if the system proposes in as a correction when the gold-standard specifies In, this will be considered correct. See the manual pages for the evaluation tools for how to override this behaviour. All corrections also count as recognitions.

For each score, three measures may be computed: Precision, Recall and F-score (yes, it would be more consistent if this was referred to as F-measure, but there's legacy code ...).

Scores are computed in terms of a number of counts across the data being evaluated; these counts are used in the formulae given below.

A gold edit is an edit that exists in the gold-standard annotations.
A system edit is an edit that exists in the set of annotations generated by the system.
A detected edit is a gold edit for which there exists a system edit whose extent, as indicated by the start and end offsets, overlaps by at least one character. We refer to this as lenient alignment: the system has determined that something needs fixed around here, but has not identified the location strictly correctly.
A gold edit is said to be a recognized edit if there is a system edit of the appropriate type whose start and end offsets are identical to those of the gold edit. We refer to this as the two edits being strictly aligned.
A corrected edit is a recognized edit where the system proposes the same correction as is present in the gold standard.
A spurious edit is system edit which does not align with any gold edit.

Note that correction requires strict alignment. Getting character offsets correct is therefore very important: see The HOO Character Counting Principles.

The three scores are computed as follows.

Detection

Precision measures the proportion of system-proposed edits that correspond to (i.e., have some overlap with) gold-standard edits, providing a penalty for spurious edits:

Precision =	Detected
	Detected + Spurious

Note that the denominator here is computed by adding the number of detected and spurious edits, rather than using the number of system edits, since it is possible that there may be multiple system edits that overlap with a given gold edit.

Recall measures the proportion of gold-standard edits that were found by the system:

Recall =	Detected
	Gold

Recognition

For recognition, Precision and Recall are computed in the normal way.

Precision =	Recognized
	System

Recall =	Recognized
	Gold

Correction

For correction, Precision and Recall are once again computed in the normal way.

Precision =	Corrected
	System

Recall =	Corrected
	Gold

Using the HOO Eval Tools We provide a number of tools for computing the scores described above. These are described briefly here and in more detail on the linked-to pages. All tools provide a -h option which prints a brief summary of the options available.

evalfrag

The evalfrag tool is used to evaluate the results of applying a system to a single source file or fragment:

evalfrag.py 0002GE.xml 0002MQ0.xml

This compares the gold standard set of edit structures in 0002GE.xml with the system-produced edit structures in 0002MQ0.xml. The two input files must be valid XML files. The output is in the form of an XML structure that is used by the dataset-level evaluation tools; by default, evalfrag sends its output to the standard output, but this can be directed to a file for subsequent processing by using the -o option. Here is an example output file.

The evalfrag tool provides a number of command-line options; see the evalfrag manual page for full information. The default behaviour provides the correct settings for HOO 2012.

evalrun

This reports the results for an entire run:

evalrun.py Gold Run0

This compares the gold-standard edit sets in the Gold directory with those in the Run0 directory: evalrun requires that there be the same number of files in each directory, and will generate an error message if this is not the case. Again, the output is an XML structure, sent by default to the standard output; more typically this will be directed to a file using the -o option. These results files are used by a number of tools used for generating reports. Here is an example output file.

The evalrun tool provides a number of command-line options; see the evalrun manual page for full information. The default behaviour provides the correct settings for HOO 2012.

References

Nicholls, D. [2003] The Cambridge Learner Corpus—Error coding and analysis for lexicography and ELT. In D. Archer, P. Rayson, A. Wilson, and T. McEnery, editors, Proceedings of the Corpus Linguistics 2003 Conference, pages 572–581, 29th March–2nd April 2001.

Back to the top of this page