Evaluation in HOO 2012
What Gets Evaluated
In HOO 2012, we are evaluating the performance of systems in detecting and correcting preposition and determiner errors. More precisely, the gold-standard data indicates cases where:
- an incorrect preposition or determiner has been used (a substitution error; RT or RD respectively);
- a preposition or determiner is present when it should not be (a spurious word error; UT or UD respectively); and
- a preposition or determiner is absent when it should be present (a missing word error; MT or MD respectively).
The set of error tags we use in HOO is based on the Cambridge University Press Error Coding System; this is copyright to Cambridge University Press and may only be used with their written permission. The coding is used by CUP to annotate the Cambridge Learner Corpus (CLC), which informs English Language Teaching materials published by CUP. The scheme is discussed in detail in [Nicholls, 2003]. The HOO coding scheme uses a different syntax from the CLC, but has essentially the same semantics.
Here are some examples of the error types we focus on:
|Error Code||Description||Example Errored Form||Corresponding Correction|
|RT||Replace preposition||When I arrived at London||When I arrived in London|
|MT||Missing preposition||I gave it John||I gave it to John|
|UT||Unnecessary preposition||I told to John that ...||I told John that ...|
|RD||Replace determiner||Have the nice day||Have a nice day|
|MD||Missing determiner||I have car||I have a car|
|UD||Unnecessary determiner||There was a lot of the traffic||There was a lot of traffic|
- detect an error of the appropriate type (one of RT, MT, UT, RD, MD or UD; and
- in the case of substitution errors and missing word errors, propose the preposition or determiner that should be used.
Note that, whereas in HOO 2011 we permitted optional and multiple corrections, this flexibility is not present here; since we are using 'found data', we restrict corrections to those specified in the original annotations, where there is always and only one correction specified. As the MD example in the table above illustrates, there are likely to be circumstances where this is questionable: in this example, the missing determiner might alternatively be the.
How Scoring Works
We distinguish scores and measures. We calculate three separate scores:
- Detection: does the system determine that an edit of the specified type is required at some point in the text?
- Recognition: does the system correctly determine the extent of the source text that requires editing?
- Correction: does the system offer a correction that is identical to that provided in the gold standard?
The detection score provides a way of giving credit to a system when it identifies the need to fix something but disagrees with the offsets for the edit provided in the gold standard. This may happen for a number of reasons: in particular, there may be a genuine disagreement as to the scope of the error when multiword elements are concerned (for example, prepositions like in between), or a system's offset calculations may be based on a tokenisation regime that handles punctuation marks adjacent to tokens in a manner other than that used in the gold standard. Teams can use the difference between the detection score and the recognition score to track down and address such disagreements.
The recognition score provides a way of giving credit to a system when it identifies the precise location of something requiring correction, but provides a correction that is not the same as that specified in the gold standard. All recognitions also count as detections.
The correction score gives credit for corrections that match the gold-standard in both extent and content. Note that the default behaviour of the evaluation tools is to ignore any casing difference; so, for example, if the system proposes in as a correction when the gold-standard specifies In, this will be considered correct. See the manual pages for the evaluation tools for how to override this behaviour. All corrections also count as recognitions.
For each score, three measures may be computed: Precision, Recall and F-score (yes, it would be more consistent if this was referred to as F-measure, but there's legacy code ...).
Scores are computed in terms of a number of counts across the data being evaluated; these counts are used in the formulae given below.
- A gold edit is an edit that exists in the gold-standard annotations.
- A system edit is an edit that exists in the set of annotations generated by the system.
- A detected edit is a gold edit for which there exists a system edit whose extent, as indicated by the start and end offsets, overlaps by at least one character. We refer to this as lenient alignment: the system has determined that something needs fixed around here, but has not identified the location strictly correctly.
- A gold edit is said to be a recognized edit if there is a system edit of the appropriate type whose start and end offsets are identical to those of the gold edit. We refer to this as the two edits being strictly aligned.
- A corrected edit is a recognized edit where the system proposes the same correction as is present in the gold standard.
- A spurious edit is system edit which does not align with any gold edit.
The three scores are computed as follows.
Precision measures the proportion of system-proposed edits that correspond to (i.e., have some overlap with) gold-standard edits, providing a penalty for spurious edits:
|Detected + Spurious|
Note that the denominator here is computed by adding the number of detected and spurious edits, rather than using the number of system edits, since it is possible that there may be multiple system edits that overlap with a given gold edit.
Recall measures the proportion of gold-standard edits that were found by the system:
For recognition, Precision and Recall are computed in the normal way.
For correction, Precision and Recall are once again computed in the normal way.
Using the HOO Eval Tools
We provide a number of tools for computing the scores described
above. These are described briefly here and in more detail on the
linked-to pages. All tools provide a -h option which prints a brief
summary of the options available.
The evalfrag tool is used to evaluate the results of applying a system
to a single source file or fragment:
evalfrag.py 0002GE.xml 0002MQ0.xml
This compares the gold standard set of edit structures in 0002GE.xml
with the system-produced edit structures in 0002MQ0.xml.
The two input files must be valid XML files.
is in the form of an XML structure that is used by the
dataset-level evaluation tools; by default, evalfrag sends its output to the
standard output, but this can be directed to a file for subsequent processing by using
the -o option.
Here is an example output file.
The evalfrag tool provides a number of command-line options; see the evalfrag manual page for full information. The default behaviour provides the correct settings for HOO 2012.
evalrunThis reports the results for an entire run:
evalrun.py Gold Run0
This compares the gold-standard edit sets in the Gold directory with those in the Run0 directory: evalrun requires that there be the same number of files in each directory, and will generate an error message if this is not the case. Again, the output is an XML structure, sent by default to the standard output; more typically this will be directed to a file using the -o option. These results files are used by a number of tools used for generating reports. Here is an example output file.
The evalrun tool provides a number of command-line options; see the evalrun manual page for full information. The default behaviour provides the correct settings for HOO 2012.