The HOO 2012 Data
This page provides information on the data to be used in the HOO 2012 Shared Task. The original data format used in HOO 2011 is motivated and described in laborious detail in The HOO Pilot Data Set: Notes on Release 2.0; some minor changes have been made to that specification for HOO 2012. This page summarises the relevant information. Also linked to from this page are a number of pages that explain aspects of the format and the evaluation mechanism in more detail:
- The HOO File Naming Conventions: Describes the conventions used in naming the data files provided by us, and the file naming conventions we require participants to adhere to for evaluation purposes.
- The HOO 2012 Error Type Definitions: Lists and defines the error types being used in HOO 2012.
- The HOO Character Counting Principles: Evaluation is based on locating errors at the correct positions in source files, and so it's important that participating teams calculate these correctly; this page explains the way in which these positions are calculated.
- Evaluation in HOO 2012: This page summarises the approach taken to evaluation.
It should not be necessary to read the earlier more elaborate Pilot Data Set specification to make sense of these pages, but if you find anything here that is unclear, please don't hesitate to alert the organisers.
The HOO 2012 Training Data
The training data for the HOO 2012 Shared Task is derived from exam scripts written by students sitting the Cambridge ESOL First Certificate in English (FCE) examination in 2000 and 2001. This data is a subset of the CLC FCE Dataset, kindly provided by Cambridge ESOL and Cambridge University Press; see [Yannakoudakis et al 2011] for more background on this data.
For HOO 2012, we are using 1000 exam scripts drawn from this corpus as training data. Each script has been reformatted to enable its use with the HOO Evaluation Tools. For each script, we provide two files: one contains the text written by the student, with no indication of the location of errors; the second contains, in standoff form, a set of corrections to perceived errors in that text. We refer to the first as the source data, and to the second as the corresponding edit set; the format of both of these is described further below.
For participating teams, the objective of the exercise is build systems which can detect and correct the errors in the source data, and report these corrections in the form of edit sets that can be compared agaisnt the gold standard.
At the time of writing, we are in negotiation with Cambridge University Press with the aim of obtaining a further release of previously unseen data for use in testing. It is possible that the issues here will not be resolved in time for the scheduled release of the test data. If this is the case, our intention is to use another subset of the already-public Cambrige FCE Dataset (disjoint from the training data) as test data. Accordingly, for training purposes, we ask participants to only make use of that subset of the FCE Dataset that we make available in HOO format, and not to use the data provided on the FCE site linked to above. Teams will be expected to provide in their reports appropriately informative summaries of the datasets they have used in developing their systems.
The HOO 2012 Source Data Format
As noted above, the texts used in the HOO 2012 Shared Task are derived from exam scripts kindly provided by Cambridge ESOL and Cambridge University Press. The original texts been converted into the HOO format for use with our edit handling and evaluation tools. An example is provided here.
- Each text is provided as an XML file, whose outermost tag is <HOO>, with the attribute VERSION="2.1".
- Each file contains two constituent elements: <HEAD> and <BODY>.
- The <HEAD> contains metadata, currently limited to information about the author of the exam script, encoded in the <CANDIDATE> element. This element contains the elements <LANGUAGE> and <AGE>, both of which are copied from the original data file: the <LANGUAGE> element indicates the native language of the author, and the <AGE> indicates the age of the author by means of an age range. This information is provided in case participants want to make use of it in their data modelling.
- The original exam scripts contain answers to a series of independent questions posed to the subject. To enable participants to make controlled use of textual context, we want to retain this structure, and so the <BODY> contains one or more <PART> elements, each of which corresponds to the contents of a separate <CODED_ANSWER> element in the original data. Each <PART> element has an ID attribute with a value that is unique within the scope of the present file.
- Each <PART> element contains one or more <P> (paragraph) elements. Sentences are not tagged.
The above describes the form of the source data that is provided for both training and testing. Errors and their corrections are not marked in these files; for the training phase, this information is provided in the form of edit structures, described below.
The Format of HOO Edit Structures
Systems will be evaluated by comparing the edit structures they produce against the gold-standard edit structures for the corresponding files. At this point we require participants to provide their results in the form of stand-off edit structures rather than as inline annotations or corrected texts, although if there is a demand for an extraction tool that can generate edit structures from corrected texts, we will explore whether we have the resources required to provide such a tool.
Teams should provide a set of edit structures for each HOO source file to be corrected; each set of edit structures should be provided in a separate file, with the name of that file generated in accordance with the HOO File Naming Conventions. Each edit set should consist of an <EDITS> element with a file attribute that specifies the base filename of the corresponding source data file; the <EDITS> element then contains a set of <EDIT> elements, each of which is an edit structure. Here is an example of an edit structure:
<edit type="RT" file="0098" part="1" index="0006" start="771" end="772">
The complete set of edit structures corresponding to the sample source file linked to above is here. To explain each of the attributes of the <edit> element shown above:
- Here, the type attribute-value RT means `incorrect preposition'; see the HOO 2012 Error Type Definitions for the different possible values of the type attribute used in the HOO 2012 Shared Task. Edit structures produced by participating systems should assign a type to each identified edit; absence of a type will be treated in the same way as the incorrect assignment of a type.
- The file attribute specifies the base filename of the source file which is host to the error being corrected. Although this is derivablle from the name of the file that contains the edit structure, its presence as an explicit attribute is convenient for debugging. Participating systems should ensure that the correct value is provided for this attribute.
- As noted above, a source file may contain multiple parts, where each part corresponds to a distinct answer within the exam script. The part attribute identifies the part of the file that is host to the error being corrected. Participating systems should ensure that the correct value is provided for this attribute.
- The index attribute is an identifier for the gold-standard edit, unique within the scope of the part of the file that contains the edit. In the gold-standard data, the index specifies the ordinal position of the edit in the sequence of edits in this part of the file. So, the edit structure above corresponds to the sixth edit in the first part of the file named 0098. Since any given participating system may miss errors or spuriously identify errors, system-produced index values cannot be compared with gold-standard index values, and are used solely for debugging purposes; participating systems may choose any form of alphanumeric sequence considered useful.
- The start and end attributes indicate the character positions of the word or words deemed to be in error. These are character counts across the textual content, i.e., only material provided between <P> and </P> tags contributes to offset counting. Offsets are relative to a specified part of the file, which is to say that the offset counts restart at 0 for each part. XML tags are not included in calculating these offsets; note also that, depending on the tools you are using, you may see spaces and linefeeds between paragraphs, but these do not count in the offsets. Edit structures produced by participating systems must provide start and end offsets, since these are central to matching up edits for evaluation; so it's really important to get these right. See the HOO Character Counting Principles for a more thorough explanation.
The rest of the edit structure is contained in two elements, <original> and <corrections>. The <original> element specifies the original text string whose correction is the subject of this edit. The <corrections> element allows for the specification of multiple <correction> elements, although for the HOO 2012 Shared Task, only a single correction will be provided in each edit structure. The content of these elements depends on whether the correction required involves replacement (substitution), deletion or insertion:
- If the edit is a replacement, a sequence of one or more words is specified as a substitution for the word(s) identified as incorrect in the original text. The edit structure shown above is an example of this where both the original text and the correction contain only one word; however, in principle, either or both may contain a sequence of contiguous words.
- If the edit is a deletion (i.e., there is no corresponding correction), then the original text to be deleted should include an associated space character; here is an example, with some <edit> attributes omitted for simplicity:
<edit type="UD">Note the space immmediately following the string the in the <ORIGINAL> element. The principle here is that the edit, when applied to the text, should leave the text appropriately punctuated; failing to delete a space when a word is deleted would leave two consecutive spaces in the text. By convention, the space following the word is deleted, except in cases where immediately adjacent punctuation means there is no following space (for example, the last word in a sentence); in such cases, the preceding space is deleted instead.
Note that in the case of deletions there is no replacement text, and so the content of the <correction> element should be the tag <empty/>, as in the example above. It is not valid to simply have null content here, since this has a special meaning in the context of optional corrections: although these are not used in this HOO round, the mechanism is retained for other tasks.
- If the edit is an insertion, so that there is no original text, the content of the element should be the tag <empty/>. Here is an example:
Note that the new text should include an associated space. As in the case of deletions, the principle here is that the edit, when applied to the text, should leave the text appropriately punctuated; failing to include a space when a word is inserted would result in the inserted word being abutted to one of the adjacent words in the text. By convention, the new space is provided after the word to be inserted, except in cases where there is an immediately following punctuation character that needs to be abutted to the word, in which case the space goes before the inserted word.
See the page on Evaluation in HOO 2012 for information on how the content of edit structures is evaluated.
Some of the complexities present in the definition of edit structures provided in The HOO Pilot Data Set: Notes on Release 2.0 are not required for the HOO 2012 Shared Task. In particular, we will not be making use of optional corrections or multiple corrections, and we will not be making use of consistency sets; see Section 3 in the Pilot Data release notes for an explanation of these features. The brief description of edit structures provided here should therefore be sufficient to understand what systems should generate.