The HOO File Naming Conventions

The HOO File Naming Conventions


Source Data

We refer to the files that contain errors to be corrected as the source data. Each file maye also be referred to as a fragment, since sometimes it is a fragment of a larger text. Each source data file is named by a four digit number, and has the extension .xml; for example:

0011.xml

A collection of source data files or fragments is called a dataset. The names of the files that make up a dataset do not necessarily form a contiguous numerical sequence.

Gold-Standard Edit Structures

Corresponding to each source data file, there is also a file containing, in stand-off markup format, the set of target edits for that file, represented as edit structures. The names of the gold-standard edit files are produced by appending the characters GE (for Gold Edit) to the base filename. So, for source file 0011.xml, the corresponding gold-standard edits file is 0011GE.xml.

System Outputs

Participating systems should deliver their results in the form of sets of edit structures, with one file of edit structures corresponding to each source file. These files will be compared against the corresponding gold-standard edits files.

Each team will be assigned a two-character identification code, which should be incorporated into system output filenames to allow easy identification and tracking. Each team is also allowed to submit up to 10 distinct runs, to permit different configurations to be used. A single-digit number ranging from 0 to 9 will be used to identify a run; this will also be incorporated into output filenames. The format of a filename for a given fragment thus has the following form:

⟨FragmentID⟩⟨TeamID⟩⟨Run⟩.xml
So, for example, the output file corresponding to our example for Run 3 for the team whose ID is MQ will be named
0011MQ3.xml

If only one run is submitted, this should be numbered as Run 0.



Back to the top of this page