The HOO Character Counting Principles

The HOO Character Counting Principles


Evaluation in HOO consists in determining whether a system has detected and corrected the edits that are indicated in the gold standard. That means we need an unambiguous way of identifying edits in text. We do this by making use of character offsets: participating systems are required to indicate the character offsets of the edits they propose, and these locations are matched against the offsets associated with gold-standard edits. We use character offsets rather than a token-based counting method since the HOO scheme is intended to generalise to cases such as automatic punctuation correction where character-level precision is required.

The way in which character offsets are calculated is therefore very important, and there are some arguably odd artefacts that arise from this way of doing things. This page explains the conventions we have adopted. Participating systems should ensure that they generate the character offset attributes in edit structures in accordance with these guidelines.

Character Encoding

  1. We assume texts are encoded in UTF-8. This means that some characters may occupy multiple bytes, and so byte counts are not the same as character counts. Take this into account when calculating offsets!
  2. We assume Unix-style line endings, i.e. each line is ended by a single linefeed (LF, generally notated as \n) character. If you are developing your system on a Windows machine, note that the standard DOS line ending consists of a sequence of a carriage return and a linefeed (\r\n). if you are developing on a Mac, note that Mac OS 9 and earlier represent line endings using a single carriage return character, while Mac OS X and later use Unix line endings.

Character Offsets

The extent of a piece of text to be edited is defined by its start and end positions in the textual content of a source <PART> element; these are counted in integer values from the beginning of the text content, starting at 0. Note that a source file typically contains multiple <PART> elements: offsets are always relative to the content of a particular <PART> element, with the part in question being identified by means of the part attribute of the edit structure.

Each printable character, including spaces and linefeeds, contributes to the calculation of offsets; however, any XML tags in the text do not contribute to these counts. In the following text, the underlined extent has start position 0 and end position 3:

<p>The good life never ever ends. ...

Our use of character offsets to determine edit locations has a significant impact on how replacements, deletions and insertions are specified.

Replacements

If we wanted to specify that the The here should be replaced by an A, the corresponding edit structure would then look like the following:

<edit type="RD" start="0" end="3">
    <original>The</original>
    <corrections>
        <correction>A</correction>
    <corrections>
</edit>
The important point to note here is that the <p> tag is not included in the calculation of the offsets. As a shorthand, we write this extent as [0:3]. Because the replacement sequence of characters may not be the same length as the original sequence of characters, we always indicate extents by means of offsets into the original uncorrected text, not into a version of the text that has been changed to accommodate corrections.

Deletions

Here's a complicating artefact of our approach to locating edits: suppose, instead of replacing The by A, the required edit is to remove the The. In this case, we also need to remove the following space, and so the corresponding edit structure looks like this:

<edit type="UD" start="0" end="4">
    <original>The </original>
    <corrections>
        <correction><empty/></correction>
    <corrections>
</edit>
Note that there are two differences here between the replacement and deletion cases: the offsets are different and the text in the <original> element in the case of deletion includes the space to be removed.

Insertions

In the limiting case of an insertion, a character sequence is inserted at an extent with zero length. Suppose we want to change the text above to read as follows:

<p>The very good life never ever ends. ...
Then the extent of the insertion is [4:4]. Note that the insertion here includes a space, so that the resulting text maintains conventional interword separation. The corresponding edit structure is therefore as follows:
<edit type="MJ" start="4" end="4">
    <original><empty/></original>
    <corrections>
        <correction>very </correction>
    <corrections>
</edit>

Note that the same net change to the text could be achieved by means of the following edit structure, where the additional required space is instead inserted before the new word:

<edit type="MJ" start="3" end="3">
    <original><empty/></original>
    <corrections>
        <correction> very</correction>
    <corrections>
</edit>
However, the current scoring mechanism will not consider this to be correct since the offsets do not match those of the gold standard. Our default convention in the case of insertions is to add the additional required space after the word rather than before it, except in those cases where an immediately following punctuation mark (such as a full stop) requires the space to be before the word. Consider the following two (made-up) examples of sentences with missing prepositions:

(1) I chose Edinburgh to go school.
(2) I chose Edinburgh to go.

Suppose that the corrected forms are as follows:

(1a) I chose Edinburgh to go to school.
(2a) I chose Edinburgh to go to.

Assuming the text begins with these sentences, in the first case, the insertion, consisting of the word to plus a following space, will be made at character offset 24; but in the second case, the insertion consists of the word to with a preceding space, so the insertion will be made at character offset 23. A similar phenomenon will occur when material is deleted at the end of a sentence: the space preceding the word(s) to be deleted must also be deleted, wheras normally we delete the space after the word(s). This is another artefact of our approach to locating edits.



Back to the top of this page