HOO—Helping Our Own—is an ongoing shared task in text correction, whose pilot run took place as part of the 2011 Generation Challenges. With an increasing number of papers in natural language processing being authored by non-native English speakers (NNSs), we think it's time the community provided more support for those authors. As a field that works on computational techniques for processing text, we're in a better position than most to do something useful; so, the aim of this shared task—called HOO, for Helping Our Own—is to promote the use of NLP tools and techniques to help improve the textual quality of papers written by NNSs in the field. Along the way, we hope that the technologies that are developed will also be useful to native speakers writing papers about NLP.
We follow the usual shared-task methodology: we define the task in some detail and prepare datasets of manually-corrected texts, one set for participants to use for developing their algorithms, and another set to be used for evaluation. We announce a schedule and encourage participation by making available the development data and some associated software. When the development period ends, participants are then given a short period to download the evaluation dataset, process it with their tools, and return the output, which is scored against the manual 'gold standard'. Finally we hold a workshop to present findings, compare methods, and plan the way forward.
For the 2011 pilot run of the shared task, the development data set is now available via the link at above-right. This consists of 1000-word excerpts of text from papers that have been graciously contributed to the project by their authors, each marked-up with corrections. A quick look at the this data will convince you that this task—which we might think of as 'domain-and-register-specific error correction'—contrasts in interesting ways with 'vanilla', general purpose error correction of the kind carried out by tools like Microsoft Word's grammar checker.
We think that the ACL Anthology Reference Corpus may be a particularly useful language resource for this task, as it embodies the target text type (though not without errors). It is available in a user-friendly corpus interface here.
HOO has been financially supported by the Generation Challenges project and Macquarie University's Centre for Language Technology. The first pilot round of the task was reported on at the 2011 European Workshop on Natural Language Generation; you can download copies of the relevant reports here.