evalfrag [-c casematching] [-h] [-m measures] [-r regime] [-t types] [-o resultsfile] goldedits sysedits


Evalfrag provides evaluation data for the edits corresponding to one text fragment. It takes two files containing edit structures and compares them; the first-specified edit file is assumed to be the standard against which the second file is evaluated. The results of the evaluation are written to the specified results file; if no results file is specified, the results are written to the standard output. This latter behaviour may be useful during development and debugging.

The contents of a typical results file are shown schematically here. The file contains the following elements:

The set of error types used in computing the counts and scores described below.
A collection of counts of various aspects of the two edit sets, used in computing the scores described below. These numbers may also be used when computing dataset-level results. Note that the system and spurious counts may be not be meaningful if only a subset of the error types is being scored.
Precision, Recall and F-score values for each of detection, recognition and correction, for the types specified in this invocation of the program. These per-fragment numbers would rarely be reported in an evaluation but may be of use during development and debugging. Precision, Recall and F-score can only be computed if the data required to do so is available; if it is not, then the elements for the corresponding measures are omitted from the results file. So, for example, if a subset of the error types are being evaluated, but the system results do not provide type information, then only Recall can be computed; the Precision and F-score elements will be omitted. Note that this is not the same as having absent or zero values for Precision or F-score.
For each gold-standard edit that is of a type to be included in the results, this shows the type of that edit, whether or not it was an optional edit, and whether the system missed, detected, recognized or corrected the edit.
For each system edit that does not correspond to a gold-standard edit, this indicates the start and end positions of the edit in the text fragment, and optionally its type, if this is provided by the system. If only a subset of types are being reported, note that this element will contain all system edits that are not of the specified types. The idea is that the results file should contain as much information as possible; individual viewer tools can choose what aspects of this information to display.
-c casematching
Specifies whether system-provided corrections should match the casing of the gold-standard corrections in order to be considered correct; possible values are match and nomatch, with nomatch being the default value.
Prints out a help message and exits.
-m measures
Specifies the measures—Precision, Recall and/or F-score—that should be computed. This option is provided to cater for cases where only a subset of error types are being evaluated, and the results file does not provide type information; in such a situation, it is best to leave the behaviour of the scoring mechanism under user control, rather than have the program try to work out what measures can be computed. Possible values for -m are as follows:
  • prf: Compute all of Precision, Recall and F-score (this is the default).
  • r (or recall): Compute only Recall.
  • p (or precision): Compute only Precision.
  • pr: Compute only Precision and Recall.
Typically only the first two of these values would be used, but the others are provided for completeness.
-o resultsfile
Specifies the file to be used to contain the results of the evaluation processing, as outlined in the DESCRIPTION above.
-r regime
Specifies the scoring regime to be used. Currently there are two regimes available: bonus and nobonus. The bonus regime considers an optional edit that is not carried out by the system to be considered as if the edit had been carried out; the nobonus regime ignores optional edits that are not carried out by the system. If no regime is specified, the nobonus regime is used by default. If the data contains no optional edits, the behaviour of both these regimes is identical. Other regimes may be added at a later date.
-t types
Specifies the error types to be included in the scoring; all is a special symbol indicating that all types in the gold standard should be included. If no type is specified, all is assumed. Alternatively, the -t argument may be either an abstract category specified in the types.config file, or a quoted list of comma separated types that should be included in the scoring, as in the following examples:
  1. ... -t prep ...
  2. ... -t "RT,MT,DT,UT" ...

The latter alternative provides a finer degree of control that bypasses the aggregations in the configuration file, which may sometimes be useful for debugging. The value of the -t option will be written into the <types> element of the results file; if an aggregated category is specified, then it is this category rather than its expansion that is written to the file. All programs that access the results files expect to access the types.config file to determine the appropriate expansions.

Back to the top of this page