Unfortunately, all the cited tools lack the capability of dealing properproperly with levels of editionediting for tokens (words and punctuationspunctuationpunctuation marks) and an integrated environment for the whole process of editionediting. Thus, in spite of their amazing features, none of them was sufficiently suitable, speciallyespecially concerning spelling modernization and normalization of graphematic aspects. In fact, this is expected, forbecause the tools are intended tofor broader purposes.
ConceptionDesign and development of a tool, E-Dictor, where the need for a WYSIWYG interface joinedis combined withcombines with a second goal, ie.i.e., integrating the tasks of the whole process, which would then be performed inside the same environment, with any necessary external tools being called by the system, transparently.
E-Dictor has been developed in Python and, today, has versions for both Linux and Windows (XP/Vista/7) platforms. A version for MacOS is planned for the future. It is currently at 1.0 beta version (not stable).
As shown in Figure 1, the main interface has an application menu, a toolbar, a content area (divided into tabs: Transcription, EditionEditing, and Morphology), and buttons to navigate throughtthrough pages. The tabs are in accordance withfollow the flow of the encoding process. Many aspects of the functioningfunctionality described in what followsthe following are determined by the application preferences.
In the 'Transcription' tab, the original text is transcribed "as is" (the user can view the facsimile image, while transcribing the text). ThroughtThrough a menu option, E-Dictor will automatically apply an XML structure to the text, "guessing" its internal structure as best as it canas best it canas well as it can. Then, in the 'EditionEditing' tab, the user can edit any token or structural element (eg.e.g., paragraph). Finally, in the 'Morphology' tab, tokens and part-of-speech tags are displayed in token/TAG format, so they can be revisedreviewed.
The XML structure specifiedspecified XML structure meets two main goals: (i) be as neutral as possible (in relation to the textual content encoded) and (ii) suit philological and linguistic needs, i.e., the editionediting must be simple and efficient without losing information relevant to philological studies. In the context of CTB, it was initially established a structurea structure was initially established to encode the following information:
* Metadata: information about the source text, e.g., author information, state of processing, etc.
* Delimitation of sections, pages, paragraphs, sentences, headers and footers, and tokens.
* Class of tokens (part-of-speech tags) and phonological form for some tokens.
* Types (levels) of editionediting for each token.
* Comments of the editorEditor's comments.
* Subtypes for some text elements, like sections, paragraphs, sentences and tokens (eg.e.g., a section of type "prologue"on "prologues").
A key goal of E-Dictor is to be flexible enough so as to be useful in other contextsaspects of corporacorpus building. To achieve this, the user can customize the "preferences" of the application. The most prominent options are the levels of editionediting for tokens; , the subtypes for the elements 'section', 'paragraph', 'sentence', and 'token';, and the list of POS tags to be used in the morphological analysis. Finally, in the 'Metadata' tab, the user can create the suitable metadata fields needed byfor his/her project.
ThroughtThrough its menu, E-Dictor provides some common options (eg.e.g., Save As, Search & Replace, Copy & Paste, and many others) as well as those particular options intended for the encoding process (XML structure generation, POS automaticautomatic POS tagging, etc.). E-Dictor provides alsoalso provides an option for exporting the encoded text and the lexicon of editionsedited texts in two different formats (HTML and TXT/CSV).
To conclude this section, we offer a brief comment about token (words and punctuation) editionediting, which is the main feature of E-Dictor. The respectiverelevant interface is shown in Figure 2. When a token is selected, the user can: (i) in the "Properties" panel, specify the type of the token (according to the subtypes defined by the preferences), its foreign language, and format (bold, italic, and underlined); (ii) in the "EditionEditing" panel, specify some other properties (eg.e.g., phonological form) of the token and include editionediting levels (according to the levels defined by the preferences).
ToFor each token, the user must click on "Apply changes" to effectivateapply (all) the editionseditseditorial changes made to it. The option "Replace all" tells E-Dictor to repeat the operation over all identical tokens in the remaining of theremainder of theremaining text (a similar functionality is available for POS tagsPOS-tag revision).
The dificultiesdifficulties of encoding ancient texts in XML, using commonordinary text editors, had shownshowed that a tool was necessary to make the process efficient and friendlyuser-friendly. This led to the development of E-Dictor, which, since its earlierearliest usageuse, has shown promising results. Now, the user does not even have to know that the underlying encoding is XML. It is only necessary for him/her to know the (philological and linguisticslinguistic) aspects of text editionediting.
E-Dictor led to a decrease of about 50% in the time required for encoding and editing texts. The improvement may be even higher if we consider the revision time. One of the factors forin this improvement is the betterincreased legibility the tool provides. The XML code is hidden, allowing one to practically read the text without any encoding. To illustrate the opposite, Figure 3 shows the commonordinary editionediting "interface", before E-Dictor. Note that the content being edited is just "Ex.mo Sr. Duque".
Finally, the integration of the whole process into one and onlyonea single environment is a second factor forin the overall improvement, forbecause it allows the user to move freely and quickly between "representations" and to access external tools transparentlyintuitively.
E-Dictor is alwaysstillconstantlycontinually under development, as we discuss its characteristics and receive feedback from users. There is already a list of future improvements that are being developed, such as extending the exporting routines, for example. A bigger goal is to incorporate an editionediting lexicon, which would be used by the tool for making suggestions during the editionediting process, or even to develop an "automatic token editionediting" system for later revisionreview by the user.
Besides CTB, E-Dictor is being used by the BBD project (BBD, 2010), and, recently, by various subgroups of the PHPB project (For a History of Portuguese in Brazil). These groups have largewidebroada lot of experience inof the philological editionediting of handwritten documents, and we hope their use of E-Dictor will help us improve it. The idealultimate goal of E-Dictor is to be capable of handling the whole flow of linguistic and philological tasks: transcription, editionediting, tagging, and parsing.