Dzongkha word segmentation implements athe principle of a the maximal matching algorithm followed by a the statistical (bigram) method. It uses a word list/lexicon at firstfirst to segment the raw input sentences. It then uses MLE principles to estimate the bigram probabilities for each segmented wordsword. All possible segmentationsegmentations of an input sentence by Maximal Matchingmaximal matching are then re-ranked and picked the mostly likely segmentationthe most likely segmentation picked from the set of possible segmentations using a statistical approach (the bigram technique). This is to decide the best possible segmentation from among all the words (Huor et al., 2007) generated by the maximal matching algorithm. These mechanisms are described in the following.
The basic idea of the a Maximalmaximal matching algorithm is, it first generates all possible segmentations for an input sentence and then selects the segmentation that contains the minimum number of word tokens. It uses dictionary lookup.
We used the following steps to segment thea given input sentence.
1. Read the input of stringstring of text. If an input line contains more than one sentence, a sentence separator is applied to break the line into individual sentences.
2. Split the input string of text by Tsheg(') into syllables.
3. Taking the next syllables, generate all possible strings.
4. If the number of stringstrings is greater than n for some value n:
* Look up the series of stringstrings in the dictionary to find matches, and assign some weight-ageweight accordingly.
* Sort the stringstrings onby the given weight-ageweight.
* Delete (number of strings - n) low countlow-count strings.
5. Repeat from Step 2 until all syllables are processed.
The above mentionedabove-mentioned steps produced all possible segmented words from the given input sentence based on the provided lexicon. Thus, the overall accuracy and performance depends on the coverage of the lexicon (Pong and Robert, 1994).
One of the key problems with the MLE is insufficient data. That is, because of the unavoidably limited size of the training corpus, the vast majority of the wordwords are uncommon and some of the bigrams may not occur at all in the corpus, leading to zero probabilities.
Therefore, the following smoothing techniques were used to count the probabilities of unseen bigrambigrams.
The above problem of data sparseness underestimates the probability of some of the sentences that are in the test set. The smoothing technique helps to prevent errors by making the probabilities more uniform. Smoothing is the process of flattening a probability distribution implied by a language model so that all reasonable word sequences can occur with some probability. This often involves adjusting zero probabilities upwardupwards and high probabilities downwardsdownward. ThisIn this way, the smoothing technique not only helps prevent zero probabilities but the overall accuracy of the model areis also significantly improved (Chen and Goodman, 1998).
Subjective evaluation has beenwas performed by comparing the experimental results with the manually segmented tokens. The method was evaluated using different sets of test documents, from various domains, consisting of 714 manually segmented words. Table 3 summarizes the evaluation results.
We have takentook the extractextracts offrom different test data hoping it maymightwould contain a fair amount of general terms, technical terms and common nouns. The manually segmented corpus, containing 41,739 tokens, areis used for the methodstudy.
In the sample comparison below, the symbol (') does not make the segmentation unit's mark, but (|) takes the segmentation unit's mark, despite its actual mark for comma or full_stop. The whitespace in the sentence are phrase boundary or comma, and isThe whitespaces in the sentences are phrase boundaries or commas, and areThe whitespace in a sentence is a phrase boundary or a comma, and is a faithful representation of speech where we pause, not between words, but either after certain phrases or at the end of sentencea sentencesentences.
During the process of word segmentation, it is understood that the maximal matching algorithm is simply effective and can produce accurate segmentation only if all the words are present in the lexicon. But since not all the word entryentries can be found in lexicon databasethe lexicon databasea lexicon databaselexicon databases in real applicationthe real applicationa real applicationreal applications, the performance of word segmentation degradesdeteriorates when it encounters words that are not in the lexicon (Chiang et al., 1992).
FollowingThe following are the significant problems with the dictionary-based maximal matching method because of the coverage of the lexicon (Emerson, 2000):
* incompleteincompleteness and inconsistency of the lexicon database
* absence of technical domains in the lexicon
* transliterated foreign names
* some of the common nounscommon nouns areof the common nouns are not included in the lexicon.
These problems of ambiguous word divisions, and unknown proper names, are lessened and solved partiallypartially solved when it is re-ranked using the bigram techniquesthe bigram techniquebigram techniques. Still the solution to the following issuesThe solution to the following issues still needs to be discussed in the future. Although the texts were collected from the widest range of domains possible, the lack of available electronic resources offor informative text adds to the following issues:
* small numbersize of the corpus werewas not very impressive for the method
* ambiguity and inconsistentinconsistency ofin the manual segmentation of a token in the corpus resultingresulted in incompatibility and sometimes in conflict.
Ambiguity and inconsistency occurs because of difficulties in identifying a word. Since the manual segmentation of corpus entrya corpus entrycorpus entries was carried out by humans rather than computercomputersby computer, such humans have to be well skilled in identifying or understanding what a word is.
There are also cases like shortening of words, removing of inflectional words and abbreviating of words for the convenience of the writer. But this is not sonotnot well reflected in the dictionaries, thus affecting the accuracy of the segmentation.
This paper describes the initial effort in segmentingat segmentingto segment the Dzongkha scripts. In this preliminary analysis of Dzongkha word segmentation, the preprocessing and normalizations are not dealt with. Numberings, special symbols and characters are also not included. These issues will have to be studied in the future. A lot of discussionsdiscussion is needed and works also havework also has to be done to improve the performance of word segmentation. Although the study was a success, there are still some obvious limitations, such as its dependency on dictionaries/lexiconlexicons, and the fact that the current Dzongkha lexicon is not comprehensive. Also, there is the absence of a large corpus collection from various domains. Future work may include overall improvement of the method for better efficiency, effectiveness and functionality, by exploring different algorithms. Furthermore, the inclusion of POS TagPOS-Tag sets applied ontousing n-gram techniques, which is proven to be helpful in handling the unknown word problems, might enhance the performance and increase accuracy. Increasing corpus size might also help to improve the results.