Morphological analysis and disambiguation are crucial stages in a variety of natural language processing applications, especially when languages with complex morphology are concerned.
We present a system which disambiguates the output of a morphological analyzer for Hebrew.
It consists of several simple classifiers and a module which combines them under linguistically motivated constraints.
We investigate a number of techniques for combining the predictions of the classifiers.
Our best result, 91.44% accuracy, reflects a 25% reduction in error rate compared with the previous state of the art.
1 Introduction
Morphological analysis and disambiguation are crucial pre-processing steps for a variety ofnatural language processing applications, from search and information extraction to machine translation.
For languages with complex morphology these are nontrivial processes.
This paper presents a morphological disambiguation module for Hebrew which uses a sophisticated combination of classifiers to rank the analyses produced by a morphological analyzer.
This work has a twofold contribution: first, our system achieves over 91% accuracy on the full disambiguation task, reducing the error rate of the previous state of the art by 25%.
More generally, we explore several ways for combining the predictions of simple classifiers under constraints; the insight gained from these experiments will be useful for other applications of machine learning to complex (morphological and other) problems.
In the remainder of this section we discuss the complexity of Hebrew morphology, the challenge of morphological disambiguation and related work.
We describe our methodology in Section 2: we use basic, naive classifiers (Section 3) to predict some components of the analysis, and then combine them in several ways (Section 4) to predict a consistent result.
We analyze the errors of the system in Section 5 and conclude with suggestions for future work.
1.1 Linguistic background
Hebrew morphology is rich and complex.1 The major word formation machinery is root-and-pattern, and inflectional morphology is highly productive and consists of prefixes, suffixes and circumfixes.
Nouns, adjectives and numerals inflect for number (singular, plural and, in rare cases, also dual) and gender (masculine or feminine).
In addition, all these three types of nominals have two phonologi-cally and morphologically distinct forms, known as the absolute and construct states.
In the standard orthography approximately half of the nominals appear to have identical forms in both states, a fact which substantially increases the ambiguity.
In addition, nominals take possessive pronominal suffixes which inflect for number, gender and person.
Verbs inflect for number, gender and person (first, second and third) and also for a combination of tense and aspect/mood, referred to simply as 'tense' below.
Verbs can also take pronominal suffixes, which are interpreted as direct objects, and in some cases can also take nominative pronominal suffixes.
A peculiarity of Hebrew verbs is that the participle form
1 To facilitate readability we use a straight-forward transliteration of Hebrew using ASCII characters, where the characters (in Hebrew alphabetic order) are: abgdhwzxviklmnsypcqr$t.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 439-447, Prague, June 2007.
©2007 Association for Computational Linguistics
can be used as present tense, but also as a noun or an adjective.
These matters are complicated further due to two sources: first, the standard Hebrew orthography leaves most of the vowels unspecified.
On top of that, the script dictates that many particles, including four of the most frequent prepositions, the definite article, the coordinating conjunction and some subordinating conjunctions, all attach to the words which immediately follow them.
When the definite article h is prefixed by one of the prepositions b, k or l, it is assimilated with the preposition and the resulting form becomes ambiguous as to whether or not it is definite.
For example, bth can be read either as b+th "in tea" or as b+h+th "in the tea".
Thus, the form $bth can be read as an inflected stem (the verb "capture", third person singular feminine past), as $+bth "that+field", $+b+th "that+in+tea",
even as $+bt+h "that her daughter".
An added complexity stems from the fact that there are two main standards for the Hebrew script: one in which vocalization diacritics, known as niqqud "dots", decorate the words, and another in which the dots are missing, and other characters represent some, but not all of the vowels.
Most of the texts in Hebrew are of the latter kind; unfortunately, different authors use different conventions for the undotted script.
Thus, the same word can be written in more than one way, sometimes even within the same document.
This fact adds significantly to the degree of ambiguity.
Our departure point in this work is HAMSAH (Yona and Wintner, 2007), a wide coverage, linguistically motivated morphological analyzer ofHe-brew, which was recently re-implemented in Java and made available from the Knowledge Center for Processing Hebrew (http://mila.cs. technion.ac.il/).
The output that HAMSAH produces for the form $bth is illustrated in Table 1.
In general, it includes the part of speech (POS) as well as sub-category, where applicable, along with several POS-dependent features such as number, gender, tense, nominal state, definitness, etc.
1.2 The challenge of disambiguation
Identifying the correct morphological analysis of a given word in a given context is an important and
non-trivial task.
Unlike POS tagging, the task does not involve assigning an analysis to words which the analyzer does not recognize.
However, selecting an analysis immediately induces a POS tagging for the target word (by projecting the analysis on the POS coordinate).
Our main contribution in this work is a system that solves this problem with high accuracy.
Compared with POS tagging of English, morphological disambiguation of Hebrew is a much more complex endeavor due to the following factors:
Segmentation A single token in Hebrew can actually be a sequence of more than one lexical item.
For example, analysis 4 of Table 1 ($+b+h+th "that+in+the+tea") corresponds to the tag sequence IN+IN+DT+NN.
Large tagset The number of different tags in a language such as Hebrew (where the POS, morphological features and prefix and suffix particles are considered) is huge.
HAMSAH produces 22 different parts of speech, some with subcategories; 6 values for the number feature (including disjunctions of values), 4 for gender, 5 for person, 7 for tense and 3 for nominal state.
Possessive pronominal suffixes can have 15 different values, and prefix particle sequences can theoretically have hundreds of different forms.
While not all the combinations of these values are possible, we estimate the number of possible analyses to be in the thousands.
Ambiguity Hebrew is highly ambiguous: HAM-SAH outputs on average approximately 2.64 analyses per word token.
Oftentimes two or more alternative analyses share the same part of speech, and in some cases two or more analyses are completely identical, except for their lexeme (see analyses 7 and 8 in Table 1).
Morphological disambiguation of Hebrew is hence closer to the problem of word sense disambiguation than to standard POS tagging.
Anchors, which are often function words, are almost always morphologically ambiguous in Hebrew.
These include most of the high-frequency forms.
Many of the function words which help boost the performance of English POS tagging are actually prefix particles which add to the ambiguity in Hebrew.
Lexical ID
Word order in Hebrew is freer than in English.
1.3 Related work
The idea of using short context for morphological disambiguation dates back to Choueka and Lusig-nan (1985).
Levinger et al. (1995) were the first to apply it to Hebrew, but their work was hampered by the lack of annotated corpora for training and evaluation.
The first work which uses stochastic contextual information for morphological disambiguation in Hebrew is Segal (1999): texts are analyzed using the morphological analyzer of Segal (1997); then, each word in a text is assigned its most likely analysis, defined by probabilities computed from a small tagged corpus.
In the next phase the system corrects its own decisions by using short context (one word to the left and one to the right of the target word).
The corrections are also automatically learned from the tagged corpus (using transformation-based learning).
In the last phase, the analysis is corrected by the results of a syntactic analysis of the sentence.
The reported results are excellent: 96.2% accuracy.
More reliable tests, however, reveal accuracy of 85.5% only (Lember-ski, 2003, page 85).
Furthermore, the performance of the program is unacceptable (the reported running time on "two papers" is thirty minutes).
Bar-Haim et al. (2005) use Hidden Markov Models (HMMs) to implement a segmenter and a tagger for Hebrew.
The main innovation of this work is that it models word-segments (morphemes: prefixes, stem and suffixes), rather than full words.
The accuracy of this system is 90.51% for POS tagging (a tagset of 21 POS tags is used) and 96.74% for segmentation (which is defined as identifying all prefixes, including a possibly assimilated definite arti-
cle).
As noted above, POS tagging does not amount to full morphological disambiguation.
Recently, Adler and Elhadad (2006) presented an unsupervised, HMM-based model for Hebrew morphological disambiguation, using a morphological analyzer as the only resource.
A morpheme-based model learns both segmentation and tagging in parallel from a large (6M words) un-annotated corpus.
Reported results are 92.32% for POS tagging and 88.5% for full morphological disambiguation.
We refer to this result as the state of the art and use the same data for evaluation.
A supervised approach to morphological disambiguation of Arabic is given by Habash and Rambow (2005), who use two corpora of 120K words each to train several classifiers.
Each morphological feature is predicted separately and then combined into a full disambiguation result.
The accuracy of the dis-ambiguator is 94.8%-96.2% (depending on the test corpus).
Note, however, the high baseline of each classifier (96.6%-99.9%, depending on the classifier) and the full disambiguation task (87.3%-92.1%, depending on the corpus).
We use a very similar approach below, but we experiment with more sophisticated methods for combining simple classifiers to induce a coherent prediction.
2 Methodology
For training and evaluation, we use a corpus of approximately 90,000 word tokens, consisting of newspaper texts, which was automatically analyzed using HAMSAH and then manually annotated (El-hadad et al., 2005).
Annotation consists simply of selecting the correct analysis produced by the analyzer, or an indication that no such analysis ex-
ists.
When the analyzer does not produce the correct analysis, it is added manually.
This is the exact setup of the experiments reported by Adler and El-hadad (2006).
Table 2 lists some statistics of the corpus, and a histogram of analyses is given in Table 3.
Table 4 lists the distribution of POS in the corpus.
Table 2: Statistics of training corpus
# analyses
Table 3: Histogram ofanalyses
In all the experiments described in this paper we use SNoW (Roth, 1998) as the learning environment, with winnow as the update rule (using perception yielded very similar results).
SNoW is a multi-class classifier that is specifically tailored for learning in domains in which the potential number of information sources (features) taking part in decisions is very large, of which NLP is a principal example.
It works by learning a sparse network of linear functions over the feature space.
SNoW has already been used successfully as the learning vehicle in a large collection of natural language related tasks and compared favorably with other classifiers (Punyakanok and Roth, 2001; Florian, 2002).
Typically, SNoW is used as a classifier, and predicts using a winner-take-all mechanism over the activation values of the target classes.
However, in addition to the prediction, it provides a reliable confidence level in the prediction, which enables its use in an inference algorithm that combines predictors to produce a coherent inference.
Following Daya et al. (2004) and Habash and
Punctuation
Proper Noun
Preposition
Adjective
Participle
Conjunction
Quantifier
Negation
Interrogative
Interjection
Table 4: POS frequencies
Rambow (2005), we approach the problem of morphological disambiguation as a complex classification task.
We train a classifier for each of the attributes that can contribute to the disambiguation of the analyses produced by HAMSAH (e.g., POS, tense, state).
Each classifier predicts a small set of possible values and hence can be highly accurate.
In particular, the basic classifiers do not suffer from problems of data sparseness.
Of course, each simple classifier cannot fully disambiguate the output of HAMSAH, but it does induce a ranking on the analyses (see Table 6 below for the level of ambiguity which remains after each simple classifier is applied).
Then, we combine the outcomes of the simple classifiers to produce a consistent ranking which induces a linear order on the analyses.
For evaluation we consider only the words that have at least one correct analysis in the annotated corpus.
Accuracy is defined as the ratio between the number of words classified correctly and the total number of words in the test corpus that have a correct analysis.
The remaining level of ambiguity is defined as the average number of analyses per word whose score is equal to the score of the top ranked analysis.
This is greater than 1 only for the simple
classifiers, where more than one analysis can have the same tag.
In all the experiments we perform 10fold cross-validation runs and report the average of the 10 runs, both on the entire corpus and on a subset of the corpus in which we only test on words which do not occur in the training corpus.
The baseline tag of the token wi is the most prominent tag of all the occurrences of wi in the corpus.
The baseline for the combination is the most prominent analysis of all the occurrences of wi in the corpus.
If wi does not occur in the corpus, we back off and select the most prominent tag in the corpus independently of the word wi.
For the combination baseline, we select the analysis of the most prominent lexical ID, chosen from the list of all possible lexical IDs of wi.
If there is more than one possible value, one top-ranking value is chosen at random.
3 Basic Classifiers
The simple classifiers are all built in the same way.
They are trained on feature vectors that are generated from the output of the morphological analyzer, and tested on a clean output of the same analyzer.
We defined several classifiers for the attributes of the morphological analyses.
Since some attributes do not apply to all the analyses, we add a value of 'N/A' for the inapplicable attributes.
An annotated corpus was needed in all those classifiers for training.
We list the basic classifiers below.
POS 22 values (only 18 in our corpus), see Table 4.
Gender 'Masculine', 'Feminine', 'Masculine and feminine', 'N/A'.
Number 'Singular', 'Plural', 'Dual', 'N/A'.
Person 'First', 'Second', 'Third', 'N/A'.
Tense 'Past , 'Present , 'Participle , 'Future , 'Imperative', 'Infinitive', 'Bare Infinitive', 'N/A'.
Definite Article 'Def', 'indef', 'N/A'.
Identifies also implicit (assimilated) definiteness.
Status 'Absolute', 'Construct' and 'N/A'.
Segmentation Predicts the number of letters which are prefix particles.
Possible values are [0-6], 6 being the length of longest possible prefix sequence.
Does not identify implicit definiteness.
Has properties A binary classifier which distinguishes between atomic POS categories (e.g., conjunction or negation) and categories whose words have attributes (such as nouns or verbs).
Each word in the training corpus induces features that are generated for itself and its immediate neighbors, using the output of the morphological analyzer.
For each word in the window, we generate the following features: POS, number, gender, person, tense, state, definiteness, prefixes (where each possible prefix is a binary feature), suffix (binary: is there word suffixed?), number/gender/person of suffix, surface form, lemma, conjunction of the surface form and the POS, conjunction of the POS and the POS of prefixes and suffixes, and some disjunctions of POS.
The total number of features for each example is huge (millions), but feature vectors are very sparse.
The simple classifiers can be configured in several ways.
First, the size of the window around the target word had to be determined, and we experimented with several sizes, up to ±3 words.
Another issue is feature generation.
It is straight-forward during training, but during evaluation and testing the feature extractor is presented only with the set of analyses produced by HAMSAH for each word, and has no access to the correct analysis.
We experimented with two methods for tackling this problem: produce the union of all possible values for each feature; or select a single analysis, the baseline one, for each word, and generate only the features induced by this analysis.
While this problem is manifested only during testing, it impacts also the training procedure, and so we experimented with feature generation at training using the correct analysis, the union of the analyses or the baseline analysis.
The results of the experiments for the POS classifier are shown in Table 5.
The best configuration uses a window of two words before and one word after the target word.
For both testing and training we generate features using the baseline analysis.
With this setup, the accuracy of all the classifiers is shown in Table 6.
We report results on two tasks: the entire test corpus; and words in the test corpus which do not occur in the training corpus, a much harder task.
We list the accuracy, remaining level of ambiguity and reduction in error rate ERR, com-
Training
Table 5: Architectural configurations of the POS classifier: columns reflect the window size, rows refer to training and testing feature generation
ambiguity
accuracy
Definite Article
Segmentation
Has properties
Table 6: Accuracy of the simple classifiers: ERR is reduction in error rate, compared with the baseline
pared with the baseline.
baseline
4 Combination of Classifiers
Given a set of simple classifiers, we now investigate various ways for combining their predictions.
These predictions may be contradicting (for example, the POS classifier can predict 'noun ' while the tense classifier predicts 'past ' ), and we use the constraints imposed by the morphological analyzer to enforce a consistent analysis.
First, we define a naive combination along the lines of Habash and Rambow (2005).
The scores assigned by the simple classifiers (except segmentation, for which we use the baseline) to each analysis are accumulated, and the score of the complete analysis is their sum (experiments with different weights to the various classifiers proved futile).
Even after the combination, the remaining level of ambiguity is 1.05; in ambiguous cases back off to the baseline analysis, and then choose at random one of the topranking analyses.
The result of the combination is shown in Table 7.
Table 7: Results of the naïve combination
Next, we define a hierarchical combination in which we try to incorporate more linguistic knowledge pertaining to the dependencies between the classifiers.
As a pre-processing step we classify the target word to one of two groups, using the has properties classifier.
Then, we predict the main POS of the target word, and take this prediction to be true; we then apply only the subset of the other classifiers that are relevant to the main POS.
The results of the hierarchical combination are shown in Table 8.
As can be seen, the hierarchical combination performs worse than the naïve one.
We conjecture that this is because the hierarchical combination does not fully disambiguate, and a random top-ranking analysis is chosen more often than in the case of the naïve combination.
hierarchical
All words Unseen words
Table 8: Results of the hierarchial combination
The combination of independent classifiers under the constraints imposed by the possible morphological analyses is intended to capture context-dependent constraints on possible sequences of analyses.
Such constraints are stochastic in nature, but linguistic theory tells us that several hard (deterministic) constraints also exist which rule out certain sequences of otherwise possible analyses.
We now explore the utility of implementing such constraints to filter out linguistically impossible sequences.
Using several linguistic sources, we defined a set of constraints, each of which is a linguistically impossible sequence of analyses (all sequences are of length 2, although in principle longer ones could have been defined).
We then checked the annotated corpus for violations of these constraints; we used the corpus to either verify the correctness of a constraint or further refine it (or abandon it altogether, in some cases).
We then re-iterated the process with the new set of constraints.
The result was a small set of six constraints which are not violated in our annotated corpus.
We used the constraints to rule out some of the paths defined by the possible outcomes of the morphological analyzer on a sequence of words.
Each of the constraints below contributes a non-zero reduction in the error rate of the disambiguation module.The (slightly simplified) constraints are:
A verb in any tense but present cannot be followed by the genitive preposition '$l ' (of).
A preposition with no attached pronomial suffix must be followed by a nominal phrase.
This rule is relaxed for some prepositions which can be followed by the prefix '$ ' .
The preposition 'at ' must be followed by a definite nominal phrase.
Construct-state words must be followed by a nominal phrase.
A sequence of two verbs is only allowed if: one of them is the verb 'hih (be); one of them has a prefix; the second is infinitival; or the first is imperative and the second is in future tense.
A non-numeral quantifier must be followed by either a nominal phrase or a punctuation.
Imposing the linguistically motivated constraints on the classifier combination improved the results to some extent, as depicted in Table 9.
The best results are obtained when the constraints are applied to the hierarchical combination.
5 Error analysis
We conducted extensive error analysis of both the simple classifiers and the combination module.
The analysis was performed over one fold of the annotated corpus (8933 tokens).
Table 10 depicts, for some classifiers, a subset of the confusion matrix: it lists the correct tag, the chosen, or predicted, tag, the number of occurrences of the specific error and the total number of errors made by the classifier.
classifier
has props
segmentation
definiteness
Table 10: Simple classifiers, confusion matrix
Several patterns can be observed in Table 10.
The 'has properties ' classifier is biased towards predicting 'yes instead of 'no .
The 'segmentation classifier, which predicts the length of the prefix, also displays a clear bias.
In almost 90% of its errors it predicts no prefix instead of a prefix of length one. '
Status ' and 'definiteness ' are among the weakest classifiers, biased towards the default.
Other classifiers make more sporadic types of errors.
Of particular interest is the POS classifier.
Here, when adjectives are mis-predicted, they are predicted as nouns.
This can be explained by the morphological similarity of the two categories, and in particular by the similar syntactic contexts in which they occur.
Similarly, almost 90% of mispredicted verbs are predicted to be either nouns
naive + consts
hier.
All words
Unseen words
Table 9: Accuracy results of various combination architectures.
ERR is reduction in error rate due to the hard constraints.
The best results are obtained using the hierarchical combination with hard constraints.
or adjectives, probably resulting from present-tense verbs in the training corpus which, in Hebrew, have similar distribution to nouns and adjectives.
The analysis of errors in the combination is more interesting.
On the entire corpus, the disambigua-tor makes 7927 errors.
Of those, 1476 (19%) are errors in which the correct analysis differs from the chosen one only in the value of the 'state feature.
Furthermore, in 1341 of the errors (17%) the system picks the correct analysis up to the value of 'definite-ness ; of those, 1275 (16% of the errors) are words in which the definite article is assimilated in a preposition.
In sum, many of the errors seem to be in the real tough cases.
6 Conclusions
Morphological disambiguation of Hebrew is a difficult task which involves, in theory, thousands of possible tags.
We reconfirm the results of Daya
which show that decoupling complex morphological tasks into several simple tasks improves the accuracy of classification.
Our best result, 91.44% accuracy, reflects a reduction of 25% in error rate compared to the previous state of the art (Adler and Elhadad, 2006), and almost 40% compared to the baseline.
We also show that imposing few context-dependent constraints on possible sequences of analyses improves the accuracy of the disambiguation.
The disambiguation module will be made available through the Knowledge Center for Processing Hebrew (http://mila.cs. technion.ac.il/).
We believe that these results can be further improved in various ways.
The basic classifiers can benefit from more detailed feature engineering and careful tuning of the parameters of the learning environment.
There are various ways in which interrelated classifiers can be combined; we only explored three here.
Using other techniques, such as
inference-based training, in which the feature generation for training is done step by step, using information inferred in the previous step, is likely to yield better accuracy.
We also believe that further linguistic exploration, based on deeper error analysis, will result in more hard constraints which can reduce the error rate of the combination module.
Finally, we are puzzled by the differences between Hebrew and Arabic (for which the baseline and the current state of the art are significantly higher) on this task.
We intend to investigate the linguistic sources for this puzzle in the future.
Acknowledgements
We are extremely grateful to Dan Roth for his continuing support and advise; to Meni Adler for providing the annotated corpus; to Dalia Bojan and Alon Itai for the implementation of the morphological analyzer; to Yariv Louck for the implementation of the deterministic constraints; to Nurit Melnik for help with error analysis; and to Yuval Nardi his help with statistical analysis.
Thanks are due to Ido Dagan, Alon Lavie and Michael Elhadad for useful comments and advise.
This research was supported by THE ISRAEL SCIENCE FOUNDATION (grant No. 137/06); by the Israel Internet Association; by the Knowledge Center for Processing Hebrew; and by the Caesarea Rothschild Institute for Interdisciplinary Application of Computer Science at the University of Haifa.
