In morphologically rich languages, should morphological and syntactic disambiguation be treated sequentially or as a single problem?
We describe several efficient, probabilistically-interpretable ways to apply joint inference to morphological and syntactic disambiguation using lattice parsing.
Joint inference is shown to compare favorably to pipeline parsing methods across a variety of component models.
State-of-the-art performance on Hebrew Treebank parsing is demonstrated using the new method.
The benefits of joint inference are modest with the current component models, but appear to increase as components themselves improve.
1 Introduction
As the field of statistical NLP expands to handle more languages and domains, models appropriate for standard benchmark tasks do not always work well in new situations.
Take, for example, parsing the Wall Street Journal Penn Treebank, a longstanding task for which highly accurate context-free models stabilized by the year 2000 (Collins, 1999; Charniak, 2000).
On this task, the Collins model achieves 90% F1-accuracy.
Extended for new languages by Bikel (2004), it achieves only 75% on Arabic and 72% on Hebrew.1
It should come as no surprise that Semitic parsing lags behind English.
The Collins model was carefully designed and tuned for WSJEnglish.
Many of the features in the model depend on English syntax or Penn Treebank annotation conventions.
Inherent in its crafting is the assumption that a million words of training text are available.
Finally, for English, it need not handle morphological ambiguity.
Indeed, the figures cited above for Arabic and Hebrew are achieved using gold-standard morphological disambiguation and part-of-speech tagging.
* The authors acknowledge helpful feedback from the anonymous reviewers, Sharon Goldwater, Rebecca Hwa, Alon Lavie, and Shuly Wintner.
1Compared to the Penn Treebank, the Arabic Treebank (Maamouri et al., 2004) has 60% as many word tokens, and the Hebrew Treebank (Sima'an et al., 2001) has 6%.
Given only surface words, Arabic performance drops by 1.5 Fi points.
The Hebrew Treebank (unlike Arabic) is built over morphemes, a convention we view as sensible, though it complicates parsing.
This paper considers parsing for morphologically rich languages, with Hebrew as a test case.
Morphology and syntax are two levels of linguistic description that interact.
This interaction, we argue, can affect disambiguation, so we explore here the matter of joint disambiguation.
This involves the comparison of a pipeline (where morphology is inferred first and syntactic parsing follows) with joint inference.
We present a generalization of the two, and show new ways to do joint inference for this task that does not involve a computational blow-up.
The paper is organized as follows.
§2 describes the state of the art in NLP for Hebrew and some phenomena it exhibits that motivate joint inference for morphology and syntax.
§3 describes our approach to joint inference using lattice parsing, and gives three variants of weighted lattice parsing with their probabilistic interpretations.
The different factor models and their stand-alone performance are given in §4.
§5 presents experiments on Hebrew parsing and explores the benefits of joint inference.
2 Background
In this section we discuss prior work on statistical morphological and syntactic processing of Hebrew and motivate the joint approach.
Wintner (2004) reviews work in Hebrew NLP, emphasizing that the challenges stem from the writing system, rich morphology, unique word formation process of roots and patterns, and relative lack of annotated corpora.
We know of no publicly available statistical parser designed specifically for Hebrew.
Sima'an et al.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 208-217, Prague, June 2007.
©2007 Association for Computational Linguistics
c. is-beautiful there shepherds that distant the and big the green the meadow the in shepherd the ADJ+Masc VB+Masc +Masc
d. the shepherd in the big green distant meadow who shepherds there is beautiful
/. nicely there is-lying distant the and big the green the meadow the in shepherd the
g. the shepherdess in the big green distant meadow is lying there nicely
Figure 1: (a.) A sentence in Hebrew (to be read right to left), with (b.) one morphological analysis, (c.) English glosses, and (d.) natural translation; and (e.) a different morphological analysis, (f.) English glosses, and (g.) less natural translation.
(h.) shows a morphological "sausage" lattice that encodes the morpheme-sequence analyses L(x) possible for a shortened sentence (unmodified "meadow").
Shaded states are word boundaries, white states are intra-word morpheme boundaries; in practice we add POS tags to the arcs, to permit disambiguation.
According to both native speakers we polled, both interpretations are grammatical—note the long-distance agreement required for grammaticality.
(2001) built a Hebrew Treebank of 88,747 words (4,783 sentences) and parsed it using a probabilistic model.
However, they assumed that the input to the parser was already (perfectly) morphologically disambiguated.
This assumption is very common in multilingual parsing (see, for example, Cowan et al., 2005, and Buchholz et al., 2006).
2005).
In NLP, the separation of syntax and morphology is understandable when the latter is impoverished, as
in English.
When both involve high levels of ambiguity, this separation becomes harder to justify, as argued by Tsarfaty (2006).
To our knowledge, that is the only study to move toward joint inference of syntax and morphology, presenting joint models and testing approximation of these models with two parsers: one a pipeline (segmentation — tagging — parsing), the other involved joint inference of segmentation and tagging, with the result piped to the parser.
The latter was slightly more accurate.
Tsar-faty discussed but did not carry out joint inference.
In a morphologically rich language, the different morphemes that make up a word can play a variety of different syntactic roles.
A reasonable linguistic analysis might not make such morphemes immediate sisters in the tree.
Indeed, the convention of the Hebrew Treebank is to place morphemes (rather than words) at the leaves of the parse tree, allowing morphemes of a word to attach to different nonterminal parents.2
Generating parse trees over morphemes requires the availability of morphological information when parsing.
Because this analysis is not in general reducible to sequence labeling (tagging), the problem is different from POS tagging.
Figure 1 gives an
2The Arabic Treebank, by contrast, annotates words morphologically but keeps the morphemes together as a single node tagged with a POS sequence.
In Bikel's Arabic parser, complex POS tags are projected to a small atomic set; it is unclear how much information is lost.
example from Hebrew that illustrates the interaction between morphology and syntax.
In this example, we show two interpretations of the surface text, with the first being a more common natural analysis for the sentence.
The first and third-to-last words' analyses depend on each other if the resulting analysis is to be the more natural one: for this analysis the first seven words have to be a noun phrase, while for the less common analysis ("lying there nicely") only the first six words compose a noun phrase, with the last two words composing a verb phrase.
Consistency depends on a long-distance dependency that a finite-state morphology model cannot capture, but a model that involves syntactic information can.
Dis-ambiguating the syntax aids in disambiguating the morphology, suggesting that a joint model will perform both more accurately.
In sum, joint inference of morphology and syntax is expected to allow decisions of both kinds to influence each other, enforce adherence to constraints at both levels, and to diminish the propagation of errors inherent in pipelines.
3 Joint Inference of Morphology and Syntax
We now formalize the problem and supply the necessary framework for performing joint morphological disambiguation and syntactic parsing.
3.1 Notation and Morphological Sausages
Let X be the language's word vocabulary and M be its morpheme inventory.
The set of valid analyses for a surface word is defined using a morphological lexicon L, which defines L(x) C M+. L(x) C (M+)+ (sequence of sequences) is the set of whole-sentence analyses for sentence x = (xi, x2,xn), produced by concatenating elements of L(x») in order.
L(x) can be represented as an acyclic lattice with a "sausage" shape familiar from speech recognition (Mangu et al., 1999) and machine translation (Lavie et al., 2004).
Fig.
1h shows a sausage lattice for a sentence in Hebrew.
We use m to denote an element of L(x) and mj to denote an element of L(xj); in general, m = (m 1, m2,mn).
classifier.
We use DG(m) c T to denote the set of valid trees under a grammar G (here, a PCFG with terminal alphabet M) for morpheme sequence m. To be precise, f (x) selects a mutually consistent morphological and syntactic analysis from
Our mapping f (x) is based on a joint probability model p(r, m | x) which combines two probability models pc(r,rn) (a PCFG built on the grammar G) and pL(m | x) (a morphological disambiguation model built on the lexicon L).
Factoring the joint model into sub-models simplifies training, since we can train each model separately, and inference (parsing), as we will see later in this section.
Factored estimation has been quite popular in NLP of late (Klein and Manning, 2003b; Smith and Smith, 2004; Smith et al., 2005a, inter alia).
The most obvious joint parser uses pG as a conditional model over trees given morphemes and maximizes the joint likelihood:
This is not straightforward, because it involves summing up the trees for each m to compute pc(mm), which calls for the O(|m|3)-Inside algorithm to be called on each m. Instead, we use the joint, pg(t, m), which, strictly speaking, makes the model deficient ("leaky"), but permits a dynamic programming solution.
Our models will be parametrized using either un-normalized weights (a log-linear model) or multinomial distributions.
Either way, both models define scores over parts of analyses, and it may be advantageous to give one model relatively greater strength, especially since we often ignore constant normalizing factors.
This is known as a product of experts (Hinton, 1999), where a new combined distribution over events is defined by multiplying component distributions together and renormalizing.
In the
where Z(x, a) need not be computed (since it is a constant in m and t). a tunes the relative weight of the morphology model with respect to the parsing model.
The higher a is, the more we trust the morphology model over the parser to correctly dis-ambiguate the sentence.
We might trust one model more than the other for a variety of reasons: it could be more robustly or discriminatively estimated, or it could be known to come from a more appropriate family.
This formulation also generalizes two more naive parsing methods.
If a = 0, the morphology is modeled only through the PCFG and pL is ignored except as a constraint on which analyses L(x) are allowed (i.e., on the definition of the set GEN(x)).
At the other extreme, as a — +00, pL becomes more important.
Because pL does not predict trees, pG still "gets to choose" the syntax tree, but in the limit it must find a tree for argmaxf eL(x) pL (m | x).
This is effectively the morphology-first pipeline.3
3.3 Parsing Algorithms
To parse, we apply a dynamic programming algorithm in the (max, +) semiring to solve the fpo^a problem shown in Eq.
If pL is a unigram-factored model, such that for some single-word morphological model u we have
then we can implement morpho-syntactic parsing by weighting the sausage lattice.
Let the weight of each arc that starts an analysis G L(xj) be equal to logu(mj | xi), and let other arcs have weight 0.
In the parsing algorithm, the weight on an arc is summed in when the arc is first used to build a constituent.
In general, we would like to define a joint model that assigns (unnormalized) probabilities to elements of GEN(x).
If pG is a PCFG and pL can
3There is a slight difference.
If no parse tree exists for the pL-best morphological analysis, then a less probable m may be chosen.
So as a — +00, we can view fnk>a as finding the best grammatical m and its best tree—not exactly a pipeline.
be described as a weighted finite-state transducer, then this joint model is their weighted composition, which is a weighted CFG; call the composed grammar I and its (unnormalized) distribution p/.
Compared to G, I will have many more nonterminals if pL has a Markov order greater than 0 (unigram, as above).
Because parsing runtime depends heavily on the grammar constant (at best, quadratic in the number of nonterminals), parsing with p/ is not computationally attractive.4 fpo^a is not, then, a scalable solution when we wish to use a morphology model pL that can make interdependent decisions about different words in x in context.
We propose two new, efficient dynamic programming solutions for joint parsing.
posterior, depends on all of x
Similar methods were applied by Matsuzaki et al. (2005) and Petrov and Klein (2007) for parsing under a PCFG with nonterminals with latent annotations.
Their approach was variational, approximating the true posterior over coarse parses using a sentence-specific PCFG on the coarse nonterminals, created directly out of the true fine-grained PCFG.
In our case, we approximate the full distribution over morphological analyses for the sentence by a simpler, sentence-specific unigram model that assumes each word's analysis is to be chosen independently of the others.
Note that our model (pL) does not make such an assumption, only the approximate model p'L does, and the approximation is per-sentence.
The idea resembles a mean-field vari-ational approximation for graphical models.
Turning to implementation, we can solve for pL(mj | x) exactly using the forward-backward algorithm.
We will call this method far^a (see Eq.
5).
A closely related method, applied by Goodman (1996) is called minimum-risk decoding.
Goodman called it "maximum expected recall" when applying it to parsing.
In the HMM community it
4In prior work involving factored syntax models— lexicalized (Klein and Manning, 2003b) and bilingual (Smith and Smith, 2004)—/poe,i was applied, and the asymptotic runtime went to O(n5) and O(n7).
is sometimes called "posterior decoding."
Minimum risk decoding is attributable to Goel and Byrne (2000).
Applied to a single model, it factors the parsing decision by penalizable errors, and chooses the solution that minimizes the risk (expected number of errors under the model).
This factors into a sum of expectations, one per potential mistake.
This method is expensive for parsing models (since it requires the Inside algorithm to compute expected recall mistakes), but entirely reasonable for sequence labeling models.
The idea is to score each word-analysis mj in the morphological lattice by the expected value (under pL) that mj is present in the final analysis m. This is, of course pL(M?j = mj | x), the same quantity computed for fvari,a, except the score of a path in the lattice is now a sum of posteriors rather than a product.
Our second approximate joint parser tries to maximize the probability of the parse (as before) and at the same time to minimize the risk of the morphological analysis.
See frisk,a in Eq.
6; the only difference between frisk,a and fvari,a is whether posteriors are added (frisk,a) or multiplied (fvari,a).
To summarize this section, fvari,a and frisk,a are two approximations to the expensive-in-general fpoe,a that boil down to parsing over weighted lattices.
The only difference between them is how the lattice is weighted: using a logpL(mj | x) for fvari,a or using apL(mj | x) for frisk,a.5 In case of a unigram pL, fpoe,a is equivalent to fvari,«; otherwise fpoe,a is likely to be too expensive.
To parse the weighted lattices using fvari,a and frisk,a in the previous section, we use lattice parsing.
Lattice parsing is a straightforward generalization of
5Until now, we have talked about weighting word analyses, which may cover several arcs, rather than arcs.
In practice we apply the weight to the first arc of a word analysis, and weight the remaining arcs of that analysis with 0 (no cost or benefit), giving the desired effect.
string parsing that indexes constituents by states in the lattice rather than word interstices.
At parsing time, a (max, +) lattice parser finds the best combined parse tree and path through the lattice.
Importantly, the data structures that are used in chart parsing need not change in order to accommodate lattices.
The generalization over classic Earley or CKY parsing is simple: keep in the parsing chart constituents created over a pair of start state and end state (instead of start position and end position), and (if desired) factor in weights on lattice arcs; see Hall
(2005).
4 Factored Models
A fair comparison of joint and pipeline parsing must make some attempt to control for the component models.
We describe here two PCFGs we used for pg(t, m) and two finite-state morphological models we used for pL (m | x).
We show how these models perform in stand-alone evaluations.
For all experiments, we used the Hebrew Treebank (Sima'an et al., 2001).
After removing traces and removing functional information from the nonterminals, we had 3,770 sentences in the training set, 371 sentences in the development set (used primarily to select the value of a) and 370 sentences in the test set.
Our first syntax model is an unbinarized PCFG trained using relative frequencies.
Preterminal (POS tag — morpheme) rules are smoothed using backoff to a model that predicts the morpheme length and letter sequence.
The PCFG is not binarized.
This grammar is remarkably good, given the limited effort that went into it.
The rules in the training set had high coverage with respect to the development set: an oracle experiment in which we maximized the number of recovered gold-standard constituents (on the development set) gave F1 accuracy of 93.7%.
In fact, its accuracy supersedes
more complex, lexicalized, models: given goldstandard morphology, it achieves 81.2% (compared to 72.0% by Bikel's parser, with head rules specified by a native speaker).
This is probably attributable to the dataset's size, which makes training with highly-parameterized lexicalized models precarious and prone to overfitting.
With first-order vertical markovization (i.e., annotating each nonterminal with its parent as in Johnson, 1998), accuracy is also at 81.2%.
Tuning the horizontal markovization of the grammar rules (Klein and Manning, 2003a) had a small, adverse effect on this dataset.
Since the PCFG model was relatively successful compared to lexicalized models, and is faster to run, we decided to use a vanilla PCFG, denoted Gvan, and a parent-annotated version of that PCFG (Johnson, 1998), denoted Gv=2.
Both of our morphology models use the same morphological lexicon L, which we describe first.
In this work, a morphological analysis of a word is a sequence of morphemes, possibly with a tag for each morpheme.
There are several available analyzers for Hebrew, including Yona and Wintner (2005) and Segal (2000).
We use instead an empirically-constructed generative lexicon that has the advantage of matching the Treebank data and conventions.
If the Treebank is enriched, this would then directly benefit the lexicon and our models.
Starting with the training data from the Hebrew Treebank, we first create a set of prefixes Mp C M; this set includes any morpheme seen in a non-final position within any word.
We also create a set of stems Ms C M that includes any morpheme seen in a final position in a word.
This effectively captures the morphological analysis convention in the Hebrew Treebank, where a stem is prefixed by a relatively dominant low-entropy sequence of 0-5 prefix morphemes.
For example, MHKLB ("from the dog") is analyzed as M+H+KLB with prefixes M ("from") and H ("the") and KLB ("dog") is the stem.
In practice, | Mp| = 124 (including some conventions for numerals) and |Ms| = 13,588.
The morphological lexicon is then defined as any analysis given Mp and
where mk denotes (m1,...,mk) and count(mk,x) denotes the number of occurrences of x disam-biguated as mk in the training set.
Note that L(x) also includes any analysis of x observed in the training data.
This permits the memorization of any observed analysis that is more involved than simple segmentation (4% of word tokens in the training set; e.g., LXDR ("to the room") is analyzed as L+H+XDR).
This will have an effect on evaluation (see §5.1).
On the development data, L has 98.6% coverage.
The baseline morphology model, pLni, first defines a joint distribution following Eq.
The word model factors out when we conditionalize to form pLni((m1,mk) | x).
The prefix sequence model is multinomial estimated by MLE.
The stem model (conditioned on the prefix sequence) is smoothed to permit any stem that is a sequence of Hebrew characters.
On the development data, pLni is 88.8% accurate (by word).
The second morphology model, pLrf, which is based on the same morphological lexicon L, uses a second-order conditional random field (Lafferty et al., 2001) to disambiguate the full sentence by modeling local contexts (Kudo et al., 2004; Smith et al., 2005b).
Space does not permit a full description; the model uses all the features of Smith et al. (2005b) except the "lemma" portion of the model, since the Hebrew Treebank does not provide lemmas.
The weights are trained to maximize the probability of the correct path through the morphological lattice, conditioned on the lattice.
This is therefore a discriminative model that defines pL(m | x) directly, though we ignore the normalization factor in parsing.
Until now we have described pL as a model of morphemes, but this CRF is trained to predict POS tags as well—we can either use the tags (i.e., label the morphological lattice with tag/morpheme pairs,
word stem prefix sequence
so that the lattice parser finds a parse that is consistent under both models), or sum the tags out and let the parser do the tagging.
One subtlety is the tagging of words not seen in the training data; for such words an unsegmented hypothesis with tag UNK NOWN is included in the lattice and may therefore be selected by the CRF.
On the development data, pLrf is 89.8% accurate on morphology, with 74.9% fine-grained POS-tagging Fi-accuracy (see §5.1).
Note on generative and discriminative models.
The reader may be skeptical of our choice to combine a generative PCFG with a discrimative CRF.
We point out that both are used to define conditional distributions over desired "output" structures given "input" sequences.
Notwithstanding the fact that the factors can be estimated in very different ways, our combination in an exact or approximate product-of-experts is a reasonable and principled approach.
5 Experiments
In this section we evaluate parsing performance, but an evaluation issue is resolved first.
5.1 Evaluation Measures
The "Parseval" measures (Black et al., 1991) are used to evaluate a parser's phrase-structure trees against a gold standard.
They compute precision and recall of constituents, each indexed by a label and two endpoints.
As pointed out by Tsarfaty (2006), joint parsing of morphology and syntax renders this indexing inappropriate, since it assumes the yields of the trees are identical—that assumption is violated if there are any errors in the hypothesized m. Tsarfaty (2006) instead indexed by non-whitespace character positions, to deal with segmentation mismatches.
In general (and in this work) that is still insufficient, since L(x) may include m that are not simply segmentations of x (see §4.2.1).
Roark et al. (2006) propose an evaluation metric for comparing a parse tree over a sentence generated by a speech recognizer to a gold-standard parse.
As in our case, the hypothesized tree could have a different yield than the original gold-standard
parse tree, because of errors made by the speech recognizer.
The metric is based on an alignment between the hypothesized sentence and the goldstandard sentence.
We used a similar evaluation metric, which takes into account the information about parallel word boundaries as well, a piece of information that does not appear naturally in speech recognition.
Given the correct m* and the hypothesis m, we use dynamic programming to find an optimal many-to-many monotonic alignment between the atomic morphemes in the two sequences.
The algorithm penalizes each violation (by a morpheme) of a one-to-one correspondence,6 and each character edit required to transform one side of a correspondence into the other (without whitespace).
Word boundaries are (here) known and included as index positions.
In the case where m = m* (or equal up to whitespace) the method is identical to Parseval (and also to Tsarfaty, 2006).
POS tag accuracy is evaluated the same way, for the same reasons; we report Fi-accuracy for tagging and parsing.
5.2 Experimental Comparison
In our experiment we vary four settings:
(§3.3).
• Syntax model: Gvan or Gv=2 (§4.1).
6That is, in a correspondence of a morphemes in one string with b in the other, the penalty is a + b — 2, since the morpheme on each side is not in violation.
7One subtlety is that any arc with the unknown POS tag can be relabeled—to any other tag—by the syntax model, whose preterminal rules are smoothed.
This was crucial for a = +0 (pipeline) parsing with t.-pLrf as the morphology model, since the parser does not recognize UNKNOWN as a tag.
Table 1: Results of experiments on Hebrew (test data, max. length 40).
This table shows the performance of joint parsing (finite a; left) and a pipeline (a — +oc; right).
Joint parsing with a non-unigram morphology model is too expensive (marked *).
Morphological analysis accuracy (by word), fine-grained (full tags) and coarse-grained (only parts of speech) POS tagging accuracy (Fi), and generalized constituent accuracy (Fi) are reported; a was tuned for each of these separately.
Boldface denotes that figures were significantly better than their counterparts in the same row, under a binomial sign test (p < 0.05). f marks the best overall accuracy and figures that are not significantly worse (binomial sign test, p < 0.05).
and as a — +oo a morphology-first pipeline is approached.
We measured four outcome values: segmentation accuracy (fraction of word tokens segmented correctly), fine- and coarse-grained tagging accuracy,8 and parsing accuracy.
For tagging and parsing, F\-measures are given, according to the generalized evaluation measure described in §5.1.
Tab.
1 compares parsing with tuned a values to the pipeline.
The best results were achieved using /vari,a, using the CRF and joint disambiguation.
Without the CRF (using pLni), the difference between the decoding algorithms is less apparent, suggesting an interaction between the sophistication of the components and the best way to decode with them.
These results suggest that /vari^, which permits pL to "veto" any structure involving a morphological analysis for any word that is a posteriori unlikely (note that
8 Although the Hebrew Treebank is small, the size of its POS tagset is large (four times larger than the Penn Treebank), because the tags encode morphological features (gender, person, and number).
These features have either been ignored in prior work or encoded differently.
In order for our POS-tagging figures to be reasonably comparable to previous work, we include accuracy for coarse-grained tags (only the core part of speech) tags as well as the detailed Hebrew Treebank tags.
log pL(mi | x) can be an arbitrarily large negative number), is beneficial as a "filter" on parses.9 /risk^, on the other hand, is only allowed to give "bonuses" of up to a to each morphological analysis that pL believes in; its influence is therefore weaker.
This result is consistent with the findings of Petrov et al. (2007) for another approximate parsing task.
The advantage of the parent-annotated PCFG is also more apparent when the CRF is used for morphology, and when a is tuned.
All other things equal, then, pLrf led to higher accuracy all around.
Letting the CRF help predict the POS tags helped tagging accuracy but not parsing accuracy.
While the gains over the pipeline are modest, the segmentation, fine POS, and parsing accuracy scores achieved by joint disambiguation with /vari^ with the CRF are significantly better than any of the pipeline conditions.
Interestingly, if we had not tested with the CRF, we might have reached a very different conclusion about the usefulness of tuning a as opposed to a pipeline.
With the unigram morphology model, joint parsing frequently underperforms the pipeline, sometimes even signficantly.
The explanation, we
9Another way to describe this combination is to call it a product of |x| +1 experts: one for the morphological analysis of each word, plus the grammar.
The morphology experts (softly) veto any analysis that is dubious based on surface criteria, and the grammar (softly) vetoes less-grammatical parses.
Table 2: Oracle results of experiments on Hebrew (test data, max. length 40).
This table shows the performance of morphological segmentation, part-of-speech tagging, coarse part-of-speech tagging and parsing when using an oracle to select the best a for each sentence.
The notation and interpretation of the numbers are the same as in Tab.
believe, has to do with the ability of the unigram model to estimate a good distribution over analyses.
While the unigram model is nearly as good as the CRF at picking the right segmentation for a word, joint parsing demands much more.
In case the best segmentation does not lead to a grammatical morpheme sequence (under the syntax model), the morphology model needs to be able to give relative strengths to the alternatives.
The unigram model is less able to do this, because it ignores the context of the word, and so the benefit of joint parsing is lost.
Most commonly the tuned value of a is around 10 (not shown, to preserve clarity).
Because of ignored normalization constants, this does not mean that morphology is "10 x more important than syntax," but it does mean that, for a particular pL and Pg, tuning their relative importance in decoding can improve accuracy.
In Tab.
2 we show how performance would improve if the oracle value of a was selected for each test-set sentence; this further highlights the potential impact of perfecting the tradeoff between models.
Of course, selecting a automatically at test-time, per sentence, is an open problem.
To our knowledge, the parsers we have described represent the state-of-the-art in Modern Hebrew parsing.
The closest result is Tsarfaty (2006), which we have not directly replicated.
Tsarfaty's model is essentially a pipeline application of /p0e,oo with a
grammar like PGvan.
Her work focused more on the interplay between the segmentation and POS tagging models and the amount of information passed to the parser.
Some key differences preclude direct comparison: we modeled fine-grained tags (though we report both kinds of tagging accurcy), we employed a richer morphological lexicon (permitting analyses that are not just segmentation), and a different training/test split and length filter (we used longer sentences).
Nonetheless, our conclusions support the argument in Tsarfaty (2006) for more integrated parsing methods.
We conclude that tuning the relative importance of the two models—rather than pipelining to give one infinitely more importance—can provide an improvement on segmentation, tagging, and parsing accuracy.
This suggests that future parsing efforts for languages with rich morphology might continue to assume separately-trained (and separately-improved) morphology and syntax components, which would stand to gain from joint decoding.
In our experiments, better morphological disambiguation was crucial to getting any benefit from joint decoding.
Our result also suggests that exploring new, fully-integrated models (and training methods for them) may be advantageous.
6 Conclusion
We showed that joint morpho-syntactic parsing can improve the accuracy of both kinds of disambiguation.
Several efficient parsing methods were presented, using factored state-of-the-art morphology and syntax models for the language under consideration.
We demonstrated state-of-the-art performance on and consistent improvements across many settings for Modern Hebrew, a morphologically-rich language with a relatively small treebank.
