This paper focuses on the evaluation of methods for the automatic acquisition of Multiword Expressions (MWEs) for robust grammar engineering.
First we investigate the hypothesis that MWEs can be detected by the distinct statistical properties of their component words, regardless of their type, comparing 3 statistical measures: mutual information (MI), x2 and permutation entropy (PE).
Our overall conclusion is that at least two measures, MI and PE, seem to differentiate MWEs from non-MWEs.
We then investigate the influence of the size and quality of different corpora, using the BNC and the Web search engines Google and Yahoo.
We conclude that, in terms of language usage, web generated corpora are fairly similar to more carefully built corpora, like the BNC, indicating that the lack of control and balance of these corpora are probably compensated by their size.
Finally, we show a qualitative evaluation of the results of automatically adding extracted MWEs to existing linguistic resources.
We argue that such a process improves qualitatively, if a more compositional approach to grammar/lexicon automated extension is adopted.
1 Introduction
The task of automatically identifying Multiword Expressions (MWEs) like phrasal verbs (break down) and compound nouns (coffee machine) using statistical measures has been the focus of considerable investigative effort, (e.g. Pearce (2002), Evert and Krenn (2005) and Zhang et al. (2006)).
Given the heterogeneousness of the different phenomena that are considered to be MWEs, there is no consensus about which method is best suited for which type of MWE, and if there is a single method that can be successfully used for any kind of MWE.
Another difficulty for work on MWE identification is that of the evaluation of the results obtained (Pearce, 2002; Evert and Krenn, 2005), starting from the lack of consensus about a precise definition for MWEs (Villavicencio et al.,
2005).
In this paper we investigate some of the issues involved in the evaluation of automatically extracted MWEs, from their extraction to their subsequent use in an NLP task.
In order to do that, we present a discussion of different statistical measures, and the influence that the size and quality of different data sources have.
We then perform a comparison of these measures and discuss whether there is a single measure that has good overall performance for MWEs in general, regardless of their type.
Finally, we perform a qualitative evaluation of the results of adding automatically extracted MWEs to a linguistic resource, taking as basis for the evaluation the approach proposed by Zhang et al. (2006).
We argue that such results can improve in quality if a more compositional approach to MWE encoding is adopted for the grammar extension.
Having more accurate means of deciding for an appropriate method for identifying and incorporating MWEs is critical for maintaining the quality of linguistic resources for precise NLP.
This paper starts with a discussion of MWEs (§ 2), of their coverage in linguistic resources (§ 3), and of some methods proposed for automatically identifying them (§ 4).
This is followed by a detailed investigation and comparison of measures for MWE identification (§ 5).
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1034-1043, Prague, June 2007.
©2007 Association for Computational Linguistics
After that we present an approach for predicting appropriate lexico-syntactic categories for their inclusion in a linguistic resource, and an evaluation of the results in a parsing task(§ 7).
We finish with some conclusions and discussion of future work.
2 Multiword Expressions
The term Multiword Expressions has been used to describe expressions for which the syntactic or semantic properties of the whole expression cannot be derived from its parts (Sag et al., 2002), including a large number of related but distinct phenomena, such as phrasal verbs (e.g. come along), nominal compounds (e.g. frying pan), institutionalised phrases (e.g. bread and butter), and many others.
Jackendoff (1997) estimates the number of MWEs in a speaker's lexicon to be comparable to the number of single words.
However, due to their heterogeneous characteristics, MWEs present a tough challenge for both linguistic and computational work (Sag et al., 2002).
For instance, some MWEs are fixed, and do not present internal variation, such as ad hoc, while others allow different degrees of internal variability and modification, such as spill beans (spill several/musical/mountains of beans).
Sag et al. (2002) discuss two main approaches commonly employed in NLP for treating MWEs: the words-with-spaces approach models an MWE as a single lexical entry and it can adequately capture fixed MWEs like by and large.
A compositional approach treats MWEs by general and compositional methods of linguistic analysis, being able to capture more syntactically flexible MWEs, like rock boat, which cannot be satisfactorily captured by a words-with-spaces approach, since it would require lexical entries to be added for all the possible variations of an MWE (e.g. rock/rocks/rocking this/that/his... boat).
Therefore, to provide a unified account for the detection and encoding of these distinct but related phenomena is a real challenge for NLP systems.
3 Grammar and Lexicon Coverage in Deep Processing
Many NLP tasks and applications, like Parsing and Machine Translation, depend on large-scale linguistic resources, such as electronic dictionaries and grammars for precise results.
Several substantial resources exist: e.g., hand-crafted large-scale grammars like the English Resource Grammar (ERG - Flickinger (2000)) and the Dutch Alpino Grammar (Bouma et al., 2001).
Unfortunately, the construction of these resources is the manual result of human efforts and therefore likely to contain errors of omission and commission (Briscoe and Carroll, 1997).
Furthermore, due to the open-ended and dynamic nature of languages, such linguistic resources are likely to be incomplete, and manual encoding of new entries and constructions is labour-intensive and costly.
Take, for instance, the coverage test results for the ERG (a broad-coverage precision HPSG grammar for English) on the British National Corpus (BNC).
Baldwin et al. (2004), among many others, have investigated the main causes of parse failure, parsing a random sample of 20,000 strings from the written component of the BNC using the ERG.
They have found that the large majority of failures is caused by missing lexical entries, with 40% of the cases, and missing constructions, with 39%, where missing MWEs accounted for 8% of total errors.
That is, even by a margin, the lexical coverage is lower than the grammar construction coverage.
This indicates the acute need for robust (semi-)automated ways of acquiring lexical information for MWEs, and this is the one of the goals of this work.
In the next section we discuss some approaches that have been developed in recent years to (semi-)automatically detect and/or repair lexical and grammar errors in linguistic grammars and/or extend their coverage.
4 Acquiring MWEs
instance, Baldwin and Villavicencio (2002) proposed a combination of methods to extract Verb-Particle Constructions (VPCs) from unanno-tated corpora, that in an evaluation on the Wall Street Journal achieved 85.9% precision and 87.1% recall.
Nicholson and Baldwin (2006) investigated the prediction of the inherent semantic relation of a given compound nominaliza-tion using as statistical measure the confidence interval.
On the other hand, Zhang et al. (2006) looked at MWEs in general investigating the semi-automated detection of MWE candidates in texts using error mining techniques and validating them using a combination of the World Wide Web as a corpus and some statistical measures.
6248 sentences were then extracted from the BNC; these contained at least one of the 311 MWE candidates verified with World Wide Web in the way described in Zhang et al. (2006).
For each occurrence of the MWE candidates in this set of sentences, the lexical type predictor proposed in Zhang and Kordoni (2006) predicted a lexical entry candidate.
This resulted in 373 additional MWE lexical entries for the ERG grammar using a words-with-spaces approach.
As reported in Zhang et al. (2006), this addition to the grammar resulted in a significant increase in grammar coverage of 14.4%.
However, no further evaluation was done of the results of the measures used on the identification of MWEs or of the resulting grammar, as not all MWEs can be correctly handled by the simple words-with-spaces approach (Sag et al., 2002).
And these are the starting points of the work we are reporting on here.
5 Evaluation of the Identification of
One way of viewing the MWE identification task is, given a list of sequences of words, to distinguish those that are genuine MWEs (e.g. in the red ), from those that are just sequences of words that do not form any kind of meaningful unit (e.g. of alcohol and).
In order to do that, one commonly used approach is to employ statisti-
cal measures (e.g. Pearce (2002) for collocations and Zhang et al. (2006) for MWEs in general).
When dealing with statistical analysis there are two important statistical questions that should be addressed: How reliable is the corpus used? and How precise is the chosen statistical measure to distinguish the phenomena studied?.
In this section we look at these issues, for the particular case of trigrams, by testing different corpora and different statistical measures.
For that we use 1039 trigrams that are the output of Zhang et al. (2006) error mining system, and frequencies collected from the BNC and from the World Wide Web.
The former were collected from two different portions of the BNC, namely the fragment of the BNC (BNC/) used in the error-mining experiments, and the complete BNC (from the site http://pie.usna.edu/), to test whether a larger sample of a more homogeneous and well balanced corpus improves results significantly.
For the latter we used two different search engines: Google and Yahoo, and the frequencies collected reflect the number of pages that had exact matches of the n-grams searched, using the API tools for each engine.
5.1 Comparing Corpora
A corpus for NLP related work should be a reliable sample of the linguistic output of a given language.
For this work in particular, we expect that the relative ordering in frequency for different n-grams is preserved across corpora, in the same domain (e.g. a corpus of chemistry articles).
For, if this is not the case, different conclusions are certain to be drawn from different corpora.
The first test we performed was a direct comparison of the rank plots of the relative frequency of trigrams for the four corpora.
We ranked 1039 MWE-candidate trigrams according to their occurrence in each corpus and we normalised this value by the total number of times any one of the 1039 trigrams appeared for each corpus.
These normalisation values
account for something like 0.3% of the BNC corpora, while for Google and Yahoo nothing can be said since their sizes are not reliable numbers.
Figure 1 displays the results.
The overall ranking distribution is very similar for these corpora showing the expected Zipf like behaviour in spite of their different sizes.
Figure 1: Relative frequency rank for the 1039 trigrams analysed.
Of course, the information coming from Figure 1 is not sufficient for our purposes.
The order of the trigrams could be very different inside each corpus.
Therefore a second test is needed to compare the rankings of the n-grams in each corpus.
In order to do that we measure the Kendall's t scores between corpora.
Kendall's t is a non-parametric method for estimating correlation between datasets (Press et al., 1992).
For the number of trigrams studied here the Kendall's scores obtained imply a significant correlation between the corpora with p<0.000001.
The significance indicates that the data are correlated and the null hypothesis of statistical independence is certainly disproved.
Unfortunately disproving the null hypothesis does not give much information about the degree of correlation; it only asserts that it exists.
Thus, it could be a very insignificant correlation.
In table 1, we display a more intuitive measure to estimate the correlation, the probability Q that any 2 trigrams chosen from two corpora have the same relative ordering in frequency.
This probability is related to Kendall's t through the expression Q = (1 + t)/2 .
Table 1: The probability Q of 2 trigrams having the same frequency rank order for different corpora.
The results show that the four corpora are certainly correlated, and can probably be used interchangeably to access most of the statistical properties of the trigrams.
Interestingly, a higher correlation was observed between Yahoo and Google than between BNC/ and BNC, even though BNC/ is a fragment of BNC, and therefore would be expected to have a very high correlation.
This suggests that as corpora sizes increase, so do the correlations between them, meaning that they are more likely to agree on the ranking of a given MWE.
5.2 Comparing statistical measures -are they equivalent?
Here we concentrate on a single corpus, BNC/, and compare the three statistical measures for MWE identification: Mutual Information (MI), X2 and Permutation Entropy (PE)(Zhang et al., 2006), to investigate if they order the trigrams in the same fashion.
MI and X2 are typical measures of association that compare the joint probability of occurrence of a certain group of events p(abc) with a prediction derived from the null hypothesis of statistical independence between these events p$(abc) = p(a)p(b)p(c) (Press et al., 1992).
In our case the events are the occurrences of words in a given position in an n-gram.
For a trigram with words w\w2w3, X2 is calculated as:
number of unigrams a, and N the number of words in the corpus.
Mutual Information, in terms of these numbers, is:
The third measure, permutation entropy, is a measure of order association.
Given the words w1,w2, and w3, PE is calculated in this work as:
where the sum runs over all the permutations of the indexes and, therefore, over all possible positions of the selected words in the trigram.
The probabilities are estimated from the number of occurrences of each permutation of a trigram (e.g. by and large, large by and, and large by, and by large, large and by, and by large and) as:
PE was proposed by Zhang et al. (2006) as a possible measure to detect MWEs, under the hypothesis that MWEs are more rigid to permutations and therefore present smaller PEs.
Even though it is quite different from MI and X2, PE can also be thought as an indirect measure of statistical independence, since the more independent the words are the closer PE is from its maximal value (ln6, for trigrams).
One possible advantage of this measure over the others is that it does not rely on single word counts, which are less accurate in Web based corpora.
Given the rankings produced for each one of these three measures we again use Kendall's t test to assess correlation and its significance.
Table 2 displays the Q probability of finding the same ordering in these three measures.
The general conclusion from the table is that even though there is statistical significance in the correlations found (the p values are not displayed, but they are very low as before) the different measures order the trigrams very differently.
There is a 70% chance of getting the same order from MI and X2, but it is safe to say that these measures are very different from the PE, since their Q values are very close to pure chance.
MIxPE x2xPE
Table 2: The probability Q of having 2 trigrams with the same rank order for different statistical measures.
5.3 Comparing Statistical Measures -are they useful?
The use of statistical measures is widespread in NLP but there is no consensus about how good these measures are for describing natural language phenomena.
It is not clear what exactly they capture when analysing the data.
In order to evaluate if they would make good predictors for MWEs, we compare the measures distributions for MWEs and non-MWEs.
For that we selected as gold standard a set of around 400 MWE candidates annotated by a native speaker1 as MWEs or not.
We then calculated the histograms for the values of MI, X2 and PE for the two groups.
MI and X2 were calculated only for BNC/.
Table 3 displays the results of the Kolmogorov-Smirnof test (Press et al., 1992) for these histograms, where the first value is Kolmogorov-Smirnov D value (De[0,1| and large D values indicate large differences between distributions) and the second is the significance probability (p) associated to D given the sizes of the data sets, in this case 90 for MWEs
and 292 for non-MWEs.
The surprising result is that there is no statistical significance, at least using the Kolmogorov-Smirnov test, that indicates that being or not an MWE has some effect in the value of the tri-gram's x2.
The same does not happen for MI or PE.
They do seem to differentiate between MWEs and non-MWEs.
As discussed before the statistical significance implies the existence of an
1The native speaker is a linguist expert in MWEs.
effect but has very little to say about the intensity of the effect.
As in the case of this work our interest is to use the effect to predict MWEs, the intensity is very important.
In the figures that follow we show the normalised histograms for MI, x2(for the BNCf) and PE (for the case of Yahoo) for MWEs and non-MWEs.
The ideal scenario would be to have non overlapping distributions for the two cases, so a simple threshold operation would be enough to distinguish MWEs.
This is not the case in any of the plots.
Starting from Figure 3 it clearly illustrates the negative result for X2 in table 3.
The other two distributions show a visible effect in the form of a slight displacement of the distributions to the left for MWEs.
In particular for the distribution of PE, the large peak on the right, representing the n-grams whose word order is irrelevant with respect to its occurrence, has an important reduction for MWEs.
The statistical measures discussed here are all different forms of measuring correlations between the component words of MWEs.
Therefore, as some types of MWEs may have stronger constraints on word order, we believe that more visible effects can be seen in these measures if we look at their application for individual types of MWEs, which is planned for future work.
This will bring an improvement to the power of MWE prediction of these measures.
Figure 2: Normalised histograms of MI values for MWEs and non-MWEs in BNCf.
inn.
Figure 3: Normalised histograms of X2 values for MWEs and non-MWEs in BNCf.
Figure 4: Normalised histograms of PE values for MWEs and non-MWEs in Yahoo.
6 Evaluation of the Extensions to the Grammar
Our ultimate goal is to maximally automate the process of discovering and handling MWEs.
With good statistical measures, we are able to distinguish genuine MWE from non-MWEs among the n-gram candidates.
However, from the perspective of grammar engineering, even with a good candidate list of MWEs, great effort is still required in order to incorporate such word units into a given grammar automatically and in a precise way.
Zhang et al. (2006) tried a simple "word with spaces" approach.
By acquiring new lexical entries for the MWEs candidates validated by the statistical measures, the grammar coverage was shown to improve significantly.
However, no further investigation on the parser accuracy was reported there.
Taking a closer look at the MWE candidates
proposed, we find that only a small proportion of them can be handled appropriately by the"word with spaces" approach of Zhang et al. (2006).
Simply adding new lexical entries for all MWEs can be a workaround for enhancing the parser coverage, but the quality of the parser output is clearly linguistically less interesting.
On the other hand, we also find that a large proportion of MWEs that cannot be correctly handled by the grammar can be covered properly in a constructional way by adding one lexical entry for the head (governing) word of the MWE.
For example, the expression foot the bill will be correctly handled with a standard head-complement rule, if there is a transitive verb reading for the word foot in the lexicon.
Some other examples are: to put forward, the good of, in combination with, ..., where lexical extension to the words in bold will allow the grammar to cover these MWEs.
In this paper, we employ a constructional approach for the acquisition of new lexical entries for the head words
It is arguable that such an approach may lead to some potential grammar overgeneration, as there is no selectional restriction expressed in the new lexical entry.
However, as far as the parsing task is concerned, such overgeneration is not likely to reduce the accuracy of the grammar significantly as we show later in this paper through a thorough evaluation.
6.1 Experimental Setup
With the complete list of 1039 MWE candidates
discussed in section 5, we rank each n-gram according to each of the three statistical measures.
The average of all the rankings is used as the combined measure of the MWE candidates.
Since we are only interested in acquiring new lexical entries for MWEs which are not covered by the grammar, we used the error mining results (Zhang et al., 2006; van Noord, 2004) to only keep those candidates with parsability < 0.1.
The top 30 MWE candidates are used in
2 The combination of the "word with space" approach of Zhang et al. (2006) with the constructional approach we propose here is an interesting topic that we want to investigate in future research.
this experiment.
We used simple heuristics in order to extract the head words from these MWEs:
• the n-grams are POS-tagged with an automatic tagger;
• finite verbs in the n-grams are extracted as head words;
• nouns are also extracted if there is no verb in the n-gram.
Occasionally, the tagger errors might introduce wrong head words.
However, the lexical type predictor of Zhang and Kordoni (2006) that we used in our experiments did not generate interesting new entries for them in the subsequent steps, and they were thus discarded, as discussed below.
With the 30 MWE candidates, we extracted a sub-corpus from the BNC with 674 sentences which included at least one of these MWEs.
The lexical acquisition technique described in Zhang and Kordoni (2006) was used with this subcorpus in order to acquire new lexical entries for the head words.
The lexical acquisition model was trained with the Redwoods treebank (Oepen et al., 2002), following Zhang et al. (2006).
The lexical prediction model predicted for each occurrence of the head words a most plausible lexical type in that context.
Only those predictions that occurred 5 times or more were taken into consideration for the generation of the new lexical entries.
As a result, we obtained 21 new lexical entries.
These new lexical entries were later merged into the ERG lexicon.
To evaluate the grammar performance with and without these new lexical entries, we
1. parsed the sub-corpus with/without new lexical entries and compared the grammar coverage;
2. inspected the parser output manually and evaluated the grammar accuracy.
uation of the parser output, we used the tree-banking tools of the [incr tsdb()] system (Oepen,
2001).
6.2 Grammar Performance
Table 4 shows that the grammar coverage improved significantly (from 7.1% to 22.7%) with the acquired lexical entries for the head words of the MWEs.
This improvement in coverage is largely comparable to the result reported in (Zhang et al., 2006), where the coverage was reported to raise from 5% to 18% with the "word with spaces" approach (see also section 4).
It is also worth mentioning that Zhang et al. (2006) added 373 new lexical entries for a total of 311 MWE candidates, with an average of 1.2 entries per MWE.
In our experiment, we achieved a similar coverage improvement with only 21 new entries for 30 different MWE candidates, with an average of 0.7 entries per MWE.
This suggests that the lexical entries acquired in our experiment are of much higher linguistic generality.
To evaluate the grammar accuracy, we manually checked the parser outputs for the sentences in the sub-corpus which received at least one analysis from the grammar before and after the lexical extension.
Before the lexical extension, 48 sentences are parsed, among which 32 (66.7%) sentences contain at least one correct reading (table 4).
After adding the 21 new lexical entries, 153 sentences are parsed, out of which 124 (81.0%) sentences contain at least one correct reading.
Baldwin et al. (2004) reported in an earlier study that for BNC data, about 83% of the sentences covered by the ERG have a correct parse.
In our experiment, we observed a much lower accuracy on the sub-corpus of BNC which contains a lot of MWEs.
However, after the lexical extension, the accuracy of the grammar recovers to the normal level.
It is also worth noticing that we did not receive a larger average number of analyses per sentence (table 4), as it was largely balanced by the significant increase of sentences covered by the new lexical entries.
We also found that the disambiguation model as described by
Toutanova et al. (2002) performed reasonably well, and the best analysis is ranked among top-5 for 66% of the cases, and top-10 for 75%.
All of these indicate that our approach of lexical acquisition for head words of MWEs achieves a significant improvement in grammar coverage without damaging the grammar accuracy.
Optionally, the grammar developers can check the validity of the lexical entries before they are added into the lexicon.
Nonetheless, even a semi-automatic procedure like this can largely reduce the manual work of grammar writers.
7 Conclusions
In this paper we looked at some of the issues involved in the evaluation of the identification of MWEs.
In particular we evaluated the use of three statistical measures for automatically identifying MWEs.
The results suggest that at least two of them (MI and PE) can distinguish MWEs.
In terms of the corpora used, a surprisingly higher level of agreement was found between different corpora (Google and Yahoo) than between two fragments of the same one.
This tells us two lessons.
First that even though Google and Yahoo were not carefully built to be language corpora their sizes compensate for that making them fairly good samples of language usage.
Second, a fraction of a smaller well balanced corpus may not necessarily be as balanced as the whole.
Furthermore, we argued that for precise grammar engineering it is important to perform a careful evaluation of the effects of including automatically acquired MWEs to a grammar.
We looked at the evaluation of the effects in coverage, size of the grammar and accuracy of the parses after adding the MWE-candidates.
We adopted a compositional approach to the encoding of MWEs, using some heuristics to detect the head of an MWE, and this resulted in a smaller grammar than that by Zhang et al. (2006), still achieving a similar increase in coverage and maintaining a high level of accuracy of parses, comparable to that reported by Baldwin
et al. (2004).
The statistical measures are currently only
parsed ff
avg. analysis ff
coverage %
Table 4: ERG coverage with/without lexical acquisition for the head words of MWEs
used in a preprocessing step to filter the non-MWEs for the lexical type predictor.
Alternatively, the statistical outcomes can be incorporated more tightly, i.e. to combine with the lexical type predictor and give confidence scores on the resulting lexical entries.
These possibilities will be explored in future work.
