In this paper we explore the use of se-lectional preferences for detecting non-compositional verb-object combinations.
To characterise the arguments in a given grammatical relationship we experiment with three models of selectional preference.
Two use WordNet and one uses the entries from a distributional thesaurus as classes for representation.
In previous work on selectional preference acquisition, the classes used for representation are selected according to the coverage of argument tokens rather than being selected according to the coverage of argument types.
In our distributional thesaurus models and one of the methods using WordNet we select classes for representing the preferences by virtue of the number of argument types that they cover, and then only tokens under these classes which are representative of the argument head data are used to estimate the probability distribution for the selectional preference model.
We demonstrate a highly significant correlation between measures which use these 'type-based' selectional preferences and composi-tionality judgements from a data set used in previous research.
The type-based models perform better than the models which use tokens for selecting the classes.
Furthermore, the models which use the automatically acquired thesaurus entries produced the best results.
The correlation for the thesaurus models is stronger than any of the individ-
ual features used in previous research on the same dataset.
1 Introduction
Baldwin et al., 2003; McCarthy et al., 2003; Bannard, 2005; Venkatapathy and Joshi, 2005).
Typically the phrases are putative multiwords and non-compositionality is viewed as an important feature of many such "words with spaces" (Sag et al., 2002).
For applications such as paraphrasing, information extraction and translation, it is essential to take the words of non-compositional phrases together as a unit because the meaning of a phrase cannot be obtained straightforwardly from the constituent words.
In this work we are investigate methods of determining semantic compositionality of verb-object 1 combinations on a continuum following previous research in this direction (McCarthy et al., 2003; Venkatapathy and Joshi, 2005).
Much previous research has used a combination of statistics and distributional approaches whereby distributional similarity is used to compare the constituents of the multiword with the multiword itself.
In this paper, we will investigate the use of selec-tional preferences of verbs.
We will use the preferences to find atypical verb-object combinations as we anticipate that such combinations are more likely to be non-compositional.
'We use object to refer to direct objects.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 369-379, Prague, June 2007.
©2007 Association for Computational Linguistics
Selectional preferences of predicates have been modelled using the man-made thesaurus WordNet (Fellbaum, 1998), see for example (Resnik, 1993; Li and Abe, 1998; Abney and Light, 1999; Clark and Weir, 2002).
There are also distributional approaches which use co-occurrence data to cluster distributionally similar words together.
The cluster output can then be used as classes for se-lectional preferences (Pereira et al., 1993), or one can directly use frequency information from distri-butionally similar words for smoothing (Grishman and Sterling, 1994).
We used three different types of probabilistic models, which vary in the classes selected for representation over which the probability distribution of the argument heads 2 is estimated.
Two use WordNet and the other uses the entries in a thesaurus of distri-butionally similar words acquired automatically following (Lin, 1998).
The first method is due to Li and Abe (1998).
The classes over which the probability distribution is calculated are selected according to the minimum description length principle (mdl) which uses the argument head tokens for finding the best classes for representation.
This method has previously been tried for modelling compositionality of verb-particle constructions (Bannard, 2002).
The other two methods (we refer to them as 'type-based') also calculate a probability distribution using argument head tokens but they select the classes over which the distribution is calculated using the number of argument head types (of a verb in a corpus) in a given class, rather than the number of argument head tokens in contrast to previous WordNet models (Resnik, 1993; Li and Abe, 1998; Clark and Weir, 2002).
For example, if the object slot of the verb park contains the argument heads { car, car, car, car, van, jeep } then the type-based models use the word type "car" only once when determining the classes over which the probability distribution is to be estimated.
Classes are selected which maximise the number of types that they cover, rather than the number of tokens.
This is done to avoid the selec-tional preferences being heavily influenced by noise from highly frequent arguments which may be poly-semous and some or all of their meanings may not be
2Argument heads are the nouns occurring in the object slot of the target verb.
semantically related to the 'prototypical' arguments of the verb.
For example car has a gondola sense in WordNet.
The third method uses entries in a distributional thesaurus rather than classes from WordNet.
The entries used as classes for representation are selected by virtue of the number of argument types they encompass.
As with the WordNet models, the tokens are used to estimate a probability distribution over these entries.
In the next section, we discuss related work on identifying compositionality.
In section 3, we describe the methods we are using for acquiring our models of selectional preference.
In section 4, we test our models on a dataset used in previous research.
We compare the three types of models individually and also investigate the best performing model when used in combination with other features used in previous research.
We conclude in section 5.
2 Related Work
Most previous work using distributional approaches to compositionality either contrasts distributional information of candidate phrases with constituent words (Schone and Jurafsky, 2001; Bannard et al.,
or uses distributionally similar words to detect nonproductive phrases (Lin, 1999).
Lin (1999) used his method (Lin, 1998) for automatic thesaurus construction.
He identified candidate phrases involving several open-class words output from his parser and filtered these by the log-likelihood statistic.
Lin proposed that if there is a phrase obtained by substitution of either the head or modifier in the phrase with a 'nearest neighbour' from the thesaurus then the mutual information of this and the original phrase must be significantly different for the original phrase to be considered non-compositional.
He evaluated the output manually.
As well as distributional similarity, researchers have used a variety of statistics as indicators of non-compositionality (Blaheta and Johnson, 2001; Krenn and Evert, 2001).
Fazly and Stevenson (2006) use statistical measures of syntactic behaviour to gauge whether a verb and noun combination is likely to be a idiom.
Although they are not specifically detecting compositionality, there is a strong corre-
lation between syntactic rigidity and semantic idiosyncrasy.
Venkatapathy and Joshi (2005) combine different statistical and distributional methods using support vector machines (svms) for identifying non-compositional verb-object combinations.
They explored seven features as measures of compositional-ity:
2. pointwise mutual information (Church and
3. least mutual information difference with similar collocations, based on (Lin, 1999) and using Lin's thesaurus (Lin, 1998) for obtaining the similar collocations.
The distributed frequency of an object, which takes an average of the frequency of occurrence with an object over all verbs occurring with the object above a threshold.
5. distributed frequency of an object, using the verb, which considers the similarity between the target verb and the verbs occurring with the target object above the specified threshold.
7. the same lsa approach, but considering the similarity of the verb-object pair with the verbal form of the object (to capture support verb constructions e.g. give a smile
We say more about this dataset and Venkatapathy and Joshi's results in section 4 since we use the dataset for our experiments.
In this paper, we investigate the use of selec-tional preferences to detect compositionality.
Bannard (2002) did some pioneering work to try and establish a link between the compositionality of verb particle constructions and the selectional preferences of the multiword and its constituent verb.
His results were hampered by models based on (Li and Abe, 1998) which involved rather uninformative models at the roots of WordNet.
There are several reasons for this.
The classes for the model are selected using mdl by compromising between a simple model with few classes and one which explains the data well.
The models are particularly affected by the quantity of data available (Wagner, 2002).
Also noise from frequent but idiosyncratic or polysemous arguments weakens the signal.
There is scope for experimenting with other approaches such as (Clark and Weir, 2002), however, we feel a type-based approach is worthwhile to avoid the noise introduced from frequent but polysemous arguments and bias from highly frequent arguments which might be part of a multiword rather than a prototypical argument of the predicate in question, for example eat hat.
In contrast to Bannard, our experiments are with verb-object combinations rather than verb particle constructions.
We compare Li and Abe models with WordNet models which use the number of argument types to obtain the classes for representation of the selectional preferences.
In addition to experiments with these WordNet models, we propose models using entries in distributional the-sauruses for representing preferences.
3 Three Methods for Acquiring Selectional Preferences
All models were acquired from verb-object data extracted using the rasp parser (Briscoe and Carroll, 2002) from the 90 million words of written English from the bnc (Leech, 1992).
We extracted verb and common noun tuples where the noun is the argument head of the object relation.
The parser was also used to extract the grammatical relation data used for acquisition of the thesaurus described below in section 3.3.
This approach is a reimplementation of Li and Abe (1998).
Each selectional preference model (referred to as a tree cut model, or tcm) comprises a set of disjunctive noun classes selected from all the possibilities in the WordNet hyponym hierarchy 3 using mdl (Rissanen, 1978).
The tcm covers all the
3We use WordNet version 2.1 for the work in this paper.
noun senses in the WordNet hierarchy and is associated with a probability distribution over these noun senses in the hierarchy reflecting the argument head data occurring in the given grammatical relationship with the specified verb. mdl finds the classes in the tcm by considering the cost measured in bits of describing both the model and the argument head data encoded in the model.
A compromise is made by having as simple a model as possible using classes further up the hierarchy whilst also providing a good model for the set of argument head tokens (TK).
The classes are selected by recursing from the top of the WordNet hierarchy comparing the cost (or description length) of using the mother class to the cost of using the hyponym daughter classes.
In any path, the mother is preferred unless using the daughters would reduce the cost.
If using the daughters for the model is less costly than the mother then the recursion continues to compare the cost of the hyponyms beneath.
The cost (or description length) for a set of classes is calculated as the model description length (mdl) and the data description length (ddl)
k, is the number of WordNet classes being currently considered for the tcm minus one.
The mdl method uses the size of TK on the assumption that a larger dataset warrants a more detailed model.
The cost of describing the argument head data is calculated using the log of the probability estimate from the classes currently being considered for the model.
The probability estimate for a class being considered for the model is calculated using the cumulative frequency of all the hyponym nouns under that class that occur in TK, divided by the number of noun senses that these nouns have, to account for their polysemy.
This cumulative frequency is also divided by the total number of noun hyponyms under that class in WordNet to obtain a smoothed estimate for all nouns under the class.
The probability of the class is obtained by dividing this frequency estimate by the total frequency of the argument heads.
The algorithm is described fully by Li and Abe (1998).
Example nouns
mile way
street lane
elf-propell venicfe
location
4See (Li and Abe, 1998) for a full explanation.
Figure 1: portion of the tcm for the objects of park.
A small portion of the tcm for the object slot of park is shown in figure 1.
WordNet classes are displayed in boxes with a label which best reflects the meaning of the class.
The probability estimates are shown for the classes on the tcm. Examples of the argument head data are displayed below the WordNet classes with dotted lines indicating membership at a hyponym class beneath these classes.
We cannot show the full tcm due to lack of space, but we show some of the higher probability classes which cover some typical nouns that occur as objects of park.
Note that probability under the classes abstract-entity, way and location arise because of a systematic parsing error where adverbials such as distance in park illegally some distance from the railway station are identified by the parser as objects.
Systematic noise from the parser has an impact on all the selectional preference models described in this paper.
We propose a method of acquiring selectional preferences which instead of covering all the noun senses in WordNet, just gives a probability distribution over a portion of prototypical classes, we refer to these models as wnprotos.
A wnproto consists of classes within the noun hierarchy which have the highest proportion of word types occurring in the argument head data, rather than using the number of tokens, or frequency, as is used for the tcms. This allows less frequent, but potentially informative arguments to have some bearing on the models acquired to reduce the impact of highly frequent but polysemous arguments.
We then used the frequency data to populate these selected classes.
distance
The classes (C) in the wnproto are selected from those which include at least a threshold of 2 argument head types 5 occurring in the training data.
Each argument head in the training data is disambiguated according to whichever of the WordNet classes it occurs at or under which has the highest 'type ratio'.
Let TY be the set of argument head types in the object slot of the verb for which we are acquiring the preference model.
The type ratio for a class (c) is the ratio of noun types (ty G TY) occurring in the training data also listed at or beneath that class in WordNet to the total number of noun types listed at or beneath that particular class in WordNet (writy G c).
The argument types attested in the training data are divided by the number of WordNet classes that the noun (classes(ty)) belongs to, to account for polysemy in the training data.
physical entity j
self-propelled i_vehicle j
transport
3 wheeled! vehicle
Example nouns hyponym classes
classes in model
'fiI motor pram
v-----^-----.
If more than one class has the same type ratio then the argument is not used for calculating the probability of the preference model.
In this way, only arguments that can be disambiguated are used for calculating the probability distribution.
The advantage of using the type ratio to determine the classes used to represent the model and to disambiguate the arguments is that it prevents high frequency verb noun combinations from masking the information from prototypical but low frequency arguments.
We wish to use classes which are as representative of the argument head types as possible to help detect when an argument head is not related to these classes and is therefore more likely to be non-compositional.
For example, the class motor .vehicle is selected for the wnproto model of the object slot of park even though there are 5 meanings of car in WordNet including elevator_car and gondola.
There are 174 occurrences of car which overwhelms the frequency of the other objects (e.g. van 11, vehicle 8) but by looking for classes with a high proportion of types (rather than word tokens) car is disambiguated appropriately and the class motor .vehicle is selected for representation.
5We have experimented with a threshold of 3 and obtained similar results.
Figure 2: Part of wnproto for the object slot of park
The relative frequency of each class is obtained from the set of disambiguated argument head tokens and used to provide the probability distribution over this set of classes.
Note that in wnproto, classes can be subsumed by others in the hyponym hierarchy.
The probability assigned to a class is applicable to any descendants in the hyponym hierarchy, except those within any hyponym classes within the wnproto.
The algorithm for selecting C and calculating the probability distribution is shown as Algorithm 1.
Note that we use brackets for comments.
In figure 2 we show a small portion of the wn-proto for park.
Again, WordNet classes are displayed in boxes with a label which best reflects the meaning of the class.
The probability estimates are shown in the boxes for all the classes included in the wnproto.
The classes in the wnproto model are shown with dashed lines.
Examples of the argument head data are displayed below the WordNet classes with dotted lines indicating membership at a hyponym class beneath these classes.
We cannot show the full wnproto due to lack of space, but we show some of the classes with higher probability which cover some typical nouns that occur as objects of park.
Algorithm 1 WNPROTO algorithm
remove c from C {classes with less than two disambiguated nouns are removed} end if end for
Algorithm 2 DSPROTO algorithm
fD = 0 {frequency of disambiguated items}
TY = argument head types {nouns occurring as objects of verb, with associated frequencies}
order C1 by num-types-in-thesaurus(cty, TY) {classes ordered by coverage of argument head types} for all cty ordered C1 do
add ty to Dcty {types disambiguated to this class only if not disambiguated by a class used already} end if end for
p(cty) = {calculating class probabilities}
We use a thesaurus acquired using the method proposed by Lin (1998).
For input we used the grammatical relation data from automatic parses of the BNC.
For each noun we considered the co-occurring verbs in the object and subject relation, the modifying nouns in noun-noun relations and the modifying adjectives in adjective-noun relations.
Each thesaurus entry consists of the target noun and the 50 most similar nouns, according to Lin's measure of distributional similarity, to the target.
The argument head noun types (TY) are used to find the entries in the thesaurus as the 'classes' (C) of the selectional preference for a given verb.
As with WNPROTOs, we only cover argument types which form coherent groups with other argument types since we wish i) to remove noise and ii) to be able to identify argument types which are not related with the other types and therefore may be non-compositional.
As our starting point we only consider an argument type as a class for C if its entry in the thesaurus covers at least a threshold of 2 types.
To select C we use a best first search.
This method processes each argument type in TY in order of the number of the other argument types from TY that it has in its thesaurus entry of 50 similar nouns.
An argument head is selected as a class for C (cty e C) 7 if it covers at least 2 of the argument heads that are not in the thesaurus entries of any of the other classes already selected for C. Each argument head is dis-ambiguated by whichever class in C under which it is listed in the thesaurus and which has the largest number of the TY in its thesaurus entry.
When the algorithm finishes processing the ordered argument heads to select C, all argument head types are dis-ambiguated by C apart from those which after disambiguation occur in isolation in a class without other argument types.
Finally a probability distribution over C is estimated using the frequency (tokens) of argument types that occur in the thesaurus entries for any cty e C. If an argument type occurs in the entry of more than one cty then it is assigned to whichever of these has the largest number
6As with the wnprotos, we experimented with a value of 3 for this threshold and obtained similar results.
7We use cty for the classes of the DSproto.
These classes are simply groups of nouns which occur under the entry of a noun (ty) in the thesaurus.
disambiguated objects (freq)
car (174) van (11) vehicle (8) ... street (5) distance (4) mile (1)... corner (4) lane (3) door (1) backside (2) bum (1) butt (1)...
Figure 3: First four classes of DSPROTO model for park
of disambiguated argument head types and its token frequency is attributed to that class.
We show the algorithm as Algorithm 2.
The algorithms for WNPROTO algorithm 1 and DSproto (algorithm 2) differ because of the nature of the inventories of candidate classes (Word-Net and the distributional thesaurus).
There are a great many candidate classes in WordNet.
The WN-PROTO algorithm selects the classes from all those that the argument heads belong to directly and indirectly by looping over all argument types to find the class that disambiguates each by having the largest type ratio calculated using the undisambiguated argument heads.
The DSPROTO only selects classes from the fixed set of argument types.
The algorithm loops over the argument types with at least two argument heads in the thesaurus entry and ordered by the number of undisambiguated argument heads in the thesaurus entry.
This is a best first search to minimise the number of argument heads used in C but maximise the coverage of argument types.
In figure 3, we show part of a DSproto model for the object of park.
8 Note again that the class mile arises because of a systematic parsing error where adverbials such as distance in park illegally some distance from the railway station are identified by the parser as objects.
4 Experiments
Venkatapathy and Joshi (2005) produced a dataset of verb-object pairs with human judgements of com-positionality.
They obtained values of rs between 0.111 and 0.300 by individually applying the 7 features described above in section 2.
The best correlation was given by feature 7 and the second best was feature 3.
They combined all 7 features using SVMs and splitting their data into test and training data and achieve a rs of 0.448, which demonstrates
8We cannot show the full model due to lack of space.
significantly better correlation with the human gold-standard than any of the features in isolation
We evaluated our selectional preference models using the verb-object pairs produced by Venkatapa-thy and Joshi (2005).
9 This dataset has 765 verb-object collocations which have been given a rating between 1 and 6, by two annotators (both fluent speakers of English).
Kendall's Tau (Siegel and Castellan, 1988) was used to measure agreement, and a score of 0.61 was obtained which was highly significant.
The ranks of the two annotators gave a Spearman's rank-correlation coefficient (rs) of 0.71.
The Verb-Object pairs included some adjectives (e.g. happy, difficult, popular), pronouns and complements e.g. become director.
We used the subset of 638 verb-object pairs that involved common nouns in the object relationship since our preference models focused on the object relation for common nouns.
For each verb-object pair we used the preference models acquired from the rasp parses of the BNC to obtain the probability of the class that this object occurs under.
Where the object noun is a member of several classes (classes(noun) e C) in the model, the class with the largest probability is used.
Note though that for WNPROTOs we have the added constraint that a hyponym class from C is selected in preference to a hypernym in C. Compo-sitionality of an object noun and verb is computed as:-
We use the probability of the class, rather than an estimate of the probability of the object, because we want to determine how likely any word belonging to this class might occur with the given verb, rather than the probability of the speciic noun which may be infrequent, yet typical, of the objects that occur with this verb.
For example, convertible may be an infrequent object of park, but it is quite likely given its membership of the class motor .vehicle.
We do not want to assume anything about the frequency of non-compositional verb-object combinations, just that they are unlikely to be members of classes which represent prototypical objects.
9This verb-object dataset is available from http://www.cis.upenn.edu/~sriramv/mywork.html.
selectional preferences
features from V&J
frequency (fl)
combination with SVM
Table 1: Correlation scores for 638 verb object pairs
will contrast these models with a baseline frequency feature used by Venkatapathy and Joshi.
We use our selectional preference models to provide the probability that a candidate is representative of the typical objects of the verb.
That is, if the object might typically occur in such a relationship then this should lessen the chance that this verb-object combination is non-compositional.
We used the probability of the classes from our 3 selec-tional preference models to rank the pairs and then used Spearman's rank-correlation coefficient (rs) to compare these ranks with the ranks from the goldstandard.
Our results for the three types of preference models are shown in the first section of table 1.
10 All the correlation values are signiicant, but we note that using the type based selectional preference models achieves a far greater correlation than using the TCMs.
The DSproto models achieve the best results which is very encouraging given that they only require raw data and an automatic parser to obtain the grammatical relations.
10We show absolute values of correlation following (Venkat-apathy and Joshi, 2005).
11The other 3 features performed less well on this dataset so we do not report the details here.
This seems to be because they worked particularly well with the adjective and pronoun data in the full dataset.
tained using the same bnc dataset used by Venkat-apathy and Joshi which was obtained using Bikel's parser (Bikel, 2004).
We obtained correlation values for these features as shown in table 1 under V&J. These features are feature 1 frequency, feature 2 pointwise mutual information, feature 3 based on (Lin, 1999) and feature 7 lsa feature which considers the similarity of the verb-object pair with the verbal form of the object.
Pointwise mutual information did surprisingly well on this 84% subset of the data, however the DSproto preferences still outperformed this feature.
We combined the DSproto and V&J features with an SVM ranking function and used 10 fold cross validation as Venkatapathy and Joshi did.
We contrast the result with the V&J features without the preference models.
The results in the bottom section of table 1 demonstrate that the preference models can be combined with other features to produce optimal results.
5 Conclusions and Directions for Future Work
We have demonstrated that the selectional preferences of a verbal predicate can be used to indicate if a speciic combination with an object is non-compositional.
We have shown that selectional preference models which represent prototypical arguments and focus on argument types (rather than tokens) do well at the task.
Models produced from distributional thesauruses are the most promising which is encouraging as the technique could be applied to a language without a man-made thesaurus.
We ind that the probability estimates from our models show a highly signiicant correlation, and are very promising for detecting non-compositional verb-object pairs, in comparison to individual features used previously.
Further comparison of wnprotos and DSprotos to other WordNet models are warranted to contrast the effect of our proposal for disambiguation using word types with iterative approaches, particularly those of Clark and Weir (2002).
A benefit of the DSprotos is that they do not require a hand-crafted inventory.
It would also be worthwhile comparing the use of raw data directly, both from the bnc and from google's Web 1T corpus (Brants and Franz, 2006) since
web counts have been shown to outperform the Clark and Weir models on a pseudo-disambiguation task (Keller and Lapata, 2003).
We believe that preferences should NOT be used in isolation.
Whilst a low preference for a noun may be indicative of peculiar semantics, this may not always be the case, for example chew the fat.
Certainly it would be worth combining the preferences with other measures, such as syntactic ixed-ness (Fazly and Stevenson, 2006).
We also believe it is worth targeting features to speciic types of constructions, for example light verb constructions undoubtedly warrant special treatment (Stevenson et
The selectional preference models we have proposed here might also be applied to other tasks.
We hope to use these models in tasks such as diathesis alternation detection (McCarthy, 2000; Tsang and Stevenson, 2004) and contrast with WordNet models previously used for this purpose.
6 Acknowledgements
We acknowledge support from the Royal Society UK for a Dorothy Hodgkin Fellowship to the first author.
We thank the anonymous reviewers for their constructive comments on this work.
