Query segmentation is the process of taking a user's search-engine query and dividing the tokens into individual phrases or semantic units.
Identification of these query segments can potentially improve both document-retrieval precision, by first returning pages which contain the exact query segments, and document-retrieval recall, by allowing query expansion or substitution via the segmented units.
We train and evaluate a machine-learned query segmentation system that achieves 86% segmentation-decision accuracy on a gold standard set of segmented noun phrase queries, well above recently published approaches.
Key en-ablers of this high performance are features derived from previous natural language processing work in noun compound bracketing.
For example, token association features beyond simple N-gram counts provide powerful indicators of segmentation.
1 Introduction
Billions of times every day, people around the world communicate with Internet search engines via a small text box on a web page.
The user provides a sequence of words to the search engine, and the search engine interprets the query and tries to return web pages that not only contain the query tokens, but that are also somehow about the topic or idea that the query terms describe.
Recent years have seen a widespread recognition that the user is indeed providing natural language
text to the search engine; query tokens are not independent, unordered symbols to be matched on a web document but rather ordered words and phrases with syntactic relationships.
For example, Zhai (1997) pointed out that indexing on single-word symbols is not able to distinguish a search for "bank terminology" from one for "terminology bank."
The reader can submit these queries to a current search engine to confirm that modern indexing does recognize the effect of token order on query meaning in some way.
Accurately interpreting query semantics also depends on establishing relationships between the query tokens.
For example, consider the query "two man power saw."
There are a number of possible interpretations of this query, and these can be expressed through a number of different segmentations or bracketings of the query terms:
[two] [manpower] [saw], etc.
One simple way to make use of these interpretations in search would be to put quotation marks around the phrasal segments to require the search engine to only find pages with exact phrase matches.
If, as seems likely, the searcher is seeking pages about the large, mechanically-powered two-man saws used by lumberjacks and sawyers to cut big trees, then the first segmentation is correct.
Indeed, a phrasal search for "two man power saw" on Google does find the device of interest.
So does the second interpretation, but along with other, less-relevant pages discussing competitions involving "two-man handsaw,
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 819-826, Prague, June 2007.
©2007 Association for Computational Linguistics
two-woman handsaw, power saw log bucking, etc." The top document returned for the third interpretation, meanwhile, describes a man on a rampage at a subway station with two cordless power saws, while the fourth interpretation finds pages about topics ranging from hockey's thrilling two-man power play advantage to the man power situation during the Second World War.
Clearly, choosing the right segmentation means finding the right documents faster.
Query segmentation can also help if insufficient pages are returned for the original query.
A technique such as query substitution or expansion (Jones et al., 2006) can be employed using the segmented units.
For example, we could replace the sexist "two man" modifier with the politically-correct "two person" phrase in order to find additional relevant documents.
Without segmentation, expanding via the individual words "two," "man," "power," or "saw" could produce less sensible results.
In this paper, we propose a data-driven, machine-learned approach to query segmentation.
Similar to previous segmentation approaches described in Section 2, we make a decision to segment or not to segment between each pair of tokens in the query.
Unlike previous work, we view this as a classification task where the decision parameters are learned dis-criminatively from gold standard data.
In Section 3, we describe our approach and the features we use.
Section 4 describes our labelled data, as well as the specific tools used for our experiments.
Section 5 provides the results of our evaluation, and shows the strong gains in performance possible using a wide set of features within a discriminative framework.
2 Related Work
Query segmentation has previously been approached in an unsupervised manner.
Risvik et al. (2003) combine the frequency count of a segment and the mutual information (MI) between pairs of words in the segment in a heuristic scoring function.
The system chooses the segmentation with the highest score as the output segmentation.
Jones et al. (2006) use MI between pairs of tokens as the sole factor in deciding on segmentation breaks.
If the MI is above a threshold (optimized on a small training set), the pair of tokens is joined in a segment.
Otherwise, a segmentation break is made.
Query segmentation is related to the task of noun compound (NC) bracketing.
NC bracketing determines the syntactic structure of an NC as expressed by a binary tree, or, equivalently, a binary bracketing (Nakov and Hearst, 2005a).
Zhai (1997) first identified the importance of syntactic query/corpus parsing for information retrieval, but did not consider query segmentation itself.
In principle, as N increases, the number of binary trees for an N-token compound is much greater than the 2N-1 possible segmentations.
In practice, empirical NC research has focused on three-word compounds.
The computational problem is thus deciding whether the three-word NC has a left or right-bracketing structure (Lauer, 1995).
For the segmentation task, analysing a three-word NC requires deciding between four different segmentations.
For example, there are two bracketings for "used car parts," the left-bracketing "[[used car] parts]" and the right-bracketing "[used [car parts]]," while there are four segmentations, including the case where there is only one segment, "[used car parts]" and the base case where each token forms its own segment, "[used] [car] [parts]."
Query segmentation thus naturally handles the case where the query consists of multiple, separate noun phrases that should not be analysed with a single binary tree.
Despite the differences between the tasks, it is worth investigating whether the information that helps disambiguate left and right-bracketings can also be useful for segmentation.
In particular, we explored many of the sources of information used by Nakov and Hearst (2005a), as well as several novel features that aid segmentation performance and should also prove useful for NC analysis researchers.
Unlike all previous approaches that we are aware of, we apply our features in a flexible discriminative framework rather than a classification based on a vote or average of features.
NC analysis has benefited from the recent trend of using web-derived features rather than corpus-based counts (Keller and Lapata, 2003).
Lapata and Keller (2004) first used web-based co-occurrence counts for the bracketing of NCs.
Recent innovations have been to use statistics "beyond the N-gram," such as counting the number of web pages where a pair of words w, x participate in a genitive relationship ("w's x"), occur collapsed as a single
phrase ("wx") (Nakov and Hearst, 2005a) or have a definite article as a left-boundary marker ("the w x") (Nicholson and Baldwin, 2006).
We show strong performance gains when such features are employed for query segmentation.
NC bracketing is part of a larger field of research on multiword expressions including general NC interpretation.
NC interpretation explores not just the syntactic dependencies among compound constituents, but the semantics of the nominal relationships (Girju et al., 2005).
Web-based statistics have also had an impact on these wider analysis tasks, including work on interpretation of verb nominalisa-tions (Nicholson and Baldwin, 2006) and NC coordination (Nakov and Hearst, 2005b).
3 Methodology
3.1 Segmentation Classification
Consider a query x = [x\,x2,...,xN} consisting of N query tokens.
Segmentation is a mapping S : x — y £ YN, where y is a segmentation from the set YN.
Since we can either have or not have a segmentation break at each of the N — 1 spaces between the N tokens, |YN | = 2N-1.
Supervised machine learning can be applied to derive the mapping S automatically, given a set of training examples consisting of pairs of queries and their segmentations T = {(xi, yi)}.
Typically this would be done via a set of features \I>(x, y) for the structured examples.
A set of weights w can be learned discriminatively such that each training example (xi, yi) has a higher score, Scorew (x, y) = w • \I>(x, y), than alternative query-segmentation pairs, (xi, zi), z = y^.1 At test time, the classifier chooses the segmentation for x that has the highest score according to the learned parameterization: y = argmaxy Scorew (x, y).
Unlike many problems in NLP such as parsing or part-of-speech tagging, the small cardinality of Yn makes enumerating all the alternative query segmentations computationally feasible.
In our preliminary experiments, we used a Support Vector Machine (SVM) ranker (Joachims, 2002) to learn the structured classifier.2 We also in-
'See e.g. Collins (2002) for a popular training algorithm.
2 A ranking approach was also used previously by Daume III and Marcu (2004) for the CoNLL-99 nested noun phrase identification task.
vestigated a Hidden Markov Model SVM (Altun et al., 2003) to label the segmentation breaks using information from past segmentation decisions.
Ultimately, the mappings produced by these approaches were not as accurate as a simple formulation that creates a full query segmentation y as the combination of independent classification decisions made between each pair of tokens in the query.3
In the classification framework, the input is a query, x, a position in the query, i, where 0<i<N, and the output is a segmentation decision yes/no.
The training set of segmented queries is converted into examples of decisions between tokens and learning is performed on this set.
At test time, N— 1 segmentation decisions are made for the N-length query and an output segmentation y is produced.
Here, features depend only on the input query x and the position in the query i. For a decision at position i, we use features from tokens up to three positions to the left and to the right of the decision location.
That is, for a decision between x l0 and xR0, we extract features from a window of six tokens in the query: {...,xL2,xL1 ,xlo,xro,xri,xr2,...}.
We now detail the features derived from this window.
There are a number ofpossible indicators ofwhether a segmentation break occurs between a pair of tokens.
Some of these features fire separately for each token x in our feature window, while others are defined over pairs or sets of tokens in the window.
We first describe the features that are defined for the tokens around the decision boundary, xL0 and xR0, before describing how these same features are extended to longer phrases and other token pairs.
Table 1 lists the binary features that fire if particular aspects of a token or pair of tokens are present.
For example, one of the POS-tags features will fire if the pair's part-of-speech tags are DT JJ, another feature will fire if the position of the pair in the to-
3The structured learners did show large gains over the classification framework on the dev-set when using only the basic features for the decision-boundary tokens (see Section 3.2.1), but not when the full feature set was deployed.
Also, features only available to structured learners, e.g. number of segments in query, etc., did improve the performance of the structured approaches, but not above that of the simpler classifier.
Table l: Indicator features.
is-the is-free POS-tags fwd-pos rev-pos
Part-of-speech tags of pair x lo xro position from beginning, i position from end N — i
ken is 2, etc. The two lexical features (for when the token is "the" and when the token is "free") fire separately for the left and right tokens around the decision boundary.
They are designed to add discrimination for these common query words, motivated by examples in our training set.
For example, in the training set, "free" often occurs in its own segment when it's on the left-hand-side of a decision boundary (e.g. "free" "online" ...), but may join into a larger segment when it's on the right-hand-side of a collocation (e.g. "sulfite free" or "sugar free").
The classifier can use the feature weights to encourage or discourage segmentation in these specific situations.
For statistical features, previous work (Section 2) suggests that the mutual information between the decision tokens xL0 and xR0 may be appropriate.
The log of the pointwise mutual information (Church and Hanks, 1989) between the decision-boundary tokens xlo,xro is:
This is equivalent to the sum: log C (xL0xR0) + log K — log C(xL0) — log C(xR0).
For web-based features, the counts C(.) can be taken as a search engine's count of the number of pages containing the term.
The normalizer K is thus the total number of pages on the Tnternet.
Represented as a summation, we can see that providing MT as the feature effectively ties the weights on the logarithmic counts C(xL0xR0), C(xL0), and C (xR0).
Another approach would be to provide these logarithmic counts as separate features to our learning algorithm, which can then set the weights optimally for segmentation.
We call this set of counts the "Basic" features.
Tn Section 5, we confirm results on our development set that showed using the basic features untied increased segmentation
Table 2: Statistical features.
Description
web-count
pair-count
definite
collapsed
and-count
genitive
Qcounts-2
Counts of "x" in query database
performance by up to 4% over using MT - an important observation for all researchers using association models as features in their discriminative classifiers.
Furthermore, with this technique, we do not need to normalize the counts for the other pairwise statistical features given in Table 2.
We can simply rely on our learning algorithm to increase or decrease the weights on the logarithm of the counts as needed.
To illustrate how the statistical features work, consider a query from our development set: "star wars weapons guns."
The phrase "star wars" can easily be interpreted as a phrase; there is a high co-occurrence count (pair-count), and many pages where they occur as a single phrase (collapsed), e.g. "starwars.com."
"Weapons" and "guns," on the other hand, should not be joined together.
Although they may have a high co-occurrence count, the coordination feature (and-count) is high ("weapons and guns") showing these to be related concepts but not phrasal constituents.
Tncluding this novel feature resulted in noticeable gains on the development set.
Since this is a query-based segmentation, features that consider whether sets of tokens occurred elsewhere in the query database may provide domain-specific discrimination.
For each of the Qcount features, we look for two quantities: the number of times the phrase occurs as a query on its own and the number of times the phrase occurs within another query.4 Tncluding both of these counts also resulted in performance gains on the development set.
We also extensively investigated other corpus-based features, such as the number of times the phrase occurred hyphenated or capitalized, and the
4We exclude counts from the training, development, and testing queries discussed in Section 4.1.
corpus-based distributional similarity (Lin, 1998) between a pair of tokens.
These features are not available from search-engine statistics because search engines disregard punctuation and capitalization, and collecting page-count-based distributional similarity statistics is computationally infeasible.
Unfortunately, none of the corpus-based features improved performance on the development set and are thus excluded from further consideration.
This is perhaps not surprising.
For such a task that involves real user queries, with arbitrary spellings and sometimes exotic vocabulary, gathering counts from web search engines is the only way to procure reliable and broad-coverage statistics.
Although the tokens at the decision boundary are of paramount importance, information from the neighbouring tokens is also critical for segmentation decision discrimination.
We thus include features that take into consideration the preceding and following tokens, xLi and xR1, as context information.
We gather all the token indicator features for each of these tokens, as well as all pairwise features between xL1 and xL0, and then xR0 and xR1.
Tf context tokens are not available at this position in the query, a feature fires to indicate this.
Also, if the context features are available, we include trigram web and query-database counts of "xL1 xL0 xR0" and "xL0 xR0 xR1", and a fourgram spanning both contexts.
Furthermore, if tokens xL2 and xR2 are available, we collect relevant token-level, pairwise, trigram, and fourgram counts including these tokens as well.
Tn Section 5, we show that context features are very important.
They allow our system to implicitly leverage surrounding segmentation decisions, which cannot be accessed directly in an independent segmentation-decision classifier.
For example, consider the query "bank loan amoritization schedule."
Although "loan amoritization" has a strong connection, we may nevertheless insert a break between them because "bank loan" and "amoritization schedule" each have even stronger association.
Motivated by work in noun phrase parsing, it might be beneficial to check if, for example, token xL0 is more likely to modify a later token, such as
xR1 .
For example, in "female bus driver", we might not wish to segment "female bus" because "female" has a much stronger association with "driver" than with "bus".
Thus, as features, we include the pair-wise counts between xL0 and xR1, and then xL1 and xR0.
Features from longer range dependencies did not improve performance on the development set.
Our dataset was taken from the AOL search query database (Pass et al., 2006), a collection of 35 million queries submitted to the AOL search engine.
Most punctuation has been removed from the queries.5 Along with the query, each entry in the database contains an anonymous user TD and the domain of the URL the user clicked on, if they selected one of the returned pages.
For our data, we used only those queries with a click-URL.
This subset has a higher proportion of correctly-spelled queries, and facilitates annotation (described below).
We then tagged the search queries using a maximum entropy part-of-speech tagger (Ratnaparkhi, 1996).
As our approach was designed particularly for noun phrase queries, we selected for our final experiments those AOL queries containing only determiners, adjectives, and nouns.
We also only considered phrases of length four or greater, since queries of these lengths are most likely to benefit from a segmentation, but our approach works for queries of any length.
Future experiments will investigate applying the current approach to phrasal verbs, prepositional idioms and segments with other parts of speech.
We randomly selected 500 queries for training, 500 for development, and 500 for final testing.
These were all manually segmented by our annota-tors.
Manual segmentation was done with improving search precision in mind.
Annotators were asked to analyze each query and form an idea of what the user was searching for, taking into consideration the click-URL or performing their own online searches, if needed.
The annotators were then asked to segment the query to improve search retrieval, by forcing a search engine to find pages with the segments
'including, unfortunately, all quotation marks, precluding our use of users' own segmentations as additional labelled examples or feature data for our system
occurring as unbroken units.
One annotator segmented all three data sets, and these were used for all the experiments.
Two additional annotators also segmented the final test set to allow inter-annotator agreement calculation.
The pairwise agreement on segmentation decisions (between each pair of tokens) was between 84.0% and 84.6%.
The agreement on entire queries was between 57.6% and 60.8%.
All three agreed completely on 219 of the 500 queries, and we use this "intersected" set for a separate evaluation in our ex-periments.6 Tf we take the proportion of segmentation decisions the annotators would be expected to agree on by chance to be 50%, the Kappa statistic (Jurafsky and Martin, 2000, page 315) is around .
69, below the .
8 considered to be good reliability.
This observed agreement was lower than we anticipated, and reflects both differences in query interpretation and in the perceived value of different segmentations for retrieval performance.
An-notators agreed that terms like "real estate," "work force," "west palm beach," and "private investigator" should be separate segments.
These are collocations in the linguistics sense (Manning and Schutze, 1999, pages 183-187); we cannot substitute related words for terms in these expressions nor apply syntactic transformations or paraphrases (e.g. we don't say "investigator of privates").
However, for a query such as "bank manager," should we exclude web pages that discuss "manager of the bank" or "branch manager for XYZ bank"?
Tf a user is searching for a particular webpage, excluding such results could be harmful.
However, for query substitution or expansion, identifying that "bank manager" is a single unit may be useful.
We can resolve the conflicting objectives of our two motivating applications by moving to a multi-layer query bracketing scheme, first segmenting unbreakable collocations and then building them into semantic units with a query segmentation grammar.
This will be the subject of future research.
All of our statistical feature information was collected using the Google SOAP Search APT.7 For training and classifying our data, we use the popular
6All queries and statistical feature information is available at http://www.cs.ualberta.ca/~bergsma/QuerySegmentation/ 7http://code.google.com/apis/soapsearch/
Support Vector Machine (SVM) learning package SVMHgh* (Joachims, 1999).
SVMs are maximum-margin classifiers that achieve good performance on a range of tasks.
Tn each case, we learn a linear kernel on the training set segmentation decisions and tune the parameter that trades-off training error and margin on the development set.
We use the following two evaluation criteria:
Seg-Acc: Segmentation decision accuracy: the proportion of times our classifier's decision to insert a segment break or not between a pair of tokens agrees with the gold standard decision.
Qry-Acc: Query segmentation accuracy: the proportion of queries for which the complete segmentation derived from our classifications agrees with the gold standard segmentation.
5 Results
the SVM set the threshold for MT on the training set.
Note that the Basic, Decision-Boundary system (Section 3.2.1), which uses exactly the same cooccurrence information as the MT system (in the form of the Basic features) but allows the SVM to discriminatively weight the logarithmic counts, immediately increases Seg-Acc performance by 3.7%.
Even more strikingly, adding the Basic count information for the Context tokens (Section 3.2.2) boosts performance by another 8.5%, increasing Qry-Acc by over 22%.
Smaller, further gains arise by adding Dependency token information (Section 3.2.3).
Also, notice that moving from Basic features for the Decision-Boundary tokens to all of our indicator (Table 1) and statistical (Table 2) features (referred to as All features) increases performance from 71.7% to 84.3%.
These gains convincingly justify
8Statistically significant intra-row differences in Qry-Acc are marked with an asterix (McNemar's test, p<0.05)
Table 3: Segmentation Performance (%)
Feature Type
Feature Span
Test Set
Intersection Set
Decision-Boundary
Decision-Boundary, Context
Decision-Boundary, Context, Dependency
our use of an expanded feature set for this task.
Tncluding Context with the expanded features adds another 2%, while adding Dependency information actually seems to hinder performance slightly, although gains were seen when adding Dependency information on the development set.
Note, however, that these results must also be considered in light of the low inter-annotator agreement (Section 4.1).
Tndeed, results are lower if we evaluate using the test-set labels from another an-notator (necessarily training on the original anno-tator's labels).
On the intersected set of the three annotators, however, results are better still: 88.7% Seg-Acc and 69.4% Qry-Acc on the intersected queries for the full-featured system (Table 3).
Since high performance is dependent on consistent training and test labellings, it seems likely that developing more-explicit annotation instructions may allow further improvements in performance as within-set and between-set annotation agreement increases.
Tt would also be theoretically interesting, and of significant practical importance, to develop a learning approach that embraces the agreement of the annotations as part of the learning algorithm.
Our initial ranking formulation (Section 3.1), for example, could learn a model that prefers segmentations with higher agreement, but still prefers any annotated segmentation to alternative, unobserved structures.
As there is growing interest in making maximal use of annotation resources within discriminative learning techniques (Zaidan et al., 2007), developing a general empirical approach to learning from ambiguously-labelled examples would be both an important contribution to this trend and a potentially helpful technique in a number of NLP domains.
6 Conclusion
We have developed a novel approach to search query segmentation and evaluated this approach on actual user queries, reducing error by 56% over a recent comparison approach.
Gains in performance were made possible by both leveraging recent progress in feature engineering for noun compound bracketing, as well as using a flexible, discriminative incorporation of association information, beyond the decision-boundary tokens.
We have created and made available a set of manually-segmented user queries, and thus provided a new testing platform for other researchers in this area.
Our initial formulation of query segmentation as a structured learning problem, and our leveraging of association statistics beyond the decision boundary, also provides powerful tools for noun compound bracketing researchers to both move beyond three-word compounds and to adopt discriminative feature weighting techniques.
The positive results achieved on this important application should encourage further inter-disciplinary collaboration between noun compound interpretation and information retrieval researchers.
For example, analysing the semantics of multiword expressions may allow for more-focused query expansion; knowing to expand "bank manager" to include pages describing a "manager of the bank," but not doing the same for non-compositional phrases like "real estate" or "private investigator," requires exactly the kind of techniques being developed in the noun compound interpretation community.
Thus for query expansion, as for query segmentation, work in natural language processing has the potential to make a real and immediate impact on search-engine technology.
The next step in this research is to directly investigate how query segmentation affects search performance.
For such an evaluation, we would need to know, for each possible segmentation (including no segmentation), the document retrieval performance.
This could be the proportion of returned documents that are deemed to be relevant to the original query.
Exactly such an evaluation was recently used by Ku-maran and Allan (2007) for the related task of query contraction.
Of course, a dataset with queries and retrieval scores may serve for more than evaluation; it may provide the examples used by the learning module.
That is, the parameters of the contraction or segmentation scoring function could be discrim-inatively set to optimize the retrieval of the training set queries.
A unified framework for query contraction, segmentation, and expansion, all based on dis-criminatively optimizing retrieval performance, is a very appealing future research direction.
Tn this framework, the size of the training sets would not be limited by human annotation resources, but by the number of queries for which retrieved-document relevance judgments are available.
Generating more training examples would allow the use of more powerful, finer-grained lexical features for classification.
Acknowledgments
We gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada, the Alberta Tngenuity Fund, the Alberta Tn-genuity Center for Machine Learning, and the Alberta Tnformatics Circle of Research Excellence.
