Traditional research on spelling correction in natural language processing and information retrieval literature mostly relies on pre-defined lexicons to detect spelling errors.
But this method does not work well for web query spelling correction, because there is no lexicon that can cover the vast amount of terms occurring across the web.
Recent work showed that using search query logs helps to solve this problem to some extent.
However, such approaches cannot deal with rarely-used query terms well due to the data sparseness problem.
In this paper, a novel method is proposed for use of web search results to improve the existing query spelling correction models solely based on query logs by leveraging the rich information on the web related to the query and its top-ranked candidate.
Experiments are performed based on real-world queries randomly sampled from search engine's daily logs, and the results show that our new method can achieve 16.9% relative ^-measure improvement and 35.4% overall error rate reduction in comparison with the baseline method.
Microsoft Research Asia 5F Sigma Center Zhichun Road, Haidian District Beijing, China, 100080
muli@microsoft.com
1 Introduction
Nowadays more and more people are using Internet search engine to locate information on the web.
Search engines take text queries that users type as input, and present users with information of ranked web pages related to users' queries.
During this process, one of the important factors that lead to poor search results is misspelled query terms.
Actually misspelled queries are rather commonly observed in query logs, as shown in previous investigations into the search engine's log data that around 10%~15% queries were misspelled (Cucer-zan and Brill, 2004).
Sometimes misspellings are due to simple typographic errors such as teh for the.
In many cases the spelling errors are more complicated cognitive errors such as camoflauge for camouflage.
As a matter of fact, correct spelling is not always an easy task - even many Americans cannot exactly spell out California governor's last name: Schwarzenegger.
A spelling correction tool can help improve users' efficiency in the first case, but it is more useful in the latter since the users cannot figure out the correct spelling by themselves.
There has been a long history of general-purpose spelling correction research in natural language processing and information retrieval literature (Kukich, 1992), but its application to web search
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 181-189, Prague, June 2007.
©2007 Association for Computational Linguistics
query is still a new challenge.
Although there are some similarities in correction candidate generation and selection, these two settings are quite different in one fundamental problem: How to determine the validity of a search term.
Traditionally, the measure is mostly based on a pre-defined spelling lexicon - all character strings that cannot be found in the lexicon are judged to be invalid.
However, in the web search context, there is little hope that we can construct such a lexicon with ideal coverage of web search terms.
For example, even manually collecting a full list of car names and company names will be a formidable task.
To obtain more accurate understanding of this problem, we performed a detailed investigation over one week's MSN daily query logs, among which found that 16.5% of search terms are out of the scope of our spelling lexicon containing around 200,000 entries.
In order to get more specific numbers, we also manually labeled a query data set that contains 2,323 randomly sampled queries and 6,318 terms.
In this data set, the ratio of out-of-vocabulary (OOV) terms is 17.4%, which is very similar to the overall distribution.
However, only 25.3% of these OOV terms are identified to be misspelled, which occupy 85% of the overall spelling errors.
All these statistics indicate that accurate OOV term classification is of crucial importance to good query spelling correction performance.
Cucerzan and Brill (2004) first investigated this issue and proposed to use query logs to infer correct spellings of misspelled terms.
Their principle can be summarized as follows: given an input query string q, finding a more probable query c than q within a confusion set of q, in which the edit distance between each element and q is less than a given threshold.
They reported good recall for misspelled terms, but without detailed discussions on accurate classification of valid out-of-vocabulary terms and misspellings.
In Li' s work, distributional similarity metrics estimated from query logs were proposed to be used to discriminate high-frequent spelling errors such as massenger from valid out-of-vocabulary terms such as biocycle.
But this method suffers from the data sparseness problem: sufficient amounts of occurrences of every possible misspelling and valid terms are required to make good estimation of distributional similarity metrics; thus this method does not work well for rarely-used out-of-
vocabulary search terms and uncommon misspellings.
In this paper we propose to use web search results to further improve the performance of query spelling correction models.
The key contribution of our work is to identify that the dynamic online search results can serve as additional evidence to determine users' intended spelling of a given term.
The information in web search results we used includes the number of pages matched for the query, the term distribution in the web page snippets and URLs.
We studied two schemes to make use of the returning results of a web search engine.
The first one only exploits indicators of the input query's returning results, while the other also looks at other potential correction candidate's search results.
We performed extensive evaluations on a query set randomly sampled from search engines' daily query logs, and experimental results show that we can achieve 35.4% overall error rate reduction and 18.2% relative F-measure improvement on OOV misspelled terms.
The rest of the paper is structured as follows.
Section 2 details other related work of spelling correction research.
In section 3, we show the intuitive motivations to use web search results for the query spelling correction.
After presenting the formal statement of the query spelling correction problem in Section 4, we describe our approaches that use machine learning methods to integrate statistical features from web search results in Section 5.
We present our evaluation methods for the proposed methods and analyze their performance in Section 6.
Section 7 concludes the paper.
2 Related Work
Spelling correction models in most previous work were constructed based on conventional task settings.
Based on the focus of these task settings, two lines of research have been applied to deal with non-word errors and real-word errors respectively.
Non-word error spelling correction is focused on the task of generating and ranking a list of possible spelling corrections for each word not existing in a spelling lexicon.
Traditionally candidate ranking is based on manually tuned scores such as assigning alternative weights to different edit operations or leveraging candidate frequencies (Damerau, 1964; Levenshtein, 1966).
In recent years, statistical models have been widely used for the tasks of nat-
ural language processing, including spelling correction task.
(Brill and Moore, 2000) presented an improved error model over the one proposed by (Kernighan et al., 1990) by allowing generic string-to-string edit operations, which helps with modeling major cognitive errors such as the confusion between le and al. Via explicit modeling of phonetic information of English words, (Toutanova and Moore, 2002) further investigated this issue.
Both of them require misspelled/correct word pairs for training, and the latter also needs a pronunciation lexicon, but recently (Ahmad and Kondrak,
2005) demonstrated that it is also possible to learn such models automatically from query logs with the EM algorithm, which is similar to work of (Martin, 2004), learning from a very large corpus of raw text for removing non-word spelling errors in large corpus.
All the work for non-word spelling correction focused on the current word itself without taking into account contextual information.
Real-word spelling correction is also referred to be context sensitive spelling correction (CSSC), which tries to detect incorrect usage of valid words in certain contexts.
Using a pre-defined confusion set is a common strategy for this task, such as in the work of (Golding and Roth, 1996) and (Mangu and Brill, 1997).
Opposite to non-word spelling correction, in this direction only contextual evidences were taken into account for modeling by assuming all spelling similarities are equal.
The complexity of query spelling correction task requires the combination of these types of evidence, as done in (Cucerzan and Brill, 2004; Li et al.,
2006) .
One important contribution of our work is that we use web search results as extended contextual information beyond query strings by taking advantage of application specific knowledge.
Although the information used in our methods can all be accessed in a search engine's web archive, such a strategy involves web-scale data processing which is a big engineering challenge, while our method is a light-weight solution to this issue.
3 Motivation
When a spelling correction model tries to make a decision whether to make a suggestion c to a query q, it generally needs to leverage two types of evidence: the similarity between c and q, and the validity plausibility of c and q. All the previous work estimated plausibility of a query based on the
query string itself - typically it is represented as the string probability, which is further decomposed into production of consecutive n-gram probabilities.
For example, both the work of (Cucerzan and Brill, 2004; Li et al., 2006) used n-gram statistical language models trained from search engine's query logs to estimate the query string probability.
In the following, we will show that the search results for a query can serve as a feedback mechanism to provide additional evidences to make better spelling correction decisions.
The usefulness of web search results can be two-fold:
First, search results can be used to validate query terms, especially those not popular enough in query logs.
One case is the validation for navigational queries (Broder, 2004).
Navigational queries usually contain terms that are key parts of destination URLs, which may be out-of-vocabulary terms since there are millions of sites on the web.
Because some of these navigational terms are very relatively rare in query logs, without knowledge of the special navigational property of a term, a query spelling correction model might confuse them with other low-frequency misspellings.
But such information can be effectively obtained from the URLs of retrieved web pages.
Inferring navigational queries through term-URL matching thus can help reduce the chance that the spelling correction model changes an uncommon web site name into popular search term, such as from innovet to innovate.
Another example is that search results can be used in identifying acronyms or other abbreviations.
We can observe some clear text patterns that relate abbreviations to their full spellings in the search results as shown in Figure 1.
But such mappings cannot easily be obtained from query logs.
CDC - Severe Acute Respiratory Syndrome fSARS")
complete and official information for the public and health care providers, including information for patients and their close contacts. www.cdc.goy/ncidod/sars ■ Cached cage
CDC | Fact Sheet: Basic Information About SARS
Information on the international outbreak of the illness known as severe acute respiratory syndrome ,,, SARS.
Severe acute respiratory syndrome (SARS) is a viral respiratory illness caused by ... www.cdc.gov/ncidod/sars/factsheet.htm ■ Cached page + SI-C-. mc-e -siuks t cm cdc 3:,"
Figure 1.
Sample search results for SARS Second, search results can help verify correction candidates.
The terms appearing in search results, both in the web page titles and snippets, provide additional evidences for users intention.
For example, if a user searches for a misspelled query vac-cum cleaner on a search engine, it is very likely that he will obtain some search results containing the correct term vacuum as shown in Figure 2.
This
can be attributed to the collective link text distribution on the web - many links with misspelled text point to sites with correct spellings.
Such evidences can boost the confidence of a spelling correction model to suggest vacuum as a correction.
Vacuum Cleaner Parts & Vacuum Filters - Vacuum Cleaner Shop
Get vacuum cleaner parts at guaranteed low prices.
Find the exact vacuum part,
Add This Site to Your Favorites!
www.vacuumcleanershop.com ■ Cached page
Vaccuin Cleaner
vaccum cleaner resources, information, and directory. ,.. vaccumcleaner-foryou.mfo Dyson DC IS All Floors - The Ball 459.
I was apprehensive paying ... www.vaccumcleaner-foryou.info
Figure 2.
Sample search results for vaccum cleaner
The number of matched pages can be used to measure the popularity of a query on the web, which is similar to term frequencies occurring in query logs, but with broader coverage.
Poor correction candidates can usually be verified by a smaller number of matched web pages.
Another observation is that the documents retrieved with correctly-spelled query and misspelled ones are similar to some extent in the view of term distribution.
Both the web retrieval results of vacuum and vaccum contain terms such as cleaner, pump, bag or systems.
We can take this similarity as an evidence to verify the spelling correction results.
Problem Statement
Given a query q, a spelling correction model is to find a query string c that maximizes the posterior probability of c given q within the confusion set of q. Formally we can write this as follows:
argmax Pr(clq)
where C is the confusion set of q. Each query string c in the confusion set is a correction candidate for q, which satisfies the constraint that the spelling similarity between c and q is within given threshold .
In this formulation, the error detection and correction are performed in a unified way.
The query q itself always belongs to its confusion set C, and when the spelling correction model identifies a more probable query string c in C which is different from q, it claims a spelling error detected and makes a correction suggestion c.
There are two tasks in this framework.
One is how to learn a statistical model to estimate the
conditional probability Pr(c\q), and the other is how to generate confusion set C of a given query q.
4.1 Maximum Entropy Model for Query
Spelling Correction
We take a feature-based approach to model the posterior probability Pr(c\q~).
Specifically we use the maximum entropy model (Berger et al., 1996) for this task:
where £c exp(2;=i Atfi (c, q)) is the normalization factor; ft (c, q) is a feature function defined over query q and correction candidate c , while At is the corresponding feature weight.
As can be optimized using the numerical optimization algorithms such as the Generalized Iterative Scaling (GIS) algorithm (Darroch and Ratcliff 1972) by maximizing the posterior probability of the training set which contains a manually labeled set of query-truth pairs:
The advantage of maximum entropy model is that it provides a natural way and unified framework to integrate all available information sources.
This property is well fit for our task in which we are using a wide variety of evidences based on lexicon, query log and web search results.
4.2 Correction Candidate Generation
Correction candidate generation for a query q can be decomposed into two phases.
In the first phase, correction candidates are generated for each term in the query from a term-base extracted from query logs.
This task can leverage conventional spelling correction methods such as generating candidates based on edit distance (Cucerzan and Brill, 2004) or phonetic similarity (Philips, 1990).
Then the correction candidates of the entire query are generated by composing the correction candidates of each individual term.
Let q = w1 •••wn, and the confusion set of wt is Cw , then the confusion set
of q is Cw ®CW2®--®CWn 1.
For example, for a query q = w1 w2, w1 has candidates c11 and c12 , while w2 has candidates c21 and c22, then the confusion set C is (enC21, CnC22, C21, C22}.
1 For denotation simplicity, we do not cover compound and composition errors here.
The problem of this method is the size of confusion set C may be huge for multi-term queries.
In practice, one term may have hundreds of possible candidates, then a query containing several terms may have millions.
This might lead to impractical search and training using the maximum entropy modeling method.
Our solution to this problem is to use candidate pruning.
We first roughly rank the candidates based on the statistical n-gram language model estimated from query logs.
Then we only choose a subset of C that contains a specified number of top-ranked (most probable) candidates to present to the maximum entropy model for offline training and online re-ranking, and the number of candidates is used as a parameter to balance top-line performance and run-time efficiency.
This subset can be efficiently generated as shown in (Li et al., 2006).
5 Web Search Results based Query Spelling Correction
In this section we will describe in detail the methods for use of web search results in the query spelling correction task.
In our work we studied two schemes.
The first one only employs indicators of the input query's search results, while the other also looks at the most probable correction candidates' search results.
For each scheme, we extract additional scheme-specific features from the available search results, combine them with baseline features and construct a new maximal model to perform candidate ranking.
We denote the maximum entropy model based on baseline model feature set as M0 and the feature set S0 derived from the latest state of the art works of (Li et al., 2006), where S0 includes the features mostly concerning the statistics of the query terms and the similarities between query terms and their correction candidates.
In this scheme we build more features for each correction candidate (including input query q itself) by distilling more evidence from the search results of the query.
S1 denotes the augmented feature set, and M1 denotes the maximum entropy model based on S1.
The features are listed as follows:
Number of pages returned: the number of web search pages retrieved by a web search engine, which is used to estimate the popularity of query.
This feature is only for q.
URL string: Binary features indicating whether the combination of terms of each candidate is in the URLs of top retrieved documents.
This feature is for all candidates.
Frequency of correction candidate term: the number of occurrences of modified terms in the correction candidate found in the title and snippet of top retrieved documents based on the observation that correction terms possibly co-occur with their misspelled ones.
This feature is invalid for q.
Frequency of query term: the number of occurrences of each term of q found in the title or snippet of the top retrieved documents, based on the observation that the correct terms always appear frequently in their search results.
Abbreviation pattern: Binary features indicating whether inputted query terms might be abbreviations according to text patterns in search results.
5.3 Scheme 2: Using both search results of input query and top-ranked candidate
In this scheme we extend the use of search results both for query q and for top-ranked candidate c other than q determined by M1.
First we submit a query to a search engine for the initial retrieval to obtain one set of search results Rq, then use Ml to find the best correction candidate c other than q. Next we perform a second retrieval with c to obtain another set of search results Rc. Finally additional features are generated for each candidate based on Rc, then a new maximum entropy model M2 is built to re-rank the candidates for a second time.
The entire process can be schematically shown in Figure 3.
Lexicon / query m Logs Spelling Similarity
features
Figure 3.
Relations of models and features
where Rq is the web search results of query q; Rc is the web search results of c which is the top-ranked correction of q suggested by model M1.
The new feature set denoted with S2 is a set of document similarities between Rq and Rc, which includes different similarity estimations between the query and its correction at the document level using merely cosine measure based on term frequency vectors of Rq and Rc.
6 Experiments
6.1 Evaluation Metrics
In our work, we consider the following four types of evaluation metrics:
• Accuracy: The number of correct outputs proposed by the spelling correction model divided by the total number of queries in the test set
• Recall: The number of correct suggestions for misspelled queries by the spelling correction model divided by the total number of misspelled queries in the test set
• Precision: The number of correct suggestions for misspelled queries proposed by the spelling correction model divided by the total number of suggestions made by the system
• F-measure: Formula F = 2PR/(P + R) used for calculating the f-measure, which is essentially the harmonic mean of recall and precision
Any individual metric above might not be sufficient to indicate the overall performance of a query spelling correction model.
For example, as in most retrieval tasks, we can trade recall for precision or vice versa.
Although intuitively F might be in accordance with accuracy, there is no strict theoretical relation between these two numbers - there are conditions under which accuracy improves while F-measure may drop or be unchanged.
6.2 Experimental Setup
We used a manually constructed data set as gold standard for evaluation.
First we randomly sampled 7,000 queries from search engine's daily query logs of different time periods, and had them manually labeled by two annotators independently.
Each query is attached to a truth, which is either the query itself for valid queries, or a spelling correction for misspelled ones.
From the annotation
results that both annotators agreed with each other, we extracted 2,323 query-truth pairs as training set and 991 as test set.
Table 1 shows the statistics of the data sets, in which Eq denotes the error rate of query and Et denotes the error rate of term.
# queries
Training set
Test set
Table 1.
Statistics of training set and test set
In the following experiments, at most 50 correction candidates were used in the maximum entropy model for each query if there is no special explanation.
The web search results were fetched from MSN's search engine.
By default, top 100 retrieved items from the web retrieval results were used to perform feature extraction.
A set of query log data spanning 9 months are used for collecting statistics required by the baseline.
Following the method as described in previous sections, we first ran a group of experiments to evaluate the performance of each model we discussed with default settings.
The detailed results are shown in Table 2.
Table 2.
Overall Results
From the table we can observe significant performance boosts on all evaluation metrics of M1 and M2 over M0.
We can achieve 25.6% error rate reduction and 23.6% improvement in precision, as well as 6.6% relative improvement in recall, when adding S1 to M1.
Paired t-test gives p-value of 0.002, which is significant to 0.01 level.
M2 can bring additional 13.1% error rate reduction and moderate improvement in precision, as well as 3.6% improvement in recall over M1, with paired t-test showing that the improvement is significant to 0.01 level.
6.4 Impact of Candidate number
Theoretically the number of correction candidates in the confusion set determines the accuracy and recall upper bounds for all models concerned in this paper.
Performance might be hurt if we use a too small candidate number, which is because the corrections are separated from the confusion sets.
These curves shown in Figure 4 and 5, include both theoretical bound (oracle) and actual performance of our described models.
From the chart we can see that our models perform best when Nt is around 50, and when Nt > 15 the oracle recall and accuracy almost stay unchanged, thus the actual models' performance only benefits a little from larger Nt values.
The missing part of recall is largely due to the fact that we are not able to generate truth candidates for some weird query terms rather than insufficient size of confusion set.
Figure 4.
Recall versus candidate number
Candidate number
Figure 5.
Accuracy versus candidate number 6.5 Discussions
We also studied the performance difference between in-vocabulary (IV) and out-of-vocabulary
(OOV) terms when using different spelling correction models.
The detailed results are shown in Table 3 and Table 4.
Table 3.
OOV Term Results
Accuracy
Precision
The results show that M1 is very powerful to identify and correct OOV spelling errors compared with M0.
Actually M1 is able to correct spelling errors such as guiness, whose frequency in query log is even higher than its truth spelling guinness.
Since most spelling errors are OOV terms, this explains why the model M1 can significantly outperform the baseline.
But for IV terms things are different.
Although the overall accuracy is better, the F-measure of M1 is far lower than M0.
M2 performs best for the IV task in terms of both accuracy and F-measure.
However, IV spelling errors is so small a portion of the total misspelling (only 17.4% of total spelling errors in our test set) that the room for improvement is very small.
This helps to explain why the performance gap between M1 and M0 is much larger than the one between M2 and M1, and shows the tendency that M1 prefer to identify and correct OOV misspellings in comparison to IV ones, which causes F-measure drop from M0 to M1; while by introducing more useful evidence, M2 outperforms better for both OOV and IV terms over M0 and M1.
Another set of statistics we collected from the experiments is the performance data of low-frequency terms when using the models proposed in this paper, since we believe that our approach would help make better classification of low-frequency search terms.
As a case study, we identified in the test set all terms whose frequencies in our query logs are less than 800, and for these terms we calculated the error reduction rate of model M1 over the baseline model M0 at each in-
terval of 50.
The detailed results are shown in Figure 6.
The clear trend can be observed that M1 can achieve larger error rate reduction over baseline for terms with lower frequencies.
This is because the performance of baseline model drops for these terms when there are no reliable distributional similarity estimations available due to data sparse-ness in query logs, while M1 can use web data to alleviate this problem.
Figure 6.
Error rate reduction of M1 over baseline for terms in different frequency ranges
Conclusions and Future Work
The task of query spelling correction is very different from conventional spelling checkers, and poses special research challenges.
In this paper, we presented a novel method for use of web search results to improve existing query spelling correction models.
We explored two schemes for taking advantage of the information extracted from web search results.
Experimental results show that our proposed methods can achieve statistically significant improvements over the baseline model which only relies on evidences of lexicon, spelling similarity and statistics estimated from query logs.
There is still further potential useful information that should be studied in this direction.
For example, we can work on page ranking information of returning pages, because trusted or well-known sites with high page rank generally contain few wrong spellings.
In addition, the term cooccurrence statistics on the returned snippet text are also worth deep investigation.
