Speech recognition transcripts are far from perfect; they are not of sufficient quality to be useful on their own for spoken document retrieval.
This is especially the case for conversational speech.
Recent efforts have tried to overcome this issue by using statistics from speech lattices instead of only the 1-best transcripts; however, these efforts have invariably used the classical vector space retrieval model.
This paper presents a novel approach to lattice-based spoken document retrieval using statistical language models: a statistical model is estimated for each document, and probabilities derived from the document models are directly used to measure relevance.
Experimental results show that the lattice-based language modeling method outperforms both the language modeling retrieval method using only the 1-best transcripts, as well as a recently proposed lattice-based vector space retrieval method.
1 Introduction
Information retrieval (IR) is the task of ranking a collection of documents according to an estimate of their relevance to a query.
With the recent growth in the amount of speech recordings in the form of voice mails, news broadcasts, and so forth, the task of spoken document retrieval (SDR) - information retrieval in which the document collection is in the form of speech recordings - is becoming increasingly important.
SDR on broadcast news corpora has been "deemed to be a solved problem", due to the fact that the performance of retrieval engines working on 1-best automatic speech recognition (ASR) transcripts was found to be "virtually the same as their performance on the human reference transcripts" (NIST, 2000).
However, this is still not the case for SDR on data which are more challenging, such as conversational speech in noisy environments, as the 1-best transcripts of these data contain too many recognition errors to be useful for retrieval.
One way to ameliorate this problem is to work with not just one ASR hypothesis for each utterance, but multiple hypotheses presented in a lattice data structure.
A lattice is a connected directed acyclic graph in which each edge is labeled with a term hypothesis and a likelihood value (James, 1995); each path through a lattice gives a hypothesis of the sequence of terms spoken in the utterance.
Each lattice can be viewed as a statistical model of the possible transcripts of an utterance (given the speech recognizer's state of knowledge); thus, an IR model based on statistical inference will seem to be a more natural and more principled approach to lattice-based SDR.
This paper thus proposes a lattice-based SDR method based on the statistical language modeling approach of Song and Croft (1999).
In this method, the expected word count -the mean number of occurrences of a word given a lattice's statistical model - is computed for each word in each lattice.
Using these expected counts, a statistical language model is estimated for each spoken document, and a document's relevance to a query is computed as a probability under this model.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 810-818, Prague, June 2007.
©2007 Association for Computational Linguistics
The rest of this paper is organized as follows.
In Section 2 we review related work in the areas of speech processing and IR.
Section 3 describes our proposed method as well as the baseline methods.
Details of the experimental setup are given in Section 4, and experimental results are in Section 5.
Finally, Section 6 concludes our discussions and outlines our future work.
2 Related Work
2.1 Lattices for Spoken Document Retrieval
James and Young (1994) first introduced the lattice as a representation for indexing spoken documents, as part of a method for vocabulary-independent keyword spotting.
The lattice representation was later applied to the task of spoken document retrieval by James (1995): James counted how many times each query word occurred in each phone lattice with a sufficiently high normalized log likelihood, and these counts were then used in retrieval under a vector space model with tf • idf weighting.
Jones et al. (1996) combined retrieval from phone lattices using variations of James' method with retrieval from 1-best word transcripts to achieve better results.
Since then, a number of different methods for SDR using lattices have been proposed.
For instance, Siegler (1999) used word lattices instead of phone lattices as the basis of retrieval, and generalized the tf • idf formalism to allow uncertainty in word counts.
Chelba and Acero (2005) prepro-cessed lattices into more compact Position Specific Posterior Lattices (PSPL), and computed an aggregate score for each document based on the posterior probability of edges and the proximity of search terms in the document.
Mamou et al. (2006) converted each lattice into a word confusion network (Mangu et al., 2000), and estimated the inverse document frequency (idf) of each word t as the ratio of the total number of words in the document collection to the total number of occurrences of t.
Despite the differences in the details, the above lattice-based SDR methods have all been based on the classical vector space retrieval model with tf idf weighting.
2.2 Expected Counts from Lattices
A speech recognizer generates a 1-best transcript of a spoken document by considering possible transcripts of the document, and then selecting the transcript with the highest probability.
However, unlike a text document, such a 1-best transcript is likely to be inexact due to speech recognition errors.
To represent the uncertainty in speech recognition, and to incorporate information from multiple transcription hypotheses rather than only the 1-best, it is desirable to use expected word counts from lattices output by a speech recognizer.
In the context of spoken document search, Siegler (1999) described expected word counts and formulated a way to estimate expected word counts from lattices based on the relative ranks of word hypothesis probabilities; Chelba and Acero (2005) used a more explicit formula for computing word counts based on summing edge posterior probabilities in lattices; Saraclar and Sproat (2004) performed word-spotting in speech lattices by looking for word occurrences whose expected counts were above a certain threshold; and Yu et al. (2005) searched for phrases in spoken documents using a similar measure, the expected word relevance.
Expected counts have also been used to summarize the phonotactics of a speech recording represented in a lattice: Hatch et al. (2005) performed speaker recognition by computing the expected counts of phone bigrams in a phone lattice, and estimating an unsmoothed probability distribution of phone bigrams.
Although many uses ofexpected counts have been studied, the use of statistical language models built from expected word counts has not been well explored.
2.3 Retrieval via Statistical Language Modeling
Finally, the statistical language modeling approach to retrieval was used by Ponte and Croft (1998) for IR with text documents, and it was shown to outperform the tf • idf approach for this task; this method was further improved on in Song and Croft (1999).
Chen et al. (2004) applied Song and Croft's method to Mandarin spoken document retrieval using 1-best ASR transcripts.
In this task, it was also shown to
outperform tf • idf.
Thus, the statistical language modeling approach to retrieval has been shown to be superior to the vector space approach for both these IR tasks.
2.4 Contributions of Our Work
The main contributions of our work include
• extending the language modeling IR approach from text-based retrieval to lattice-based spoken document retrieval; and
• formulating a method for building a statistical language model based on expected word counts derived from lattices.
Our method is motivated by the success of the statistical retrieval framework over the vector space approach with tf • idf for text-based IR, as well as for spoken document retrieval via 1-best transcripts.
Our use of expected counts differs from Saraclar and Sproat (2004) in that we estimate probability models from the expected counts.
Conceptually, our method is close to that of Hatch et al. (2005), as both methods build a language model to summarize the content of a spoken document represented in a lattice.
In practice, our method differs from Hatch et al. (2005)'s in many ways: first, we derive word statistics for representing semantics, instead of phone bigram statistics for representing phonotac-tics; second, we introduce a smoothing mechanism (Zhai and Lafferty, 2004) to the language model that is specific for information retrieval.
3 Methods
We now describe the formulation of three different SDR methods: a baseline statistical retrieval method which works on 1-best transcripts, our proposed statistical lattice-based SDR method, as well as a previously published vector space lattice-based SDR method.
3.1 Baseline Statistical Retrieval Method
Our baseline retrieval method is motivated by Song and Croft (1999), and uses the language model smoothing methods of Zhai and Lafferty (2004).
This method is used to perform retrieval on the documents' 1-best ASR transcripts and reference human transcripts.
Let C be the collection of documents to retrieve from.
For each document d contained in C, and each query q, the relevance of d to q can be defined as Pr(d | q).
This probability cannot be computed directly, but under the assumption that the prior Pr(d) is uniform over all documents in C, we see that
and Lafferty, 1999).
where C (w | q) is the word count of w in q.
Before using Equation 1, we must estimate a unigram model from d: that is, an assignment ofproba-bilities Pr(w | d) for all w £ V. One way to do this is to use a maximum likelihood estimate (MLE) - an assignment of Pr(w | d) for all w which maximizes the probability of generating d. The MLE is given by the equation
where C(w | d) is the number of occurrences of w in d, and | d| is the total number of words in d. However, using this formula means we will get a value of zero for Pr(q | d) if even a single query word Qi is not found in d. To overcome this problem, we smooth the model by assigning some probability mass to such unseen words.
Specifically, we adopt
a two-stage smoothing method (Zhai and Lafferty,
entire speech segment; then
Here, U denotes a background language model, and A > 0 and A £ (0, 1) are parameters to the smoothing procedure.
This is a combination of Bayesian smoothing using Dirichlet priors (MacKay and Peto, 1984) and Jelinek-Mercer smoothing (Jelinek and
Mercer, 1980).
The parameter A can be set empirically according to the nature of the queries.
For the parameter /x, we adopt the estimation procedure of Zhai and Lafferty (2004): we maximize the leave-one-out log likelihood of the document collection, namely
by using Newton's method to solve the equation
3.2 Our Proposed Statistical Lattice-Based Retrieval Method
We now propose our lattice-based retrieval method.
In contrast to the above baseline method, our proposed method works on the lattice representation of spoken documents, as generated by a speech recognizer.
First, each spoken document is divided into M short speech segments.
A speech recognizer then generates a lattice for each speech segment.
As previously stated, a lattice is a connected directed acyclic graph with edges labeled with word hypotheses and likelihoods.
Thus, each path through the lattice contains a hypothesis ofthe series ofwords spoken in this speech segment, t = t1t2 • • • tN, along with acoustic probabilities Pr(o1 | t1), Pr(o2 | t2), ••• Pr(oN | tN), where oi denotes the acoustic observations for the time interval of the word ti hypothesized by the speech recognizer.
Let o = o1o2 • • • oN denote the acoustic observations for the
We then rescore each lattice with an n-gram language model.
Effectively, this means multiplying the acoustic probabilities with n-gram probabilities:
This produces an expanded lattice in which paths (hypotheses) are weighted by their posterior probabilities rather than their acoustic likelihoods: specifically, by Pr(t, o) « Pr(t | o) rather than Pr(o | t) (Odell, 1995).
The lattice is then pruned, by removing those paths in the lattice whose log posterior probabilities - to be precise, whose 7 ln Pr(t | o) - are not within a threshold 6 of the best path's log posterior probability (in our implementation, 7 = 10000.5).
where C(w | t) is the word count of w in the hypothesized transcript t. We can also analogously compute the expected document length:
where | t| denotes the number ofwords in t.
In addition, we also modify the procedure for estimating a, by replacing C (w | d) and
Figure 1: Example of a word confusion network
|d| in Equation 3 with |_E[C(uj | d)] + and S«,ev |_E[C(ty | d)] + \\ respectively.
The probability estimates from Equation 4 can then be substituted into Equation 1 to yield relevance scores.
3.3 Baseline tf • idf Lattice-Based Retrieval Method
As a further comparison, we also implemented Mamou et al. (2006)'s vector space retrieval method (without query refinement via lexical affinities).
In this method, each document d is represented as a word confusion network (WCN) (Mangu et al., 2000) - a simplified lattice which can be viewed as a sequence of confusion sets c1; c2, c3, • • •.
Each ci corresponds approximately to a time interval in the spoken document and contains a group of word hypotheses, and each word w in this group of hypotheses is labeled with the probability Pr(w | ci; d) - the probability that w was spoken in the time interval of ci.
A confusion set may also give a probability for Pr(e | ci, d), the probability that no word was spoken in the time of ci.
Figure 1 gives an example of a
WCN.
Mamou et al.'s retrieval method proceeds as follows.
First, the documents are divided into speech segments, lattices are generated from the speech segments, and the lattices are pruned according to the path probability threshold 6, as described in Section 3.2.
The lattice for each speech segment is then converted into a WCN according to the algorithm
segments in each document are then concatenated to form a single WCN per document.
• the "average document length" audi, computed
• the "inverse document frequency" idf(w), computed as
4 Experiments
4.1 Document Collection
To evaluate our proposed retrieval method, we performed experiments using the Hub5 Mandarin training corpus released by the Linguistic Data Consortium (LDC98T26).
This is a conversational telephone speech corpus which is 17 hours long, and
contains recordings of 42 telephone calls corresponding to approximately 600Kb of transcribed Mandarin text.
Each conversation has been broken up into speech segments of less than 8 seconds each.
As the telephone calls in LDC98T26 have not been divided neatly into "documents", we had to choose a suitable unit of retrieval which could serve as a "document".
An entire conversation would be too long for such a purpose, while a speech segment or speaker turn would be too short.
We decided to use t;-minute time windows with 50% overlap as retrieval units, following Abberley et al. (1999) and Tuerk et al. (2001).
The 42 telephone conversations were thus divided into 4,312 retrieval units ("documents").
Each document comprises multiple consecutive speech segments.
4.2 Queries and Ground Truth Relevance Judgements
We then formulated 18 queries (14 test queries, 4 development queries) to issue on the document collection.
Each query was comprised of one or more written Chinese keywords.
We then obtained ground truth relevance judgements by manually examining each of the 4,312 documents to see if it is relevant to the topic of each query.
The number of retrieval units relevant to each query was found to range from 4 to 990.
The complete list of queries and the number of documents relevant to each query are given in
Table 1.
4.3 Preprocessing of Documents and Queries
Next, we processed the document collection with a speech recognizer.
For this task we used the Abacus system (Hon et al., 1994), a large vocabulary continuous speech recognizer which contains a triphone-based acoustic system and a frame-synchronized search algorithm for effective word decoding.
Each Mandarin syllable was modeled by one to four triphone models.
Acoustic models were trained from a corpus of 200 hours of telephony speech from 500 speakers sampled at 8kHz.
For each speech frame, we extracted a 39-dimensional feature vector consisting of 12 MFCCs and normalized energy, and their first and second order derivatives.
Sentence-based cepstral mean subtraction was applied for acoustic normalization both in the training and testing.
Each triphone was modeled by a left-
Test queries
Contact information
The weather
Housing matters
Studies, academia
Litigation
Raising children
Christian churches
Clothing
Eating out
Playing sports
Dealings with banks
Computers and software
Development queries
Keywords
# relevant documents
Passport and visa matters
Washington D. C.
Working life
Table 1: List of test and development queries
to-right 3-state hidden Markov model (HMM), each state having 16 Gaussian mixture components.
In total, we built 1,923 untied within-syllable triphone models for 43 Mandarin phonemes, as well as 3 silence models.
The search algorithm was supported by a loop grammar of over 80,000 words.
We processed the speech segments in our collection corpus, to generate lattices incorporating acoustic likelihoods but not n-gram model probabilities.
We then rescored the lattices using a backoff tri-
gram language model interpolated in equal proportions from two trigram models:
• a model built from corpora of transcripts of conversations, comprised of a 320Kb subset of the Callhome Mandarin corpus (LDC96T16) and the CSTSC-Flight corpus from the Chinese Corpus Consortium (950Kb)
The unigram counts from this model were also used as the background language model U in Equations 2
and 4.
The reference transcripts, queries, and trigram model training data were all segmented into words using Low et al. (2005)'s Chinese word segmenter, trained on the Microsoft Research (MSR) corpus, with the speech recognizer's vocabulary used as an external dictionary.
The 1-bestASR transcripts were decoded from the rescored lattices.
Lattice rescoring, trigram model building, WCN generation, and computation of expected word counts were done using the SRILM toolkit (Stolcke, 2002), while lattice pruning was done with the help of the AT&T FSM Library (Mohri et al., 1998).
We also computed the character error rate (CER) and syllable error rate (SER) of the 1-best transcripts, and the lattice oracle CER, for one of the telephone conversations in the speech corpus (ma_416 0).
The CER was found to be 69%, the SER 63%, and the oracle CER 29%.
4.4 Retrieval and Evaluation
We then performed retrieval on the document collection using the algorithms in Section 3, using the reference transcripts, the 1-best ASR transcripts, lattices, and WCNs.
We set A = 0.1, which was suggested by Zhai and Lafferty (2004) to give good retrieval performance for keyword queries.
The results of retrieval were checked against the ground truth relevance judgements, and evaluated in terms of the non-interpolated mean average precision (MAP):
Retrieval method
Retrieval source
MAP for development queries
test queries
Reference transcripts
1-best transcripts
Vector space tf ■ idf
Statistical
Lattices,
Table 2: Summary of experimental results
where L denotes the total number of queries, Ri the total number of documents relevant to the ith query, and ri;j the position of the jth relevant document in the ranked list output by the retrieval method for query i.
For the lattice-based retrieval methods, we performed retrieval with the development queries using several values of 6 between 0 and 100,000, and then used the value of 6 with the best MAP to do retrieval with the test queries.
5 Experimental Results
The results of our experiments are summarized in Table 2; the MAP of the two lattice-based retrieval methods, Mamou et al. (2006)'s vector space method and our proposed statistical retrieval method, are shown in Figure 2 and Figure 3 respectively.
The results show that, for the vector space retrieval method, the MAP of the development queries is highest at 6 = 27, 500, at which point the MAP for the test queries is 0.1599; and for our proposed method, the MAP for the development queries is highest at 6 = 65,000, and at this point the MAP for the test queries reaches 0.2154.
As can be seen, the performance of our statistical lattice-based method shows a marked improvement over the MAP of 0.1364 achieved using only the 1-best ASR transcripts, and indeed a one-tailed Student's t-test shows that this improvement is statistically significant at the 99.5% confidence level.
The statistical method also yields better performance than Mamou et al.'s vector space method - a t-test
For 4 development queries
Retrieval using word probabilities from word confusion networks -
q (max. log probability difference of paths)
Figure 3: MAP of our proposed statistical method for lattice-based retrieval, at various pruning thresholds 6
shows the performance difference to be statistically significant at the 97.5% confidence level.
6 Conclusions and Future Work
We have presented a method for performing spoken document retrieval using lattices which is based on a statistical language modeling retrieval framework.
Results show that our new method can significantly improve the retrieval MAP compared to using only the 1-best ASR transcripts.
Also, our proposed retrieval method has been shown to outperform Mamou et al. (2006)'s vector space lattice-based retrieval method.
Besides the better empirical performance, our method also has other advantages over Mamou et al.'s vector space method.
For one, our method computes expected word counts directly from rescored lattices, and does not require an additional step to
convert lattices lossily to WCNs.
Furthermore, our method uses all the hypotheses in each lattice, rather than just the top 10 word hypotheses at each time interval.
Most importantly, our method provides a more natural and more principled approach to lattice-based spoken document retrieval based on a sound statistical foundation, by harnessing the fact that lattices are themselves statistical models; the statistical approach also means that our method can be more easily augmented with additional statistical knowledge sources in a principled way.
For future work, we plan to test our proposed method on English speech corpora, and with larger-scale retrieval tasks involving more queries and more documents.
We would like to extend our method to other speech processing tasks, such as spoken document classification and example-based spoken document retrieval as well.
