We present a domain-independent unsupervised topic segmentation approach based on hybrid document indexing.
Lexical chains have been successfully employed to evaluate lexical cohesion of text segments and to predict topic boundaries.
Our approach is based in the notion of semantic cohesion.
It uses spectral embedding to estimate semantic association between content nouns over a span of multiple text segments.
Our method significantly outperforms the baseline on the topic segmentation task and achieves performance comparable to state-of-the-art methods that incorporate domain specific information.
1 Introduction
The goal of topic segmentation is to discover story boundaries in the stream of text or audio recordings.
Story is broadly defined as segment of text containing topically related sentences.
In particular, the task may require segmenting a stream of broadcast news, addressed by the Topic Detection and Tracking (TDT) evaluation project (Wayne, 2000; Allan, 2002).
In this case topically related sentences belong to the same news story.
While we are considering TDT data sets in this paper, we would like to pose the problem more broadly and consider a domain-independent approach to topic segmentation.
Previous research on topic segmentation has shown that lexical coherence is a reliable indicator of topical relatedness.
Therefore, many approaches
have concentrated on different ways of estimating lexical coherence of text segments, such as semantic similarity between words (Kozima, 1993), similarity between blocks of text (Hearst, 1994), and adaptive language models (Beeferman et al., 1999).
These approaches use word repetitions to evaluate coherence.
Since the sentences covering the same story represent a coherent discourse segment, they typically contain the same or related words.
Repeated words build lexical chains that are consequently used to estimate lexical coherence.
This can be done either by analyzing the number of overlapping lexical chains (Hearst, 1994) or by building a short-range and long-range language model (Beefer-man et al., 1999).
More recently, topic segmentation with lexical chains has been successfully applied to segmentation of news stories, multi-party conversation and audio recordings (Galley et al., 2003).
When the task is to segment long streams of text containing stories which may continue at a later point in time, for example developing news stories, building of lexical chains becomes intricate.
In addition, the word repetitions do not account for synonymy and semantic relatedness between words and therefore may not be able to discover coherence of segments with little word overlap.
Our approach aims at discovering semantic relat-edness beyond word repetition.
It is based on the notion of semantic cohesion rather than lexical cohesion.
We propose to use a similarity metric between segments of text that takes into account semantic associations between words spanning a number ofseg-ments.
This method approximates lexical chains by averaging the similarity to a number of previous text
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 351-359, Prague, June 2007.
©2007 Association for Computational Linguistics
segments and accounts for synonymy by using a hybrid document indexing scheme.
Our text segmentation experiments show a significant performance improvement over the baseline.
The rest of the paper is organized as follows.
Section 2 discusses hybrid indexing.
Section 3 describes our segmentation algorithm.
Section 5 reports the experimental results.
We conclude in section 6.
2 Hybrid Document Indexing
For the topic segmentation task we would like to define a similarity measure that accounts for synonymy and semantic association between words.
This similarity measure will be used to evaluate semantic cohesion between text units and the decrease in semantic cohesion will be used as an indicator of a story boundary.
First, we develop a document representation which supports this similarity measure.
Capturing semantic relations between words in a document representation is difficult.
Different approaches tried to overcome the term independence assumption of the bag-of-words representation (Salton and McGill, 1983) by using distributional term clusters (Slonim and Tishby, 2000) and expanding the document vectors with synonyms, see (Levow et al., 2005).
Since content words can be combined into semantic classes there has been a considerable interest in low-dimensional representations.
Latent Semantic Analysis (LSA) (Deerwester et al., 1990) is one of the best known dimensionality reduction algorithms.
In the LSA space documents are indexed with latent semantic concepts.
LSA maps all words to low dimensional vectors.
However, the notion of semantic relatedness is defined differently for subsets of the vocabulary.
In addition, the numerical information, abbreviations and the documents' style may be very good indicators of their topic.
However, this information is no longer available after the dimensionality reduction.
We use a hybrid approach to document indexing to address these issues.
We keep the notion of latent semantic concepts and also try to preserve the specifics of the document collection.
Therefore, we divide the vocabulary into two sets: nouns and the rest of the vocabulary.
The set of nouns does not include proper nouns.
We use a method of spectral embedding, as described below and compute a
low-dimensional representation for documents using only the nouns.
We also compute a tf-idf representation for documents using the other set of words.
Since we can treat each latent semantic concept in the low-dimensional representation as part of the vocabulary, we combine the two vector representations for each document by concatenating them.
2.1 Spectral Embedding
A vector space representation for documents and sentences is convenient and makes the similarity metrics such as cosine and distance readily available.
However, those metrics will not work if they don't have a meaningful linguistic interpretation.
Spectral methods comprise a family of algorithms that embed terms and documents in a low-dimensional vector space.
These methods use pair-wise relations between the data points encoded in a similarity matrix.
The main step is to find an embedding for the data that preserves the original similarities.
GLSA We use Generalized Latent Semantic Analysis (GLSA) (Matveeva et al., 2005) to compute spectral embedding for nouns.
GLSA computes term vectors and since we would like to use spectral embedding for nouns, it is well-suited for our approach.
GLSA extends the ideas of LSA by defining different ways to obtain the similarities matrix and has been shown to outperform LSA on a number of applications (Matveeva and Levow, 2006).
GLSA begins with a matrix of pair-wise term similarities S, computes its eigenvectors U and uses the first k of them to represent terms and documents, for details see (Matveeva et al., 2005).
The justification for this approach is the theorem by Eckart and Young (Golub and Reinsch, 1971) stating that inner product similarities between the term vectors based on the eigenvectors of S represent the best element-wise approximation to the entries in S. In other words, the inner product similarity in the GLSA space preserves the semantic similarities in S.
Since our representation will try to preserve semantic similarities in S it is important to have a matrix of similarities which is linguistically motivated.
Nearest Neighbors in GLSA Space
prosecutor
testimony
eyewitness
investment
category
broadcast
television
satellite
surprise
announcement
disappointment
stunning
reaction
astonishment
Table 1: Words' nearest neighbors in the GLSA semantic space.
2.2 Distributional Term Similarity
v ' JJ 6 P(Wi = l)P(Wj = 1) Thus, for GLSA, S (wj, Wj ) = PMI (wj, Wj ).
Co-occurrence Proximity An advantage of PMI is the notion of proximity.
The co-occurrence statistics for PMI are typically computed using a sliding window.
Thus, PMI will be large only for words that co-occur within a small context of fixed size.
Semantic Association vs. Synonymy Although GLSA was successfully applied to synonymy induction (Matveeva et al., 2005), we would like to point out that the GLSA discovers semantic association in a broad sense.
Table 1 shows a few words from the TDT2 corpus and their nearest neighbors in the GLSA space.
We can see that for "witness", "finance" and "broadcast" words are grouped into corresponding semantic classes.
The nearest neighbors for "hearing" and "stay" represent their different senses.
Interestingly, even for the abstract noun "surprise" the nearest neighbors are meaningful.
2.3 Document Indexing
We have two sets of the vocabulary terms: a set of nouns, N, and the other words, T. We compute tf-idf document vectors indexed with the words in T :
where aj(wt) = tf(wt, dj) * idf (wt).
We also compute a k-dimensional representation with latent concepts cj as a weighted linear combination of GLSA term vectors wt:
We concatenate these two representations to generate a hybrid indexing of documents:
In our experiments, we compute document and sentence representation using three indexing schemes: the tf-idfbaseline, the GLSA representation and the hybrid indexing.
The GLSA indexing computes term vectors for all vocabulary words; document and sentence vectors are generated as linear combinations of term vectors, as shown above.
2.4 Document similarity
One can define document similarity at different levels of semantic content.
Documents can be similar because they discuss the same people or events and because they discuss related subjects and contain se-mantically related words.
Hybrid Indexing allows us to combine both definitions of similarity.
Each representation supports a different similarity measure. tf-idf uses term-matching, the GLSA representation uses semantic association in the latent semantic space computed for all words, and hybrid indexing uses a combination of both: term-matching for named entities and content words other than nouns combined with semantic association for nouns.
In the GLSA space, the inner product between document vectors contains all pair-wise inner product between their words, which allows one to detect semantic similarity beyond term matching:
If documents contain words which are different but semantically related, the inner product between the term vectors will contribute to the document similarity, as illustrated with an example in section 5.
When we compare two documents indexed with the hybrid indexing scheme, we compute a combination of similarity measures:
Document similarity contains semantic association between all pairs of nouns and uses term-matching for the rest of the vocabulary.
3 Topic Segmentation with Semantic Cohesion
Our approach to topic segmentation is based on semantic cohesion supported by the hybrid indexing.
Topic segmentation approaches use either sentences (Galley et al., 2003) or blocks of words as text units (Hearst, 1994).
We used both variants in our experiments.
When using blocks, we computed blocks of a fixed size (typically 20 words) sliding over the documents in a fixed step size (10 or 5 words).
The algorithm predicts a story boundary when the semantic cohesion between two consecutive units drops.
Blocks can cross story boundaries, thus many predicted boundaries will be displaced with respect to the actual boundary.
Averaged similarity In our preliminary experiments we used the largest difference in score to predict story boundary, following the TextTiling approach (Hearst, 1994).
We found, however, that in our document collection the word overlap between sentences was often not large and pair-wise similarity could drop to zero even for sentences within the same story, as will be illustrated below.
We could not obtain satisfactory results with this approach.
Therefore, we used the average similarity by using a history of fixed size n. The semantic cohesion score was computed for the position between two
text units, tj and tj as follows:
Our approach predicts story boundaries at the minima of the semantic cohesion score.
Approximating Lexical Chains One of the motivations for our cohesion score is that it approximates lexical chains, as for example in (Galley et al., 2003).
Galley et al. (Galley et al., 2003) define lexical chains R1,RN by considering repetitions of terms t1} ..,tN and assigning larger weights to short and compact chains.
Then the lexical cohesion score between two text units tj and tj is based on the number of chains that overlap both of them:
where wk (tj) = score(Rj) if the chain Rj overlaps tj and zero otherwise.
Our cohesion score takes into account only the chains for words that occur in tj and have another occurrence within n previous sentences.
Due to this simplification, we compute the score based on inner products.
Once we make the transition to inner products, we can use hybrid indexing and compute semantic cohesion score beyond term repetition.
4 Related Approaches
We compare our approach to the LCseg algorithm which uses lexical chains to estimate topic boundaries (Galley et al., 2003).
Hybrid indexing allows us to compute semantic cohesion score rather than the lexical cohesion score based on word repetitions.
Choi at al. used LSA for segmentation (Choi et al., 2001).
LSA (Deerwester et al., 1990) is a special case ofspectral embedding and Choi atal.
(Choi et al., 2001) used all vocabulary words to compute low-dimensional document vectors.
We use GLSA (Matveeva et al., 2005) because it computes term vectors as opposed to the dual document-term representation with LSA and uses a different matrix of pair-wise similarities.
Furthermore, Choi at al. (Choi et al., 2001) used clustering to predict boundaries whereas we used the average similarity scores.
s1: The Cuban news agency Prensa Latina called Clinton's announcement Friday that Cubans picked up
at sea will be taken to Guantanamo Bay naval base a " new and dangerous element " in U S immigration policy.
s2: The Cuban government has not yet publicly reacted to Clinton 's announcement that Cuban rafters
will be turned away from the United States and taken to the U S base on the southeast tip of Cuba.
s5: The arrival of Cuban emigrants could be an " extraordinary aggravation " to the situation , Prensa Latina said.
s6: It noted that Cuba had already denounced the use of the base as a camp for Haitian refugees.
whom it had for many years encouraged to come to the United States.
s8: Cuba considers the land at the naval base , leased to the United States at the turn of the century, to be illegally occupied.
s10: General Motors Corp said Friday it was recalling 5,600 1993-94 model Chevrolet Lumina, Pontiac Trans Sport and Oldsmobile Silhouette minivans equipped with a power sliding door and built-in child seats. s14: If this occurs , the shoulder belt may not properly retract, the CClYtYlClkeY said. s15: GM is the only company to offer the power-sliding door.
s16: The Company said it was not aware of any accidents or injuries related to the defect.
s17: To correct the problem , GM said dealers will install a modified interior trim piece that will reroute the seat belt.
Table 2: TDT.
The first 17 sentences in the first file.
Existing approaches to hybrid indexing used different weights for proper nouns, nouns phrase heads and use WordNet synonyms to expand the documents, for example (Hatzivassiloglou et al., 2000; Hatzivassiloglou et al., 2001).
Our approach does not require linguistic resources and learning the weights.
The semantic associations between nouns are estimated using spectral embedding.
The first TDT collection is part of the LCseg toolkit1 (Galley et al., 2003) and we used it to compare our approach to LCseg.
We used the part ofthis collection with 50 files with 22 documents each.
We also used the TDT2 collection2 of news articles from six news agencies in 1998.
We used only 9,738 documents that are assigned to one topic and have length more than 50 words.
We used the Lemur toolkit3 with stemming and stop words list for the tf-idf indexing; we used Bikel's parser4 to obtain the POS-tags and select nouns; we used the PLA-PACK package (Bientinesi et al., 2003) to compute the eigenvalue decomposition.
3http://www.lemurproject.org/
4http://www.cis.upenn.edu/ dbikel/software.html
Evaluation For the TDT data we use the error metric pk (Beeferman et al., 1999) and WindowD-iff (Pevzner and Hearst, 2002) which are implemented in the LCseg toolkit.
We also used the TDT cost metric Cseg5, with the default parameters P(seg)=0.3, Cmiss=1, Cfa=0.3 and distance of 50 words.
All these measures look at two units (words or sentences) N units apart and evaluate how well the algorithm can predict whether there is a boundary between them or not.
Lower values mean better performance for all measures.
Global vs. Local GLSA Similarity To obtain the PMI values we used the TDT2 collection, denoted as GLSAiocai.
Since co-occurrence statistics based on larger collections give a better approximation to linguistic similarities, we also used 700,000 documents from the English GigaWord collection, denoted as GLSA.
We used a window of size 8.
5.2 Topic Segmentation
The first set of experiments was designed to evaluate the advantage of the GLSA representation over the baseline.
We compare our approach to the LCseg algorithm (Galley et al., 2003) and use sentences as segmentation unit.
To avoid the issue of parameters setting when the number of boundaries is not known, we provide each algorithm with the actual numbers
Figure 1: TDT.
Pair-wise sentence similarities for tf-idf(left), GLSA (middle); x-axis shows story boundaries.
Details for the first 20 sentences, table 2 (right).
Figure 2: TDT.
Pair-wise sentence similarities for tf-idf (left), GLSA (middle) averaged over 10 preceeding sentences; LCseg lexical cohesion scores (right).
X-axis shows story boundaries.
of boundaries.
TDT We use the LCseg approach and our approach with the baseline tf-idfrepresentation and the GLSA representation to segment this corpus.
Table 2 shows a few sentences.
Many content words are repeated, so the lexical chains is definitely a sound approach.
As shown in Table 2, in the first story the word "Cuba" or "Cuban" is repeated in every sentence thus generating a lexical chain.
On the topic boundary, the word overlap between sentences is very small.
At the same time, the repetition of words may also be interrupted within a story: sentence 5, 6 and sentences 14, 15, 16 have little word overlap.
LCseg deals with this by defining several parameters to control chain length and gaps.
This simple example illustrates the potential benefit of semantic cohesion.
Table 2 shows that "General Motors" or "GM" are not repeated in every sentence of the second story.
However, "GM", "carmaker" and
"company" are semantically related.
Making this information available to the segmentation algorithm allows it to establish a connection between each sentence of the second story.
We computed pair-wise sentence similarities between pairs of consecutive sentences in the tf-idf and GLSA representations.
Figure 1 shows the similarity values plotted for each sentence break.
The pair-wise similarities based on term-matching are very spiky and there are many zeros within the story.
The GLSA-based similarity makes the dips in the similarities at the boundaries more prominent.
The last plot gives the details for the sentences in table 2.
In the tf-idfrepresentation sentences without word overlap receive zero similarity but the GLSA representation is able to use the semantic association between between "emigrants" and "refugees" for sentences 5 and 6, and also the semantic association between "carmaker" and "company" for sentences 14
Table 3: TDT segmentation results.
and 15.
This effect increases as we use the semantic cohesion score as in equation 7.
Figure 2 shows the similarity values for tf-idf and GLSA and also the lexical cohesion scores computed by LCseg.
The GLSA-based similarities are not quite as smooth as the LC-seg scores, but they correctly discover the boundaries.
LCseg parameters are fine-tuned for this document collection.
We used a general TDT2 GLSA representation for this collection, and the only segmentation parameter we used is to avoid placing next boundary within n=3 sentences of the previous one.
For this reason the predicted boundary may be one sentence off the actual boundary.
These results are summarized in Table 3.
The GLSA representation performs significantly better than the tf-idf baseline.
Its pk and WindowDiff scores with default parameters for LCseg are worse than for LCseg.
We attribute it to the fact that we did not fine-tuned our method to this collection and that boundaries are often placed one position off the actual boundary.
TDT2 For this collection we used three different indexing schemes: the f-ifbaseline, the GLSArep-resentation and the hybrid indexing.
Each representation supports a different similarity measure.
Our TDT experiments showed that the semantic cohesion score based on the GLSA representation improves the segmentation results.
The variant of the TDT corpus we used is rather small and well-balanced, see (Galley et al., 2003) for details.
In the second phase of experiments we evaluate our approach on the larger TDT2 corpus.
The experiments were designed to address the following issues:
• performance comparison between GLSA and Hybrid indexing representations.
As mentioned before, GLSA embeds all words in a low-dimensional space.
Whereas semantic
#b unknown
GLSAJocaZ
HybridJocaZ
Table 4: TDT2 segmentation results.
Sliding blocks with size 20 and stepsize 10; similarity averaged over 10 preceeding blocks.
classes for nouns have theoretical linguistic justification, it is harder to motivate a latent space representation for example for proper nouns.
Therefore, we want to evaluate the advantage of using spectral embedding only for nouns.
• collection dependence of similarities.
The similarity matrix S is computed using the TDT2 corpus (GLSAiocai) and using the larger Giga-Word corpus.
The larger corpus provides more reliable co-occurrence statistics.
On the other hand, word distribution is different from that in the TDT2 corpus.
We wanted to evaluate whether semantic similarities are collection independent.
Table 4 shows the performance evaluation.
We show the results computed using blocks containing 20 words (after preprocessing) with step size 10.
We tried other parameter values but did not achieve better performance, which is consistent with other research (Hearst, 1994; Galley et al., 2003).
We show
the results for two settings: predict a known number of boundaries, and predict boundaries using a threshold.
In our experiments we used the average of the smallest N scores as threshold, N = 4000 showing best results.
The spectral embedding based representations (GLSA, Hybrid) significantly outperform the baseline.
This confirms the advantage of the semantic cohesion score vs. term-matching.
Hybrid indexing outperforms the GLSA representation supporting our intuition that semantic association is best defined for nouns.
We used the GigaWord corpus to obtain the pair-wise word associations for the GLSA and Hybrid representations.
We also computed GLSAiocai and Hybridiocai using the TDT2 corpus to obtain the pair-wise word associations.
The co-occurrence statistics based on the GigaWord corpus provide more reliable estimations of semantic association despite the difference in term distribution.
The difference is larger for the GLSA case when we compute the embedding for all words, GLSA performs better than GLSAiocai.
Hybridiocai performs only slightly worse than Hybrid.
This seems to support the claim that semantic associations between nouns are largely collection independent.
On the other hand, semantic associations for proper names are collection dependent at least because the collections are static but the semantic relations of proper names may change over time.
The semantic space for a name of a president, for example, is different for the period of time of his presidency and for the time before and after that.
Disappointingly, we could not achieve good results with LCseg.
It tends to split stories into short paragraphs.
Hybrid indexing could achieve results comparable to state-of-the art approaches, see (Fis-cus et al., 1998) for an overview.
6 Conclusion and Future Work
We presented a topic segmentation approach based on semantic cohesion scores.
Our approach is domain independent, does not require training or use of lexical resources.
The scores are computed based on the hybrid document indexing which uses spectral embedding in the space of latent concepts for nouns and keeps proper nouns and other specifics of the documents collections unchanged.
We approximate the lexical chains approach by simplifying the definition of a chain which allows us to use inner products as basis for the similarity score.
The similarity score takes into account semantic relations be-
tween nouns beyond term matching.
This semantic cohesion approach showed good results on the topic segmentation task.
We intend to extend the hybrid indexing approach by considering more vocabulary subsets.
Syntactic similarity is more appropriate for verbs, for example, than co-occurrence.
As a next step, we intend to embed verbs using syntactic similarity.
It would also be interesting to use lexical chains for proper names and learn the weights for different similarity scores.
