In this paper, we study the problem of automatically segmenting written text into paragraphs.
This is inherently a sequence labeling problem, however, previous approaches ignore this dependency.
We propose a novel approach for automatic paragraph segmentation, namely training Semi-Markov models discriminatively using a Max-Margin method.
This method allows us to model the sequential nature of the problem and to incorporate features of a whole paragraph, such as paragraph coherence which cannot be used in previous models.
Experimental evaluation on four text corpora shows improvement over the previous state-of-the art method on this task.
1 Introduction
In this paper, we study automatic paragraph segmentation (APS).
This task is closely related to some well known problems such as text segmentation, discourse parsing, topic shift detection and is relevant for various important applications in speech-to-text and text-to-text tasks.
In speech-to-text applications, the output of a speech recognition system, such as the output of systems creating memos and documents for the Parliament House, is usually raw text without any punctuation or paragraph breaks.
Clearly, such text requires paragraph segmentations.
In text-to-text processing, such as summarization, the output text does not necessarily retain the correct paragraph
structure and may require post-processing.
There is psycholinguistic evidence as cited by Sporleder & Lapata (2004) showing that insertion of paragraph breaks could improve the readability.
Moreover, it has been shown that different languages may have cross-linguistic variations in paragraph boundary placement (Zhu, 1999), which indicates that machine translation can also benefit from APS.
APS can also recover the paragraph breaks that are often lost in the OCR applications.
There has been growing interest within the NLP community for APS in recent years.
Previous methods such as Sporleder & Lapata (2004); Genzel (2005); Filippova & Strube (2006) treat the problem as a binary classification task, where each sentence is labeled as the beginning of a paragraph or not.
They focus on the use of features, such as surface features, language modeling features and syntactic features.
The effectiveness of features is investigated across languages and/or domains.
However, these approaches ignore the inherent sequential nature of APS.
Clearly, consecutive sentences within the same paragraph depend on each other.
Moreover, paragraphs should exhibit certain properties such as coherence, which should be explored within an APS system.
One cannot incorporate such properties/features when APS is treated as a binary classification problem.
To overcome this limitation, we cast APS as a sequence prediction problem, where the performance can be significantly improved by optimizing the choice of labeling over whole sequences of sentences, rather than individual sentences.
Sequence prediction is one of the most promi-
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 640-648, Prague, June 2007.
©2007 Association for Computational Linguistics
Figure 1: Top: sequence (horizontal line) with segment boundaries (vertical lines).
This corresponds to a model where we estimate each segment boundary independently of all other boundaries.
Middle: simple semi-Markov structure.
The position of the segment boundaries only depends on the position of its neighbors, as denoted by the (red) dash arcs.
Bottom: a more sophisticated semi-Markov structure, where each boundary depends on the position oftwo of its neighbors.
This may occur, e.g., when the decision of where to place a boundary depends on the content of two adjacent segments.
The longer range interaction is represented by the additional (blue) arcs.
nent examples of structured prediction.
This problem is generally formalized such that there exists one variable for each observation in the sequence and the variables form a Markov chain (HMM).
Segmentation of a sequence has been studied as a class of sequence prediction problems with common applications such as protein secondary structure prediction, Named Entity Recognition and segmentation of FAQ's.
The exceptions to this approach are Sarawagi & Cohen (2004); Raetsch & Sonnenburg (2006), which show that Semi-Markov models (SMMs) (Janssen & Limnois, 1999), which are a variation of Markov models, are a natural formulation for sequence segmentation.
The advantage of these models, depicted in Figure 1, is their ability to encode features that capture properties of a segment as a whole, which is not possible in an HMM model.
In particular, these features can encode similarities between two sequence segments of arbitrary lengths, which can be very useful in tasks such as
APS.
In this paper, we present a Semi-Markov model
for APS and propose a max-margin training on these methods.
This training method is a generalization of the Max-Margin methods for HMMs (Altun et al., 2003b) to SMMs.
It follows the recent literature on discriminative learning of structured prediction (Lafferty et al., 2001; Collins, 2002; Altun et al., 2003a; Taskar et al., 2003).
Our method inherits the advantages of discriminative techniques, namely the ability to encode arbitrary (overlapping) features and not making implausible conditional independence assumptions.
It also has advantages of SMM models, namely the ability to encode features at segment level.
We present a linear time inference algorithm for SMMs and outline the learning method.
Experimental evaluation on datasets used previously on this task (Sporleder & Lapata, 2004) shows improvement over the state-of-the art methods on APS.
2 Modeling Sequence Segmentation
In sequence segmentation, our goal is to solve the estimation problem of finding a segmentation y e Y, given an observation sequence x e X. For example, in APS x can be a book which is a sequence of sentences.
In a Semi-Markov model, there exists one variable for each subsequence of observations (i. e. multiple observations) and these variables form a Markov chain.
This is opposed to an HMM where there exists one variable for each observation.
More formally, in SMMs, y e Y is a sequence of segment labelings si = (b, Zj) where bi is a non-negative integer denoting the beginning of the ith segment which ends at position bi+1 — 1 and whose label is given by Zi (Sarawagi & Cohen, 2004).
Since in APS the label of the segments is irrelevant, we represent each segment simply by the beginning position y := {bi j^-,1 with the convention that bo = 0 and bL = N where N is the number of observations in x. Here, L denotes the number of segments in y. So the first segment is [0, b1), and the last segment is [bL-1,N), where [a, b) denotes all the sentences from a to b inclusive a but exclusive b.
We cast this estimation problem as finding a discriminant function F(x, y) such that for an observation sequence x we assign the segmentation that receives the best score with respect to F,
As in many learning methods, we consider functions that are linear in some feature representation $,
Here, $(x, y) is a feature map defined over the joint input/output space as detailed in Section 2.3.
We now present a maximum margin training for predicting structured output variables, of which sequence segmentation is an instance.
One of the advantages of this method is its ability to incorporate the cost function that the classifier is evaluated with.
Let A(y, y) be the cost of predicting y instead of y. For instance, A is usually the 0-1 loss for binary and multiclass classification.
However, in segmentation, this may be a more sophisticated function such as the symmetric difference of y and y as discussed in Section 2.2.
Then, one can argue that optimizing a loss function that incorporates this cost can lead to better generalization properties.
One can find a theoretical analysis of this approach in Tsochantaridis et al. (2004).
We follow the general framework of Tsochan-taridis et al. (2004) and look for a hyperplane that separates the correct labeling yi of each observation sequence xi in our training set from all the incorrect labelings Y — yi with some margin that depends on A additively 1.
In order to allow some outliers, we use slack variables £j and maximize the minimum margin, F(xi, yi) — maxy€y _yi F(xi, y), across training instances i. Equivalently,
To solve this optimization problem efficiently, one
1 There is an alternative formulation that is multiplicative in A. We prefer (3) due to computational efficiency reasons.
can investigate its dual given by
Vi,y < C, aiy > 0.
Here, there exists one parameter aiy for each training instance xi and its possible labeling y e Y. Solving this optimization problem presents a formidable challenge since Y generally scales exponentially with the number of variables within each variable y. This essentially makes it impossible to find an optimal solution via enumeration.
Instead, one may use a column generation algorithm (Tsochantaridis et al., 2005) to find an approximate solution in polynomial time.
The key idea is to find the most violated constraints (3b) for the current set of parameters and satisfy them up to some precision.
In order to do this, one needs to find
which can usually be done via dynamic programming.
As we shall see, this is an extension of the Viterbi algorithm for Semi Markov models.
Note that one can express the optimization and estimation problem in terms of kernels k((x,y), (x',y')) := ($(x,y), $(x',y')).
We refer the reader to Tsochantaridis et al. (2005) for details.
To adapt the above framework to the segmentation setting, we need to address three issues: a) we need to specify a loss function A for segmentation, b) we need a suitable feature map $ as defined in Section 2.3, and c) we need to find an algorithm to solve (5) efficiently.
The max-margin training of SMMs was also presented in Raetsch & Sonnenburg (2006)
To measure the discrepancy between y and some alternative sequence segmentation y', we simply count the number of segment boundaries that have a) been missed and b) been wrongly added.
Note that this definition allows for errors exceeding 100% - for
Algorithm 1 Max-Margin Training Algorithm Input: data Xj, labels sample size m, tolerance
if (w, $(xj, y*)) + A(yj, y) > £ + e then Increase constraint set Si — Si U y* Optimize (4) wrt aiy, Vy £ Si.
end if end for
until S has not changed in this iteration
instance, if we were to place considerably more boundaries than can actually be found in a sequence.
The number of errors is given by the symmetric difference between y and y', when segmentations are viewed as sets.
This can be written as
Here | • | denotes the cardinality of the set.
Eq.
(6) plays a vital role in solving (5), since it allows us to decompose the loss in y' into a constant and functions depending on the segment boundaries bi only.
Note that in the case where we want to segment and label, we simply would need to check that the positions are accurate and that the labels of the segments match.
2.3 Feature Representation
SMMs can extract three kinds of features from the input/output pairs: a) node features, i. e. features that encode interactions between attributes of the observation sequence and the (label of a) segment (rather than the label of each observation as in HMM), b) features that encode interactions between neighboring labels along the sequence and c) edge features, i.e. features that encode properties of segments.
The first two types of features are commonly used in
other sequence models, such as HMMs and Conditional Random Fields (CRFs).
The third feature type is specific to Semi-Markov models.
In particular, these features can encode properties of a whole segment or similarities between two sequence segments of arbitrary lengths.
The cost of this express-ibility is simply a constant factor of the complexity of Markov models, if the maximum length of a segment is bounded.
This type of features are particularly useful in the face of sparse data.
As in HMMs, we assume stationarity in our model and sum over the features of each segment to get <J>(x,y).
Then, $ corresponding to models of the middle structure given in Figure 1 is given by
$(x,y) := ($o,J^ $i(fii,x),^ $2(bi-i,bi,x)).
We let $0 = l — 1, the number of segments.
The node features $1 capture the dependency of the current segment boundary to the observations, whereas the edge features $2 represent the dependency of the current segment to the observations.
To model the bottom structure in Figure 1, one can design features that represent the dependency of the current segment to its adjacent segments as well as the observations, $3 (x, bi-2, bi-1, bi).
The specific choices of the feature map $ are presented in Section 3.
2.4 Column Generation on SMMs
Tractability of Algorithm 1 depends on the existence of an efficient algorithm that finds the most violated constraint (3b) via (5).
Both the cost function of Section 2.2 and the feature representation of Section 2.3 are defined over a short sequence of segment boundaries.
Therefore, using the Markovian property, one can perform the above maximization step efficiently via a dynamic programming algorithm.
This is a simple extension of the Viterbi algorithm.
The inference given by (1) can be performed using the same algorithm, setting A to a constant function.
We first state the dynamic programming recursion for F + A in its generality.
We then give the pseudocode for $3 = 0.
Algorithm 2 Column Generation
Input: sequence x, segmentation y, max-length of a segment M
length of a segment.
The recursive step of the dynamic program is given by
by T(0, 0, x) = | y| .
See Algorithm 2 for pseudocode, when $3 = 0.
The segmentation corresponding to (5) is found by constructing the path traversed by the argument of the max operation generating T.
3 Features
We now specify the features described in Section 2.3 for APS.
Note that the second type of features do not exist for APS since we ignore the labelings of segments.
Node features $ 1 (bj, x) represent the information of the current segment boundary and some attributes of the observations around it (which we define as the current, preceding and successive sentences).
These are sentence level features, which we adapt from Genzel (2005) and Sporleder & Lapata (2004)2.
For the bjth sentence, x(bj), we use the following features
• Length of x(bj).
• Relative Position of x(bj).
• Final punctuation of x(bj).
• Number of capitalized words in x(bj).
2 | x(bj) n x(bj + 1) | | x(bj) | + | x(bj + 1) |.
• First word of x(bj).
words of a set of sentences S be
where N is the size of the dictionary and ci is the frequency of word i in S.
- The inner product of the two items above
• Cosine Similarity of x(bj) and the previous sentence
2Due to space limitations, we omit the motivations for these features and refer the reader to the literature cited above.
• Shannon's Entropy of x(bj) computed by using a language model as described in Genzel & Charniak (2003).
• Quotes(Qp, Qc, Qp and Qc are the number of pairs of quotes in the previous(Nump) and current sentence (Numc), Qp = 0.5 x Nump and Qc = 0.5 x Numc.
Below is the set of features $2 (bj, bj+1, x) encoding information about the current segment.
These features represent the power of the Semi-Markov models.
Note that $3 features also belong to edge features category.
In this paper, we did not use $3 feature due to computational issues.
• Length of The Paragraph: This feature expresses the assumption that one would want to have a balance across the lengths of the paragraphs assigned to a text.
Very long and very short paragraphs should be uncommon.
• Cosine Similarity of the current paragraph and neighboring sentences: Ideally, one would like to measure the similarity of two consecutive paragraphs and search for a segmentation that assigns low similarity scores (in order to facilitate changes in the content).
This can be encoded using $3(x,bj-1,bj, bj+1) features.
When such features are computationally expensive, one can measure the similarity of the current paragraph with the preceding sentence as
for CS (P,x(bj+1)).
• Shannon's Entropy of the Paragraph: The motivation for including features encoding the entropy of the sentences is the observation that the entropy of paragraph initial sentences is lower than the others (Genzel & Charniak, 2003).
The motivation for including features encoding the entropy of the paragraphs, on the other hand, is that the entropy rate should remain
more or less constant across paragraphs, especially for long texts like books.
We ignore the sentence boundaries and use the same technique that we use to compute the entropy of a sentence.
3.2 Feature Rescaling
Most of the features described above are binary.
There are also some features such as the entropy whose value could be very large.
We rescale all the non-binary valued features so that they do not override the effect of the binary features.
The scaling is performed as follows:
where unew is the new feature and u is the old feature. min(u) is the minimum of u, and max(u) is the maximum of u. An exception to this is the rescaling of BOW features which is given by
We collected four sets of data for our experiments.
The first corpus, which we call SB, consists of manually annotated text from the book The Adventures of Bruce-Partington Plans by Arthur Conan-Doyle.
The second corpus, which we call SA, again consists of manually annotated text but from 10 different books by Conan-Doyle.
Our third corpus consists of German (GER) and English (ENG) texts.
The German data consisting of 12 German novels was used by Sporleder & Lapata (2006).
This data uses automatically assigned paragraph boundaries, with the labeling error expected to be around 10%.
The English data contains 12 well known English books from Project Gutenberg (http://www. gutenberg.org/wiki/Main_Page).
For this dataset the paragraph boundaries were marked manually.
All corpora were approximately split into training (72%), development (21%), and test set (7%) (see Table 1).
The table also reports the accuracy of the baseline classifier, denoted as BASE, which either labels all sentences as paragraph boundaries or
Table 1: Number ofsentences and % accuracy ofthe baseline classifier (BASE) on various datasets used in our experiments.
non-boundaries, choosing whichever scheme yields a better accuracy.
We evaluate our system using accuracy, precision, recall, and the F1-score given by (2 x Precision x Recall)/(Precision + Recall) and compare our results to Sporleder & Lapata (2006) who used BoosTexter (Schapire & Singer, 2000) as a learning algorithm.
To the best of our knowledge, BoosTexter (henceforth called BT) is the leading method published for this task so far.
In order to evaluate the importance of the edge features and the resultant large-margin constraint, we also compare against a standard binary Support Vector Machine (SVM) which uses node features alone to predict whether each sentence is the beginning of a paragraph or not.
For a fair comparison, all classifiers used the linear kernel and the same set of node features.
We perform model selection for all three algorithms by choosing the parameter values that achieve the best F1 -score on the development set.
For both the SVM as well as our algorithm, SMM, we tune the parameter C (see (3a)) which measures the trade-off between training error and margin.
For BT, we tune the number of Boosting iterations, denoted
by N.
In our first experiment, we compare the performance of our algorithm, SMM, on the English and German corpus to a standard SVM and BoosTexter.
We report these result in Table 2.
Our algorithm achieves the best F1-score on the ENG corpus.
SMM performs very competitively on the GER corpus, achieving accuracies close to those of BT.
We observed a large discrepancy between the performance of our algorithm on the development and
Table 2: Test results on ENG and GER data after model selection.
Algo.
the test datasets.
The situation is similar for both SVM and BT.
For instance, BT when trained on the ENG corpora, achieves an optimal F1-score of 18.67% after N = 100 iterations.
For the same N value, the test performance is 41.67%.
We conjecture that this discrepancy is because the books that we use for training and test are written by different authors.
While there is some generic information about when to insert a paragraph break, it is often subjective and part of the authors style.
To test this hypothesis, we performed experiments on the SA and SB corpus, and present results in Table 3.
Indeed, the F1-scores obtained on the development and test corpus closely match for text drawn from the same book (whilst exhibiting better overall performance), differs slightly for text drawn from different books by the same author, and has a large deviation for the GER and ENG corpus.
Table 3: Comparison on various ENG datasets.
Fi-score
There is one extra degree of freedom that we can optimize in our model, namely the offset, i. e. the weight assigned to the constant feature $0.
After fixing all the parameters as described above, we vary the value of the offset parameter and pick the value that gives the F1 -score on the development data.
We choose to use F1 -score, since it is the error measure that we care about.
Although this extra optimization
leads to better F1 -score in German (69.35% as opposed to 54.66% where there is no extra tuning of the offset), it results in a decrease of the F1 -score in English (52.28% as opposed to 58.33%).
These results are reported in Table 4.
We found that the difference of the F1 -score of tuning and not tuning the threshold on the development set was not a good indicator on the usefulness of this extra parameter.
We are now investigating other properties, such as variance on the development data, to see if the tuning of the threshold can be used for better APS systems.
precision
Figure 2: Precision-recall curves
Figure 2 plots the precision-recall curve obtained on various datasets.
As can be seen the performance of our algorithm on the SB dataset is close to optimum, whilst it degrades slightly on the SA dataset, and substantially on the ENG and GER datasets.
This further confirms our hypothesis that our algorithm excels in capturing stylistic elements from a single author, but suffers slightly when trained to identify generic stylistic elements.
We note that this is not a weakness of our approach alone.
In fact, all the other learning algorithms also suffer from this shortcoming.
Table 4: Performance on ENG test set tuning the offset for best F1 -score on ENG development set.
data set
Acc.
Rec.
Prec.
Fx-score
5 Conclusion
We presented a competitive algorithm for paragraph segmentation which uses the ideas from large margin classifiers and graphical models to extend the semi-Markov formalism to the large margin case.
We obtain an efficient dynamic programming formulation for segmentation which works in linear time in the length of the sequence.
Experimental evaluation shows that our algorithm is competitive when compared to the state-of-the-art methods.
As future work, we plan on implementing $3 features in order to perform an accuracy/time analysis.
By defining appropriate features, we can use our method immediately for text and discourse segmentation.
It would be interesting to compare this method to Latent Semantic Analysis approaches for text segmentation as studied for example in Bestgen (2006) and the references thereof.
