In Sequential Viterbi Models, such as HMMs, MEMMs, and Linear Chain CRFs, the type of patterns over output sequences that can be learned by the model depend directly on the model's structure: any pattern that spans more output tags than are covered by the models' order will be very difficult to learn.
However, increasing a model's order can lead to an increase in the number of model parameters, making the model more susceptible to sparse data problems.
This paper shows how the notion of output transformation can be used to explore a variety of alternative model structures.
Using output transformations, we can selectively increase the amount of contextual information available for some conditions, but not for others, thus allowing us to capture longer-distance consistencies while avoiding unnecessary increases to the model's parameter space.
The appropriate output transformation for a given task can be selected by applying a hill-climbing approach to held-out data.
On the NP Chunking task, our hill-climbing system finds a model structure that outperforms both first-order and second-order models with the same input feature set.
1 Sequence Prediction
A sequence prediction task is a task whose input is a sequence and whose output is a corresponding sequence.
Examples of sequence prediction tasks in-
clude part-of-speech tagging, where a sequence of words is mapped to a sequence of part-of-speech tags; and IOB noun phrase chunking, where a sequence of words is mapped to a sequence of labels, I, O, and B, indicating whether each word is inside a chunk, outside a chunk, or at the boundary between two chunks, respectively.
In sequence prediction tasks, we are interested in finding the most likely output sequence for a given input.
In order to be considered likely, an output value must be consistent with the input value, but it must also be internally consistent.
For example, in part-of-speech tagging, the sequence "preposition-verb" is highly unlikely; so we should reject an output value that contains that sequence, even if the individual tags are good candidates for describing their respective words.
2 Sequential Viterbi Models
This intuition is captured in many sequence learning models, including Hidden Markov Models (HMMs), Maximum Entropy Markov Models (MEMMs), and Linear Chain Conditional Random Fields (LC-CRFs), by including terms corresponding to pieces of output structure in their scoring functions.
(Sha and Pereira, 2003; Sutton and McCallum, 2006; Mc-Callum et al., 2000; Alpaydin, 2004)
Each of these Sequential Viterbi Models defines a set of scoring functions that evaluate fixed-size pieces of the output sequence based on fixed-size pieces of the input sequence.1 The overall score for
For HMMs and MEMMs, the local scores are negative log probabilities.
For LC-CRFs, the local scores do not have any direct probabilistic interpretation.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 8D1-8D9, Prague, June 2DD7.
©2007 Association for Computational Linguistics
Figure 1: Common Model Structures.
(a) Simple first order.
(b) Extended first order.
(c) Simple second order.
(d) Extended second order.
an output value is then computed by summing the scores for all its fixed-size pieces.
Sequence prediction models can differ from one another along two dimensions:
Model Structure: The set of output pieces and input pieces for which local scoring functions are defined.
Model Type: The set of parametrized equations used to define those local scoring functions, and the procedures used to determine their parameters.
In this paper, we focus on model structure.
In particular, we are interested in finding a suitable model structure for a given task and training corpus.
2.1 Common Model Structures
The model structure used by classical HMMs is the "simple first order" structure.
This model structure defines two local scoring functions.
The first scoring function evaluates an output value in the context of the corresponding input value; and the second scoring function evaluates adjacent pairs of output values.
Simple LC-CRFs often extend this structure by adding a third local scoring function, which evaluates adjacent pairs of output values in the context of the input value corresponding to one of those outputs.
Because these first order structures include scoring functions for adjacent pairs of output items, they can identify and reject output values that contain improbable subsequences of length two.
For
example, in part-of-speech tagging, the sequence "preposition-verb" is highly unlikely; and such models will easily learn to reject outputs containing that sequence.
However, it is much more difficult for first order models to identify improbable subsequences of length three or more.
For example, in English texts, the sequence "verb-noun-verb" is much less likely than one would predict based just on the subsequences "verb-noun" and "noun-verb."
But first order models are incapable of learning that
fact.
Thus, in order to improve performance, it is often necessary to include scoring functions that span over larger sequences.
In the "simple second order" model structure, the local scoring function for adjacent pairs of output values is replaced with a scoring function for each triple of consecutive output values.
In extended versions of this structure typically used by LC-CRFs, scoring functions are also added that combine output value triples with an input value.
These model structures are illustrated in Figure 1.
Similarly, third order and and fourth order models can be used to further increase the span over which scoring functions are defined.
Moving to higher order model structures increases the distance over which the model can check consistency.
However, it also increases the number of parameters the model must learn, making the model more susceptible to sparse data problems.
Thus, the usefulness of a model structure for a given task will depend on the types of constraints that are important for the task itself, and on the size and diversity of the training corpus.
3 Searching for Good Model Structures
We can therefore use simple search methods to look for a suitable model structure for a given task and training corpus.
In particular, we have performed several experiments using hill-climbing methods to search for an appropriate model structure for a given task.
In order to apply hill-climbing methods, we need to define:
The search space.
I.e., concrete representations for the set of model structures we will consider.
A set of operations for moving through that search space.
An evaluation metric.
In Section 4, we will define the search space using transformations on output values.
This will allow us to consider a wide variety of model structures without needing to make any direct modifications to the underlying sequence modelling systems.
Output value transformations will be concretely represented using Finite State Transducers (FSTs).
In Section 5, we will define the set of operations for moving through the search space as modification operations on FSTs.
For the evaluation metric, we simply train and test the model, using a given model structure, on held-out data.
4 Representing Model Structure with Reversible Output Transformations
The common model structures described in Section 2.1 differ from one another in that they examine varying sizes of "windows" on the output structure.
Rather than varying the size of the window, we can achieve the same effect by fixing the window size, but transforming the output values.
For example, consider the effects of transforming the output values by replacing individual output tags with pairs of adjacent output tags:
Training a first order model based on these transformed values is equivalent to training a second order model based on the original values, since in each case the local scoring functions will be based on pairs of adjacent output tags.
Similarly, transforming the output values by replacing individual output tags with triples of adjacent output tags is equivalent to training a third order model based on the original output values.
Of course, when we apply a model trained on this type of transformed output to new inputs, it will generate transformed output values.
Thus, the transformation must be reversible, so that we can map the output of the model back to an un-transformed output value.
This transformational approach has the advantage that we can explore different model structures using off-the-shelf learners, without modifying them.
In particular, we can apply the transformation corresponding to a given model structure to the training corpus, and then train the off-the-shelf learner based on that transformed corpus.
To predict the value for a new input, we simply apply the learned model to generate a corresponding transformed output value, and then use the inverse transformation to map that value back to an un-transformed value.
Output encoding transformations can be used to represent a large class of model structures, including commonly used structures (first order, second order, etc) as well as a number of "hybrid" structures that use different window sizes depending on the content of the output tags.
Output encoding transformations can also be used to represent a wide variety of other model structures.
For example, there has been some debate about the relative merits of different output encodings for the chunking task (Tjong Kim Sang and Veenstra, 1999; Tjong Kim Sang, 2000; Shen and Sarkar, 2005).
These encodings differ in whether they define special tags for the beginning of chunks, for the ends of chunks, and for boundaries between chunks.
The output transformation procedure described here is capable of capturing all of the output encodings used for chunking.
Thus, this transformational method provides a unified framework for considering both the type of information that should be encoded by individual tags (i.e., the encoding) and the distance over which that information should be evaluated (i.e., the order of the model).
Under this framework, we can use simple search procedures to find an appropriate transformation for a given task.
4.1 Representing Transformations as FSTs
Finite State Transducers (FSTs) provide a natural formalism for representing output transformations.
FSTs are powerful enough to capture different orders of model structure, including hybrid orders; and to capture different output encodings, such as the ones considered in (Shen and Sarkar, 2005).
FSTs are efficient, so they add very little overhead.
Finally, there exist standard algorithms for inverting
Figure 2: FSTs for Five Common Chunk Encodings.
Each transducer takes an IOB1-encoded string for a given output value, and generates the corresponding string for the same output value, using a new encoding.
Note that the IOB1 FST is an identity transducer; and note that the transducers that make use of the E tag must use e-output edges to delay the decision of which tag should be used until enough information is available.
and determinizing FSTs.
Output-Transformation FSTs
In order for an FST to be used to transform output values, it must have the following three properties:
The FST's inverse should be deterministic.3 Otherwise, we will be unable to convert the model's (transformed) output into an un-transformed output value.
The FST should recognize exactly the set of valid output values.
If it does not recognize some valid output value, then it won't be able to transform that value.
If it recognizes some invalid output value, then there exists an transformed output value that would map back to an invalid output value.
The FST should not modify the length of the output sequence.
Otherwise, it will not be pos-
2Note that we are not attempting to learn a transducer that generates the output values from input values, as is done in e.g. (Oncina et al., 1993) and (Stolcke and Omohundro, 1993).
Rather, we we are interested in finding a transducer from one output encoding to another output encoding that will be more amenable to learning by the underlying Sequential Viterbi Model.
3Or at least determinizable.
sible to align the output values with input values when running the model.
In addition, it seems desirable for the FST to have the following two properties:
The FST should be deterministic.
Otherwise, a single training example's output could be encoded in multiple ways, which would make training the individual base decision classifiers difficult.
The FST should generate every output string.
Otherwise, there would be some possible system output that we are unable to map back to an un-transformed output.
We must therefore relax at least one of these two properties.
Relaxing the property 4 (deterministic FSTs) will make training harder; and relaxing the property 5 (complete FSTs) will make testing harder.
In the experiments presented here, we chose to relax the second property.
4.1.2 Inverting the Transformation
Recall that the motivation behind property 5 is that we need a way to map any output generated by the machine learning system back to an un-transformed output value.
As an alternative to requiring that the FST generate every output string, we can define an extended inversion function, that includes the inverted FST, but also generates output values for transformed values that are not generated by the FST.
In particular,
4To see why the number of possible chunkings is 3n — gn-i — 3n-2, consider the IOB1 encoding: it generates all chunkings, and is valid for any of the 3n strings except those that start with B (of which there are 3n-i) and those that include the sequence OB (of which there are 3n-2).
in cases where the transformed value is not generated by the FST, we can assume that one or more of the transformed tags was chosen incorrectly; and make the minimal set of changes to those tags that results in a string that is generated by the FST.
Thus, we can compute the optimal un-transformed output value corresponding to each transformed output using the following procedure:
Invert the original FST.
I.e., replace each arc (S — Q[a : //]) with an arc (S — Q[// : a]).
Normalize the FST such that each arc has exactly one input symbol.
Convert the FST to a weighted FST by assigning a weight of zero to all arcs.
This weighted FST uses non-negative real-valued weights, and the weight of a path is the sum of the weights of all edges in that path.
For each arc (S — Q[x : a]), and each y = x, add a new arc (S — Q[y : a]) with a weight one.
Determinize the resulting FST, using a variant of the algorithm presented in (Mohri, 1997).
This determinization algorithm will prune paths that have non-optimal weights.
In cases where determinization algorithm has not completed by the time it creates 10,000 states, the candidate FST is assumed to be non-determinizable, and the original FST is rejected as a candidate.
The resulting FST will accept all sequences of transformed tags, and will generate for each transformed tag the un-transformed output value that is generated with the fewest number of "repairs" made to the transformed tags.
5 FST Modification Operations
In order to search the space of output-transforming FSTs, we must define a set of modification operations, that generate a new FST from a previous FST.
In order to support a hill-climbing search strategy, these modification operations should make small incremental changes to the FSTs.
The selection of appropriate modification operations is important, since it will significantly impact the efficiency
of the search process.
In this section, I describe the set of FST modification operations that are used for the experiments described in this paper.
These operations were chosen based our intuitions about what modifications would support efficient hill-climbing search.
In future experiments, we plan to examine alternative modification operations.
The new output tag operation replaces an arc (S — Q[a : ///xy]) with an arc (S — Q[a : ///yY]), where y is a new output tag that is not used anywhere else in the transducer.
When a single output tag appears on multiple arcs, this operation effectively splits that tag in two.
For example, when applied to the identity transducer for the IOB1 encoding shown in Figure 2, this operation can be used to distinguish O tags that follow other O tags from O tags that follow / or B tags - effectively increasing the order of the model structure for just O tags.
The specialize output tag operation is similar to the new output tag operation, but rather than replacing the output tag with a new tag, we "subdivide" the tag.
When the model is trained, features will be included for both the subdivided tag and the original (undivided) tag.
The loop unrolling operation acts on a single self-loop arc e at a state S, and makes the following changes to the FST:
Create a new state S'.
For each outgoing arc e1 = (S — Q[a : //]) = e, add add an arc e2 = (S' — Q[a : //]).
Note that if e1 was a self-loop arc (i.e., S = Q), then e2 will point from S' to S.
Change the destination of loop arc e from S to
S'.
By itself, the loop unrolling operation just modifies the structure of the FST, but does not change
5This operation requires the use of a model where features are defined over (input,output) pairs, such as MEMMs or LC-CRFs.
the actual transduction performed by the FST.
It is therefore always immediately followed by applying the new output tag operation or the specialize output tag operation to the loop arc e.
The copy tag forward operation splits an existing state in two, directing all incoming edges that generate a designated output tag to one copy, and all remaining incoming edges to the other copy.
The outgoing edges of these two states are then distinguished from one another, using either the specialize output tag operation (if available) or the new output tag operation.
This modification operation creates separate edges for different output histories, effectively increasing the "window size" of tags that pass through the state.
The copy state forward operation is similar to the copy tag forward operation; but rather than redirecting incoming edges based on what output tags they generate, it redirects incoming edges based on what state they originate from.
This modification operation allows the FST to encode information about the history of states in the transformational FST as part of the model structure.
The copy feature forward operation is similar to the copy tag forward operation; but rather than redirecting incoming edges based on what output tags they generate, it redirects incoming edges based on a feature of the current input value.
This modification operation allows the transformation to subdivide output tags based on features of the input value.
6 Hill Climbing System
Having defined a search space, a set of transformations to explore that space, and an evaluation metric, we can use a hill-climbing system to search for a good model structure.
This approach starts with a simple initial FST, and makes incremental local changes to that FST until a locally optimal FST is found.
In order to help avoid sub-optimal local maxima, we use a fixed-size beam search.
To increase the search speed, we used a 12-machine cluster to
evaluate candidate FSTs in parallel.
The hill climbing system iteratively performs the following procedure:
Initialize candidates to be the singleton set containing the identity transducer.
Repeat ...
(a) Generate a new FST, by applying a random modification operation to a randomly selected member of the candidates set.
(b) Evaluate the new FSTs, and test its performance on the held-out data set.
(This is done in parallel.)
(c) Once the FST has been evaluated, add it to the candidates set.
(d) Sort the candidates set by their score on the held-out data, and discard all but the 10 highest-scoring candidates.
... until no improvement is made for twenty consecutive iterations.
Return the candidate FST with the highest score.
7 Noun Phrase Chunking Experiment
In order to test this approach to finding a good model structure, we applied our hill-climbing system to the task of noun phrase chunking.
The base system was a Linear Chain CRF, implemented using Mallet (McCallum, 2002).
The set of features used are listed in Figure 1.
Training and testing were performed using the noun phrase chunking corpus described in Ramshaw & Marcus (1995) (Ramshaw and Marcus, 1995).
A randomly selected 10% of the original training corpus was used as held-out data, to provide feedback to the hill-climbing system.
7.1 NP Chunking Experiment: Results
Description
The current output tag.
A tuple of the current output tag and the i + nth word, —2 < n < 2.
A tuple of the current output tag, the current word, and the previous word.
A tuple of the current output tag, the current word, and the next word.
A tuple of the current output tag and the part of speech tag of the i + nth
word, —2 < n < 2.
A tuple of the current output tag and the two consecutive part of speech tags starting at word i + n, —2 < n < 1.
A tuple of the current output tag, and three consecutive part of speech tags centered on word i+n, —1 < n < 1.
is the ith output tag; wi is the ith word; and ti is the part-of-speech tag for the ith word.
Baseline (first order)
Second order
Learned structure
Table 2: Results for NP Chunking Experiment.
of 92.80.6 As a point of comparison, a simple second order model achieves an intermediate F-score of 92.63 on the test data.
Thus, the model learned by the hill-climbing system outperforms both the simple first-order model and the simple second-order model.
Figure 3 shows how the scores of FSTs on held-out data changed as the hill-climbing system ran.
Figure 4 shows the search tree explored by the hill-climbing system.
6The reason that held-out scores are significantly higher than test scores is that held-out data was taken from the same sections of the original corpus as the training data; but test data was taken from new sections.
Thus, there was more lexical overlap between the training data and the held-out data than between the training data and the testing data.
Figure 3: Performance on Heldout Data for NP Chunking Experiment.
In this graph, each point corresponds to a single transducer generated by the hill-climbing system.
The height of each transducer's point indicates its score on held-out data.
The line indicates the highest score that has been achieved on the held-out data by any transducer.
Figure 4: Hill Climbing Search Tree for NP Chunking Experiment This tree shows the "ancestry" of each transducer tried by the hill climbing system.
Lighter colors indicate higher scores on the held-out data.
After one hundred iterations, the five highest scoring transducers were fst047, fst0 58, fst083, fst102, and fst089.
Figure 5: Final FST.
The highest-scoring FST generated by the hill-climbing algorithm, after a run of 100 iterations.
For a discussion of this transducer, see Section 7.1.1.
7.1.1 NP Chunking Experiment: The Selected Transformation
Figure 5 shows the FST for the best output transformation found after 100 iterations of the hill-climbing algorithm.
Inspection of this FST reveals that it transforms the original set of three tags (7, O, and B) to six new tags: I1, I2,I3, O, B1, and B2.
The first three of these tags are used at the beginning of a chunk: I1 is used if the preceding tag was O; B1 is used if the preceding tag was B; and B2 is used if the preceding tag was I. This is similar to a second order model, in that it records information about both the current tag and the previous tag.
The next tag, O, is used for all words outside of chunks.
Thus, the hill-climbing system found that increasing the window size used for O chunks does not help to learn any useful constraints with neighboring tags.
Finally, two tags are used for words that are inside a chunk, but not at the beginning of the chunk: I2 and I3.
The choice of which tag should be used depends on the input feature that tests whether the current word is a comma, and the previous word was a proper noun (NNP).
At first, this might seem like an odd feature to distinguish.
But note that in the Wall Street Journal, it is quite common for proper nouns to include internal commas; but for other nouns, it is
fairly uncommon.
By dividing the I tag in two based on this feature, the model can use separate distributions for these two cases.
Thus, the model avoids conflating two contexts that are significantly different from one another for the task at hand.
8 Discussion
Sequential Viterbi Models are capable of learning to model the probability of local patterns on the output structure.
But the distance that these patterns can span is limited by the model's structure.
This distance can be lengthened by moving to higher order model structures, but only at the expense of an increase in the number of model parameters, along with the data sparsity issues that can arise from that increase.
Therefore, it makes sense to be more selective about how we extend the model structure.
Using reversible output transformations, it is possible to define model structures that extend the reach of the model only where necessary.
And as we have shown here, it is possible to find a suitable output transformation for a given task by using simple search procedures.
9 Acknowledgements
We gratefully acknowledge the support of the National Science Foundation Grant NSF-0415923, Word Sense Disambiguation, the DTO-AQUAINT NBCHC040036 grant under the University of Illinois subcontract to University of Pennsylvania 2003-07911-01, NSF-ITR-0325646: Domain-Independent Semantic Interpretation, and Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022.
