We propose a succinct randomized language model which employs a perfect hash function to encode fingerprints of n-grams and their associated probabilities, backoff weights, or other parameters.
The scheme can represent any standard n-gram model and is easily combined with existing model reduction techniques such as entropy-pruning.
We demonstrate the space-savings of the scheme via machine translation experiments within a distributed language modeling framework.
1 Introduction
Language models (LMs) are a core component in statistical machine translation, speech recognition, optical character recognition and many other areas.
They distinguish plausible word sequences from a set of candidates.
LMs are usually implemented as n-gram models parameterized for each distinct sequence of up to n words observed in the training corpus.
Using higher-order models and larger amounts of training data can significantly improve performance in applications, however the size of the resulting LM can become prohibitive.
With large monolingual corpora available in major languages, making use of all the available data is now a fundamental challenge in language modeling.
Efficiency is paramount in applications such as machine translation which make huge numbers of LM requests per sentence.
To scale LMs to larger corpora with higher-order dependencies, researchers
* Work completed while this author was at Google Inc.
have considered alternative parameterizations such as class-based models (Brown et al., 1992), model reduction techniques such as entropy-based pruning (Stolcke, 1998), novel represention schemes such as suffix arrays (Emami et al., 2007), Golomb Coding (Church et al., 2007) and distributed language models that scale more readily (Brants et al., 2007).
In this paper we propose a novel randomized language model.
Recent work (Talbot and Osborne, 2007b) has demonstrated that randomized encodings can be used to represent n-gram counts for LMs with signficant space-savings, circumventing information-theoretic constraints on lossless data structures by allowing errors with some small probability.
In contrast the representation scheme used by our model encodes parameters directly.
It can be combined with any n-gram parameter estimation method and existing model reduction techniques such as entropy-based pruning.
Parameters that are stored in the model are retrieved without error; however, false positives may occur whereby n-grams not in the model are incorrectly 'found' when requested.
The false positive rate is determined by the space usage of the model.
Our randomized language model is based on the Bloomier filter (Chazelle et al., 2004).
We encode fingerprints (random hashes) of n-grams together with their associated probabilities using a perfect hash function generated at random (Majewski et al., 1996).
This paper focuses on machine translation.
However, many of our findings should transfer to other applications of language modeling.
2 Scaling Language Models
In statistical machine translation (SMT), LMs are used to score candidate translations in the target language.
These are typically n-gram models that approximate the probability of a word sequence by assuming each token to be independent of all but n — 1 preceding tokens.
Parameters are estimated from monolingual corpora with parameters for each distinct word sequence of length l e [n] observed in the corpus.
Since the number of parameters grows somewhat exponentially with n and linearly with the size of the training corpus, the resulting models can be unwieldy even for relatively small corpora.
2.1 Scaling Strategies
Various strategies have been proposed to scale LMs to larger corpora and higher-order dependencies.
Model-based techniques seek to parameterize the model more efficiently (e.g. latent variable models, neural networks) or to reduce the model size directly by pruning uninformative parameters, e.g. (Stolcke, 1998), (Goodman and Gao, 2000).
Representation-based techniques attempt to reduce space requirements by representing the model more efficiently or in a form that scales more readily, e.g. (Emami et al., 2007), (Brants et al., 2007), (Church et al., 2007).
2.2 Lossy Randomized Encodings
A fundamental result in information theory (Carter et al., 1978) states that a random set of objects cannot be stored using constant space per object as the universe from which the objects are drawn grows in size: the space required to uniquely identify an object increases as the set of possible objects from which it must be distinguished grows.
In language modeling the universe under consideration is the set of all possible n-grams of length n for given vocabulary.
Although n-grams observed in natural language corpora are not randomly distributed within this universe no lossless data structure that we are aware of can circumvent this space-dependency on both the n-gram order and the vocabulary size.
Hence as the training corpus and vocabulary grow, a model will require more space per parameter.
However, if we are willing to accept that occasionally our model will be unable to distinguish between distinct n-grams, then it is possible to store
each parameter in constant space independent of both n and the vocabulary size (Carter et al., 1978), (Talbot and Osborne, 2007a).
The space required in such a lossy encoding depends only on the range of values associated with the n-grams and the desired error rate, i.e. the probability with which two distinct n-grams are assigned the same fingerprint.
2.3 Previous Randomized LMs
Recent work (Talbot and Osborne, 2007b) has used lossy encodings based on Bloom filters (Bloom, 1970) to represent logarithmically quantized corpus statistics for language modeling.
While the approach results in significant space savings, working with corpus statistics, rather than n-gram probabilities directly, is computationally less efficient (particularly in a distributed setting) and introduces a dependency on the smoothing scheme used.
It also makes it difficult to leverage existing model reduction strategies such as entropy-based pruning that are applied to final parameter estimates.
In the next section we describe our randomized LM scheme based on perfect hash functions.
This scheme can be used to encode any standard n-gram model which may first be processed using any conventional model reduction technique.
3 Perfect Hash-based Language Models
Our randomized LM is based on the Bloomier filter (Chazelle et al., 2004).
We assume the n-grams and their associated parameter values have been precom-puted and stored on disk.
We then encode the model in an array such that each n-gram's value can be retrieved.
Storage for this array is the model's only significant space requirement once constructed.1
The model uses randomization to map n-grams to fingerprints and to generate a perfect hash function that associates n-grams with their values.
The model can erroneously return a value for an n-gram that was never actually stored, but will always return the correct value for an n-gram that is in the model.
We will describe the randomized algorithm used to encode n-gram parameters in the model, analyze the probability of a false positive, and explain how we construct and query the model in practice.
1Note that we do not store the n-grams explicitly and therefore that the model's parameter set cannot easily be enumerated.
We wish to encode a set of n-gram/value pairs
using an array A of size M and a perfect hash function.
Each n-gram xj is drawn from some set of possible n-grams U and its associated value from a corresponding set of possible values V.
We do not store the n-grams and their probabilities directly but rather encode a fingerprint of each n-gram f (xj) together with its associated value in such a way that the value can be retrieved when the model is queried with the n-gram Xj.
A fingerprint hash function f : U — [0, B — 1] maps n-grams to integers between 0 and B — 1.2 The array A in which we encode n-gram/value pairs has addresses of size |~log2 B] hence B will determine the amount of space used per n-gram.
There is a trade-off between space and error rate since the larger B is, the lower the probability of a false positive.
This is analyzed in detail below.
For now we assume only that B is at least as large as the range of values stored in the model, i.e. B > |V|.
3.2 Composite Perfect Hash Functions
Figure 1: Encoding an n-gram's value in the array.
function for a given set of n-grams is a significant challenge described in the following sections.
All addresses in A are initialized to zero.
The procedure we use to ensure g(xj) = v(xj) for all xj eS updates a single, unique location in A for each n-gram Xj. This location is chosen from among the k locations given by hj (xj), j e [k].
Since the composite function g(xj) depends on the values stored at all k locations A[h1(xj)], A[h2(xj)],..., A[hk(xj)] in A, we must also ensure that once an n-gram X has been encoded in the model, these k locations are not subsequently changed since this would invalidate the encoding; however, n-grams encoded later may reference earlier entries and therefore locations in A can effectively be 'shared' among parameters.
In the following section we describe a randomized algorithm to find a suitable order in which to enter n-grams in the model and, for each n-gram X , determine which of the k hash functions, say hj, can be used to update A without invalidating previous entries.
Given this ordering of the n-grams and the choice of hash function hj for each X e S, it is clear that the following update rule will encode X in the array A so that g(xj) will return v(xj) (cf. Eq.
(1))
2The analysis assumes that all hash functions are random.
3We use ® to denote the exclusive bitwise OR operator.
3.4 Finding an Ordered Matching
functions hj, j e [k] for each n-gram Xj e S and an order in which to apply the update rule Eq.
(2) so that g(xj) maps xj to v(xj) for all n-grams in S.
This problem is equivalent to finding an ordered matching in a bipartite graph whose LHS nodes correspond to n-grams in S and RHS nodes correspond to locations in A. The graph initially contains edges from each n-gram to each of the k locations in A given by h i(xj), h2(xj),..., hfc(xj) (see Fig.
(2)).
The algorithm uses the fact that any RHS node that has degree one (i.e. a single edge) can be safely matched with its associated LHS node since no remaining LHS nodes can be dependent on it.
We first create the graph using the k hash functions hj, j e [k] and store a list (degree_one) of those RHS nodes (locations) with degree one.
The algorithm proceeds by removing nodes from degree_one in turn, pairing each RHS node with the unique LHS node to which it is connected.
We then remove both nodes from the graph and push the pair (xj, hj(xj)) onto a stack (matched).
We also remove any other edges from the matched LHS node and add any RHS nodes that now have degree one to degree_one.
The algorithm succeeds if, while there are still n-grams left to match, degree.one is never empty.
We then encode n-grams in the order given by the stack (i.e., first-in-last-out).
Since we remove each location in A (RHS node) from the graph as it is matched to an n-gram (LHS node), each location will be associated with at most one n-gram for updating.
Moreover, since we match an n-gram to a location only once the location has degree one, we are guaranteed that any other n-grams that depend on this location are already on the stack and will therefore only be encoded once we have updated this location.
Hence dependencies in g are respected and g(xj) = v(xj) will remain true following the update in Eq.
(2) for each xj e S.
3.5 Choosing Random Hash Functions
N-grams Locations N-grams Locations
Figure 2: The ordered matching algorithm: matched =
taken modulo M. We generate a set of k hash functions by sampling k pairs of random numbers (aj,bj),j e [k].
If the algorithm does not find a matching with the current set of hash functions, we re-sample these parameters and re-start the algorithm.
Since the probability of failure on a single attempt is low when M > 1.23|S|, the probability of failing multiple times is very small.
3.6 Querying the Model and False Positives
The construction we have described above ensures that for any n-gram xj e S we have g(xj) = v(xj), i.e., we retrieve the correct value.
To retrieve a value given an n-gram xj we simply compute the fingerprint f (xj), the hash functions hj(xj), j e [k] and then return g(xj) using Eq.
(1).
Note that unlike the constructions in (Talbot and Osborne, 2007b) and (Church et al., 2007) no errors are possible for n-grams stored in the model.
Hence we will not make errors for common n-grams that are typically in S.
Algorithm 1 Ordered Matching
return matched else
return fail end if
On the other hand, querying the model with an n-gram that was not stored, i.e. with xj G U \ S we may erroneously return a value v G V.
Pr{g(Xj) GV|Xj GU\S} = |V|/B.
We refer to this event as a false positive.
If V is fixed, we can obtain a false positive rate e by setting B as
B = |V|/e.
each location (i.e. | log2 B] — log2 |V| or 3 in our example) as error bits in our experiments below.
3.7 Constructing the Full Model
When encoding a large set of n-gram/value pairs S, Algorithm 1 will only be practical if the raw data and graph can be held in memory as the perfect hash function is generated.
This makes it difficult to encode an extremely large set S into a single array A. The solution we adopt is to split S into t smaller sets Sj, i e [t] that are arranged in lexicographic or-der.4 We can then encode each subset in a separate array Aj, i e [t] in turn in memory.
Querying each of these arrays for each n-gram requested would be inefficient and inflate the error rate since a false positive could occur on each individual array.
Instead we store an index of the final n-gram encoded in each array and given a request for an n-gram's value, perform a binary search for the appropriate array.
Our models are consistent in the following sense
(w i,W2, . . . e S => (W2, . . . , Wn) e S.
Hence we can infer that an n-gram can not be present in the model, if the n — 1-gram consisting of the final n — 1 words has already tested false.
Following (Talbot and Osborne, 2007a) we can avoid unnecessary false positives by not querying for the longer n-gram in such cases.
Backoff smoothing algorithms typically request the longest n-gram supported by the model first, requesting shorter n-grams only if this is not found.
In our case, however, if a query is issued for the 5-gram (w i, w2, w3, w4, w5) when only the unigram (w5) is present in the model, the probability of a false positive using such a backoff procedure would not be e as stated above, but rather the probability that we fail to avoid an error on any of the four queries performed prior to requesting the unigram, i.e. 1 —(1—e)4 w 4e.
We therefore query the model first with the unigram working up to the full n-gram requested by the decoder only if the preceding queries test positive.
The probability of returning a false positive for any n-gram requested by the decoder (but not in the model) will then be at most e.
4In our system we use subsets of 5 million n-grams which can easily be encoded using less than 2GB of working space.
4 Experimental Set-up
4.1 Distributed LM Framework
We deploy the randomized LM in a distributed framework which allows it to scale more easily by distributing it across multiple language model servers.
We encode the model stored on each lan-guagage model server using the randomized scheme.
The proposed randomized LM can encode parameters estimated using any smoothing scheme (e.g. Kneser-Ney, Katz etc.).
Here we choose to work with stupid backoff smoothing (Brants et al., 2007) since this is significantly more efficient to train and deploy in a distributed framework than a context-dependent smoothing scheme such as Kneser-Ney.
Previous work (Brants et al., 2007) has shown it to be appropriate to large-scale language modeling.
The language model is trained on four data sets: target: The English side of Arabic-English parallel data provided by LDC (132 million tokens). gigaword: The English Gigaword dataset provided
by LDC (3.7 billion tokens).
An initial experiment will use the Web 1T 5-gram corpus only; all other experiments will use a loglinear combination of models trained on each corpus.
The combined model is pre-compiled with weights trained on development data by our system.
4.3 Machine Translation
The SMT system used is based on the framework proposed in (Och and Ney, 2004) where translation is treated as the following optimization problem
argmaxV^ Aj$j(e, f).
Entropy-Pruned
5N-grams with count < 40 are not included in this data set.
This section describes three sets of experiments: first, we encode the Web 1T 5-gram corpus as a randomized language model and compare the resulting size with other representations; then we measure false positive rates when requesting n-grams for a held-out data set; finally we compare translation quality when using conventional (lossless) languages models and our randomized language model.
Note that the standard practice of measuring perplexity is not meaningful here since (1) for efficient computation, the language model is not normalized; and (2) even if this were not the case, quantization and false positives would render it unnormalized.
We build a language model from the Web 1T 5-gram corpus.
Parameters, corresponding to negative logarithms of relative frequencies, are quantized to 8-bits using a uniform quantizer.
More sophisticated quantizers (e.g. (S.Lloyd, 1982)) may yield better results but are beyond the scope of this paper.
Table 1 provides some statistics about the corpus.
We first encode the full set of n-grams, and then a version that is reduced to approx.
1/3 of its original size using entropy pruning (Stolcke, 1998).
bytes/n-gram
Full Set
LDC gzip'd
Entropy Pruned
Block Encoding
Randomized
Table 2: Web 1T 5-gram language model sizes with different encodings.
"Randomized" uses 12 error bits.
monly used trie encoding.
Our method is the only one to use the same amount of space per parameter for both full and entropy-pruned models.
All n-grams explicitly inserted into our randomized language model are retrieved without error; however, n-grams not stored may be incorrectly assigned a value resulting in a false positive.
Section (3) analyzed the theoretical error rate; here, we measure error rates in practice when retrieving n-grams for approx.
11 million tokens of previously unseen text (news articles published after the training data had been collected).
We measure this separately for all n-grams of order 2 to 5 from the same text.
2.46/3.08/3.69 bytes/n-gram).
Using such a large language model results in a large fraction of known n-grams in new text.
Table 3 shows, e.g., that almost half of all 5-grams from the new text were seen in the training data.
Column (1) in Table 4 shows the number of false positives that occurred for this test data.
Column (2) shows this as a fraction of the number of unseen n-grams in the data.
This number should be close to 2-b where b is the number of error bits (i.e. 0.003906 for 8 bits and 0.000244 for 12 bits).
The error rates for bigrams are close to their expected values.
The numbers are much lower for higher n-gram orders due to the use of sanity checks (see Section 3.8).
Table 3: Number of n-grams in test set and percentages of n-grams that were seen/unseen in the training data.
false pos.
false pos unseen
false pos total
S error bits
Table 4: False positive rates with 8 and 12 error bits.
The overall fraction of n-grams requested for which an error occurs is of most interest in applications.
This is shown in Column (3) and is around a factor of 4 smaller than the values in Column (2).
On average, we expect to see 1 error in around 2,500 requests when using 8 error bits, and 1 error in 40,000 requests with 12 error bits (see "total" row).
5.3 Machine Translation
We run an improved version of our 2006 NIST MT Evaluation entry for the Arabic-English "Unlimited" data track.6 The language model is the same one as in the previous section.
Table 5 shows baseline translation BLEU scores for a lossless (non-randomized) language model with parameter values quantized into 5 to 8 bits.
We use MT04 data for system development, with MT05
data and MT06 ("NIST" subset) data for blind testing.
As expected, results improve when using more bits.
There seems to be little benefit in going beyond
Table 5: Baseline BLEU scores with lossless n-gram model and different quantization levels (bits).
Figure 3: BLEU scores on the MT05 data set.
8 bits.
Overall, our baseline results compare favorably to those reported on the NIST MT06 web site.
We now replace the language model with a randomized version.
Fig.
3 shows BLEU scores for the MT05 evaluation set with parameter values quantized into 5 to 8 bits and 8 to 16 additional 'error' bits.
Figure 4 shows a similar graph for MT06 data.
We again see improvements as quantization uses more bits.
There is a large drop in performance when reducing the number of error bits from 10 to 8, while increasing it beyond 12 bits offers almost no further gains with scores that are almost identical to the lossless model.
Using 8-bit quantization and 12 error bits results in an overall requirement of (8 +12) x 1.23 = 24.6 bits = 3.08 bytes per n-gram.
All runs use the sanity checks described in Section 3.8.
Without sanity checks, scores drop, e.g. by 0.002 for 8-bit quantization and 12 error bits.
Randomization and entropy pruning can be combined to achieve further space savings with minimal loss in quality as shown in Table (6).
The BLEU score drops by between 0.0007 to 0.0018 while the
Figure 4: BLEU scores on MT06 data ("NIST" subset).
unpruned block
unpruned rand
pruned block
pruned rand
Table 6: Combining randomization and entropy pruning.
All models use 8-bit values; "rand" uses 12 error bits.
We have presented a novel randomized language model based on perfect hashing.
It can associate arbitrary parameter types with n-grams.
Values explicitly inserted into the model are retrieved without error; false positives may occur but are controlled by the number of bits used per n-gram.
The amount of storage needed is independent of the size of the vocabulary and the n-gram order.
Lookup is very efficient: the values of 3 cells in a large array are combined with the fingerprint of an n-gram.
Experiments have shown that this randomized language model can be combined with entropy pruning to achieve further memory reductions; that error rates occurring in practice are much lower than those predicted by theoretical analysis due to the use of runtime sanity checks; and that the same translation quality as a lossless language model representation can be achieved when using 12 'error' bits, resulting in approx.
3 bytes per n-gram (this includes one byte to store parameter values).
