Using the constrains of chunk correspondence between source language (SL)1 and target language (TL), our algorithm can dramatically reduce search space, support time synchronous DP algorithm, and lead to highly consistent chunking.
Introduction
We address here the problem of structure alignment, which accepts as input a sentence pair,
t This work was done while the author was visiting Microsoft Research Asia
1 In this paper, we take English-Chinese parallel text as example; it is relatively easy, however, to be extended to other language pairs.
Ming Zhou Microsoft Research, Asia Beijing,
mingzhou@microsoft. com Chang-Ning Huang Microsoft Research, Asia Beijing, 100080, P.R.C cnhuang@microsoft.com
and produces as output the parsed structures of both sides with correspondences between them.
The structure alignment can be used to support machine translation and cross language information retrieval by providing extended phrase translation lexicon and translation templates.
The popular methods for structure alignment try to align hierarchical structures like sub-trees with parsing technology.
However, the alignment accuracy cannot be guaranteed since no parser can handle all authentic sentences very well.
Furthermore, the strategies which were usually used for structure alignment suffer from serious shortcomings.
For instance, parse-to-parse matching which regards parsing and alignment as separate and successive procedures suffers from the inconsistency between grammars of different languages.
Bilingual parsing which looks upon parsing and alignment as a simultaneous procedure needs an extra 'bilingual grammar'.
It is, however, difficult to write a complex 'bilingual grammar'.
In this paper, a new statistical method called "bilingual chunking" for structure alignment is proposed.
Using the constrains of chunk correspondence between source language (SL) and target language (TL), our algorithm can dramatically reduce search space, support time synchronous DP algorithm,
and lead to highly consistent chunking.
The experimental results with English- Chinese structure alignment show that our model can produce 90% in precision for chunking, and 87% in precision for chunk alignment.
1 Related Works
Most of the previous works conduct structure alignment with complex, hierarchical structures, such as phrase structures (e.g., Kaji, Kida & Morimoto, 1992), or dependency structures (e.g., Matsumoto et al. 1993; Grishman, 1994; Meyers, Yanharber & Grishman 1996; Watanabe, Kurohashi & Aramaki 2000).
However, the mismatching between complex structures across languages and the poor parsing accuracy of the parser will hinder structure alignment algorithms from working out high accuracy results.
A straightforward strategy for structure alignment is parse-to-parse matching, which regards the parsing and alignment as two separate and successive procedures.
First, parsing is conducted on each language, respectively.
Then the correspondent structures in different languages are aligned (e.g., Kaji, Kida & Morimoto 1992; Matsumoto et al. 1993; Grishman 1994; Meyers, Yanharber & Grishman 1996; Watanabe, Kurohashi & Aramaki 2000).
Unfortunately, automatic parse-to-parse matching has some weaknesses as described in Wu (2000).
For example, grammar inconsistency exists across languages; and it is hard to handle multiple alignment choices.
To deal with the difficulties in parse-to-parse matching, Wu (1997) utilizes inversion transduction grammar (ITG) for bilingual parsing.
Bilingual parsing approach looks upon the parsing and alignment as a single procedure which simultaneously encodes both the parsing and transferring information.
It is, however, difficult to write a broad coverage 'bilingual grammar' for bilingual parsing.
2 Structure Alignment Using Bilingual Chunking
The chunks, which we will use, are extracted from the Treebank.
When converting a tree to the chunk sequence, the chunk types are based on the syntactic category part of the bracket label.
Roughly, a chunk contains everything to the left of and including the syntactic head of the constituent of the same name.
Besides the head, a chunk also contains pre-modifiers, but no post-modifiers or arguments (Erik.
2000).
Using chunk as the alignment structure, we can get around the problems such as PP attachment, structure mismatching across languages.
Therefore, we can get high chunking accuracy.
Using bilingual chunking, we can get both high chunking accuracy and high chunk alignment accuracy by making the SL chunking process and the TL chunking process constrain and improve each other.
Our 'bilingual chunking' model for structure alignment comprises three integrated components: chunking models of both languages, and the crossing constraint; it uses chunk as the structure.
(See Fig.
(Integrated with ( Crossi^lg A (Integrated with
POS tagging) Constraint J POS tagging)
1 Three components of our model
The crossing constraint requests a chunk in one language only correspond to at most one chunk in the other language.
For instance, in Fig.
2 (the dashed lines represent the word alignments; the brackets indicate the chunk boundaries), the phrase "the first man " is a monolingual chunk, it, however, should be divided into "the first" and "man" to satisfy the crossing constraint.
Fig.
2 the crossing constraint
using crossing constraint, the illegal chunk candidates can be removed in the chunking process.
The chunking models for both languages work successively under the crossing constraint.
Usually, chunking involves two steps: (1) POS tagging, and (2) chunking.
To alleviate effectively the influence of POS tagging deficiency to the chunking result, we integrate the two steps with a unified model for optimal solution.
This integration strategy has been proven to be effective for base NP identification (Xun, Huang & Zhou, 2001).
Consequently, our model works in three successive steps: (1) word alignment between SL and TL sentences; (2) source language chunking; (3) target language chunking.
Both (2) and (3) should work under the supervision of crossing constraints.
2.2 The Crossing Constraint
According to (Wu, 1997), crossing constraint can be defined in the following.
For non-recursive phrases: Suppose two words wl and w2 in language-1 correspond to two words vl and v2 in language-2, respectively, and wl and w2 belong to the same phrase of language-1.
Then vl and v2 must also belong to the same phrase of language-2.
We can benefit from applying crossing constraint in the following three aspects:
• Consistent chunking in the view of alignment.
For example, in Fig.
2, "the first man" should be divided into "the first" and "man" for the consistency with the Chinese chunks "M— "f" and " ", respectively.
• Searching space reduction.
The chunking space is reduced by ruling out those illegal fragments like "the first man"; and the alignment space is reduced by confining those legal fragments like "the first" only to correspond to the Chinese fragments " " or " " based on word alignment anchors.
• Time synchronous algorithms for structure alignment.
Time synchronous algorithms cannot be used due to word permutation problem before.
While under the crossing constraint, these
algorithms (for example, dynamic programming) can be used for both chunking and alignment.
2.3 Mathematical Formulation
where I is the sentence length.
A sequence of chunks can be represented as:
Where, n'e denotes the ith chunk type of e, and I' is the number of chunks in e .
Similarly, for a Chinese sentence c
Where, m denotes the word number of c, m' is the number of Chinese chunks in c.
Let bmi denote the ith positional tag, bmi can be begin of a chunk, inside a chunk, or outside any chunk.
The most probable result is expressed as
Where, A is the alignment between Be and Bc .
a refers to the crossing constraint.
Equation (1) can be further derived into
In this formula, p (Te | e, a) aims to determine the best POS tag sequence for e . p (Be, Te | e, a) aims to determine the best chunk sequence from them. p (c | Tc, Be, Te, e, a )aims to decide the best POS tag sequence for c based on the English POS sequence. p(Bc | c,Tc,Be,Te,e,a) aims to decide the best Chinese chunking result based on the Chinese POS sequence and the English chunk sequence.
In practice, in order to reduce the search space, only N-best results of each step are retained.
Determining the N-Best English POS Sequences
The HMM based POS tagging model (Kupiec 1992) with the trigram assumption is used to provide possible POS candidates for each word in terms of the N-best lattice.
Determining the N-best English Chunking Result
This step is to find the best chunk sequence based on the N-best POS lattice by decomposing the chunking model into two sub-models (1) inter-chunk model; (2) intra- chunk model.
Based on the trigram assumption, the first part can be written as,
Here, the crossing constraint a will remove those illegal candidates.
The second part can be further derived based on two assumptions: (1) bigram for the English POS transition inside a chunk; (2) the first POS tag of a chunk only depends on the previous two tags.
Thus
Where, xi is the number of words that the ith English chunk contains.
And ti,-2 ti,-1 refer to the two tags before t ei,1 .
The third part can be derived based on the assumption that an English word wei only
i 1 inter - chunk prob .
intra -chunk prob .
Where ß is a normalization coefficient, and its value is 0.5 in our experiment.
Deciding the Chinese N-best POS
Sequences
The N-best Chinese POS sequences are obtained by considering four factors: (1) tag transition probability; (2) tag translation probability; (3) lexical generation probability; (4) lexicon translation probability.
POS tag transition prob . pos tag translatio n prob .
Where, conn is the word alignment result.
And
lex generation prob .
lex translatio n prob .
We assume the word translation probability is 1 since we are using the word alignment result.
Comparing with a typical HMM based tagger, our model also utilizes the POS tag information in the other language.
Obtaining the Best Chinese Chunking Result
Similar to the English chunking model, the Chinese chunking model also includes (1) inter-chunk model; (2) intra-chunk model.
They are simplified, however, because of limited training data.
transition; (3) bigram for tag transition inside a chunk; (4) trigram for the POS tag transition between chunks, we get
/' is the word number of the i' Chinese phrase.
We use three kinds of resources for training and testing: a) The WSJ part of the Penn Treebank II corpus (Marcus, Santorini & Marcinkiewics 1993).
Sections 00-19 are used as the training data, and sections 20-24 as the test data. b) The HIT Treebank2, containing 2000 sentences. c) The HIT bilingual corpus3, containing 20,000 sentence-pairs (in general domain) annotated with POS and word alignment information.
We used 19,000 sentence-pairs for training and 1,000 for testing.
These 1000 sentence-pairs are manually chunked and aligned.
From the Penn Treebank, English chunks were extracted with the conversion tool (http://lcg-www.uia.ac.be/conll2000/chunking).
From the HIT Treebank, Chinese chunks were extracted with a conversion tool implemented by ourselves.
We can obtain an English chunk bank and a Chinese chunk bank.
With the chunk dataset obtained above, the parameters were estimated with Maximum Likelihood Estimation.
The POS tag translation probability in equation (9) was estimated from c).
The English part-of-speech tag set is the same with Penn Treebank.
And the Chinese tag set is the same with HIT Treebank.
phrase), BVP (verb phrase), BMP (quantifier phrase), BPP (prepositional phrase) and O (words outside any other chunks).
3 Experimental Results
We conducted experiments to evaluate (1) the overall accuracy; (2) the comparison with isolated strategy; (3) the comparison with a score-function approach.
The word aligner developed by Wang e' al. (2001) was used to provide word alignment anchors.
The 1000 sentence-pairs described in section 2.4 were used as evaluation standard set.
The result is evaluated in terms of chunking precision and recall, as well as alignment precision and recall, as defined in the following:
Chunking Pre.
Alignment Pre.
Alignment Rec.
# chunks correc'ly identified
# chunks iden'ified
# chunks correc'ly iden'ified
# chunks should be iden'ified
# Eng. chunks aligned
# Eng. chunks correc'ly aligned
# Eng. chunks should be aligned
As described in section 2.3, in each step, N-best candidates were selected.
In our experiment, N
English Chunking
Chinese Chunking
was set from
to 7.
Table 1 shows the results with different N. When N=4, we get the best results, we got 93.48% for English chunking, 89.93% for Chinese chunking, and 87.05% for alignment.
http://mtlab.hit. edu. cn/ download/ 4.
TXT Created by Harbin Institute of Technology.
Table 2 shows the results of individual Chinese chunk types.
The second column is the percentage that each type occupies among all the Chinese chunks.
Table 3 shows the results of individual English chunk types.
The last column shows the alignment precision of each English
Table 2 : accuracy of Chinese chunk types
Chunk Type
chunk type.
and O are around 90% for both Chinese and English.
This reflects that the compositional rules of these chunk types are very regular.
Table 3 : accuracy of English chunk types
Chunking |
Alignment
Chunking
Evaluation:
Comparison with Isolated Strategy
We now compare with the isolated strategy, which separately conduct chunks for English and Chinese.
In isolated strategy, we carry out the English and Chinese chunking separately, we call this experiment M.
We next add the crossing constraint to M. In other words, chunk each language under the crossing constraint, without considering the chunking procedure of the correspondent language.
We call this experiment M+C.
Both M and M+C are compared with our integrated mode, which we call I.
Table 4 chunking accuracies of different
approaches
English Chunking Accuracy
Chinese Chunking Accuracy
Table 4 indicates the contribution of the crossing cons'rain' and our integrated strategy.
Comparing M+C with M, we see that the accuracies (pre.
& rec.) of both languages rise.
Comparing I with M+C, the accuracies rise again.
Table 5 searching space of different approaches
(#chunk candidate)
In table 5, please note that the searching spaces of M+C and I are the same.
This is because they all adopt the crossing cons'rain'.
Comparing both I and M+C with M, we see that the searching space is reduced 21% ((59790-46937)/59790) for English and 71% ((57043-14746)/57043) for Chinese and 47% ((59790+ 57043-46937-14746) / (59790+57043)) for all.
3.3 Alignment Evaluation: Comparing with Score Function Approach
The score-function approach is usually used to select the best target language correspondence for a source language fragment.
Here, we call it SF.
First, we parse the English side under the crossing cons'rain' (as the M+C case in section 3.2).
And then use a score function to find the target correspondence for each English chunk.
The score function is:
SF = p(m 11)p(Dk | m, l)p(Dj | m, l) m and l are the lengths of the English chunk and its correspondent Chinese chunk respectively.
Dk is the difference in number of content words between these two chunks, Dj is the difference of functional words.
This function achieves the best performance among several
lexicalized score functions in (Wang, e' al., 2001).
The alignment result is shown in table 6.
Table 6 : finding target correspondence
The comparison between SF and I indicates that our integrated model obviously outperforms the score function approach in the aspect of finding the target alignment for source language chunks.
Conclusion
A new statistical method called "bilingual chunking" for structure alignment is proposed.
Different with the existing approaches which align hierarchical structures like sub-trees, our method conducts alignment on chunks.
The alignment is finished through a simultaneous bilingual chunking algorithm.
Using the constrains of chunk correspondence between source language (SL) and target language(TL), our algorithm can dramatically reduce search space, support time synchronous DP algorithm , and lead to highly consistent chunking.
Furthermore, by unifying the POS tagging and chunking in the search process, our algorithm alleviates effectively the influence of POS tagging deficiency to the chunking result.
The experimental results with English-Chinese structure alignment show that our model can produce 90% in precision for chunking, and 87% in precision for chunk alignment.
Compared with the isolated strategy, our method achieves much higher precision and recall for bilingual chunking.
Compared with the score function approach, our method got much higher precision and recall for chunk alignment.
In the future, we will conduct further research such as the inner-phrase translation modeling, or transferring grammar introduction, bilingual pattern learning, etc, based on the results of our method.
