This paper proposes a new method for automatic acquisition of Chinese bracketing knowledge from English-Chinese sentence-aligned bilingual corpora.
Bilingual sentence pairs are first aligned in syntactic structure by combining English parse trees with a statistical bilingual language model.
Chinese bracketing knowledge is then extracted automatically.
The preliminary experiments show automatically learned knowledge accords well with manually annotated brackets.
The proposed method is particularly useful to acquire bracketing knowledge for a less studied language that lacks tools and resources found in a second language more studied.
Although this paper discusses experiments with Chinese and English, the method is also applicable to other language pairs.
Introduction
The past few years have seen a great success in automatic acquisition of monolingual parsing knowledge and grammars.
The availability of large tagged and syntactically bracketed corpora, such as Penn Tree bank, makes it possible to extract syntactic structure and grammar rules automatically (Marcus 1993).
Substantial improvements have been made to parse western language such as English, and many powerful models have been proposed (Brill 1993, Collins 1997).
However, very limited progress has been achieved in Chinese.
Knowledge acquisition is a bottleneck for real appication of Chinese parsing.
While some methods have been proposed to learn syntactic knowledge from annotated Chinese corpus, most of the methods depended on the annotated or
partial annotated data(Zhou 1997, Streiter 2000).
Due to the limited availbility of Chinese annotated corpus, tests of these methods are still small in scale.
Although some institutions and universities currently are engaged in building Chinese tree bank, no large scale annotated corpus has been published until now because the complexity in Chinese syntatic sturcture and the difficulty in corpus annotation (Chen 1996).
This paper proposes a novel method to facilitate the Chinese tree bank construction.
Based on English-Chinese bilingual corpora and better English parsing, this method obtains Chinese bracketing information automatically via a bilingual model and word alignment results.
The main idea of the method is that we may acquire knowledge for a language lacking a rich collection of resources and tools from a second language that is full of them.
The rest of this paper is organized as follows : In the next section, a bilingual language model is introduced.
Then, a bilingual parsing method supervised by English parsing is proposed in section 2.
Based on the bilingual parsing, Chinese bracketing knowlege is extracted in section 3.
The evaluation and discussion are given in section 4.
We conclude with discussion of future work.
1 A bilingual language model - ITG
Wu (1997) has proposed a bilingual language model called Inversion Transduction Grammar (ITG), which can be used to parse bilingual sentence pairs simultaneously.
We will give a brief description here.
For details please refer to (Wu 1995, Wu 1997).
The Inversion Transduction Grammar is a bilingual context-free grammar that generates two matched output languages (referred to as Li
and L2).
It also differs from standard context-free grammars in that the ITG allows right-hand side production in two directions: straight or inverted.
The following examples are two ITG productions:
Each nonterminal symbol stands for a pair of matched strings.
For example, the nonterminal A stands for the string-pair (A;, A2).
A sub-string in L;, and A2 is A/s corresponding translation in L2.
Similarly, (B;, B2) denotes the string-pair generated by B. The operator [ ] performs the usual concatenation, so that C -> [A B] yields the string-pair (C;, C2), where C1=A1B1 and C2=A2B2.
On the other hand, the operator <> performs the straight concatenation for language 1 but the reversing concatenation for language 2, so that C -> <A B> yields C1=A1B1, but C2=B2A2.
The inverted concatenation operator permits the extra flexibility needed to accommodate many kinds of word-order variation between source and target languages (Wu 1995).
There are also lexical productions of the following form in ITG:
This means that a symbol x in language L; is translated by the symbol y in language L2. x or y may be a null symbol e, which means there may be no counterpart string on other side of the bitext.
ITG based parsing matches constituents for an input sentence-pair.
For example, Figure 1 shows an ITG parsing tree for an English-Chinese sentence-pair.
The inverted production is indicated by a horizontal line in the parsing tree.
The English text is read in the usual depth-first left to right order, but for the Chinese text, a horizontal line means the right sub-tree is traversed before the left.
The generated parsing results are:
We can also represent the common structure of the two sentences more clearly and compactly with the aid of <> notation:
where the horizontal line from Figure 1 corresponds to the <> level of bracketing.
plays/lj' basketball/ly# on/e Sunday V**!
^ Figure 1 Inversion transduction Grammar parsing
Any ITG can be converted to a normal form, where all productions are either lexical productions or binary-fanout nonterminal productions(Wu 1997).
If probability is associated with each production, the ITG is called the Stochastic Inversion Transduction Grammar (SITG).
2 English parsing supervised bilingual bracketing
Because of the difficulty in finding a suitable bilingual syntactic grammar for Chinese and English, a practical ITG is the generic Bracketing Inversion Transduction Grammar (BTG)(Wu 1995).
BTG is a simplified ITG that has only one nonterminal and does not use any syntactic grammar.
A Statistical BTG (SBTG) grammar is as follows:
A — > ut / e; A — > e / vj SBTG employs only one nonterminal symbol A that can be used recursively.
Here, "a" denotes the probability of syntactic rules.
However, since those constituent categories are not differentiated in BTG, it has no practical effect here and can be set to an arbitrary constant.
The remaining productions are all lexical. by is the translation probability that source word u translates into target word vj. by can be obtained using a statistical word-translation model (Melamed 2000) or word alignment(Lu 2001a).
The last two productions denote that the word in one language has no counterpart on other side of the bitext.
A small constant can be chosen for the probabilities bie and bej.
In BTG, no language specific syntactic
grammar is used.
The maximum-likelihood parser selects the parse tree that best satisfies the combined lexical translation preferences, as expressed by the bij probabilities.
Because the expressiveness characteristics of ITG naturally constrain the space of possible matching in a highly appropriate fashion, BTG achieves encouraging results for bilingual bracketing using a word-translation lexicon alone (Wu 1997).
Since no syntactic knowledge is used in SBTG, output grammaticality can not be guaranteed.
In particular, if the corresponding constituents appear in the same order in both languages, both straight and inverted, then lexical matching does not provide the discriminative leverage needed to identify the sub-constituent boundaries.
For example, consider an English-Chinese sentence pair:
(4) English: That old teacher is our adviser.
Chinese:
Using SBTG, the bilingual bracketing result is :
The result is not consistent with the expected syntactic structure.
In this case, grammatical information about one or both of the languages can be very helpful.
For example, if we know the English parsing result shown in (6), then the bilingual bracketing can be determined easily; the result should be (7).
From the example, we can see that if one language parser is available, the induced bilingual bracketing result would be more accurate.
English parsing methods have been well studied and many powerful models have been proposed.
It will be helpful to make use of English parsing results.
In the following, we will propose a method of bilingual bracketing supervised by English parsing.
Here, English parsing supervised BTG means using an English parser's bracketing information as a boundary restriction in the BTG language model.
But this does not necessitate parsing Chinese completely according to the
same parsing boundary of English.
If the English parsing structure is totally fixed, it is possible that the structure is not linguistically valid for Chinese under the formalism of Inversion Transduction Grammar.
To illustrate this, see the example shown in Figure 2.
If you want to lose weight, you had better eat less bread .
Si lift # m Mi iM, m 4>n£ffi& .
Figure 2 A example of mismatch subtree
The sub-tree for blacked underlined part of English and corresponding Chinese are shown in Figure 2(a).
We can see that the Chinese constituents do not match the English counterparts in the English structure.
In this case, our solution is that: the whole English constituent of "VP" is aligned with the whole Chinese correspondence; i.e., "eat less bread" is matched with " " shown in Figure 2(b).
At the same time, we give the inner structure matching according to ITG regardless of the English parsing constraint.
An "X" tag is introduced to indicate that the sub-bilingual-parsing-tree is not consistent with the given English sub-tree.
Our result can also be understood as a flattened bilingual parsing tree as shown in Figure 2(c).
This means that when the bilingual constituents couldn't match in the small syntactic structure, we will match them in a larger structure.
The main idea is that the given English parser is only used as a boundary constraint for bilingual parsing.
When the constraint is incompatible with the bilingual model ITG, we use ITG as the default result.
This process enables parsing to go on regardless of some failures in matching.
We heuristically define a constraint function Fe(s, t) to denote the English boundary constraint, where s is the beginning position and t is the end.
There are three cases of structure matching: violate match, exact match and inside match.
Violate match means the bilingual parsing conflicts with the given English bracketing boundary.
For example, given the following English bracketing result (8), (1,2), (1,3), (2,3),
(3.5) are examples.
(3,4), (4,5) are examples of inside match, and the value 1 is assigned to these Fe(s, t) functions.
max P[est / cuv ] denotes the maximum probability of sub-parsing-tree of node q and that both the sub-string es t and cu v derive from node q. Thus, the best parser has the probability S(0, T,0, V).
S(s, t, u, v) is calculated as the maximum probability combination of all possible sub-tree combinations(Wu 1995).
To insert English parsing constraints in bilingual parsing, we integrate the constraint function Fe(s, t) into the local optimization function.
Computation of the local optimization function is then modified as given below:
S<> (s,t,u, v) = max Fe (s,t)S(s, S,U,v)S(S,t,u,U).
Initialization is as follows :
where, T ,V is the length of English and Chinese sentence respectively. b(et /cv) is the probability of translating English word et into Chinese word cv .
A minimal probability can be assigned to empty word alignment b(et /e) and b(e /cv).
The optimal bilingual parsing tree for a given sentence-pair can be computed using dynamic programming (DP) algorithm(Wu 1997).
Using the standard SBTG local optimization fuction, the obtained bilingual parsing result for the given sentence-pair(4) is shown as example (5); when using the above modified local optimization function, the parsing result is that shown as example (7).
Comparing the two results, we can see that by intergrating English parsing constraints into BTG, the bilingual parsing becomes more grammatical.
Our experiments showed that this English parsing supervised BTG would improve the accuracy of bilingual bracketing by nearly 20% (Lu 2001b).
The obtained bilingual parsing tree is in the normal form of ITG, that is each node in the tree is either a lexical node or a binary-fanout nonterminal node.
We can combine the subtree to restore the fanout flexibility using the production characters [[A414]=L4[A4]]=[4A4] and <<AA>A>= <A<AA»=<AAA>.
The combining operation could not cross the given English parisng boundary.
3 Chinese bracketing knowledge extraction
Table 1 shows some bilingual bracketing examples obtained using the above method.
To understand easily, we give the tree form of the first example in Figure 3(a).
The leaf node is the aligned words of the two languages and their POS tag categories.
These POS tags are generated from an English and a Chinese POS tagger respectively.
The English POS tag and phrase tag set are the same as those of the Penn Tree Bank (Marcus 1993) and the Chinse POS tag set please refer to the web site: http://mtlab.hit.edu.cn.
The nonterminal node are labeled using English sub-tree tags.
Based on the bilingual parsing result, it is easy to extract the Chinese bracketing structure according to the Inversion Transduction Grammar.
For the normal node, the Chinese text is traversed in depth-first left to right order, but for an inverted node (indicated by a horizontal line in the parsing tree or indicated by a <> notation in bracketing expression), the right sub-tree is traversed before the left.
Thus, the Chinese parsing tree corresponding to Figure 3(a) is shown in Figure 3(b).
The nonterminal labels are derived from the English sub-tree.
The extracted Chinese bracketing results from Table1
_Table 1 Bilingual bracketing examples_
_Table 2 The extracted Chinese bracketing results corresponding to Table 1_
are listed in Table 2.
(a) Bilingual parsing result supervised by English parsing
(b) The Chinese parsing result extracted from (a) Figure 3 Extract Chinese Bracketing structure from Bilingual Parsing
It can be seen from Table 2 that the automatic acquired bracketing results reflect the Chinese structure well though some English phrase tags are not suitable to label the corresponding Chinese phrase directly.
For example, in Table 2, the English tags "PP (preposition phrase)" in sentence 1 and "SBAR(clause)" in sentence 4 are incorrectly tag the corresponding Chinese structure.
We don't care about the phrase tags here.
Our main concern is the bracketing
boundary of the syntactic structure.
The bracketing boundary knowledge has been proved to be valuable for Chinese grammar induction (Zhou 1997).
The advantage of our method is that the bracketing knowledge is acquired from bilingual corpus automatically.
It reduces the manual labour for corpus tagging, which are time-consuming and error-prone.
4 Evaluation and discussion
To evaluate the quality of the acquired Chinese bracketing boundaries, we compared them with the parsing annotation based on an existed Chinese syntax annotation scheme.
Detail of the Chinese syntax annotation scheme and a annotated corpus can be download from the website http://mtlab.hit. edu.cn.
The test set consisted of 3,000 English-Chinese bilingual sentence-pairs that come from the machine translation evaluation corpus(Duan 1996).
The average length is 9.1 words for English sentences and 12.6 Chinese characters for Chinese sentences.
The test sentence pairs were first aligned at the word level based on statistics and lexicon with a accuracy of nearly 90%(Lu 2001a).
The English and Chinese sentences were parsed based on the Penn Tree
bank tag set and the Chinese syntax annotation scheme respectively.
Both the English and the Chinese parsing results were manually corrected.
The corrected Chinese parsing results are used as the standard test set.
We acquired Chinese bracketing results using the proposed method.
The previous defined exact match, violate match, and inside match are used to evaluate the accordance between acquired bracketing result and the standard parsing result.
Here, exact match means the acquired structure are the same as the standard structure; violate match means the acquired structure conflict with the standard structure.
Otherwise, the acquired structure is called a inside match.
In example (9), A is the standard bracketing result, B is the acquired bracketing result and C demonstrates the classification of the acquired structures.
The structure of whole sentence are not participate in evaluation.
Exact match rate(EMR), violate match rate(VMR), and inside match rate(IMR) denote the ratio of three types of bracketing numbers in all bracketing numbers respectively.
Table 3 gives the evaluation result.
The evaluation results for acquired Chinese structure corresponding to six main English phrases (BNP,
Table 3 Evaluation on acquired Chinese bracketing results
Np, Vp, ADjp, ADVp and pp) are also given in detail.
From the results we can see that only a fraction of the learned structures are violate match(14.03%), most of them are exact match (55.46%).
In addition, there are also many inside match.
These inside matches occured due to the difference standard in phrase merging between penn Tree bank and the standard Chinese annotation scheme.
The English phrase structure are labeled with more details.
While for Chinese, the main phrase in the level of sentence are not merged futher.
For example, the verb and object in sentence level are not combined.
That is why most of the verb phrases(Vp) are inside match (53.28%).
The bracketing boundary of inside match can be either right or wrong.
We checked the correctness of inside match manually and got a average accuray of 79.37%.
Then the accuracy of all acquired structure bracketing is 79.68% (EMR+IMR x Accuracy of IM).
The violate matches acquired in bilingual parsing are mainly due to the empty word alignments.
Such as in the special strucures "IE ..." and in Chinese.
The word
" IC " and" -ft " has no counterpart word in English.They are usually merged with the neighboring noun word as shown in example (10) thus lead to a violate match.
It is neccessary to build special patterns to handle these structures.
Word alignment errors also produce violate matches in bilingual bracketing.
Bracket number
Accuracy of IM
Accuracy
Chinese annotated training corpus, which is difficult to accumulate.
Another advantage of our method is that the Chinese bracketing result is derived based on English parsing and parallel corpus, which make it particularly benefit for research on the corresponding relationship between Chinese and English phrase.
In (Lu 2001b), we used bilingual bracketing result for automatic translation templates acquisition, which turns out to be very useful for structure transfer in machine translation.
In addition, the acquired bracketing corpus can be applied to many Chinese NLp tasks.
It can be used as the foundation for further Chinese treebank annotation, which will save human labour in a great deal.
It can also be used to improve the efficiency and accuracy in Chinese grammar induction (Zhou 1997).
Grammar rules can also be extracted from the bracketing corpus.
For example, we can obtain the following BNp rules from the acquired bracketing results in Table 2:
Conclusion
In this paper, we have presented a method to learn Chinese syntactic structure from English parsing based on a bilingual language model.
The method creates structure bracketing Chinese corpora automatically by taking full advantage of English parsing and bilingual corpora.
The created corpora are very useful for further Chinese corpus annotation and parsing knowledge acquisition. primary experiment proved the feasibility and validity of the method.
Although this paper is related to Chinese and English, the method is also applicable to other language pairs.
Obviously, if the concerned languages come from same language family, such as English and French, the method would be more effective.
Acknowledgements
This research was funded by High Technology Research and Development program of China (2001AA114101).
We also would like to thank the Institute of Computational Linguistics at peking university for providing bilingual corpora for test.
