András Rung


NP Alignment in Bilingual Corpora
Gábor Recski | András Rung | Attila Zséder | András Kornai
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Aligning the NPs of parallel corpora is logically halfway between the sentence- and word-alignment tasks that occupy much of the MT literature, but has received far less attention. NP alignment is a challenging problem, capable of rapidly exposing flaws both in the word-alignment and in the NP chunking algorithms one may bring to bear. It is also a very rewarding problem in that NPs are semantically natural translation units, which means that (i) word alignments will cross NP boundaries only exceptionally, and (ii) within sentences already aligned, the proportion of 1-1 alignments will be higher for NPs than words. We created a simple gold standard for English-Hungarian, Orwell’s 1984, (since this already exists in manually verified POS-tagged format in many languages thanks to the Multex and MultexEast project) by manually verifying the automaticaly generated NP chunking (we used the yamcha, mallet and hunchunk taggers) and manually aligning the maximal NPs and PPs. The maximum NP chunking problem is much harder than base NP chunking, with F-measure in the .7 range (as opposed to over .94 for base NPs). Since the results are highly impacted by the quality of the NP chunking, we tested our alignment algorithms both with real world (machine obtained) chunkings, where results are in the .35 range for the baseline algorithm which propagates GIZA++ word alignments to the NP level, and on idealized (manually obtained) chunkings, where the baseline reaches .4 and our current system reaches .64.


pdf Hungarian lexical database and morphological grammar
Viktor Trón | Péter Halácsy | Péter Rebrus | András Rung | Péter Vajda | Eszter Simon
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes, a Hungarian lexical database and morphological grammar. is the outcome of a several-year collaborative effort and represents the resource with the widest coverage and broadest range of applicability presently available for Hungarian. The grammar resource is the formalization of well-founded theoretical decisions handling inflection and productive derivation. The lexical database was created by merging three independent lexical databases, and the resulting resource was further extended.


Creating Open Language Resources for Hungarian
Péter Halácsy | András Kornai | László Németh | András Rung | István Szakadát | Viktor Trón
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)