Sariya Karimova
2020
LibriVoxDeEn: A Corpus for German-to-English Speech Translation and German Speech Recognition
Benjamin Beilharz
|
Xin Sun
|
Sariya Karimova
|
Stefan Riezler
Proceedings of the Twelfth Language Resources and Evaluation Conference
We present a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that speech alignment quality is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for German speech recognition and for end-to-end German-to-English speech translation.
2016
A Post-editing Interface for Immediate Adaptation in Statistical Machine Translation
Patrick Simianer
|
Sariya Karimova
|
Stefan Riezler
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations
Adaptive machine translation (MT) systems are a promising approach for improving the effectiveness of computer-aided translation (CAT) environments. There is, however, virtually only theoretical work that examines how such a system could be implemented. We present an open source post-editing interface for adaptive statistical MT, which has in-depth monitoring capabilities and excellent expandability, and can facilitate practical studies. To this end, we designed text-based and graphical post-editing interfaces. The graphical interface offers means for displaying and editing a rich view of the MT output. Our translation systems may learn from post-edits using several weight, language model and novel translation model adaptation techniques, in part by exploiting the output of the graphical interface. In a user study we show that using the proposed interface and adaptation methods, reductions in technical effort and time can be achieved.
2014
Offline extraction of overlapping phrases for hierarchical phrase-based translation
Sariya Karimova
|
Patrick Simianer
|
Stefan Riezler
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers
Standard SMT decoders operate by translating disjoint spans of input words, thus discarding information in form of overlapping phrases that is present at phrase extraction time. The use of overlapping phrases in translation may enhance fluency in positions that would otherwise be phrase boundaries, they may provide additional statistical support for long and rare phrases, and they may generate new phrases that have never been seen in the training data. We show how to extract overlapping phrases offline for hierarchical phrasebased SMT, and how to extract features and tune weights for the new phrases. We find gains of 0.3 − 0.6 BLEU points over discriminatively trained hierarchical phrase-based SMT systems on two datasets for German-to-English translation.
Search