Jörg Steffen


TransIns: Document Translation with Markup Reinsertion
Jörg Steffen | Josef van Genabith
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

For many use cases, it is required that MT does not just translate raw text, but complex formatted documents (e.g. websites, slides, spreadsheets) and the result of the translation should reflect the formatting. This is challenging, as markup can be nested, apply to spans contiguous in source but non-contiguous in target etc. Here we present TransIns, a system for non-plain text document translation that builds on the Okapi framework and MT models trained with Marian NMT. We develop, implement and evaluate different strategies for reinserting markup into translated sentences using token alignments between source and target sentences. We propose a simple and effective strategy that compiles down all markup to single source tokens and transfers them to aligned target tokens. A first evaluation shows that this strategy yields highly accurate markup in the translated documents that outperforms the markup quality found in documents translated with popular translation services. We release TransIns under the MIT License as open-source software on https://github.com/DFKI-MLT/TransIns. An online demonstrator is available at https://transins.dfki.de.


pdf bib
Common Round: Application of Language Technologies to Large-Scale Web Debates
Hans Uszkoreit | Aleksandra Gabryszak | Leonhard Hennig | Jörg Steffen | Renlong Ai | Stephan Busemann | Jon Dehdari | Josef van Genabith | Georg Heigold | Nils Rethmeier | Raphael Rubino | Sven Schmeier | Philippe Thomas | He Wang | Feiyu Xu
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

Web debates play an important role in enabling broad participation of constituencies in social, political and economic decision-taking. However, it is challenging to organize, structure, and navigate a vast number of diverse argumentations and comments collected from many participants over a long time period. In this paper we demonstrate Common Round, a next generation platform for large-scale web debates, which provides functions for eliciting the semantic content and structures from the contributions of participants. In particular, Common Round applies language technologies for the extraction of semantic essence from textual input, aggregation of the formulated opinions and arguments. The platform also provides a cross-lingual access to debates using machine translation.


A Fully Coreference-annotated Corpus of Scholarly Papers from the ACL Anthology
Ulrich Schäfer | Christian Spurk | Jörg Steffen
Proceedings of COLING 2012: Posters


pdf bib
The ACL Anthology Searchbench
Ulrich Schäfer | Bernd Kiefer | Christian Spurk | Jörg Steffen | Rui Wang
Proceedings of the ACL-HLT 2011 System Demonstrations


The pragmatic combination of different crosslingual resources
Hans Uszkoreit | Feiyu Xu | Jörg Steffen | Ilhan Aslan
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We will describe new cross-lingual strategies for the development multilingual information services on mobile devices. The novelty of our approach is the intelligent modeling of cross-lingual application domains and the combination of textual translation with speech generation. The final system helps users to speak foreign languages and communicate with the local people in relevant situations, such as restaurant, taxi and emergencies. The advantage of our information services is that they are robust enough for the use in real-world situations. They are developed for the Beijing Olympic Games 2008, where most foreigners will have to rely on translation assistance. Their deployment is foreseen as part of the planned ubiquitous mobile information system of the Olympic Games.


Integrated Language Technologies for Multilingual Information Services in the MEMPHIS Project
Walter Kasper | Jörg Steffen | Jakub Piskorski | Paul Buitelaar
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

N-Gram Language Modeling for Robust Multi-Lingual Document Classification
Jörg Steffen
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)