Martin Volk


Nunc profana tractemus. Detecting Code-Switching in a Large Corpus of 16th Century Letters
Martin Volk | Lukas Fischer | Patricia Scheurer | Bernard Silvan Schroffenegger | Raphael Schwitter | Phillip Ströbel | Benjamin Suter
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper is based on a collection of 16th century letters from and to the Zurich reformer Heinrich Bullinger. Around 12,000 letters of this exchange have been preserved, out of which 3100 have been professionally edited, and another 5500 are available as provisional transcriptions. We have investigated code-switching in these 8600 letters, first on the sentence-level and then on the word-level. In this paper we give an overview of the corpus and its language mix (mostly Early New High German and Latin, but also French, Greek, Italian and Hebrew). We report on our experiences with a popular language identifier and present our results when training an alternative identifier on a very small training corpus of only 150 sentences per language. We use the automatically labeled sentences in order to bootstrap a word-based language classifier which works with high accuracy. Our research around the corpus building and annotation involves automatic handwritten text recognition, text normalisation for ENH German, and machine translation from medieval Latin into modern German.

Evaluation of HTR models without Ground Truth Material
Phillip Benjamin Ströbel | Martin Volk | Simon Clematide | Raphael Schwitter | Tobias Hodel | David Schoch
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The evaluation of Handwritten Text Recognition (HTR) models during their development is straightforward: because HTR is a supervised problem, the usual data split into training, validation, and test data sets allows the evaluation of models in terms of accuracy or error rates. However, the evaluation process becomes tricky as soon as we switch from development to application. A compilation of a new (and forcibly smaller) ground truth (GT) from a sample of the data that we want to apply the model on and the subsequent evaluation of models thereon only provides hints about the quality of the recognised text, as do confidence scores (if available) the models return. Moreover, if we have several models at hand, we face a model selection problem since we want to obtain the best possible result during the application phase. This calls for GT-free metrics to select the best model, which is why we (re-)introduce and compare different metrics, from simple, lexicon-based to more elaborate ones using standard language models and masked language models (MLM). We show that MLM-based evaluation can compete with lexicon-based methods, with the advantage that large and multilingual transformers are readily available, thus making compiling lexical resources for other metrics superfluous.

Improving Specificity in Review Response Generation with Data-Driven Data Filtering
Tannon Kew | Martin Volk
Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5)

Responding to online customer reviews has become an essential part of successfully managing and growing a business both in e-commerce and the hospitality and tourism sectors. Recently, neural text generation methods intended to assist authors in composing responses have been shown to deliver highly fluent and natural looking texts. However, they also tend to learn a strong, undesirable bias towards generating overly generic, one-size-fits-all outputs to a wide range of inputs. While this often results in ‘safe’, high-probability responses, there are many practical settings in which greater specificity is preferable. In this work we examine the task of generating more specific responses for online reviews in the hospitality domain by identifying generic responses in the training data, filtering them and fine-tuning the generation model. We experiment with a range of data-driven filtering methods and show through automatic and human evaluation that, despite a 60% reduction in the amount of training data, filtering helps to derive models that are capable of generating more specific, useful responses.

A Multilingual Simplified Language News Corpus
Renate Hauser | Jannis Vamvas | Sarah Ebling | Martin Volk
Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference

Simplified language news articles are being offered by specialized web portals in several countries. The thousands of articles that have been published over the years are a valuable resource for natural language processing, especially for efforts towards automatic text simplification. In this paper, we present SNIML, a large multilingual corpus of news in simplified language. The corpus contains 13k simplified news articles written in one of six languages: Finnish, French, Italian, Swedish, English, and German. All articles are shared under open licenses that permit academic use. The level of text simplification varies depending on the news portal. We believe that even though SNIML is not a parallel corpus, it can be useful as a complement to the more homogeneous but often smaller corpora of news in the simplified variety of one language that are currently in use.

Machine Translation of 16Th Century Letters from Latin to German
Lukas Fischer | Patricia Scheurer | Raphael Schwitter | Martin Volk
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

This paper outlines our work in collecting training data for and developing a Latin–German Neural Machine Translation (NMT) system, for translating 16th century letters. While Latin–German is a low-resource language pair in terms of NMT, the domain of 16th century epistolary Latin is even more limited in this regard. Through our efforts in data collection and data generation, we are able to train a NMT model that provides good translations for short to medium sentences, and outperforms GoogleTranslate overall. We focus on the correspondence of the Swiss reformer Heinrich Bullinger, but our parallel corpus and our NMT system will be of use for many other texts of the time.


Benchmarking Data-driven Automatic Text Simplification for German
Andreas Säuberli | Sarah Ebling | Martin Volk
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

Automatic text simplification is an active research area, and there are first systems for English, Spanish, Portuguese, and Italian. For German, no data-driven approach exists to this date, due to a lack of training data. In this paper, we present a parallel corpus of news items in German with corresponding simplifications on two complexity levels. The simplifications have been produced according to a well-documented set of guidelines. We then report on experiments in automatically simplifying the German news items using state-of-the-art neural machine translation techniques. We demonstrate that despite our small parallel corpus, our neural models were able to learn essential features of simplified language, such as lexical substitutions, deletion of less relevant words and phrases, and sentence shortening.

How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR
Phillip Benjamin Ströbel | Simon Clematide | Martin Volk
Proceedings of the Twelfth Language Resources and Evaluation Conference

Recent advances in Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) have led to more accurate textrecognition of historical documents. The Digital Humanities heavily profit from these developments, but they still struggle whenchoosing from the plethora of OCR systems available on the one hand and when defining workflows for their projects on the other hand.In this work, we present our approach to build a ground truth for a historical German-language newspaper published in black letter. Wealso report how we used it to systematically evaluate the performance of different OCR engines. Additionally, we used this ground truthto make an informed estimate as to how much data is necessary to achieve high-quality OCR results. The outcomes of our experimentsshow that HTR architectures can successfully recognise black letter text and that a ground truth size of 50 newspaper pages suffices toachieve good OCR accuracy. Moreover, our models perform equally well on data they have not seen during training, which means thatadditional manual correction for diverging data is superfluous.


Post-editing Productivity with Neural Machine Translation: An Empirical Assessment of Speed and Quality in the Banking and Finance Domain
Samuel Läubli | Chantal Amrhein | Patrick Düggelin | Beatriz Gonzalez | Alena Zwahlen | Martin Volk
Proceedings of Machine Translation Summit XVII: Research Track

Geotagging a Diachronic Corpus of Alpine Texts: Comparing Distinct Approaches to Toponym Recognition
Tannon Kew | Anastassia Shaitarova | Isabel Meraner | Janis Goldzycher | Simon Clematide | Martin Volk
Proceedings of the Workshop on Language Technology for Digital Historical Archives

Geotagging historic and cultural texts provides valuable access to heritage data, enabling location-based searching and new geographically related discoveries. In this paper, we describe two distinct approaches to geotagging a variety of fine-grained toponyms in a diachronic corpus of alpine texts. By applying a traditional gazetteer-based approach, aided by a few simple heuristics, we attain strong high-precision annotations. Using the output of this earlier system, we adopt a state-of-the-art neural approach in order to facilitate the detection of new toponyms on the basis of context. Additionally, we present the results of preliminary experiments on integrating a small amount of crowdsourced annotations to improve overall performance of toponym recognition in our heritage corpus.


Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation
Samuel Läubli | Rico Sennrich | Martin Volk
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recent research suggests that neural machine translation achieves parity with professional human translation on the WMT Chinese–English news translation task. We empirically test this claim with alternative evaluation protocols, contrasting the evaluation of single sentences and entire documents. In a pairwise ranking experiment, human raters assessing adequacy and fluency show a stronger preference for human over machine translation when evaluating documents as compared to isolated sentences. Our findings emphasise the need to shift towards document-level evaluation as machine translation improves to the degree that errors which are hard or impossible to spot at the sentence-level become decisive in discriminating quality of different translation outputs.


Multilingwis² – Explore Your Parallel Corpus
Johannes Graën | Dominique Sandoz | Martin Volk
Proceedings of the 21st Nordic Conference on Computational Linguistics


Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus
Simon Clematide | Lenz Furrer | Martin Volk
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been successfully applied to several historic text collections. We report on our crowd-correction platform Kokos, which we built to improve the OCR quality of the digitized yearbooks of the Swiss Alpine Club (SAC) from the 19th century. This multilingual heritage corpus consists of Alpine texts mainly written in German and French, all typeset in Antiqua font. Finding and engaging volunteers for correcting large amounts of pages into high quality text requires a carefully designed user interface, an easy-to-use workflow, and continuous efforts for keeping the participants motivated. More than 180,000 characters on about 21,000 pages were corrected by volunteers in about 7 month, achieving an OCR gold standard with a systematically evaluated accuracy of 99.7% on the word level. The crowdsourced OCR gold standard and the corresponding original OCR recognition results from Abby FineReader 7 for each page are available as a resource. Additionally, the scanned images (300dpi) of all pages are included in order to facilitate tests with other OCR software.


Pre-reordering for Statistical Machine Translation of Non-fictional Subtitles
Magdalena Plamada | Gion Linder | Phillip Ströbel | Martin Volk
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Leveraging Compounds to Improve Noun Phrase Translation from Chinese and German
Xiao Pu | Laura Mascarell | Andrei Popescu-Belis | Mark Fishel | Ngoc-Quang Luong | Martin Volk
Proceedings of the ACL-IJCNLP 2015 Student Research Workshop

Detecting Document-level Context Triggers to Resolve Translation Ambiguity
Laura Mascarell | Mark Fishel | Martin Volk
Proceedings of the Second Workshop on Discourse in Machine Translation

Pre-reordering for Statistical Machine Translation of Non-fictional Subtitles
Magdalena Plamadă | Gion Linder | Phillip Ströbel | Martin Volk
Proceedings of the 18th Annual Conference of the European Association for Machine Translation


Detecting Code-Switching in a Multilingual Alpine Heritage Corpus
Martin Volk | Simon Clematide
Proceedings of the First Workshop on Computational Approaches to Code Switching

Machine Translation for Subtitling: A Large-Scale Evaluation
Thierry Etchegoyhen | Lindsay Bywood | Mark Fishel | Panayota Georgakopoulou | Jie Jiang | Gerard van Loenhout | Arantza del Pozo | Mirjam Sepesy Maučec | Anja Turner | Martin Volk
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article describes a large-scale evaluation of the use of Statistical Machine Translation for professional subtitling. The work was carried out within the FP7 EU-funded project SUMAT and involved two rounds of evaluation: a quality evaluation and a measure of productivity gain/loss. We present the SMT systems built for the project and the corpora they were trained on, which combine professionally created and crowd-sourced data. Evaluation goals, methodology and results are presented for the eleven translation pairs that were evaluated by professional subtitlers. Overall, a majority of the machine translated subtitles received good quality ratings. The results were also positive in terms of productivity, with a global gain approaching 40%. We also evaluated the impact of applying quality estimation and filtering of poor MT output, which resulted in higher productivity gains for filtered files as opposed to fully machine-translated files. Finally, we present and discuss feedback from the subtitlers who participated in the evaluation, a key aspect for any eventual adoption of machine translation technology in professional subtitling.

Innovations in Parallel Corpus Search Tools
Martin Volk | Johannes Graën | Elena Callegaro
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Recent years have seen an increased interest in and availability of parallel corpora. Large corpora from international organizations (e.g. European Union, United Nations, European Patent Office), or from multilingual Internet sites (e.g. OpenSubtitles) are now easily available and are used for statistical machine translation but also for online search by different user groups. This paper gives an overview of different usages and different types of search systems. In the past, parallel corpus search systems were based on sentence-aligned corpora. We argue that automatic word alignment allows for major innovations in searching parallel corpora. Some online query systems already employ word alignment for sorting translation variants, but none supports the full query functionality that has been developed for parallel treebanks. We propose to develop such a system for efficiently searching large parallel corpora with a powerful query language.


Mining for Domain-specific Parallel Text from Wikipedia
Magdalena Plamadă | Martin Volk
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

pdf bib
Building a German/Simple German Parallel Corpus for Automatic Text Simplification
David Klaper | Sarah Ebling | Martin Volk
Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations

Combining Statistical Machine Translation and Translation Memories with Domain Adaptation
Samuel Läubli | Mark Fishel | Martin Volk | Manuela Weibel
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

Statistical Machine Translation for Automobile Marketing Texts
Samuel Läubli | Mark Fishel | Manuela Weibel | Martin Volk
Proceedings of Machine Translation Summit XIV: Posters

Assessing post-editing efficiency in a realistic translation environment
Samuel Läubli | Mark Fishel | Gary Massey | Maureen Ehrensberger-Dow | Martin Volk
Proceedings of the 2nd Workshop on Post-editing Technology and Practice

Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis
Rico Sennrich | Martin Volk | Gerold Schneider
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013


SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles
Volha Petukhova | Rodrigo Agerri | Mark Fishel | Sergio Penkale | Arantza del Pozo | Mirjam Sepesy Maučec | Andy Way | Panayota Georgakopoulou | Martin Volk
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Subtitling and audiovisual translation have been recognized as areas that could greatly benefit from the introduction of Statistical Machine Translation (SMT) followed by post-editing, in order to increase efficiency of subtitle production process. The FP7 European project SUMAT (An Online Service for SUbtitling by MAchine Translation: aims to develop an online subtitle translation service for nine European languages, combined into 14 different language pairs, in order to semi-automate the subtitle translation processes of both freelance translators and subtitling companies on a large scale. In this paper we discuss the data collection and parallel corpus compilation for training SMT systems, which includes several procedures such as data partition, conversion, formatting, normalization and alignment. We discuss in detail each data pre-processing step using various approaches. Apart from the quantity (around 1 million subtitles per language pair), the SUMAT corpus has a number of very important characteristics. First of all, high quality both in terms of translation and in terms of high-precision alignment of parallel documents and their contents has been achieved. Secondly, the contents are provided in one consistent format and encoding. Finally, additional information such as type of content in terms of genres and domain is available.

pdf bib
From Subtitles to Parallel Corpora
Mark Fishel | Yota Georgakopoulou | Sergio Penkale | Volha Petukhova | Matej Rojc | Martin Volk | Andy Way
Proceedings of the 16th Annual conference of the European Association for Machine Translation


Combining Semantic and Syntactic Generalization in Example-Based Machine Translation
Sarah Ebling | Andy Way | Martin Volk | Sudip Kumar Naskar
Proceedings of the 15th Annual conference of the European Association for Machine Translation

Le corpus Text+Berg Une ressource parallèle alpin français-allemand (The Text+Berg Corpus An Alpine French-German Parallel Resource)
Anne Göhring | Martin Volk
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article présente un corpus parallèle français-allemand de plus de 4 millions de mots issu de la numérisation d’un corpus alpin multilingue. Ce corpus est une précieuse ressource pour de nombreuses études de linguistique comparée et du patrimoine culturel ainsi que pour le développement d’un système statistique de traduction automatique dans un domaine spécifique. Nous avons annoté un échantillon de ce corpus parallèle et aligné les structures arborées au niveau des mots, des constituants et des phrases. Cet “alpine treebank” est le premier corpus arboré parallèle français-allemand de haute qualité (manuellement contrôlé), de libre accès et dans un domaine et un genre nouveau : le récit d’alpinisme.

Reducing OCR Errors in Gothic-Script Documents
Lenz Furrer | Martin Volk
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage

Iterative, MT-based Sentence Alignment of Parallel Texts
Rico Sennrich | Martin Volk
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

Disambiguation of English Contractions for Machine Translation of TV Subtitles
Martin Volk | Rico Sennrich
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)


Machine Translation of TV Subtitles for Large Scale Production
Martin Volk | Rico Sennrich | Christian Hardmeier | Frida Tidström
Proceedings of the Second Joint EM+/CNGL Workshop: Bringing MT to the User: Research on Integrating MT in the Translation Industry

This paper describes our work on building and employing Statistical Machine Translation systems for TV subtitles in Scandinavia. We have built translation systems for Danish, English, Norwegian and Swedish. They are used in daily subtitle production and translate large volumes. As an example we report on our evaluation results for three TV genres. We discuss our lessons learned in the system development process which shed interesting light on the practical use of Machine Translation technology.

Combining Parallel Treebanks and Geo-Tagging
Martin Volk | Anne Goehring | Torsten Marek
Proceedings of the Fourth Linguistic Annotation Workshop

MT-based Sentence Alignment for OCR-generated Parallel Texts
Rico Sennrich | Martin Volk
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

The performance of current sentence alignment tools varies according to the to-be-aligned texts. We have found existing tools unsuitable for hard-to-align parallel texts and describe an alternative alignment algorithm. The basic idea is to use machine translations of a text and BLEU as a similarity score to find reliable alignments which are used as anchor points. The gaps between these anchor points are then filled using BLEU-based and length-based heuristics. We show that this approach outperforms state-of-the-art algorithms in our alignment task, and that this improvement in alignment quality translates into better SMT performance. Furthermore, we show that even length-based alignment algorithms profit from having a machine translation as a point of comparison.

Challenges in Building a Multilingual Alpine Heritage Corpus
Martin Volk | Noah Bubenhofer | Adrian Althaus | Maya Bangerter | Lenz Furrer | Beni Ruef
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes our efforts to build a multilingual heritage corpus of alpine texts. Currently we digitize the yearbooks of the Swiss Alpine Club which contain articles in French, German, Italian and Romansch. Articles comprise mountaineering reports from all corners of the earth, but also scientific topics such as topography, geology or glacierology as well as occasional poetry and lyrics. We have already scanned close to 70,000 pages which has resulted in a corpus of 25 million words, 10% of which is a parallel French-German corpus. We have solved a number of challenges in automatic language identification and text structure recognition. Our next goal is to identify the great variety of toponyms (e.g. names of mountains and valleys, glaciers and rivers, trails and cabins) in this corpus, and we sketch how a large gazetteer of Swiss topographical names can be exploited for this purpose. Despite the size of the resource, exact matching leads to a low recall because of spelling variations, language mixtures and partial repetitions.


Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles
Christian Hardmeier | Martin Volk
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)


Human Judgements in Parallel Treebank Alignment
Martin Volk | Torsten Marek | Yvonne Samuelsson
Coling 2008: Proceedings of the workshop on Human Judgements in Computational Linguistics


Evaluating MT with translations or translators: what is the difference?
Martin Volk | Søren Harder
Proceedings of Machine Translation Summit XI: Papers

A Search Tool for Parallel Treebanks
Martin Volk | Joakim Lundborg | Maël Mettler
Proceedings of the Linguistic Annotation Workshop

pdf bib
Proceedings of the Fourth ACL-SIGSEM Workshop on Prepositions
Fintan Costello | John Kelleher | Martin Volk
Proceedings of the Fourth ACL-SIGSEM Workshop on Prepositions

Comparing French PP-attachment to English, German and Swedish
Martin Volk | Frida Tidström
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)


How Bad is the Problem of PP-Attachment? A Comparison of English, German and Swedish
Martin Volk
Proceedings of the Third ACL-SIGSEM Workshop on Prepositions

XML-based Phrase Alignment in Parallel Treebanks
Martin Volk | Sofia Gustafson-Capková | Joakim Lundborg | Torsten Marek | Yvonne Samuelsson | Frida Tidström
Proceedings of the 5th Workshop on NLP and XML (NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing


Evaluation Resources for Concept-based Cross-Lingual Information Retrieval in the Medical Domain
Paul Buitelaar | Diana Steffen | Martin Volk | Dominic Widdows | Bogdan Sacaleanu | Špela Vintar | Stanley Peters | Hans Uszkoreit
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Bootstrapping Parallel Treebanks
Martin Volk | Yvonne Samuelsson
Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora


A Cross Language Document Retrieval System Based on Semantic Annotation
Bogdan Sacaleanu | Paul Buitelaar | Martin Volk


Combining Unsupervised and Supervised Methods for PP Attachment Disambiguation
Martin Volk
COLING 2002: The 19th International Conference on Computational Linguistics


Evaluating Translation Quality as Input to Product Development
Niamh Bohan | Elisabeth Breidt | Martin Volk
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)


Experiences with the GTU grammar development environment
Martin Volk
Computational Environments for Grammar Development and Linguistic Engineering

Probing the Lexicon in Evaluating Commercial MT Systems
Martin Volk
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics


The Role of Testing in Grammar Engineering
Martin Volk
Third Conference on Applied Natural Language Processing


The Logical Structure of English: Computing Semantic Content
Martin Volk
Computational Linguistics, Volume 17, Number 3, September 1991