Helmut Schmid

2020

pdf bib abs
The LMU Munich System for the WMT20 Very Low Resource Supervised MT Task
Jindřich Libovický | Viktor Hangya | Helmut Schmid | Alexander Fraser
Proceedings of the Fifth Conference on Machine Translation

We present our systems for the WMT20 Very Low Resource MT Task for translation between German and Upper Sorbian. For training our systems, we generate synthetic data by both back- and forward-translation. Additionally, we enrich the training data with German-Czech translated from Czech to Upper Sorbian by an unsupervised statistical MT system incorporating orthographically similar word pairs and transliterations of OOV words. Our best translation system between German and Sorbian is based on transfer learning from a Czech-German system and scores 12 to 13 BLEU higher than a baseline system built using the available parallel data only.

pdf bib abs
Automatically Identifying Words That Can Serve as Labels for Few-Shot Text Classification
Timo Schick | Helmut Schmid | Hinrich Schütze
Proceedings of the 28th International Conference on Computational Linguistics

A recent approach for few-shot text classification is to convert textual inputs to cloze questions that contain some form of task description, process them with a pretrained language model and map the predicted words to labels. Manually defining this mapping between words and labels requires both domain expertise and an understanding of the language model’s abilities. To mitigate this issue, we devise an approach that automatically finds such a mapping given small amounts of training data. For a number of tasks, the mapping found by our approach performs almost as well as hand-crafted label-to-word mappings.

2017

pdf bib abs
Statistical Models for Unsupervised, Semi-Supervised Supervised Transliteration Mining
Hassan Sajjad | Helmut Schmid | Alexander Fraser | Hinrich Schütze
Computational Linguistics, Volume 43, Issue 2 - June 2017

We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e., noise). The model is trained on noisy unlabeled data using the EM algorithm. During training the transliteration sub-model learns to generate transliteration pairs and the fixed non-transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.

This paper describes general requirements for evaluating and documenting NLP tools with a focus on morphological analysers and the design of a Gold Standard. It is argued that any evaluation must be measurable and documentation thereof must be made accessible for any user of the tool. The documentation must be of a kind that it enables the user to compare different tools offering the same service, hence the descriptions must contain measurable values. A Gold Standard presents a vital part of any measurable evaluation process, therefore, the corpus-based design of a Gold Standard, its creation and problems that occur are reported upon here. Our project concentrates on SMOR, a morphological analyser for German that is to be offered as a web-service. We not only utilize this analyser for designing the Gold Standard, but also evaluate the tool itself at the same time. Note that the project is ongoing, therefore, we cannot present final results.

pdf bib abs
A Corpus Representation Format for Linguistic Web Services: The D-SPIN Text Corpus Format and its Relationship with ISO Standards
Ulrich Heid | Helmut Schmid | Kerstin Eckart | Erhard Hinrichs
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In the framework of the preparation of linguistic web services for corpus processing, the need for a representation format was felt, which supports interoperability between different web services in a corpus processing pipeline, but also provides a well-defined interface to both, legacy tools and their data formats and upcoming international standards. We present the D-SPIN text corpus format, TCF, which was designed for this purpose. It is a stand-off XML format, inspired by the philosophy of the emerging standards LAF (Linguistic Annotation Framework) and its ``instances'' MAF for morpho-syntactic annotation and SynAF for syntactic annotation. Tools for the exchange with existing (best practice) formats are available, and a converter from MAF to TCF is being tested in spring 2010. We describe the usage scenario where TCF is embedded and the properties and architecture of TCF. We also give examples of TCF encoded data and describe the aspects of syntactic and semantic interoperability already addressed.

pdf bib
Hindi-to-Urdu Machine Translation through Transliteration
Nadir Durrani | Hassan Sajjad | Alexander Fraser | Helmut Schmid
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

2009

pdf bib
Tagging Urdu Text with Parts of Speech: A Tagger Comparison
Hassan Sajjad | Helmut Schmid
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf bib
Estimation of Conditional Probabilities With Decision Trees and an Application to Fine-Grained POS Tagging
Helmut Schmid | Florian Laws
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Combining EM Training and the MDL Principle for an Automatic Verb Classification Incorporating Selectional Preferences
Sabine Schulte im Walde | Christian Hying | Christian Scheible | Helmut Schmid
Proceedings of ACL-08: HLT

2007

pdf bib
Phonological Constraints and Morphological Preprocessing for Grapheme-to-Phoneme Conversion
Vera Demberg | Helmut Schmid | Gregor Möhler
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib
Trace Prediction and Recovery with Unlexicalized PCFGs and Slash Features
Helmut Schmid
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2005

pdf bib
Disambiguation of Morphological Structure using a PCFG
Helmut Schmid
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2004

pdf bib
Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors
Helmut Schmid
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
New Statistical Methods for Phrase Break Prediction
Helmut Schmid | Michaela Atterer
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
SMOR: A German Computational Morphology Covering Derivation, Composition and Inflection
Helmut Schmid | Arne Fitschen | Ulrich Heid
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2002

pdf bib
A Generative Probability Model for Unification-Based Grammars
Helmut Schmid
COLING 2002: The 19th International Conference on Computational Linguistics

pdf bib
Lexicalization of Probabilistic Grammars
Helmut Schmid
COLING 2002: The 19th International Conference on Computational Linguistics

2001

pdf bib
Parse Forest Computation of Expected Governors
Helmut Schmid | Mats Rooth
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

2000

pdf bib
Robust German Noun Chunking With a Probabilistic Context-Free Grammar
Helmut Schmid | Sabine Schulte im Walde
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

1997

pdf bib abs
Parsing by Successive Approximation
Helmut Schmid
Proceedings of the Fifth International Workshop on Parsing Technologies

It is proposed to parse feature structure-based grammars in several steps. Each step is aimed to eliminate as many invalid analyses as possible as efficiently as possible. To this end the set of feature constraints is divided into three subsets, a set of context-free constraints, a set of filtering constraints and a set of structure-building constraints, which are solved in that order. The best processing strategy differs: Context-free constraints are solved efficiently with one of the well-known algorithms for context-free parsing. Filtering constraints can be solved using unification algorithms for non-disjunctive feature structures whereas structure-building constraints require special techniques to represent feature structures with embedded disjunctions efficiently. A compilation method and an efficient processing strategy for filtering constraints are presented.