Matthias Hertel


2023

pdf
Fast Whitespace Correction with Encoder-Only Transformers
Hannah Bast | Matthias Hertel | Sebastian Walter
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

The goal of whitespace correction is to fix space errors in arbitrary given text. For example, given the text “whi te space correctio nwithTransf or mers”, produce “whitespace correction with Transformers”. We compare two Transformer-based models, a character-level encoder-decoder model and a byte-level encoder-only model. We find that the encoder-only model is both faster and achieves higher quality. We provide an easy-to-use tool that is over 900 times faster than the previous best tool, with the same high quality. Our tool repairs text at a rate of over 200 kB/s on GPU, with a sequence-averaged F1-score ranging from 87.5% for hard-to-correct text up to 99% for text without any spaces.

2022

pdf
ELEVANT: A Fully Automatic Fine-Grained Entity Linking Evaluation and Analysis Tool
Hannah Bast | Matthias Hertel | Natalie Prange
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present Elevant, a tool for the fully automatic fine-grained evaluation of a set of entity linkers on a set of benchmarks. Elevant provides an automatic breakdown of the performance by various error categories and by entity type. Elevant also provides a rich and compact, yet very intuitive and self-explanatory visualization of the results of a linker on a benchmark in comparison to the ground truth. A live demo, the link to the complete code base on GitHub and a link to a demo video are provided under https://elevant.cs.uni-freiburg.de .

2021

pdf
Tokenization Repair in the Presence of Spelling Errors
Hannah Bast | Matthias Hertel | Mostafa M. Mohamed
Proceedings of the 25th Conference on Computational Natural Language Learning

We consider the following tokenization repair problem: Given a natural language text with any combination of missing or spurious spaces, correct these. Spelling errors can be present, but it’s not part of the problem to correct them. For example, given: “Tispa per isabout token izaionrep air”, compute “Tis paper is about tokenizaion repair”. We identify three key ingredients of high-quality tokenization repair, all missing from previous work: deep language models with a bidirectional component, training the models on text with spelling errors, and making use of the space information already present. Our methods also improve existing spell checkers by fixing not only more tokenization errors but also more spelling errors: once it is clear which characters form a word, it is much easier for them to figure out the correct word. We provide six benchmarks that cover three use cases (OCR errors, text extraction from PDF, human errors) and the cases of partially correct space information and all spaces missing. We evaluate our methods against the best existing methods and a non-trivial baseline. We provide full reproducibility under https://ad.informatik.uni-freiburg.de/publications.