2020
pdf
abs
Recognizing Sentence-level Logical Document Structures with the Help of Context-free Grammars
Jonathan Hildebrand
|
Wahed Hemati
|
Alexander Mehler
Proceedings of the Twelfth Language Resources and Evaluation Conference
Current sentence boundary detectors split documents into sequentially ordered sentences by detecting their beginnings and ends. Sentences, however, are more deeply structured even on this side of constituent and dependency structure: they can consist of a main sentence and several subordinate clauses as well as further segments (e.g. inserts in parentheses); they can even recursively embed whole sentences and then contain multiple sentence beginnings and ends. In this paper, we introduce a tool that segments sentences into tree structures to detect this type of recursive structure. To this end, we retrain different constituency parsers with the help of modified training data to transform them into sentence segmenters. With these segmenters, documents are mapped to sequences of sentence-related “logical document structures”. The resulting segmenters aim to improve downstream tasks by providing additional structural information. In this context, we experiment with German dependency parsing. We show that for certain sentence categories, which can be determined automatically, improvements in German dependency parsing can be achieved using our segmenter for preprocessing. The assumption suggests that improvements in other languages and tasks can be achieved.
pdf
abs
Voting for POS tagging of Latin texts: Using the flair of FLAIR to better Ensemble Classifiers by Example of Latin
Manuel Stoeckel
|
Alexander Henlein
|
Wahed Hemati
|
Alexander Mehler
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
Despite the great importance of the Latin language in the past, there are relatively few resources available today to develop modern NLP tools for this language. Therefore, the EvaLatin Shared Task for Lemmatization and Part-of-Speech (POS) tagging was published in the LT4HALA workshop. In our work, we dealt with the second EvaLatin task, that is, POS tagging. Since most of the available Latin word embeddings were trained on either few or inaccurate data, we trained several embeddings on better data in the first step. Based on these embeddings, we trained several state-of-the-art taggers and used them as input for an ensemble classifier called LSTMVoter. We were able to achieve the best results for both the cross-genre and the cross-time task (90.64% and 87.00%) without using additional annotated data (closed modality). In the meantime, we further improved the system and achieved even better results (96.91% on classical, 90.87% on cross-genre and 87.35% on cross-time).
2019
pdf
bib
abs
When Specialization Helps: Using Pooled Contextualized Embeddings to Detect Chemical and Biomedical Entities in Spanish
Manuel Stoeckel
|
Wahed Hemati
|
Alexander Mehler
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks
The recognition of pharmacological substances, compounds and proteins is an essential preliminary work for the recognition of relations between chemicals and other biomedically relevant units. In this paper, we describe an approach to Task 1 of the PharmaCoNER Challenge, which involves the recognition of mentions of chemicals and drugs in Spanish medical texts. We train a state-of-the-art BiLSTM-CRF sequence tagger with stacked Pooled Contextualized Embeddings, word and sub-word embeddings using the open-source framework FLAIR. We present a new corpus composed of articles and papers from Spanish health science journals, termed the Spanish Health Corpus, and use it to train domain-specific embeddings which we incorporate in our model training. We achieve a result of 89.76% F1-score using pre-trained embeddings and are able to improve these results to 90.52% F1-score using specialized embeddings.
2018
pdf
FastSense: An Efficient Word Sense Disambiguation Classifier
Tolga Uslu
|
Alexander Mehler
|
Daniel Baumartz
|
Wahed Hemati
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
abs
TextImager as a Generic Interface to R
Tolga Uslu
|
Wahed Hemati
|
Alexander Mehler
|
Daniel Baumartz
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics
R is a very powerful framework for statistical modeling. Thus, it is of high importance to integrate R with state-of-the-art tools in NLP. In this paper, we present the functionality and architecture of such an integration by means of TextImager. We use the OpenCPU API to integrate R based on our own R-Server. This allows for communicating with R-packages and combining them with TextImager’s NLP-components.
2016
pdf
Text2voronoi: An Image-driven Approach to Differential Diagnosis
Alexander Mehler
|
Tolga Uslu
|
Wahed Hemati
Proceedings of the 5th Workshop on Vision and Language
pdf
abs
TextImager: a Distributed UIMA-based System for NLP
Wahed Hemati
|
Tolga Uslu
|
Alexander Mehler
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations
More and more disciplines require NLP tools for performing automatic text analyses on various levels of linguistic resolution. However, the usage of established NLP frameworks is often hampered for several reasons: in most cases, they require basic to sophisticated programming skills, interfere with interoperability due to using non-standard I/O-formats and often lack tools for visualizing computational results. This makes it difficult especially for humanities scholars to use such frameworks. In order to cope with these challenges, we present TextImager, a UIMA-based framework that offers a range of NLP and visualization tools by means of a user-friendly GUI. Using TextImager requires no programming skills.