Christoph Friedrich
2023
On the Impact of Cross-Domain Data on German Language Models
Amin Dada
|
Aokun Chen
|
Cheng Peng
|
Kaleb Smith
|
Ahmad Idrissi-Yaghir
|
Constantin Seibold
|
Jianning Li
|
Lars Heiliger
|
Christoph Friedrich
|
Daniel Truhn
|
Jan Egger
|
Jiang Bian
|
Jens Kleesiek
|
Yonghui Wu
Findings of the Association for Computational Linguistics: EMNLP 2023
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45% over the previous state-of-the-art.
2022
Cross-Language Transfer of High-Quality Annotations: Combining Neural Machine Translation with Cross-Linguistic Span Alignment to Apply NER to Clinical Texts in a Low-Resource Language
Henning Schäfer
|
Ahmad Idrissi-Yaghir
|
Peter Horn
|
Christoph Friedrich
Proceedings of the 4th Clinical Natural Language Processing Workshop
In this work, cross-linguistic span prediction based on contextualized word embedding models is used together with neural machine translation (NMT) to transfer and apply the state-of-the-art models in natural language processing (NLP) to a low-resource language clinical corpus. Two directions are evaluated: (a) English models can be applied to translated texts to subsequently transfer the predicted annotations to the source language and (b) existing high-quality annotations can be transferred beyond translation and then used to train NLP models in the target language. Effectiveness and loss of transmission is evaluated using the German Berlin-Tübingen-Oncology Corpus (BRONCO) dataset with transferred external data from NCBI disease, SemEval-2013 drug-drug interaction (DDI) and i2b2/VA 2010 data. The use of English models for translated clinical texts has always involved attempts to take full advantage of the benefits associated with them (large pre-trained biomedical word embeddings). To improve advances in this area, we provide a general-purpose pipeline to transfer any annotated BRAT or CoNLL format to various target languages. For the entity class medication, good results were obtained with 0.806 F1-score after re-alignment. Limited success occurred in the diagnosis and treatment class with results just below 0.5 F1-score due to differences in annotation guidelines.
Search
Co-authors
- Ahmad Idrissi-Yaghir 2
- Henning Schäfer 1
- Peter Horn 1
- Amin Dada 1
- Aokun Chen 1
- show all...