2024
pdf
abs
Common European Language Data Space
Georg Rehm
|
Stelios Piperidis
|
Khalid Choukri
|
Andrejs Vasiļjevs
|
Katrin Marheinecke
|
Victoria Arranz
|
Aivars Bērziņš
|
Miltos Deligiannis
|
Dimitris Galanis
|
Maria Giagkou
|
Katerina Gkirtzou
|
Dimitris Gkoumas
|
Annika Grützner-Zahn
|
Athanasia Kolovou
|
Penny Labropoulou
|
Andis Lagzdiņš
|
Elena Leitner
|
Valérie Mapelli
|
Hélène Mazo
|
Simon Ostermann
|
Stefania Racioppa
|
Mickaël Rigault
|
Leon Voukoutis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The Common European Language Data Space (LDS) is an integral part of the EU data strategy, which aims at developing a single market for data. Its decentralised technical infrastructure and governance scheme are currently being developed by the LDS project, which also has dedicated tasks for proof-of-concept prototypes, handling legal aspects, raising awareness and promoting the LDS through events and social media channels. The LDS is part of a broader vision for establishing all necessary components to develop European large language models.
pdf
abs
HybridBERT - Making BERT Pretraining More Efficient Through Hybrid Mixture of Attention Mechanisms
Gokul Srinivasagan
|
Simon Ostermann
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Pretrained transformer-based language models have produced state-of-the-art performance in most natural language understanding tasks. These models undergo two stages of training: pretraining on a huge corpus of data and fine-tuning on a specific downstream task. The pretraining phase is extremely compute-intensive and requires several high-performance computing devices like GPUs and several days or even months of training, but it is crucial for the model to capture global knowledge and also has a significant impact on the fine-tuning task. This is a major roadblock for researchers without access to sophisticated computing resources. To overcome this challenge, we propose two novel hybrid architectures called HybridBERT (HBERT), which combine self-attention and additive attention mechanisms together with sub-layer normalization. We introduce a computing budget to the pretraining phase, limiting the training time and usage to a single GPU. We show that HBERT attains twice the pretraining accuracy of a vanilla-BERT baseline. We also evaluate our proposed models on two downstream tasks, where we outperform BERT-base while accelerating inference. Moreover, we study the effect of weight initialization with a limited pretraining budget. The code and models are publicly available at: www.github.com/gokulsg/HBERT/.
2023
pdf
abs
Investigating the Encoding of Words in BERT’s Neurons Using Feature Textualization
Tanja Baeumel
|
Soniya Vijayakumar
|
Josef van Genabith
|
Guenter Neumann
|
Simon Ostermann
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Pretrained language models (PLMs) form the basis of most state-of-the-art NLP technologies. Nevertheless, they are essentially black boxes: Humans do not have a clear understanding of what knowledge is encoded in different parts of the models, especially in individual neurons. A contrast is in computer vision, where feature visualization provides a decompositional interpretability technique for neurons of vision models. Activation maximization is used to synthesize inherently interpretable visual representations of the information encoded in individual neurons. Our work is inspired by this but presents a cautionary tale on the interpretability of single neurons, based on the first large-scale attempt to adapt activation maximization to NLP, and, more specifically, large PLMs. We propose feature textualization, a technique to produce dense representations of neurons in the PLM word embedding space. We apply feature textualization to the BERT model to investigate whether the knowledge encoded in individual neurons can be interpreted and symbolized. We find that the produced representations can provide insights about the knowledge encoded in individual neurons, but that individual neurons do not represent clear-cut symbolic units of language such as words. Additionally, we use feature textualization to investigate how many neurons are needed to encode words in BERT.
pdf
abs
Find-2-Find: Multitask Learning for Anaphora Resolution and Object Localization
Cennet Oguz
|
Pascal Denis
|
Emmanuel Vincent
|
Simon Ostermann
|
Josef van Genabith
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
In multimodal understanding tasks, visual and linguistic ambiguities can arise. Visual ambiguity can occur when visual objects require a model to ground a referring expression in a video without strong supervision, while linguistic ambiguity can occur from changes in entities in action flows. As an example from the cooking domain, “oil” mixed with “salt” and “pepper” could later be referred to as a “mixture”. Without a clear visual-linguistic alignment, we cannot know which among several objects shown is referred to by the language expression “mixture”, and without resolved antecedents, we cannot pinpoint what the mixture is. We define this chicken-and-egg problem as Visual-linguistic Ambiguity. In this paper, we present Find2Find, a joint anaphora resolution and object localization dataset targeting the problem of visual-linguistic ambiguity, consisting of 500 anaphora-annotated recipes with corresponding videos. We present experimental results of a novel end-to-end joint multitask learning framework for Find2Find that fuses visual and textual information and shows improvements both for anaphora resolution and object localization with one joint model in multitask learning, as compared to a strong single-task baseline.
2019
pdf
abs
MCScript2.0: A Machine Comprehension Corpus Focused on Script Events and Participants
Simon Ostermann
|
Michael Roth
|
Manfred Pinkal
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)
We introduce MCScript2.0, a machine comprehension corpus for the end-to-end evaluation of script knowledge. MCScript2.0 contains approx. 20,000 questions on approx. 3,500 texts, crowdsourced based on a new collection process that results in challenging questions. Half of the questions cannot be answered from the reading texts, but require the use of commonsense and, in particular, script knowledge. We give a thorough analysis of our corpus and show that while the task is not challenging to humans, existing machine comprehension models fail to perform well on the data, even if they make use of a commonsense knowledge base. The dataset is available at
http://www.sfb1102.uni-saarland.de/?page_id=2582pdf
bib
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing
Simon Ostermann
|
Sheng Zhang
|
Michael Roth
|
Peter Clark
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing
pdf
abs
Commonsense Inference in Natural Language Processing (COIN) - Shared Task Report
Simon Ostermann
|
Sheng Zhang
|
Michael Roth
|
Peter Clark
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing
This paper reports on the results of the shared tasks of the COIN workshop at EMNLP-IJCNLP 2019. The tasks consisted of two machine comprehension evaluations, each of which tested a system’s ability to answer questions/queries about a text. Both evaluations were designed such that systems need to exploit commonsense knowledge, for example, in the form of inferences over information that is available in the common ground but not necessarily mentioned in the text. A total of five participating teams submitted systems for the shared tasks, with the best submitted system achieving 90.6% accuracy and 83.7% F1-score on task 1 and task 2, respectively.
2018
pdf
Mapping Texts to Scripts: An Entailment Study
Simon Ostermann
|
Hannah Seitz
|
Stefan Thater
|
Manfred Pinkal
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge
Simon Ostermann
|
Ashutosh Modi
|
Michael Roth
|
Stefan Thater
|
Manfred Pinkal
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
abs
SemEval-2018 Task 11: Machine Comprehension Using Commonsense Knowledge
Simon Ostermann
|
Michael Roth
|
Ashutosh Modi
|
Stefan Thater
|
Manfred Pinkal
Proceedings of the 12th International Workshop on Semantic Evaluation
This report summarizes the results of the SemEval 2018 task on machine comprehension using commonsense knowledge. For this machine comprehension task, we created a new corpus, MCScript. It contains a high number of questions that require commonsense knowledge for finding the correct answer. 11 teams from 4 different countries participated in this shared task, most of them used neural approaches. The best performing system achieves an accuracy of 83.95%, outperforming the baselines by a large margin, but still far from the human upper bound, which was found to be at 98%.
2017
pdf
abs
Aligning Script Events with Narrative Texts
Simon Ostermann
|
Michael Roth
|
Stefan Thater
|
Manfred Pinkal
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)
Script knowledge plays a central role in text understanding and is relevant for a variety of downstream tasks. In this paper, we consider two recent datasets which provide a rich and general representation of script events in terms of paraphrase sets. We introduce the task of mapping event mentions in narrative texts to such script event types, and present a model for this task that exploits rich linguistic representations as well as information on temporal ordering. The results of our experiments demonstrate that this complex task is indeed feasible.
2016
pdf
abs
InScript: Narrative texts annotated with script information
Ashutosh Modi
|
Tatjana Anikina
|
Simon Ostermann
|
Manfred Pinkal
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper presents the InScript corpus (Narrative Texts Instantiating Script structure). InScript is a corpus of 1,000 stories centered around 10 different scenarios. Verbs and noun phrases are annotated with event and participant types, respectively. Additionally, the text is annotated with coreference information. The corpus shows rich lexical variation and will serve as a unique resource for the study of the role of script knowledge in natural language processing.
2015
pdf
Annotating Entailment Relations for Shortanswer Questions
Simon Ostermann
|
Andrea Horbach
|
Manfred Pinkal
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications
2014
pdf
Paraphrase Detection for Short Answer Scoring
Nikolina Koleva
|
Andrea Horbach
|
Alexis Palmer
|
Simon Ostermann
|
Manfred Pinkal
Proceedings of the third workshop on NLP for computer-assisted language learning