2024
pdf
abs
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali
|
Michael Fromm
|
Klaudia Thellmann
|
Richard Rutmann
|
Max Lübbering
|
Johannes Leveling
|
Katrin Klug
|
Jan Ebert
|
Niclas Doll
|
Jasper Buschhoff
|
Charvi Jain
|
Alexander Weber
|
Lena Jurkschat
|
Hammam Abdelwahab
|
Chelsea John
|
Pedro Ortiz Suarez
|
Malte Ostendorff
|
Samuel Weinbach
|
Rafet Sifa
|
Stefan Kesselheim
|
Nicolas Flores-Herr
Findings of the Association for Computational Linguistics: NAACL 2024
The recent success of large language models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot.Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model’s downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model’s downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.
pdf
bib
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge) @ LREC-COLING-2024
Tiansi Dong
|
Erhard Hinrichs
|
Zhen Han
|
Kang Liu
|
Yangqiu Song
|
Yixin Cao
|
Christian F. Hempelmann
|
Rafet Sifa
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge) @ LREC-COLING-2024
pdf
abs
Word Sense Disambiguation as a Game of Neurosymbolic Darts
Tiansi Dong
|
Rafet Sifa
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge) @ LREC-COLING-2024
Word Sense Disambiguation (WSD) is one of the hardest tasks in natural language understanding and knowledge engineering. The glass ceiling of the 80% F1 score is recently achieved through supervised learning, enriched by knowledge graphs. Here, we propose a novel neurosymbolic methodology that may push the F1 score above 90%. The core of our methodology is a neurosymbolic sense embedding, in terms of a configuration of nested n-dimensional balls. The central point of a ball well preserves pre-trained word embeddings learned from data, which partially fixes the locations of balls. Inclusion relations among balls precisely encode symbolic hypernym relations among senses, and enable simple logic deduction among sense embeddings. We trained a Transformer to learn the mapping from a contextualized word embedding to its sense ball embedding, just like playing the game of darts (a game of shooting darts into a dartboard). A series of experiments are carried out using pre-training n ball embeddings, which cover around 70% training data and 75% testing data in the benchmark WSD corpus. Euclidean distance and cosine similarity functions are used as objective functions, separately, and each reaches >95.0% F1 score in the ALL-n-ball dataset. This substantially breaks the glass ceiling of deep learning methods. Future work is discussed to develop a full-fledged neurosymbolic WSD system that substantially outperforms deep learning approaches.
2020
pdf
abs
Fraunhofer IAIS at FinCausal 2020, Tasks 1 & 2: Using Ensemble Methods and Sequence Tagging to Detect Causality in Financial Documents
Maren Pielka
|
Rajkumar Ramamurthy
|
Anna Ladi
|
Eduardo Brito
|
Clayton Chapman
|
Paul Mayer
|
Rafet Sifa
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation
The FinCausal 2020 shared task aims to detect causality on financial news and identify those parts of the causal sentences related to the underlying cause and effect. We apply ensemble-based and sequence tagging methods for identifying causality, and extracting causal subsequences. Our models yield promising results on both sub-tasks, with the prospect of further improvement given more time and computing resources. With respect to task 1, we achieved an F1 score of 0.9429 on the evaluation data, and a corresponding ranking of 12/14. For task 2, we were ranked 6/10, with an F1 score of 0.76 and an ExactMatch score of 0.1912.