Benjamin Säfken


2024

pdf
Human in the Loop: How to Effectively Create Coherent Topics by Manually Labeling Only a Few Documents per Class
Anton F. Thielmann | Christoph Weisser | Benjamin Säfken
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Few-shot methods for accurate modeling under sparse label-settings have improved significantly. However, the applications of few-shot modeling in natural language processing remain solely in the field of document classification. With recent performance improvements, supervised few-shot methods, combined with a simple topic extraction method pose a significant challenge to unsupervised topic modeling methods. Our research shows that supervised few-shot learning, combined with a simple topic extraction method, can outperform unsupervised topic modeling techniques in terms of generating coherent topics, even when only a few labeled documents per class are used. The code is available at the following link: https://github.com/AnFreTh/STREAM

pdf
Topics in the Haystack: Enhancing Topic Quality through Corpus Expansion
Anton Thielmann | Arik Reuter | Quentin Seifert | Elisabeth Bergherr | Benjamin Säfken
Computational Linguistics, Volume 50, Issue 2 - June 2023

Extracting and identifying latent topics in large text corpora have gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. Through simple corpus expansion, our model can detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared with state-of-the-art topic modeling and document clustering models. The code is available at the following link: https://github.com/AnFreTh/STREAM.