Martin Theobald


2026

Evaluating Retrieval-Augmented Generation(RAG) systems remains a challenging task: existingmetrics often collapse heterogeneous behaviorsinto single scores and provide little insightinto whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduceRAGVUE, a diagnostic and explainableframework for automated, reference-freeevaluation of RAG pipelines. RAGVUE decomposesRAG behavior into retrieval quality,answer relevance and completeness, strictclaim-level faithfulness, and judge calibration.Each metric includes a structured explanation,making the evaluation process transparent. Ourframework supports both manual metric selectionand fully automated agentic evaluation. Italso provides a Python API, CLI, and a localStreamlit interface for interactive usage. Incomparative experiments, RAGVUE surfacesfine-grained failures that existing tools suchas RAGAS often overlook. We showcase thefull RAGVUE workflow and illustrate how itcan be integrated into research pipelines andpractical RAG development. The source codeand detailed instructions on usage are publiclyavailable on Github.

2025

Analyzing historical discourse in large-scale newspaper archives requires scalable and interpretable methods to uncover hidden themes. This study systematically evaluates topic modeling approaches for newspaper articles from 1955 to 2018, comparing probabilistic LDA, matrix factorization NMF, and neural-based models such as Top2Vec and BERTopic across various preprocessing strategies. We benchmark these methods on topic coherence, diversity, scalability, and interpretability. While LDA is commonly used in historical text analysis, our findings demonstrate that BERTopic, leveraging contextual embeddings, consistently outperforms classical models in all tested aspects, making it a more robust choice for large-scale textual corpora. Additionally, we highlight the trade-offs between preprocessing strategies and model performance, emphasizing the importance of tailored pipeline design. These insights advance the field of historical NLP, offering concrete guidance for historians and computational social scientists in selecting the most effective topic-modeling approach for analyzing digitized archives. Our code will be publicly available on GitHub.

2016

Methods for Named Entity Recognition and Disambiguation (NERD) perform NER and NED in two separate stages. Therefore, NED may be penalized with respect to precision by NER false positives, and suffers in recall from NER false negatives. Conversely, NED does not fully exploit information computed by NER such as types of mentions. This paper presents J-NERD, a new approach to perform NER and NED jointly, by means of a probabilistic graphical model that captures mention spans, mention types, and the mapping of mentions to entities in a knowledge base. We present experiments with different kinds of texts from the CoNLL’03, ACE’05, and ClueWeb’09-FACC1 corpora. J-NERD consistently outperforms state-of-the-art competitors in end-to-end NERD precision, recall, and F1.