Laura Dietz
2026
UNH @ Rag4Reports: A Broad Exploration of LLM-Judges for RAG
Minna Tran | Ryan McCarthy | Aiden Parsons | Jaren Unzen | Laura Dietz
Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026)
Minna Tran | Ryan McCarthy | Aiden Parsons | Jaren Unzen | Laura Dietz
Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026)
We submitted a breadth of LLM-as-a-Judge approaches to Rag4Reports Task A; our top method ranked first among all submitted systems. We find that citation faithfulness is the most essential signal, and that content is best verified by checking whether cited documents cover nuggets generated from the LLM’s internal knowledge.
Crucible @ Rag4Reports: Generating Nuggets for Report Generation and Evaluation
Laura Dietz | Eugene Yang
Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026)
Laura Dietz | Eugene Yang
Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026)
We submit to both tracks of the RAG4Reports challenge with two complementary components: PREFNUGGET, which derives concise nugget banks from pairwise preference judgments between system responses, and CRUCIBLE, a nugget-first pipeline that uses such banks to assemble reports on a given topic. The shared nugget-level representation unifies our approach to report evaluation (Task A) and report generation (Task B).
Sycophancy Negatively Affects LLM-as-a-Judge in Conflict Evaluation
Naghmeh Farzi | Laura Dietz | Samuel Carton
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Naghmeh Farzi | Laura Dietz | Samuel Carton
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
LLM-as-Judge systems are increasingly used to generate labels and evaluate conversational data, yet their susceptibility to narrative framing remains underexplored. We study whether replacing one speaker’s username with the first-person identifier ’Me’ systematically biases model judgments independent of the underlying evidence. Using the Conversations Gone Awry corpus, we evaluate four LLMs across three judgment tasks (attack detection, attacker identification, and blame attribution), three perspective conditions, and two evidence visibility settings. Our results show that narrative perspective induces strong, task-dependent distortions, particularly in more subjective judgment tasks. We find that models systematically favor the narrator when a speaker is presented as ’Me’, reducing blame and responsibility attribution toward that speaker even when the underlying evidence is unchanged. These findings raise concerns about using LLMs to judge or moderate first-person conversational data.
2021
Learn The Big Picture: Representation Learning for Clustering
Sumanta Kashyapi | Laura Dietz
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)
Sumanta Kashyapi | Laura Dietz
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)
Existing supervised models for text clustering find it difficult to directly optimize for clustering results. This is because clustering is a discrete process and it is difficult to estimate meaningful gradient of any discrete function that can drive gradient based optimization algorithms. So, existing supervised clustering algorithms indirectly optimize for some continuous function that approximates the clustering process. We propose a scalable training strategy that directly optimizes for a discrete clustering metric. We train a BERT-based embedding model using our method and evaluate it on two publicly available datasets. We show that our method outperforms another BERT-based embedding model employing Triplet loss and other unsupervised baselines. This suggests that optimizing directly for the clustering outcome indeed yields better representations suitable for clustering.
2019
An Analysis of Deep Contextual Word Embeddings and Neural Architectures for Toponym Mention Detection in Scientific Publications
Matthew Magnusson | Laura Dietz
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications
Matthew Magnusson | Laura Dietz
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications
Toponym detection in scientific papers is an open task and a key first step in place entity enrichment of documents. We examine three common neural architectures in NLP: 1) convolutional neural network, 2) multi-layer perceptron (both applied in a sliding window context) and 3) bidirectional LSTM and apply contextual and non-contextual word embedding layers to these models. We find that deep contextual word embeddings improve the performance of the bi-LSTM with CRF neural architecture achieving the best performance when multiple layers of deep contextual embeddings are concatenated. Our best performing model achieves an average F1 of 0.910 when evaluated on overlap macro exceeding previous state-of-the-art models in the toponym detection task.
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications
Vivi Nastase | Benjamin Roth | Laura Dietz | Andrew McCallum
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications
Vivi Nastase | Benjamin Roth | Laura Dietz | Andrew McCallum
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications
UNH at SemEval-2019 Task 12: Toponym Resolution in Scientific Papers
Matthew Magnusson | Laura Dietz
Proceedings of the 13th International Workshop on Semantic Evaluation
Matthew Magnusson | Laura Dietz
Proceedings of the 13th International Workshop on Semantic Evaluation
The SemEval-2019 Task 12 is toponym resolution in scientific papers. We focus on Subtask 1: Toponym Detection which is the identification of spans of text for place names mentioned in a document. We propose two methods: 1) sliding window convolutional neural network using ELMo embeddings (cnn-elmo), and 2) sliding window multi-Layer perceptron using ELMo embeddings (mlp-elmo). We also submit Bi-lateral LSTM with Conditional Random Fields (bi-LSTM) as a strong baseline given its state-of-art performance in Named Entity Recognition (NER) task. Our best performing model is cnn-elmo with a F1 of 0.844 which was below bi-LSTM F1 of 0.862 when evaluated on overlap macro detection. Eight teams participated in this subtask with a total of 21 submissions.