2025
pdf
bib
abs
Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation
Xujun Peng
|
Anoop Kumar
|
Jingyu Wu
|
Parker Glenn
|
Daben Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses that are grounded in retrieved context. However, LLMs often generate inconsistent outputs for semantically equivalent inputs, a problem exacerbated by limited consistency-focused data and the limitations of existing fine-tuning methods for improving consistency. We propose a new approach combining systematic synthetic data generation, triplet loss for better embeddings, and a novel layer-wise model merging approach. Using consistency-aware weights derived from intermediate layer activations, our method effectively integrates knowledge from specialized models. Experimental results how that our merged model significantly enhances output consistency, achieving approximately 47.5% improvement in response similarity over the baseline, thus offering a practical solution for increasing the the reliability of an industrial RAG system.
pdf
bib
abs
Readability Reconsidered A Cross-Dataset Analysis of Reference-Free Metrics
Catarina Belem
|
Parker Glenn
|
Alfy Samuel
|
Anoop Kumar
|
Daben Liu
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)
Automatic readability assessment plays a key role in ensuring effective communication between humans and language models. Despite significant progress the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work we investigate the factors shaping human perceptions of readability through the analysis of 1.2k judgments finding that beyond surface-level cues information content and topic strongly shape text comprehensibility. Furthermore we evaluate 15 popular readability metrics across 5 datasets contrasting them with 5 more nuanced model-based metrics. Our results show that four model-based metrics consistently place among the top 4 in rank correlations with human judgments while the best performing traditional metric achieves an average rank of 7.8. These findings highlight a mismatch between current readability metrics and human perceptions pointing to model-based approaches as a more promising direction.
2024
pdf
bib
abs
BlendSQL: A Scalable Dialect for Unifying Hybrid Question Answering in Relational Algebra
Parker Glenn
|
Parag Dakle
|
Liang Wang
|
Preethi Raghavan
Findings of the Association for Computational Linguistics: ACL 2024
Many existing end-to-end systems for hybrid question answering tasks can often be boiled down to a “prompt-and-pray” paradigm, where the user has limited control and insight into the intermediate reasoning steps used to achieve the final result. Additionally, due to the context size limitation of many transformer-based LLMs, it is often not reasonable to expect that the full structured and unstructured context will fit into a given prompt in a zero-shot setting, let alone a few-shot setting. We introduce BlendSQL, a superset of SQLite to act as a unified dialect for orchestrating reasoning across both unstructured and structured data. For hybrid question answering tasks involving multi-hop reasoning, we encode the full decomposed reasoning roadmap into a single interpretable BlendSQL query. Notably, we show that BlendSQL can scale to massive datasets and improve the performance of end-to-end systems while using 35% fewer tokens. Our code is available and installable as a package at https://github.com/parkervg/blendsql.
2023
pdf
bib
Jetsons at the FinNLP-2023: Using Synthetic Data and Transfer Learning for Multilingual ESG Issue Classification
Parker Glenn
|
Alolika Gon
|
Nikhil Kohli
|
Sihan Zha
|
Parag Pravin Dakle
|
Preethi Raghavan
Proceedings of the Fifth Workshop on Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting
pdf
bib
abs
Correcting Semantic Parses with Natural Language through Dynamic Schema Encoding
Parker Glenn
|
Parag Pravin Dakle
|
Preethi Raghavan
Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023)
In addressing the task of converting natural language to SQL queries, there are several semantic and syntactic challenges. It becomes increasingly important to understand and remedy the points of failure as the performance of semantic parsing systems improve. We explore semantic parse correction with natural language feedback, proposing a new solution built on the success of autoregressive decoders in text-to-SQL tasks. By separating the semantic and syntactic difficulties of the task, we show that the accuracy of text-to-SQL parsers can be boosted by up to 26% with only one turn of correction with natural language. Additionally, we show that a T5-base model is capable of correcting the errors of a T5-large model in a zero-shot, cross-parser setting.
2022
pdf
bib
abs
The Viability of Best-worst Scaling and Categorical Data Label Annotation Tasks in Detecting Implicit Bias
Parker Glenn
|
Cassandra L. Jacobs
|
Marvin Thielk
|
Yi Chu
Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022
Annotating workplace bias in text is a noisy and subjective task. In encoding the inherently continuous nature of bias, aggregated binary classifications do not suffice. Best-worst scaling (BWS) offers a framework to obtain real-valued scores through a series of comparative evaluations, but it is often impractical to deploy to traditional annotation pipelines within industry. We present analyses of a small-scale bias dataset, jointly annotated with categorical annotations and BWS annotations. We show that there is a strong correlation between observed agreement and BWS score (Spearman’s r=0.72). We identify several shortcomings of BWS relative to traditional categorical annotation: (1) When compared to categorical annotation, we estimate BWS takes approximately 4.5x longer to complete; (2) BWS does not scale well to large annotation tasks with sparse target phenomena; (3) The high correlation between BWS and the traditional task shows that the benefits of BWS can be recovered from a simple categorically annotated, non-aggregated dataset.