Ryan Barron
2026
Limited Linguistic Diversity in Embodied AI Datasets
Selma Liliane Wanna | Agnes Luhtaru | Jonathan Salfity | Ryan Barron | Juston Moore | Cynthia Matuszek | Mitch Pryor
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Selma Liliane Wanna | Agnes Luhtaru | Jonathan Salfity | Ryan Barron | Juston Moore | Cynthia Matuszek | Mitch Pryor
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions—including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support more detailed dataset reporting, more principled dataset selection, and targeted curation or augmentation strategies that broaden language coverage.
2025
HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning
Manish Bhattarai | Ryan Barron | Maksim Eren | Minh Vu | Vesselin Grantcharov | Ismael Boureima | Valentin Stanev | Cynthia Matuszek | Vladimir Valtchinov | Kim Rasmussen | Boian Alexandrov
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Manish Bhattarai | Ryan Barron | Maksim Eren | Minh Vu | Vesselin Grantcharov | Ismael Boureima | Valentin Stanev | Cynthia Matuszek | Vladimir Valtchinov | Kim Rasmussen | Boian Alexandrov
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain’s specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.