Selma Liliane Wanna
2026
Limited Linguistic Diversity in Embodied AI Datasets
Selma Liliane Wanna | Agnes Luhtaru | Jonathan Salfity | Ryan Barron | Juston Moore | Cynthia Matuszek | Mitch Pryor
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Selma Liliane Wanna | Agnes Luhtaru | Jonathan Salfity | Ryan Barron | Juston Moore | Cynthia Matuszek | Mitch Pryor
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions—including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support more detailed dataset reporting, more principled dataset selection, and targeted curation or augmentation strategies that broaden language coverage.
2025
LLM-Assisted Translation of Legacy FORTRAN Codes to C++: A Cross-Platform Study
Nishath Rajiv Ranasinghe | Shawn M. Jones | Michal Kucer | Ayan Biswas | Daniel O’Malley | Alexander Most | Selma Liliane Wanna | Ajay Sreekumar
Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities
Nishath Rajiv Ranasinghe | Shawn M. Jones | Michal Kucer | Ayan Biswas | Daniel O’Malley | Alexander Most | Selma Liliane Wanna | Ajay Sreekumar
Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities
Large Language Models (LLMs) are increasinglybeing leveraged for generating andtranslating scientific computer codes by bothdomain-experts and non-domain experts. Fortranhas served as one of the go to programminglanguages in legacy high-performance computing(HPC) for scientific discoveries. Despitegrowing adoption, LLM-based code translationof legacy code-bases has not been thoroughlyassessed or quantified for its usability.Here, we studied the applicability of LLMbasedtranslation of Fortran to C++ as a step towardsbuilding an agentic-workflow using openweightLLMs on two different computationalplatforms. We statistically quantified the compilationaccuracy of the translated C++ codes,measured the similarity of the LLM translatedcode to the human translated C++ code, andstatistically quantified the output similarity ofthe Fortran to C++ translation.