Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

Nouha Dziri, Sean (Xiang) Ren, Shizhe Diao (Editors)

Anthology ID:: 2025.naacl-demo
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Venues:: NAACL | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-demo/
DOI:
ISBN:: 979-8-89176-191-9
Bib Export formats:: BibTeX

pdf bib
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Nouha Dziri | Sean (Xiang) Ren | Shizhe Diao

To address the challenges associated with data processing at scale, we propose Dataverse, a unified open-source Extract-Transform-Load (ETL) pipeline for large language models (LLMs) with a user-friendly design at its core. Easy addition of custom processors with block-based interface in Dataverse allows users to readily and efficiently use Dataverse to build their own ETL pipeline. We hope that Dataverse will serve as a vital tool for LLM development and open source the entire library to welcome community contribution. Additionally, we provide a concise, two-minute video demonstration of our system, illustrating its capabilities and implementation.

Many endangered languages are at risk of extinction due to barriers in communication and generational gaps that hinder their preservation. A cause for languages becoming endangered is the lack of language educational tools and artificial intelligence (AI) models for these low-resource languages. To address this, we propose the ATAIGI learning app designed with AI-powered models leveraging multimodal generative techniques. Our app offers users a comprehensive learning experience by providing translated phrases and definitions, example sentences, illustrative images, romanized pronunciation, and audio speech to accelerate language learning. ATAIGI is built on five AI models that are rigorously benchmarked individually, with our Transliteration Model achieving state-of-the-art results for Taiwanese Hokkien transliteration. ATAIGI is available for all to learn the endangered language of Taiwanese Hokkien, an endangered language spoken in Taiwan. A human evaluation conducted demonstrates the effectiveness of ATAIGI in improving language proficiency and cultural understanding, supporting its potential for the preservation and education of endangered languages like the Taiwanese Hokkien.

pdf bib abs
CLEAR-Command: Coordinated Listening, Extraction, and Allocation for Emergency Response with Large Language Models
Achref Doula | Bela Bohlender | Max Mühlhäuser | Alejandro Sanchez Guinea

Effective communication is vital in emergency response scenarios where clarity and speed can save lives. Traditional systems often struggle under the chaotic conditions of real-world emergencies, leading to breakdowns in communication and task management. This paper introduces CLEAR-Command, a system that leverages Large Language Models (LLMs) to enhance emergency communications. CLEAR stands for $textbfCoordinatedListening,Extraction, andAllocation inResponse. CLEAR-Command automates the transcription, summarization, and task extraction from live radio communications of emergency first responders using the OpenAI Whisper API for transcription and gpt-4o for summarization and task extraction. Our system provides a dynamic overview of task allocations and their execution status, significantly improving the accuracy of task identification and the clarity of communication. We evaluated our system through an expert pre-study with 4 experts and a user study with 13 participants. The expert pre-study identified gpt-4o as providing the most accurate task extraction, while the user study showed that CLEAR-Command significantly outperforms traditional radio communication in terms of clarity, trust, and correctness of task extraction. Our demo is hosted under thislink, and all project details are presented in ourGitlab page$.

pdf bib abs
LM-Pub-Quiz: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models
Max Ploner | Jacek Wiland | Sebastian Pohl | Alan Akbik

Knowledge probing evaluates to which extent a language model (LM) has acquired relational knowledge during its pre-training phase. It provides a cost-effective means of comparing LMs of different sizes and training setups and is useful for monitoring knowledge gained or lost during continual learning (CL). In prior work, we presented an improved knowledge probe called BEAR (Wiland et al., 2024), which enables the comparison of LMs trained with different pre-training objectives (causal and masked LMs) and addresses issues of skewed distributions in previous probes to deliver a more unbiased reading of LM knowledge. With this paper, we present LM-Pub-Quiz, a Python framework and leaderboard built around the BEAR probing mechanism that enables researchers and practitioners to apply it in their work. It provides options for standalone evaluation and direct integration into the widely-used training pipeline of the Hugging Face transformers library. Further, it provides a fine-grained analysis of different knowledge types to assist users in better understanding the knowledge in each evaluated LM. We publicly release LM-Pub-Quiz as an open-source project.https://lm-pub-quiz.github.io/

We present TRACE, a novel system for live *common ground* tracking in situated collaborative tasks. With a focus on fast, real-time performance, TRACE tracks the speech, actions, gestures, and visual attention of participants, uses these multimodal inputs to determine the set of task-relevant propositions that have been raised as the dialogue progresses, and tracks the group’s epistemic position and beliefs toward them as the task unfolds. Amid increased interest in AI systems that can mediate collaborations, TRACE represents an important step forward for agents that can engage with multiparty, multimodal discourse.

pdf bib abs
MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation
Javier García Gilabert | Carlos Escolano | Audrey Mash | Xixian Liao | Maite Melero

We introduce MT-Lens, a framework designed to evaluate Machine Translation (MT) systems across a variety of tasks, including translation quality, gender bias detection, added toxicity, and robustness to misspellings. While several toolkits have become very popular for benchmarking the capabilities of Large Language Models (LLMs), existing evaluation tools often lack the ability to thoroughly assess the diverse aspects of MT performance. MT-Lens addresses these limitations by extending the capabilities of LM-eval-harness for MT, supporting state-of-the-art datasets and a wide range of evaluation metrics. It also offers a user-friendly platform to compare systems and analyze translations with interactive visualizations. MT-Lens aims to broaden access to evaluation strategies that go beyond traditional translation quality evaluation, enabling researchers and engineers to better understand the performance of a NMT model and also easily measure system’s biases.

pdf bib abs
A Learning-based Multi-Frame Visual Feature Framework for Real-Time Driver Fatigue Detection
Liang Xie | Songlin Fan

Driver fatigue is a significant factor contributing to road accidents, highlighting the need for reliable and accurate detection methods. In this study, we introduce a novel learning-based multi-frame visual feature framework (LMVFF) designed for precise fatigue detection. Our methodology comprises several clear and interpretable steps. Initially, facial landmarks are detected, enabling the calculation of distances between eyes, lips, and the assessment of head rotation angles based on 68 identified landmarks. Subsequently, visual features from the eye region are extracted, and an effective visual model is developed to accurately classify eye openness. Additionally, features characterizing lip movements are analyzed to detect yawning, thereby enriching fatigue detection through continuous monitoring of eye blink frequency, yawning occurrences, and head movements. Compared to conventional single-feature detection approaches, LMVFF significantly reduces instances of fatigue misidentification. Moreover, we employ various quantization and compression techniques for multiple computation stages, substantially reducing the latency of our system and achieving a real-time frame rate of 25-30 FPS for practical applications.

Ensuring the trustworthiness of Generative Foundation Models (GenFMs) is a pressing challenge as they gain widespread use. Existing evaluation toolkits are often limited in scope, dynamism, and flexibility. This paper introduces TRUSTEVAL, a dynamic and comprehensive toolkit designed for evaluating GenFMs across various dimensions. TRUSTEVAL supports both dynamic dataset generation and evaluation, offering advanced features including comprehensiveness, usability, and flexibility. TRUSTEVAL integrates diverse generative models, datasets, evaluation methods, metrics, inference efficiency enhancement, and evaluation report generation. Through case studies, we demonstrate TRUSTEVAL’s potential to advance the trustworthiness evaluation of GenFMs.

Recent studies highlight the reliance of Large Language Models (LLMs) on high-quality, diverse data for optimal performance. The data sourced from the Internet often aggregated into datasets like the Common Crawl corpus, presents significant quality variability and necessitates extensive cleaning. Moreover, specific domain knowledge is usually presented in HTML, but there is a lack of effective methods to clean them into the training corpus automatically. Traditional cleaning methods involve either labor-intensive human teams that lack scalability or static heuristics that lead to suboptimal outcomes and are unable to be applied to specific target domains. In this paper, inspired by the recent progress in employing LLMs as versatile agents for diverse tasks, we take the initiative to explore the potential of these agents in automating data-cleaning methodologies. By configuring LLMs as an agent team that imitates the human data-cleaning team, we can automatically generate cleaning rules that traditionally require the involvement of data-cleaning experts. These rules are developed using a limited number of data samples and can then be applied broadly to substantial portions of raw data from the same domain. We demonstrate the efficiency and effectiveness of on both pre-train scale corpora such as Common Crawl and specific target websites. Both automatic and human evaluations of the quality of the cleaned content highlight the feasibility of using LLMs to prepare their training corpus.

Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.

Recent surge in Large Language Model (LLM) availability has opened exciting avenues for research. However, efficiently interacting with these models presents a significant hurdle since LLMs often reside on proprietary or self-hosted API endpoints, each requiring custom code for interaction. Conducting comparative studies between different models can therefore be time-consuming and necessitate significant engineering effort, hindering research efficiency and reproducibility. To address these challenges, we present prompto, an open source Python library which facilitates asynchronous querying of LLM endpoints enabling researchers to interact with multiple LLMs concurrently, while maximising efficiency and utilising individual rate limits. Our library empowers researchers and developers to interact with LLMs more effectively and allowing faster experimentation, data generation and evaluation. prompto is released with an introductory video (https://youtu.be/lWN9hXBOLyQ) under MIT License and is available via GitHub (https://github.com/alan-turing-institute/prompto).

We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: https://github.com/espnet/espnet/tree/speechlm.

Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present InspectorRAGet, an introspection platform for performing a comprehensive analysis of the quality of RAG system output. InspectorRAGet allows the user to analyze aggregate and instance-level performance of RAG systems, using both human and algorithmicmetrics as well as annotator quality. InspectorRAGet is suitable for multiple use cases and is available publicly to the community.A live instance of the platform is available at https://ibm.biz/InspectorRAGet

pdf bib abs
Cerebrum (AIOS SDK): A Platform for Agent Development, Deployment, Distribution, and Discovery
Balaji Rama | Kai Mei | Yongfeng Zhang

Autonomous LLM-based agents have emerged as a powerful paradigm for complex task execution, yet the field lacks standardized tools for development, deployment, and distribution. We present Cerebrum, an open-source platform that addresses this gap through three key components: (1) a comprehensive SDK featuring a modular four-layer architecture for agent development, encompassing LLM, memory, storage, and tool management; (2) a community-driven Agent Hub for sharing and discovering agents, complete with version control and dependency management; and (3) an interactive web interface for testing and evaluating agents. The platform’s effectiveness is demonstrated through implementations of various agent architectures, including Chain of Thought (CoT), ReAct, and tool-augmented agents. Cerebrum advances the field by providing a unified framework that standardizes agent development while maintaining flexibility for researchers and developers to innovate and distribute their work. Live url for demo can be found at https://app.aios.foundation. Code can be found at https://github.com/agiresearch/Cerebrum. Video demo can be found at https://app.aios.foundation/video-demo.

With the rapid advancement of large language models (LLMs), recent years have witnessed many promising studies on leveraging LLM-based agents to simulate human social behavior. While prior work has demonstrated significant potential across various domains, much of it has focused on specific scenarios involving a limited number of agents and has lacked the ability to adapt when errors occur during simulation. To overcome these limitations, we propose a novel LLM-agent-based simulation platform called GenSim, which: (1) Abstracts a set of general functions to simplify the simulation of customized social scenarios; (2) Supports one hundred thousand agents to better simulate large-scale populations in real-world contexts; (3) Incorporates error-correction mechanisms to ensure more reliable and long-term simulations. To evaluate our platform, we assess both the efficiency of large-scale agent simulations and the effectiveness of the error-correction mechanisms. To our knowledge, GenSim represents an initial step toward a general, large-scale, and correctable social simulation platform based on LLM agents, promising to further advance the field of social science.

pdf bib abs
Semi-automatic Sequential Sentence Classification in the Discourse Analysis Tool Suite
Tim Fischer | Chris Biemann

This paper explores an AI-assisted approach to sequential sentence annotation designed to enhance qualitative data analysis (QDA) workflows within the open-source Discourse Analysis Tool Suite (DATS) developed at our university.We introduce a three-phase Annotation Assistant that leverages the capabilities of large language models (LLMs) to assist researchers during annotation.Based on the number of annotations, the assistant employs zero-shot prompting, few-shot prompting, or fine-tuned models to provide the best suggestions.To evaluate this approach, we construct a benchmark with five diverse datasets.We assess the performance of three prominent open-source LLMs — Llama 3.1, Gemma 2, and Mistral NeMo — and a sequence tagging model based on SentenceTransformers.Our findings demonstrate the effectiveness of our approach, with performance improving as the number of annotated examples increases. Consequently, we implemented the Annotation Assistant within DATS and report the implementation details.With this, we hope to contribute to a novel AI-assisted workflow and further democratize access to AI for qualitative data analysis.

While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fallshort on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent’s capabilities effectively. We propose CowPilot, a frame- work supporting autonomous as well as human-agent co llaborative w eb navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent’s by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

The ability to revise essays in response to feedback is important for students’ writing success. An automated writing evaluation (AWE) system that supports students in revising their essays is thus essential. We present eRevise+RF, an enhanced AWE system for assessing student essay revisions (e.g., changes made to an essay to improve its quality in response to essay feedback) and providing revision feedback. We deployed the system with 6 teachers and 406 students across 3 schools in Pennsylvania and Louisiana. The results confirmed its effectiveness in (1) assessing student essays in terms of evidence usage, (2) extracting evidence and reasoning revisions across essays, and (3) determining revision success in responding to feedback. The evaluation also suggested eRevise+RF is a helpful system for young students to improve their argumentative writing skills through revision and formative feedback.

In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/shinjiwlab/versa.

Suggested questions (SQs) provide an effective initial interface for users to engage with their documents in AI-powered reading applications. In practical reading sessions, users have diverse backgrounds and reading goals, yet current SQ features typically ignore such user information, resulting in homogeneous or ineffective questions. We introduce a pipeline that generates personalized SQs by incorporating reader profiles (professions and reading goals) and demonstrate its utility in two ways: 1) as an improved SQ generation pipeline that produces higher quality and more diverse questions compared to current baselines, and 2) as a data generator to fine-tune extremely small models that perform competitively with much larger models on SQ generation. Our approach can not only serve as a drop-in replacement in current SQ systems to immediately improve their performance but also help develop on-device SQ models that can run locally to deliver fast and private SQ experience.

Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo further provides users with the option to get on-the-fly automated evaluation metrics such as (1) latency, (2) ability to understand user input, (3) coherence, diversity, and relevance of system response, and (4) intelligibility and audio quality of system output. Using the evaluation metrics, we compare various cascaded and E2E spoken dialogue systems with a human-human conversation dataset as a proxy. Our analysis demonstrates that the toolkit allows researchers to effortlessly compare and contrast different technologies, providing valuable insights such as current E2E systems having poorer audio quality and less diverse responses. An example demo produced using our toolkit is publicly available here: https://huggingface.co/spaces/Siddhant/Voice_Assistant_Demo.

Firm risk relations are crucial in financial applications, including hedging and portfolio construction. However, the complexity of extracting relevant information from financial reports poses significant challenges in quantifying these relations. To this end, we introduce SURF, a System to Unveil Explainable Risk Relations between Firms. SURF employs a domain-specific encoder and an innovative scoring mechanism to uncover latent risk connections from financial reports. It constructs a network graph to visualize these firm-level risk interactions and incorporates a rationale explainer to elucidate the underlying links. Our evaluation using stock data shows that SURF outperforms baseline methods in effectively capturing firm risk relations. The demo video of the system is publicly available.

As large language models (LLMs) continue to evolve, leaderboards play a significant role in steering their development. Existing leaderboards often prioritize model capabilities while overlooking safety concerns, leaving a significant gap in responsible AI development. To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.

The Sejong dictionary dataset offers a valuable resource, providing extensive coverage of morphology, syntax, and semantic representation. This dataset can be utilized to explore linguistic information in greater depth.The labeled linguistic structures within this dataset form the basis for uncovering relationships between words and phrases and their associations with target verbs. This paper introduces a user-friendly web interface designed for the collection and consolidation of verb-related information, with a particular focus on subcategorization frames. Additionally, it outlines our efforts in mapping this information by aligning subcategorization frames with corresponding illustrative sentence examples.Furthermore, we provide a Python library that would simplify syntactic parsing and semantic role labeling. These tools are intended to assist individuals interested in harnessing the Sejong dictionary dataset to develop applications for Korean language processing.

pdf bib abs
TransformerRanker: A Tool for Efficiently Finding the Best-Suited Language Models for Downstream Classification Tasks
Lukas Garbas | Max Ploner | Alan Akbik

Classification tasks in NLP are typically addressed by selecting a pre-trained language model (PLM) from a model hub, and fine-tuning it for the task at hand. However, given the very large number of PLMs that are currently available, a practical challenge is to determine which of them will perform best for a specific downstream task. With this paper, we introduce TransformerRanker, a lightweight library that efficiently ranks PLMs for classification tasks without the need for computationally costly fine-tuning. Our library implements current approaches for transferability estimation (LogME, H-Score, kNN), in combination with layer aggregation options, which we empirically showed to yield state-of-the-art rankings of PLMs (Garbas et al., 2024). We designed the interface to be lightweight and easy to use, allowing users to directly connect to the HuggingFace Transformers and Dataset libraries. Users need only select a downstream classification task and a list of PLMs to create a ranking of likely best-suited PLMs for their task. We make TransformerRanker available as a pip-installable open-source library.

LangLearn is an open-source framework designed to facilitate autonomous learning of low-resource languages (LRL). By combining a language-agnostic approach with AI-enhanced flashcards, LangLearn empowers users to generate custom flashcards for their vocabulary, while offering structured learning through both pre-curated and self-curated decks. The framework integrates six key components: the word definition, corresponding Hanji characters, romanization with numeric tones, audio pronunciation, a sample sentence, as well as a contextual AI-generated image. LangLearn currently supports English and Taiwanese Hokkien (a variety of Southern Min), with plans to extend support for other dialects. Our preliminary study demonstrates that LangLearn positively empowers users to engage with LRLs using their vocabulary preferences, with a comprehensive user study currently underway. LangLearn’s modular structure enables future expansion, including ASR-based pronunciation practice. The code is available at https://github.com/HokkienTranslation/HokkienTranslation.

We introduce SAVIS, a sentence-level attention visualization tool that enhances the interpretability of long documents processed by Large Language Models (LLMs). By computing inter-sentence attention (ISA) through token-level attention aggregation, SAVIS reduces the complexity of attention analysis, enabling users to identify meaningful document-level patterns. The tool offers an interactive interface for exploring how sentences relate to each other in model processing. Our comparative analysis with existing visualization tools demonstrates that SAVIS improves task accuracy and reduces error identification time. We demonstrate its effectiveness for text analysis applications through case studies on various analysis tasks. Our open-source tool is available at https://pypi.org/project/savis with a screencast video at https://youtu.be/fTZZPHA55So.

pdf bib abs
NeMo-Inspector: A Visualization Tool for LLM Generation Analysis
Daria Gitman | Igor Gitman | Evelina Bakhturina

Adapting Large Language Models (LLMs) to novel tasks and enhancing their overall capabilities often requires large, high-quality training datasets. Synthetic data, generated at scale, serves a valuable alternative when real-world data is scarce or difficult to obtain. However, ensuring the quality of synthetic datasets is challenging, as developers must manually inspect and refine numerous samples to identify errors and areas for improvement. This process is time-consuming and requires specialized tools. We introduce NeMo-Inspector, an open-source tool designed to simplify the analysis of synthetic datasets with integrated inference capabilities. We demonstrate its effectiveness through two real-world cases. Analysis and cleaning of the synthetically generated GSM-Plus dataset with NeMo-Inspector led to a significant decrease in low-quality samples from 46.99% to 19.51%. The tool also helped identify and correct generation errors in OpenMath models, improving accuracy by 1.92% on the MATH dataset and by 4.17% on the GSM8K dataset for a Meta-Llama-3-8B model fine-tuned on synthetic data generated from Nemotron-4-340B.

We introduce Cognitive Kernel, an open-source agent system towards the goal of generalist autopilots. Unlike copilot systems, which primarily rely on users to provide essential state information, autopilot systems complete tasks from start to finish independently. This requires the system to acquire the missing state information actively. Cognitive Kernel adopts a dynamic programming design where the central policy model (a fine-tuned LLM) could initiate an environment state perception task, essentially another agent task, as needed. The results demonstrate that Cognitive Kernel achieves better or comparable performance to other closed-source systems on core autopilot capabilities. Cognitive Kernel is fully dockerized, ensuring everyone can deploy it privately and securely. We open-source the system to encourage further research on LLM-driven autopilot systems

Social simulation through large language model (LLM) agents is a promising approach to explore and validate social science hypotheses.We present SOTOPIA-S4, a fast, flexible, and scalable social simulation system that addresses the technical barriers of current frameworks while enabling practitioners to generate realistic, multi-turn and multi-party interactions with customizable evaluation metrics for hypothesis testing. SOTOPIA-S4 comes as a pip package that contains a simulation engine, an API server with flexible RESTful APIs for simulation management, and a web interface that enables both technical and non-technical users to design, run, and analyze simulations without programming. We demonstrate the usefulness of SOTOPIA-S4 with two use cases involving dyadic hiring negotiation scenarios and multi-party planning scenarios.

Detecting toxic language, including sexism, harassment, and abusive behaviour, remains a critical challenge, particularly in its subtle and context-dependent forms. Existing approaches largely focus on isolated message-level classification, overlooking toxicity that emerges across conversational contexts. To promote and enable future research in this direction, we introduce *SafeSpeech*, a comprehensive platform for toxic content detection and analysis that bridges message-level and conversation-level insights. The platform integrates fine-tuned classifiers and large language models (LLMs) to enable multi-granularity detection, toxic-aware conversation summarization, and persona profiling. *SafeSpeech* also incorporates explainability mechanisms, such as perplexity gain analysis, to highlight the linguistic elements driving predictions. Evaluations on benchmark datasets, including EDOS, OffensEval, and HatEval, demonstrate the reproduction of state-of-the-art performance across multiple tasks, including fine-grained sexism detection.

The rise of Large Language Models (LLMs) revolutionizes information retrieval, allowing users to obtain required answers through complex instructions within conversations. However, publicly available services remain inadequate in addressing the needs of faculty and students to search campus-specific information. It is primarily due to the LLM’s lack of domain-specific knowledge and the limitation of search engines in supporting multilingual and timely scenarios. To tackle these challenges, we introduce ALOHA, a multilingual agent enhanced by hierarchical retrieval for university orientation. We also integrate external APIs into the front-end interface to provide interactive service. The human evaluation and case study show our proposed system has strong capabilities to yield correct, timely, and user-friendly responses to the queries in multiple languages, surpassing commercial chatbots and search engines. The system has been deployed and has provided service for more than 12,000 people.

pdf bib abs
MeKB-Sim: Personal Knowledge Base-Powered Multi-Agent Simulation
Zhenran Xu | Jifang Wang | Baotian Hu | Longyue Wang | Min Zhang

Language agents have demonstrated remarkable emergent social behaviors within simulated sandbox environments. However, the characterization of these agents has been constrained by static prompts that outline their profiles, highlighting a gap in achieving simulations that closely mimic real-life interactions. To close this gap, we introduce MeKB-Sim, a multi-agent simulation platform based on a dynamic personal knowledge base, termed MeKB. Each agent’s MeKB contains both fixed and variable attributes—such as linguistic style, personality, and memory—crucial for theory-of-mind modeling. These attributes are updated when necessary, in response to events that the agent experiences. Comparisons with human annotators show that the LLM-based attribute updates are reliable. Based on the dynamic nature of MeKB, experiments and case study show that MeKB-Sim enables agents to adapt their planned activities and interactions with other agents effectively. Our platform includes a Unity WebGL game interface for visualization and an interactive monitoring panel that presents the agents’ planning, actions, and evolving MeKBs over time. For more information, including open-source code, a live demo website, and videos, please visit our project page at https://mekb-sim.github.io/.

The discovery of novel mechanical metamaterials, whose properties are dominated by their engineered structures rather than chemical composition, is a knowledge-intensive and resource-demanding process. To accelerate the design of novel metamaterials, we present MetaScientist, a human-in-the-loop system that integrates advanced AI capabilities with expert oversight with two primary phases: (1) hypothesis generation, where the system performs complex reasoning to generate novel and scientifically sound hypotheses, supported with domain-specific foundation models and inductive biases retrieved from existing literature; (2) 3D structure synthesis, where a 3D structure is synthesized with a novel 3D diffusion model based on the textual hypothesis and refined it with a LLM-based refinement model to achieve better structure properties. At each phase, domain experts iteratively validate the system outputs, and provide feedback and supplementary materials to ensure the alignment of the outputs with scientific principles and human preferences. Through extensive evaluation from human scientists, MetaScientist is able to deliver novel and valid mechanical metamaterial designs that have the potential to be highly impactful in the metamaterial field.

With the widespread consumption of AI-generated content, there has been an increased focus on developing automated tools to verify the factual accuracy of such content. However, prior research and tools developed for fact verification treat it as a binary classification or a linear regression problem. Although this is a useful mechanism as part of automatic guardrails in systems, we argue that such tools lack transparency in the prediction reasoning and diversity in source evidence to provide a trustworthy user experience.We develop FACTS&EVIDENCE—an interactive and transparent tool for user-driven verification of complex text. The tool facilitates the intricate decision-making involved in fact-verification, presenting its users a breakdown of complex input texts to visualize the credibility of individual claims along with explanation of model decisions and attribution to multiple, diverse evidence sources. FACTS&EVIDENCE aims to empower consumers of machine-generated text and give them agency to understand, verify, selectively trust and use such text.

We introduce LiteWebAgent, an open-source suite for VLM-based web agent applications. Our framework addresses a critical gap in the web agent ecosystem with a production-ready solution that combines minimal serverless backend configuration, intuitive user and browser interfaces, and extensible research capabilities in agent planning, memory, and tree search. For the core LiteWebAgent agent framework, we implemented a simple yet effective baseline using recursive function calling, providing with decoupled action generation and action grounding. In addition, we integrate advanced research components such as agent planning, agent workflow memory, and tree search in a modular and extensible manner. We then integrate the LiteWebAgent agent framework with frontend and backend as deployed systems in two formats: (1) a production Vercel-based web application, which provides users with an agent-controlled remote browser, (2) a Chrome extension leveraging LiteWebAgent’s API to control an existing Chrome browser via CDP (Chrome DevTools Protocol). The LiteWebAgent framework is available at https://github.com/PathOnAI/LiteWebAgent, with deployed frontend at https://lite-web-agent.vercel.app/.

Diffusion-based image generation models such as DALL-E 3 and Stable Diffusion-XL demonstrate remarkable capabilities in generating images with realistic and unique compositions. Yet, these models are not robust in precisely reasoning about physical and spatial configurations of objects, especially when instructed with unconventional, thereby out-of-distribution descriptions, such as “a chair with five legs”. In this paper, we propose a language agent with chain-of-3D-thoughts (L3GO), an inference-time approach that can reason about part-based 3D construction of unconventional objects that current data-driven diffusion models struggle with. More concretely, we use large language models as agents to compose a desired object via trial-and-error within the 3D simulation environment. To facilitate our investigation, we develop a new benchmark, Unconventionally Feasible Objects (UFO), as well as SimpleBlenv, a wrapper environment built on top of Blender where language agents can build and compose atomic building blocks via API calls. Human and automatic GPT-4V evaluations show that our approach surpasses the standard GPT-4 and other language agents (e.g., ReAct and Reflexion) for 3D mesh generation on ShapeNet. Moreover, when tested on our UFO benchmark, our approach outperforms other state-of-the-art text-to-2D image and text-to-3D models based on human evaluation.

To develop high-performing Visual Language Models (VLMs), it is essential to prepare multimodal resources, such as image-text pairs, interleaved data, and instruction data. While multimodal resources for English are abundant, there is a significant lack of corresponding resources for non-English languages, such as Japanese. To address this problem, we take Japanese as a non-English language and propose Japanese multimodal datasets for rapidly developing a Japanese multimodal model. We collect Japanese image-text pairs and interleaved data from web archives and generate Japanese instruction data using an existing large language model and a VLM. Our experimental results show that a VLM trained on these native datasets outperforms those relying on machine-translated content. The resulting VLM, dataset and code used for training is publicly available.

pdf bib abs
Storybranch - generating multimedia content from novels
Rushikesh Hiray | Venelin Kovatchev

We present Storybranch - an automated system for generating multimedia content from long texts such as novels and fanfiction.The Storybranch pipeline includes structured information extraction, text parsing and processing, content generation using Gen-AI models and syncronization of different streams (audio, video, background). Our system is highly modular and can efficiently generate three different types of multimodal content: audiobooks, simple animated videos, and visual novel text-and-image-style video games.Storybranch successfully addresses challenges such as generating unique and consistent image and voice for each character and narrator, identifying and generating background images and sounds effects, and syncronizing character expressions and lip movement with text.As part of the Storybranch , we develop and release BookNLP2 - a new open-source library for parsing and extracting information from books, based on the legacy library BookNLP.

pdf bib abs
EventFull: Complete and Consistent Event Relation Annotation
Alon Eirew | Eviatar Nachshoni | Aviv Slobodkin | Ido Dagan

Event relation detection is a fundamental NLP task, leveraged in many downstream applications, whose modeling requires datasets annotated with event relations of various types. However, systematic and complete annotation of these relations is costly and challenging, due to the quadratic number of event pairs that need to be considered. Consequently, many current event relation datasets lack systematicity and completeness.In response, we introduce EventFull, the first tool that supports consistent, complete and efficient annotation of temporal, causal and coreference relations via a unified and synergetic process.A pilot study demonstrates that EventFull accelerates and simplifies the annotation process while yielding high inter-annotator agreement.

pdf bib abs
METAPHORSHARE: A Dynamic Collaborative Repository of Open Metaphor Datasets
Joanne Boisson | Arif Mehmood | Jose Camacho-Collados

The metaphor studies community has developed numerous valuable labelled corpora in various languages over the years. Many of these resources are not only unknown to the NLP community, but are also often not easily shared among the researchers. Both in human sciences and in NLP, researchers could benefit from a centralised database of labelled resources, easily accessible and unified under an identical format. To facilitate this, we present MetaphorShare, a website to integrate metaphor datasets making them open and accessible. With this effort, our aim is to encourage researchers to share and upload more datasets in any language in order to facilitate metaphor studies and the development of future metaphor processing NLP systems. The website has four main functionalities: upload, download, search and label metaphor datasets. It is accessible at www.metaphorshare.com.

pdf bib abs
Towards Unified, Dynamic and Annotation-based Visualisations and Exploration of Annotated Big Data Corpora with the Help of Unified Corpus Explorer
Kevin Bönisch | Giuseppe Abrami | Alexander Mehler

The annotation and exploration of large text corpora, both automatic and manual, presents significant challenges across multiple disciplines, including linguistics, digital humanities, biology, and legal science. These challenges are exacerbated by the heterogeneity of processing methods, which complicates corpus visualization, interaction, and integration. To address these issues, we introduce the Unified Corpus Explorer (UCE), a standardized, dockerized, open-source and dynamic Natural Language Processing (NLP) application designed for flexible and scalable corpus navigation. Herein, UCE utilizes the UIMA format for NLP annotations as a standardized input, constructing interfaces and features around those annotations while dynamically adapting to the corpora and their extracted annotations. We evaluate UCE based on a user study and demonstrate its versatility as a corpus explorer based on generative AI.

Existing Multimodal Large Language Model (MLLM)-based agents face significant challenges in handling complex GUI (Graphical User Interface) interactions on devices. These challenges arise from the dynamic and structured nature of GUI environments, which integrate text, images, and spatial relationships, as well as the variability in action spaces across different pages and tasks. To address these limitations, we propose MobA, a novel MLLM-based mobile assistant system. MobA introduces an adaptive planning module that incorporates a reflection mechanism for error recovery and dynamically adjusts plans to align with the real environment contexts and action module’s execution capacity. Additionally, a multifaceted memory module provides comprehensive memory support to enhance adaptability and efficiency. We also present MobBench, a dataset designed for complex mobile interactions. Experimental results on MobBench and AndroidArena demonstrate MobA’s ability to handle dynamic GUI environments and perform complex mobile tasks.

pdf bib abs
OpenReviewer: A Specialized Large Language Model for Generating Critical Scientific Paper Reviews
Maximilian Idahl | Zahra Ahmadi

We present OpenReviewer, an open-source system for generating high-quality peer reviews of machine learning and AI conference papers. At its core is Llama-OpenReviewer-8B, an 8B parameter language model specifically fine-tuned on 79,000 expert reviews from top conferences. Given a PDF paper submission and review template as input, OpenReviewer extracts the full text, including technical content like equations and tables, and generates a structured review following conference-specific guidelines. Our evaluation on 400 test papers shows that OpenReviewer produces considerably more critical and realistic reviews compared to general-purpose LLMs like GPT-4 and Claude-3.5. While other LLMs tend toward overly positive assessments, OpenReviewer’s recommendations closely match the distribution of human reviewer ratings. The system provides authors with rapid, constructive feedback to improve their manuscripts before submission, though it is not intended to replace human peer review. OpenReviewer is available as an online demo and open-source tool.