Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Anthology ID:: 2026.acl-demo
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Venue:: ACL
Event:: Annual Meeting of the Association for Computational Linguistics (2026)
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-demo/
DOI:
ISBN:: 979-8-89176-392-0
Bib Export formats:: BibTeX

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Greg Durrett | Ping Jian

pdf bib abs

Interpreto is an open-source Python library for interpreting HuggingFace language models, from early BERT variants to LLMs. It provides two complementary families of methods: attribution methods and concept-based explanations. The library bridges recent research and practical tooling by exposing explanation workflows through a unified API for both classification and text generation. A key differentiator is its end-to-end concept-based pipeline (from activation extraction to concept learning, interpretation, and scoring), which goes beyond feature-level attributions and is uncommon in existing libraries.

pdf bib abs

We present **Copyright Detective**, the first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an **evidence discovery** process rather than a static classification task due to the complex nature of copyright law. It integrates multiple detection paradigms, including content recall testing, paraphrase-level similarity analysis, persuasive jailbreak probing, and unlearning verification, within a unified and extensible framework. Through interactive prompting, response collection, and iterative workflows, our system enables systematic auditing of verbatim memorization and paraphrase-level leakage, supporting responsible deployment and transparent evaluation of LLM copyright risks even with black-box access. In our experiments with GPT-4o-mini, we demonstrate that the specific persuasive strategy "Pathos" shifts the leakage distribution from about 0.1 (ROUGE-L) to 0.7. Our live system is hosted on [Streamlit server](https://copyright-detective.streamlit.app), with a [demonstration video](https://youtu.be/z9Lh4kNDHiM) included as supplementary material.

pdf bib abs

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows
Kazuki Kawamura | Satoshi Waki | Kei Tateno

Multi-agent LLM workflows, which are AI systems composed of multiple role-specialized LLM calls, often outperform single prompts, but they are notoriously difficult to debug and refine. Failures can originate from subtle mistakes in intermediate artifacts that silently propagate downstream, forcing developers to read long traces and guess which agent to edit. We present PROTEA, a unified UI that closes the loop for offline, test-case–driven improvement of multi-agent workflows, enabling developers to efficiently diagnose and fix errors without manual inspection of long traces. PROTEA executes a workflow, scores intermediate artifacts with configurable evaluators, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To address the difficulty of preparing intermediate reference in complex systems, PROTEA performs backward node evaluation by inferring each node’s ideal expected output from terminal supervision and graph context, and comparing it with the observed node output. For a selected node, it proposes a targeted prompt patch as an editable diff, then automatically re-runs and re-evaluates the workflow to show before/after output diffs and score trajectories within the same interface. Using PROTEA, users can visually pinpoint system-wide bottlenecks at a glance, streamline remediation via semi-automated prompt patching, and immediately verify pre- and post-correction outcomes within a unified loop.

pdf bib abs

Understanding character relationships is essential for interpreting complex narratives and conducting socially grounded AI research. However, manual annotation is time-consuming and low in coverage, while large language models (LLMs) often produce hallucinated or logically inconsistent outputs. We present SymbolicThought, a human-in-the-loop framework that combines LLM-based extraction with symbolic reasoning. The system constructs editable character relationship graphs, refines them using seven types of logical constraints, and enables real-time validation and conflict resolution through an interactive interface. To support logical supervision and explainable social analysis, we release a dataset of 160 interpersonal relationships with corresponding logical structures. Experiments show that SymbolicThought improves annotation accuracy and consistency while significantly reducing time cost, offering a practical tool for narrative understanding, explainable AI, and LLM evaluation. The source code and dataset are publicly available on GitHub (https://github.com/BLPXSPG/SymbolicThought).

pdf bib abs

LiTS: A Modular Framework for LLM Tree Search
Xinzhe Li | Yaguang Tao

LiTS is a modular Python framework for LLM reasoning via tree search. It decomposes tree search into three reusable components—Policy, Transition, and RewardModel—that plug into algorithms like MCTS and BFS. A decorator-based registry enables domain experts to extend to new domains by registering components, and algorithmic researchers to implement custom search algorithms. We demonstrate composability on MATH500 (language reasoning), Crosswords (environment planning), and MapEval (tool use), showing that components and algorithms are orthogonal: components are reusable across algorithms within each task type, and algorithms work across all components and domains. We also report a mode-collapse finding: in infinite action spaces, LLM policy diversity—not reward quality—is the bottleneck for effective tree search. A demonstration video is available at https://youtu.be/nRGX43YrR3I. The package is released under the Apache 2.0 license at https://github.com/xinzhel/lits-llm, including installation instructions and runnable examples that enable users to reproduce the demonstrated workflows.

pdf bib abs

High-quality scientific illustrations are essential for communicating complex scientific and technical concepts, yet existing automated systems remain limited in editability, stylistic controllability, and efficiency. We present AutoFigure-Edit, an end-to-end system that generates fully editable scientific illustrations from long-form scientific text while enabling flexible style adaptation through user-provided reference images. By combining long-context understanding, reference-guided styling, and native SVG editing, it enables efficient creation and refinement of high-quality scientific illustrations. To facilitate further progress in this field, we release the video at https://youtu.be/10IH8SyJjAQ, the full codebase at https://github.com/ResearAI/AutoFigure-Edit and provide a live demo for easy access and interactive use at https://autofigure.cc/.

pdf bib abs

Large language model (LLM) agents are increasingly applied to financial decision-making tasks that require interaction with external tools, including market data retrieval, news analysis, and trade execution. However, existing trading systems rely on fragmented and task-specific APIs, resulting in inconsistent schemas, complex integration, and limited reproducibility. We present QFinZero, a unified trading environment for LLM-based financial agents. QFinZero standardizes three core capabilities: (i) multi-frequency market and derivatives data access, (ii) structured news and event retrieval, and (iii) stateful brokerage simulation with explicit order lifecycle management. All tools adopt consistent JSON schemas and time-aligned interfaces, enabling agents to acquire information and execute trades within a coherent framework. By abstracting financial interaction into composable, agent-invokable primitives, QFinZero reduces engineering overhead and supports reproducible evaluation through comprehensive logging and deterministic replay. We argue that such a standardized trading environment is essential for scalable research on LLM-based financial agents.

pdf bib abs

dLLM: Simple Diffusion Language Modeling
Zhanhui Zhou | Lingjie Chen | Hanghang Tong | Dawn Song

Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures.To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling—training, inference, and evaluation—and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline.The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute—including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.

pdf bib abs

Fast-MIA: Efficient and Scalable Membership Inference for LLMs
Hiromu Takahashi | Shotaro Ishihara

We propose Fast-MIA (https://github.com/Nikkei/fast-mia), a Python library for efficiently evaluating membership inference attacks (MIA) against large language models (LLMs).MIA has emerged as a crucial technique for auditing privacy risks and copyright infringement in LLMs. However, computational demands have grown substantially: recent methods rely on repeated inference, while practical auditing requires large-scale evaluation.Progress is further hindered by existing implementations that execute methods independently, redundantly computing shared intermediate results such as log-probabilities.To address these challenges, Fast-MIA combines two strategies: (1) high-throughput batch inference via vLLM, achieving approximately 5× speedup, and (2) a cross-method caching architecture that computes intermediate results once and shares them across methods.The library includes representative MIA methods under a unified framework, integrates with established benchmarks, and supports flexible YAML configuration.We release Fast-MIA under the Apache License 2.0 to support scalable and reproducible MIA research.

pdf bib abs

CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents
Peter Jansen | Samiah Hassan | Pragnya Narasimha

Automated Scientific Discovery (ASD) systems can help automatically generate and run code-based experiments, but their capabilities are limited by the code they can reliably generate from parametric knowledge alone. As a result, current systems either mutate a small number of manually-crafted experiment examples, or operate solely from parametric knowledge, limiting quality and reach. We introduce CodeDistiller, a system that automatically distills large collections of scientific Github repositories into a vetted library of working domain-specific code examples, allowing ASD agents to expand their capabilities without manual effort. Using a combination of automatic and domain-expert evaluation on 250 materials science repositories, we find the best model is capable of producing functional examples for 74% of repositories, while our downstream evaluation shows an ASD agent augmented with a CodeDistiller generated library produces more accurate, complete, and scientifically sound experiments than an agent with only general materials-science code examples. We also evaluate LLM-as-a-judge ratings against domain-expert ratings in an A/B testing paradigm, finding moderate agreement and suggesting that inexpensive proxy metrics may be feasible for evaluating scientific discovery systems at scale.

pdf bib abs

Understanding and extracting structured insights from unstructured documents remains a foundational challenge in industrial NLP. While Large Language Models (LLMs) enable zero-shot extraction, traditional pipelines often fail to handle multi-document packets, complex reasoning, and strict compliance requirements. We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging to segment complex document packets; (2) configurable Extraction Module leveraging multimodal LLMs to transform unstructured content into structured data; (3) Agentic Analytics Module, compliant with the Model Context Protocol (MCP) providing data access through secure, sandboxed code execution; and (4) Rule Validation Module replacing deterministic engines with LLM-driven logic for complex compliance checks. The interactive demonstration enables users to upload document packets, visualize classification results, and explore extracted data through an intuitive web interface. We demonstrate effectiveness across industries, highlighting a production deployment at a leading healthcare provider achieving 98% classification accuracy, 80% reduced processing latency, and 77% lower operational costs over legacy baselines. IDP Accelerator is open-sourced with a live demonstration available to the community.

pdf bib abs

AInterviewer: A Platform for Designing and Conducting AI-led Qualitative Interviews
Tobias Gårdhus | Nikolas Vitsakis | Fie Lejre Frederiksen | Anna Rogers | Hjalmar Bang Carlsen

There are now multiple proposals for systems based on Large Language Models(LLMs) to conduct automated qualitative interviews. This approach scales up qualitative interview techniques that have traditionally been constrained by the high costs of data collection. However, most of the current solutions rely on proprietary LLMs, which compromise reproducibility and data security. They also rely on LLMs for all interview tasks, which limits standardisation of question wording as well as control over question order. To address these issues, we introduce the AInterviewer platform, based on a multi-agent framework that combines controlled question administration of survey software with the flexibility of LLMs. AInterviewer can run with locally hosted models to ensure security and transparency. Our platform provides a web-based GUI supporting each phase of data collection: from interview guide design and pilot testing to interview distribution and data collection monitoring.

pdf bib abs

Systematic reviews rely on forest plots to synthesise quantitative evidence across biomedical studies, but generating them remains a fragmented and labour-intensive process. Researchers must interpret complex clinical texts, manually extract outcome data from trials, define appropriate interventions and comparators, harmonise inconsistent study designs, and carry out meta-analytic computations—typically using specialised software that demands structured inputs and domain expertise. While recent work has demonstrated that large language models can extract study-level data from unstructured text, no existing system automates the complete pipeline from raw documents to synthesised forest plots. To address this gap, we introduce AutoForest, the first end-to-end system that generates publication-ready forest plots directly from biomedical papers. Given one or more study papers, AutoForest automatically suggests ICO (Intervention, Comparator, Outcome) elements, extracts outcome data, performs statistical synthesis, and renders the final forest plot. We describe the system architecture, user interface and demonstrate its effectiveness on real-world examples through a user study involving clinicians, showing how AutoForest can accelerate evidence synthesis and substantially lower the barrier to conducting meta-analyses.

pdf bib abs

We present **Dash-M5H**, an interactive dashboard for *multi-modal, multi-model mental health* assessment that helps clinicians and researchers jointly inspect multimodal behavioral data with multi-model signal outputs of recorded clinical interviews. Guided by signal detection and integrated sensemaking theories, Dash-M5H synchronizes transcript text, audio, and facial behavior (action units and gaze) to support overview-to-detail evidence tracing; and it integrates extracted signals (e.g., sentiment and facial activity) with a clinically grounded VLM prediction pipeline that produces DSM-5-aligned depression predictions. Dash-M5H (https://dash-m5h.io) is implemented in a lightweight, browser-based stack (Quarto + Observable JS + D3), supports local data import and time-synced clinical annotation with export. We demonstrate Dash-M5H through a depression screening scenario, evaluate its note-taking and screening capabilities through a user experiment, and release a live demo (https://youtu.be/w3qCJ02k6bw) and code (https://github.com/nd-hal/M5H-Dashboard-VLM) to facilitate reproducible evaluation.

pdf bib abs

MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models
Ahmad Chamma | Omar El Herraoui | Guokan Shang

We introduce MixtureKit, a modular open-source framework for constructing, training, and analyzing Mixture-of-Experts (MoE) models from arbitrary pre-trained or fine-tuned checkpoints. MixtureKit supports three complementary strategies: (i) Traditional MoE, using a single router per transformer block to select experts; (ii) BTX (Branch-Train-Mix), adding routers at user-specified sub-layers for fine-grained token routing; and (iii) BTS (Branch-Train-Stitch), preserving experts intact and introducing lightweight stitch layers for controlled hub–expert information exchange. Given a single configuration dictionary, MixtureKit automatically modifies model configuration, patches decoder and causal LM classes, and exports a unified transformers-compatible checkpoint ready for inference or further fine-tuning. We also provide a visualization interface to inspect token routing, expert weight distributions, and layer-wise contributions. Experiments on multilingual code-switched (Arabic–Latin) data show that BTX models built with MixtureKit can outperform dense baselines across multiple benchmarks. The library is accessible at: https://github.com/MBZUAI-Paris/MixtureKit.

pdf bib abs

Squrve: A Unified and Modular Framework for Complex Real-World Text-to-SQL Tasks
Yihan Wang | Peiyu Liu | Runyu Chen | Jiaxing Pu | Wei Xu

Text-to-SQL technology has evolved rapidly, with diverse academic methods achieving impressive results. However, deploying these techniques in real-world systems remains challenging due to limited integration tools. Despite these advances, we introduce Squrve, a unified, modular, and extensive Text-to-SQL framework designed to bring together research advances and real-world applications. Squrve first establishes a universal execution paradigm that standardizes invocation interfaces, then proposes a multi-actor collaboration mechanism based on seven abstracted effective atomic actor components. Experiments on widely adopted benchmarks demonstrate that the collaborative workflows consistently outperform the original individual methods, thereby opening up a new effective avenue for tackling complex real-world queries. The codes are available at https://github.com/LLM-Cube/Squrve.

pdf bib abs

Large language model (LLM) agents increasingly operate in multi-agent settings where failures emerge from interaction dynamics rather than isolated model errors. We introduce RiskLab, an open-source toolkit for instantiating, probing, and measuring emergent risks in LLM-based multi-agent systems under controlled conditions. Each experiment is defined as a structured topology–environment–protocol–agent–task quintuple, enabling reproducible studies of how communication structure, coordination mechanisms, and incentives shape system-level risks. RiskLab provides flexible communication topologies, swappable interaction protocols, trajectory-grounded evaluation, and extensible registries for risk detectors and agent backends. We demonstrate the toolkit across representative risks, including collusion, resource overreach, semantic drift, and strategic misreporting, and support one-file reproducibility via configuration.

pdf bib abs

ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs
Yuzhuang Xu | Xu Han | Yuxuan Li | Wanxiang Che

Although existing frameworks for large language model (LLM) inference on CPUs are mature, they fail to fully exploit the computational potential of many-core CPU platforms. Many-core CPUs are widely deployed in web servers and high-end networking devices, and are typically organized into multiple NUMA nodes that group cores and memory. Current frameworks largely overlook the substantial overhead of cross-NUMA memory access, limiting inference scalability and intelligence enabling on such platforms. To address this limitation, we build ArcLight, a lightweight LLM inference architecture designed from the ground up for many-core CPUs. ArcLight integrates efficient memory management and thread scheduling, and introduces finely controlled tensor parallelism to mitigate the cross-node memory access wall. Experimental results show that ArcLight significantly surpasses the performance ceiling of mainstream frameworks, achieving up to 46% higher inference throughput. Moreover, ArcLight maintains compatibility with arbitrary CPU devices. ArcLight is publicly available at https://github.com/OpenBMB/ArcLight.

pdf bib abs

DialogGuard: Multi-Agent Psychosocial Safety Evaluation Interface of Sensitive LLM Responses
Han Luo | Guy Laban

LLM-based agents are increasingly deployed for mental-health support and crisis counselling, yet recent evaluations reveal that commercial therapy chatbots respond appropriately only about half the time in clinical scenarios.Clinicians and safety engineers are called upon to audit these systems, but existing tools do not surface the subtler psychosocial harms (manipulation, discrimination, psychological distress) nor produce the explainable rationales that practitioners need.We present DialogGuard, an open-source system that lets practitioners inspect, stress-test, and create audit trails for prompted LLM agents across five psychosocial safety dimensions.The system wraps around arbitrary generative models through four LLM-as-a-judge pipelines (single-agent scoring, dual-agent correction, multi-agent debate, and majority voting), each grounded in shared three-level rubrics.Through its web interface, practitioners evaluate agents in two modes (Live Chat and Manual Input) and review per-dimension risk scores with natural-language rationales.Experiments on PKU-SafeRLHF show that dual-agent correction provides the best accuracy-robustness trade-off, and a formative study with 12 practitioners confirms that the system supports prompt auditing, safety inspection, and supervisory decision-making.Code and demo: https://github.com/lhannnn/dialogguard-web.

pdf bib abs

ClinQueryAgent: A Conversational Agent for Population Health Management
Joseph Spartacus Boyle | Anthony Michael Dranfield | Mike O’Neil | Maria Liakata | Alison Q. Smithard

In this paper we introduce CLINQUERYAGENT, a system for translating natural language population health questions into executable database queries using agents with access to both local and external knowledge bases. Our novel architecture enables the use of powerful cloud-based language models whilst ensuring that no patient data leaves the secure environment. To combat inaccuracies over the course of longer dialogues due to context rot, information retrieval is delegated to a sub-agent. We deploy the system via a chat window embedded within an existing population health management platform where it has been used by 128 staff from 15 healthcare practices covering a total of 148,319 patients in the UK’s National Health Service (NHS). We evaluate the system’s capacity to autonomously handle a range of health informatics tasks on three datasets and via a beta-testing phase. Our results show that both analysts and clinicians are able to easily generate actionable information from patient health records using natural language requests requiring no programming expertise to verify. A public demo of the system is available to try: https://clinqueryagent.josephsboyle.com/

pdf bib abs

Building retrieval-augmented generation (RAG) systems often requirescombining separate tools for retrieval, re-ranking, and generation,with incompatible data formats, evaluation pipelines, and deployment workflows.We present , an open-source Python toolkit that unifies these stagesin a single modular framework.[PyPI: <https://pypi.org/project/rankify/>],[GitHub: <https://github.com/DataScienceUIBK/Rankify>],[Docs: <https://rankify.readthedocs.io>]%,[Video: <https://youtu.be/kkLzomrM2ec>]provides 42 benchmark datasets with pre-retrieved documents andpre-built indices, 15 retrievers (sparse, dense, and reasoning-augmented),and 24 re-ranking models spanning 41 pointwise, pairwise, and listwise variants.It also supports 6 RAG strategies across four inference backends(Hugging Face, vLLM, LiteLLM, and OpenAI), enabling consistent experimentationfrom local models to hosted APIs.A unified pipeline interface allows users to compose retrieve–rerank–generateworkflows in a few lines of code, while an agentic assistant (RankifyAgent), aREST server (RankifyServer), and an interactive webplayground support deployment and non-programmatic exploration.Across 200+ configurations on QA and BEIR/TREC benchmarks with six generator LLMs,re-ranking consistently improves downstream performance, yielding gains of5–15 points in Exact Match and up to 8.5 points in RAGAS context precisionacross diverse retriever–generator combinations.

pdf bib abs

Many disciplines pose natural-language research questions over large document collections whose answers typically requires structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com.

pdf bib abs

Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) – to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as “redacted documents” or “pie charts.” We detail GovScape’s search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape’s pre-processing pipeline for 10 million PDFs was approximately 1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. We evaluate GovScape by (1) analyzing 1,679 search queries and (2) benchmarking vector and keyword index efficiency using these queries. GovScape can be found at https://www.govscape.net.

pdf bib abs

Linux kernel device drivers are tightly coupled with hardware, making them difficult to execute and test without physical devices. This heavily limits automated code analysis and vulnerability discovery. While manual modeling is unscalable, Large Language Models (LLMs) offer a new approach to scale virtual device construction across the Linux driver ecosystem. In this paper, we present DevGen, an LLM-powered tool that generates QEMU-based virtual devices directly from Linux driver source code. DevGen combines static analysis to gather necessary context, guides the LLM through step-by-step prompting, and uses an automated self-correction loop driven by compilation and execution feedback. To further reduce errors, similar fixes are retrieved from a library of common modeling failures and incorporated into the repair prompt, which supports more targeted corrections in later iterations. The generated devices finally integrate with QEMU and Syzkaller, enabling driver fuzzing without physical hardware. DevGen is evaluated on 50 PCI/PCIe drivers from Linux 6.18 using three mainstream LLMs, and successfully generates usable models for 44 drivers. In these drivers, 24% of them achieve significant improvements in fuzzing coverage, and 7 previously unknown crashes are triggered with 1 CVE assigned. These results demonstrate the practical capability of LLMs to automate complex, system-level code generation tasks.

pdf bib abs

BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation
Yang Qi | Xiangyao Ma | Xiao Wang | Hao Wang | Rui Wang

As global cross-lingual communication intensifies, language barriers in visually rich documents such as PDFs remain a practical bottleneck. Existing document translation pipelines face a tension between linguistic processing and layout preservation: text-oriented Computer-Assisted Translation (CAT) systems often discard structural metadata, while document parsers focus on extraction and do not support faithful re-rendering after translation. We introduce BabelDOC, an Intermediate Representation (IR)-based framework for layout-preserving PDF translation. BabelDOC decouples visual layout metadata from semantic content, enabling document-level translation operations such as terminology extraction, cross-page context handling, glossary-constrained generation, and formula placeholdering. The translated content is then re-anchored to the original layout through an adaptive typesetting engine. Experiments on a curated 200-page benchmark, together with human evaluation and multimodal LLM-as-a-judge evaluation, show that BabelDOC improves layout fidelity, visual aesthetics, and terminology consistency over representative baselines, while maintaining competitive translation precision. The open-source toolkit and its interactive downstream applications have garnered over 7.8k stars on GitHub https://github.com/funstory-ai/BabelDOC. A demonstration video is available at https://youtu.be/chwrlApH7a4.

pdf bib abs

Automated ICD coding is a critical task for standardizing clinical information from electronic health records (EHRs) and supporting downstream healthcare administration.However, existing automated ICD coding systems face several fundamental challenges. First, the majority of existing research focuses on English ICD tasks, with limited attention to Chinese-language clinical contexts due to the scarcity of publicly available Chinese ICD datasets. Second, most approaches primarily target disease coding, overlooking procedure coding as well as the multi-stage workflows followed in real-world clinical practice. Moreover, many recent methods rely heavily on closed-source large language models or substantial computational resources, which limits their scalability and deployability in clinical environments.To address these gaps, this paper proposes JointCoder, which includes a real-world Chinese ICD coding dataset and a multi-agent framework that reformulates automated ICD coding as a joint disease-procedure coding task. JointCoder explicitly models real-world clinical coding workflows through stage-wise agent collaboration.All agents are instantiated using locally deployed 1.7B-parameter models, enabling scalable and privacy-preserving deployment.Extensive experiments on real-world Chinese ICD coding datasets demonstrate JointCoder’s superiority over state-of-the-art baselines across all evaluation metrics.

pdf bib abs

We demonstrate Hindsight, a working memory system for AI agents that organizes long-term memory into four logical networks and exposes three core operations. The world, experience, observation, and opinion networks separate objective facts from subjective beliefs, giving developers visibility into what an agent knows versus what it believes. The retain, recall, and reflect operations handle ingestion, retrieval, and reasoning respectively, with a parallel pipeline that combines vector search, keyword matching, graph traversal, and temporal filtering, backed by PostgreSQL with pgvector. Unlike existing systems such as MemGPT, Zep, and Mem0, Hindsight is the only one that jointly provides fact-belief separation, temporal entity graphs, evolving opinions with confidence scores, and configurable behavioral profiles. On LongMemEval and LoCoMo, Hindsight with a 20B open-source model reaches 83.6% and 83.2% accuracy, outperforming full-context GPT-4o and all prior memory systems; with Gemini-3 Pro, LongMemEval accuracy reaches 91.4%. Our interactive demo lets users build memory graphs through multi-session conversations, inspect how memories are classified, and watch opinions form and change. The system is **open-source under the MIT license**, available as a Python package (pip install hindsight-all) and Docker image, with **13.3K GitHub stars** and 763 forks to date, and in production use at Fortune 500 enterprises. Video demo: https://youtu.be/4M2wS-yEmVA.

pdf bib abs

Conversational AI (ConvAI) agents increasingly maintain structured memory to support long-term, task-oriented interactions. In-context memory approaches append the growing history to the model input, which scales poorly under context-window limits. RAG-based methods retrieve request-relevant information, but most assume flat memory collections and ignore structure. We propose **Semantic XPath**, a **tree-structured memory module** to access and update structured conversational memory. **Semantic XPath** improves performance over flat-RAG baselines by **176.7%** while using only **9.1%** of the tokens required by in-context memory. We also introduce **SemanticXPath Chat**, an end-to-end ConvAI demo system that visualizes the structured memory and query execution details. Overall, this paper demonstrates a candidate for the next generation of long-term, task-oriented ConvAI systems built on structured memory.

pdf bib abs

Scientific AI agents can autonomously carry out complex research workflows, yet these unfolded workflows often remains difficult for humans to inspect and review, limiting interpretable, controllable and effective human–AI collaboration. To address this challenge, we present a monitoring and visualization framework that records fine-grained execution events and organizes them into a directed graph that make agent workflows explicit as they proceed. The system records intermediate steps (e.g. tool calls and code executions), and renders them as real-time updated visual traces that expose workflow structure. This allows users to examine how results are produced, identify where failures emerge, and better understand agent behavior across different stages of the research process.We conduct an evaluation on complex research tasks with domain experts of interdisciplinary background in AI, neuroscience and biology. Experts report that structured traces visualization improves understanding of agent workflows, perceived interpretability, and usability for analysis and further interaction.

pdf bib abs

Code sandboxes have emerged as a critical infrastructure for advancing the coding capabilities of large language models, providing verifiable feedback for both RL training and evaluation. However, existing systems fail to provide accurate verification and efficiency under high-concurrency workloads. We present ScaleBox, a high-fidelity and scalable system designed to address these limitations in large-scale code training. ScaleBox introduces automated special-judge generation and management, fine-grained parallel execution across test cases with seamless multi-node coordination, and a configuration-driven evaluation suite for reproducible benchmarking. A series of experiments demonstrates that ScaleBox significantly enhances code verification accuracy and efficiency. Our further RLVR experiments show that ScaleBox substantially improves both performance on LiveCodeBench and training stability, significantly outperforming heuristic-matching baselines. By providing a reliable and high-throughput infrastructure, ScaleBox facilitates more effective research and development in large-scale code training.

pdf bib abs

Synthetic data has emerged as a crucial solution to the data scarcity bottleneck in large language models (LLMs), particularly for specialized domains and low-resource languages. However, the broader adoption of existing synthetic data tools is severely hindered by convoluted workflows, fragmented data standards, and limited scalability across modalities.To address these limitations, we develop DataArc-SynData-Toolkit, an open-source framework featuring: (1) a configuration-driven, end-to-end pipeline equipped with an intuitive visual interface and simplified CLI for exceptional usability; (2) a unified, quality-controllable synthesis paradigm that standardizes multi-source data generation to ensure high reusability; and (3) a highly modular architecture designed for seamless multimodal, multilingual, and multi-task adaptation.We apply the toolkit in multiple application scenarios. Experimental results demonstrate that our toolkit achieves an optimal balance between generation efficiency and data quality. By offering an end-to-end and visually interactive pipeline, DataArc-SynData-Toolkit significantly lowers the technical barrier to synthetic data generation and subsequent model training, accelerating its practical deployment in real-world applications.

pdf bib abs

HuoziIME: An On-Device LLM-Enhanced Input Method for Deep Personalization
Baocai Shan | Yuzhuang Xu | Wanxiang Che

Mobile input method editors (IMEs) are the primary interface for text input, yet they remain constrained to manual typing and struggle to produce personalized text. While lightweight large language models (LLMs) make on-device auxiliary generation feasible, enabling deeply personalized, privacy-preserving, and real-time generative IMEs poses fundamental challenges. To this end, we present HUOZIIME, a personalized on-device IME powered by LLM. We endow HUOZIIME with initial human-like prediction ability by post-training a base LLM on synthesized personalization data. Notably, a hierarchical memory mechanism is designed to continually capture and leverage user-specific input history. Furthermore, we perform systemic optimizations tailored to on-device LLM-based IME deployment, ensuring efficient and responsive operation under mobile constraints. Experiments demonstrate efficient on-device execution and high-fidelity memory-driven personalization. Code and package are available at https://github.com/Shan-HIT/HuoziIME.

pdf bib abs

WIGVO: Real-Time Bidirectional Speech Translation over Legacy PSTN Calls via Dual-Session Echo Gating
Hyeong-seob Kim | Sang-Woo Son | Hyun-woo Cho | Hyeonsang Kim | Jinmo Kim

Real-time speech translation with large language models (LLMs) has become feasible in controlled wideband settings—mobile apps, web browsers, and end-to-end full-duplex systems pushing latency below 200 ms—where developers can assume client-side echo cancellation. However, deploying such systems over the Public Switched Telephone Network (PSTN) remains challenging due to narrowband G.711 audio, unpredictable round-trip delays, and absence of client-side signal processing. We present **WIGVO** (WIGTN Voice-Only), a server-side relay system that enables bidirectional LLM-based speech translation over ordinary telephone calls without requiring app installation or carrier integration. A central contribution is addressing what we term *echo-induced self-reinforcing translation loops*: synthesized speech echoing back through the PSTN gets re-ingested and repeatedly translated. WIGVO solves this through a dual-session architecture with deterministic silence injection and energy-based voice activity detection (VAD) gating. We evaluate WIGVO on 155 Korean–English PSTN calls (148 instrumented, 147 completed) across three communication modes—voice-to-voice (V2V), text-to-voice (T2V), and full-agent—observing 555 ms median caller-to-callee latency and 2,684 ms median callee-to-caller latency, zero echo-induced translation loops, COMET semantic adequacy of 0.71 (en→ko) and 0.62 (ko→en) against offline LLM references, and USD 0.28 per minute cost. The system is deployed at https://wigvo.wigtn.com, with a video walkthrough at https://youtu.be/4Uf6zMPOInY. Evaluation scripts and anonymized call logs are available in the open-source repository.

pdf bib abs

The rapid adoption of LLM-based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet many existing benchmarks are model-centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a Python library that treats the entire agentic system as the unit of analysis. Important design decisions such as harness and context engineering are first-class citizens. MASEval helps practitioners identify the best implementation for their use case and researchers systematically study agentic systems, opening new avenues for principled system design. Through the first systematic system-level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that, across models of comparable cost and capability, framework choice matters as much as model choice. MASEval is available under the MIT licence at https://github.com/maseval/MASEval.

pdf bib abs

Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents or sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components, skill support, multimodal message handling, and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory, licensed under Apache-2.0) and video demonstration (https://youtu.be/ANynzVfY32k) are publicly available.

pdf bib abs

FactSearch: An Interactive Agentic Fact Search System for Verifying Large Language Model Outputs
Meng Fang | Harry Mackenzie

Large language models (LLMs) frequently generate factually incorrect or unverifiable statements, motivating tool-augmented verification systems that combine model reasoning with external evidence retrieval. For factuality evaluation to be scientifically reliable, verification pipelines must be controllable and reproducible: retrieval configuration and reasoning behaviour should be explicitly configurable and stable across runs. In practice, many existing systems depend on commercial search APIs whose ranking policies and retrieval behaviours are opaque and externally controlled, introducing uncontrolled variability into evaluation. This makes it difficult to disentangle reasoning errors from retrieval effects. We present FactSearch, a reproducibility-oriented agentic fact search system for claim-level factuality verification, built on a locally aggregated open-source search infrastructure. FactSearch follows an agentic verification workflow: it decomposes model outputs into atomic factual claims, generates targeted search queries, retrieves supporting evidence via a self-hosted meta-search engine, and performs modular verification within a fully configurable pipeline. By treating retrieval infrastructure as a first-class component, the system enables systematic analysis of retrieval–reasoning interactions. An interactive web interface supports transparent inspection and practical deployment. The project is available at https://factsearch.github.io.

pdf bib abs

Potato 2.0: A Comprehensive Annotation Platform with AI-in-the-Loop Support
David Jurgens | Michael Chen | Lina Iyer

Annotated data remains essential for training and evaluating NLP systems. Large language models have broadened the kinds of data researchers need, including multimodal and agentic system data. Here, we introduce Potato 2.0, a major update to our open source annotation platform designed for easy deployment, customization, and fully reproducible and shareable annotation designs. Potato offers broad support for many types of annotations in NLP, including 39 different types of annotation tasks, support for text, audio, image, and video modalities, or mixtures thereof. Potato 2.0 includes robust support for labeling agentic system outputs through reading common trace formats, or live interaction and annotation with agents in multiple settings, such as chatting, web-browsing, and coding. Potato also includes multiple AI-assistance features to help annotators more easily label data. Finally, Potato introduces a new agentic AI-in-the-loop workflow where a single human annotator collaborates with an LLM through iterative prompt refinement, uncertainty-driven instance selection, and progressive autonomy—enabling efficient dataset creation without a large annotation team.

pdf bib abs

mllm-shap: A Shapley Value Explainability Platform for Text-Audio Multimodal Large Language Models
Jakub Muszyński | Paweł Pozorski | Maria Ganzha

We present mllm-shap, an open-sourcePython platform for researchers and ML practitioners that extends Shapley value (SV)explainability from text-only large languagemodels to multimodal LLMs (MLLMs) thatjointly process text and audio. Buildingon the token-level SV framework introducedby TokenSHAP, mllm-shap addresses threechallenges absent in the text-only setting:(1) modality-aware coalition masking thathandles the coexistence of text tokens anddense audio encoder frames within a single input, (2) multi-turn conversation tracking withper-token role and modality metadata, and(3) audio token grouping via phonetic alignment that reduces the coalition space by 10–50 times. The platform ships as a pip-installablepackage implementing five SV estimation strategies – including a Complementary Contributions estimator with Neyman-optimal allocation that outperforms Monte Carlo baselines – together with an interactive web GUIfor real-time attribution visualization. Toour knowledge, mllm-shap is the first publicly available framework for complete, reproducible SV-based explainability of text-audioMLLMs. The package is MIT-licensed withfull source code on GitHub and a demonstration video included as supplementary material.

Expert writing feedback from experienced researchers is critical for early-career scholars to improve their manuscripts, yet high-quality feedback often remains scarce because reviewing research papers is labor-intensive. Emerging AI-powered writing assistants largely focus on grammar fixes or simulating peer review with final scores, yet they fall short of providing concrete, actionable suggestions that help students improve their papers during drafting. We present PaperMentor, a human-centered writing assistant system that delivers actionable suggestions as Overleaf-native inline comments while leaving the actual writing entirely to human authors. PaperMentor integrates an expert skill library carefully curated from established researchers’ writing advice with 12 specialized agents covering different aspects of paper writing, such as formatting compliance, phrasing accuracy, and terminology consistency. In a user study (n=14), 90.6% of the generated comments were rated actionable and 67.5% were rated valid, significantly outperforming a GPT-5.2 baseline without the skill library. We release PaperMentor as open source for public use.

pdf bib abs

Gardener is an interactive agentic system for single-cell RNA-seq (scRNA-seq) analysis that enables expert-steered, iterative workflows under strict data-residency requirements. Existing large language model (LLM)-based analysis agents commonly encode workflow progress as implicit conversational state and rely on cloud-centric execution, which hinders traceability and auditability and complicates keeping sensitive expression data on-device. Gardener grounds cloud-side reasoning in a local, on-device scientific engine and an Experiment Management Kernel (EMK) that externalizes analysis progress as persistent, immutable snapshots linked by lineage. This explicit state representation supports rollback, branching, and comparison of alternative analysis paths while reusing prior computation. Gardener enforces data isolation by design: cloud-hosted LLMs operate only on snapshot identifiers and sanitized summaries, while raw expression matrices and local artifacts remain on the user’s device. A local graphical user interface (GUI) provides human-in-the-loop steering and inspection of workflow state and outputs. Gardener is released as an open-source desktop application for macOS and Windows under the Apache License 2.0.

pdf bib abs

TokCollate: A Comprehensive Tool for Tokenizer Evaluation and Visualization across Languages
Dušan Variš | Abishek Stephen | Jindřich Libovický

Tokenization quality varies significantly across languages, contributing to disparities in LLM performance and cost for speakers of less-resourced languages – a phenomenon known as the "token premium" problem. Despite growing research interest, no existing tool provides a comprehensive intrinsic evaluation of tokenizers paired with interactive visualization. We present TokCollate (pronounced similarly to chocolate), a Python-based evaluation framework combined with a JavaScript visualization interface that addresses this gap. TokCollate implements a wide range of intrinsic metrics, including monolingual measures such as average token length and Rényi/Shannon efficiency, and cross-lingual measures such as vocabulary overlap, Jensen-Shannon divergence, alignment-based Eflomal scores, and length ratios. It further enables analysis across language groups defined by genealogical families, scripts, geographic regions, speaker populations, and estimated data availability. TokCollate is open-source under the MIT license and available on GitHub.

pdf bib abs

A System for Dynamically Tracking Content Moderation on Reddit
George Arthur Baker | Bharadwaj Kadiyala

Recent work in natural language processing, human-computer interaction, and computational social science takes interest in the study of decentralized content moderation, in which individual communities largely determine their own norms, rules, and enforcement thereof. A key challenge to this body of work is that, once moderated, content and related variables become difficult or impossible to recover; previous work often relied on 3rd-party historical data sources, but recent world events, legal disputes, and policy shifts have significantly disrupted these services, practically disabling their research use-cases. As a result, in order to conduct new research and reproduce previous results, researchers must record content as it’s created, and monitor variables of interest over time. In this paper we present and publicly release a software system for the dynamic monitoring of Reddit posts, communities, and moderation actions, to enable scalable and reproducible research on decentralized platform governance and content moderation. To the authors’ knowledge, at the time of publication this system is the only available solution for general-purpose, real-time, policy-compliant longitudinal data collection on Reddit. Furthermore, the system’s integration with the official Reddit API enables the collection of authentication-gated data such as community engagement metrics and moderation team information, which was unavailable in previous historical data sources.

pdf bib abs

The AI Steerability 360 toolkit is an extensible, open-source Python library for steering LLMs. Steering abstractions are designed around four model control surfaces: input (modification of the prompt), structural (modification of the model’s weights or architecture), state (modification of the model’s activations and attentions), and output (modification of the decoding or generation process). Steering methods exert control on the model through a common interface, termed a steering pipeline, which additionally allows for the composition of multiple steering methods. Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task). The functionality provided by the toolkit significantly lowers the barrier to developing and comprehensively evaluating steering methods. The toolkit is Hugging Face native and is released under an Apache 2.0 license at https://github.com/IBM/AISteer360.

pdf bib abs

Continuous monitoring of high-volume media streams requires systems that go beyond keyword alerts to deliver structured, actionable intelligence. We present a multi-agent media monitoring system that processes streaming articles through three stages: (1) a Matching Agent that uses a hybrid keyword-then-semantic matching approach, reducing agent invocations by 2̃0% (2) a batched multi-agent feature extraction, reducing core feature-extraction calls from 7 to 2 per article - a 71% reduction - with bounded quality tradeoffs; and (3) a Report Generation Agent that uses deterministic deduplication and density-based clustering. Four autonomous life-cycle agents manage the evolution of watchers.

pdf bib abs

LinkNav: Surfacing Interconnected Information in Scientific Articles
Sebastian Antony Joseph | Jennifer Healey | Junyi Jessy Li | Ani Nenkova

We present LinkNav, an enhanced experience for reading academic papers which makes explicit connections between related but non-adjacent passages. To create the experience, we instruct a language model to generate questions that may arise while reading a passage and then search for answer-bearing passages elsewhere in the document, forming intra-document connections when answers are found. We confirm that these building blocks work well to power the experience, with an answer detection pipeline that works with high precision, resulting in a reasonable number of such connections being made for a document. On a dataset of academic papers, we find that connected segments are on average ten segments away from each other, making explicit connections that a reader may have otherwise missed.

pdf bib abs

Legal practitioners in Thailand must navigate fragmented government websites to research over 3,800 active laws and 87,000 Supreme Court decisions, with no unified tool for cross-referencing, version tracking, or structural navigation. We present FourCorners, a deployed platform that addresses five practitioner pain points through three modules built on a temporal legal knowledge graph covering 552K nodes and 6.3M edges: (1) an AI legal assistant that performs grounded generation via structured graph retrieval, streaming verified source content inline with responses; (2) an interactive law reader that translates graph structure into navigation and comparison features; and (3) a court decision explorer with version-aware citations produced by temporal entity resolution across 87,394 rulings. The system discovers implicit cross-corpus relationships through co-citation analysis of Supreme Court decisions. Interviews with 20 legal professionals reveal that inline source verification fundamentally changes how practitioners interact with AI-generated legal content, and that cross-corpus enrichment surfaces legal relationships that existing tools leave invisible.

pdf bib abs

rosaOS: Agentic Operating System for Embodied LLMs
Yijun Ge | Kushaldeep Mujral | Karthik Nambiar | Jimmy Lin

We present rosaOS, an open-source agentic operating system for embodied LLMs: interactive, LLM-driven agents coordinate various software tools and physical devices through a desktop companion, the Reachy Mini robot. Existing LLM–robotic systems are generally built as a tight, intertwined stack, making it difficult to switch hardware, add extra capabilities, or expand to multiple devices without bespoke integration. Our system aims to provide a classic OS-inspired architecture where an agentic kernel manages all task execution and mediates device access, while process agents invoke tools to perform actions. We adopt industry-standard interfaces with MCP for agentic tool-calling and ROS for robot interactions, and demonstrate rosaOS on a multi-device setup including a quadruped robot, a wheeled mobile robot, and a smart lamp, all controlled through interactions with the Reachy Mini. By incorporating MCP extensibility with ROS hardware interoperability, rosaOS enables a plug-and-play ecosystem for building embodied agentic systems. Our OS is available at rosaos.ai.

pdf bib abs

Misinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics *before* they encounter them in the wild. We present *CritiSense*, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the *first* multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that *CritiSense* helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 6 months, we have reached 500+ active users. It is freely available on the Apple [App Store](https://apps.apple.com/us/app/critisense/id6749675792) and Google [Play Store](https://play.google.com/store/apps/details?id=com.critisense hl=en).

pdf bib abs

Clinical decision-making requires synthesizing heterogeneous evidence, including patient histories, clinical guidelines, and trajectories of comparable cases. While large language models (LLMs) offer strong reasoning capabilities, they remain prone to hallucinations and struggle to integrate long, structured medical documents. We present MED-COPILOT, an interactive research prototype for evidence-aware clinical reasoning, designed to help clinicians and medical trainees inspect guideline-level and patient-level evidence. MED-COPILOT combines guideline-grounded GraphRAG retrieval with hybrid semantic-keyword similar-patient retrieval to support transparent and evidence-aware clinical reasoning. The system builds a structured knowledge graph from WHO and NICE guidelines, applies community-level summarization for efficient retrieval, and maintains a 36,000-case similar-patient database derived from SOAP-normalized MIMIC-IV notes and Synthea-generated records.We evaluate our framework on clinical note completion and medical question answering, and demonstrate that it consistently outperforms parametric LLM baselines and standard RAG, improving generation fidelity and benchmark QA accuracy. The full system is available at https://huggingface.co/spaces/shuhengc/MED-COPILOT, enabling users to inspect retrieved evidence, visualize token-level similarity contributions, and conduct guided follow-up analysis. Our results suggest a practical and interpretable approach to integrating structured guideline knowledge with patient-level analogical evidence for clinical LLMs.

pdf bib abs

Large language models (LLMs) enable scalable content generation for personalized learning, but reliability and pedagogical alignment remain open challenges. We present PathBuilder, a web-based system that integrates expert-validated assessment, retrieval-augmented generation (RAG), and an LLM-as-a-Judge validation loop within a closed instructional pipeline. The system uses a 17,758-item curriculum-aligned question bank, including 1,018 expert-approved LLM-generated items, to construct diagnostic and post-tests for fine-grained learner profiling. In a real-world deployment with 179 registered users (75 matched learners), PathBuilder achieved a mean absolute gain of 37.9 percentage points, Hake’s normalized gain of 0.760, and a large effect size (Cohen’s d = 0.98). A controlled study of the judge mechanism showed consistent high-quality instructional outputs with a 100% threshold pass rate. These results demonstrate that structured curriculum alignment combined with retrieval grounding and automated validation can support reliable LLM-based personalization in deployed learning systems. A live demonstration of PathBuilder is available at https://demo.pathbuilderedu.com.

pdf bib abs

AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
Karen Zhou | Chenhao Tan

Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge. Beyond evaluation, these structured criteria can serve as signals for model alignment, reinforcement learning, and self-correction. To support these use cases, we present AutoChecklist, an open-source library that unifies checklist-based evaluation into composable pipelines. At its core is a taxonomy of five checklist generation abstractions, each encoding a distinct strategy for deriving evaluation criteria. A modular Generator → Refiner → Scorer pipeline connects any generator with a unified scorer, and new configurations can be registered via prompt templates alone. The library ships with ten built-in pipelines implementing published approaches and supports multiple LLM providers (OpenAI, OpenRouter, vLLM). Beyond the Python API, the library includes a CLI for off-the-shelf evaluation and a web interface for interactive exploration. Validation experiments confirm that these checklist methods significantly align with human preferences and quality ratings, and a case study on ICLR peer review rebuttals demonstrates flexible domain adaptation. AutoChecklist is publicly available at https://github.com/ChicagoHAI/AutoChecklist.

pdf bib abs

JobMatchAI - An Intelligent Job Matching Platform Using Knowledge Graphs, Semantic Search and Explainable AI
Mayank Vyas | Abhijit Chakraborty | Vivek Gupta

Recruiters and job seekers rely on search systems to navigate labor markets, making candidate matching engines critical for hiring outcomes. Most systems act as keyword filters, failing to handle skill synonyms and nonlinear careers, resulting in missed candidates and opaque match scores. We introduce JobMatchAI, a production-ready system integrating Transformer embeddings, skill knowledge graphs, and interpretable reranking. Our system optimizes utility across skill fit, experience, location, salary, and company preferences, providing factor-wise explanations through resume-driven search workflows. We release JobSearch-XS benchmark and a hybrid retrieval stack combining BM25, knowledge graph and semantic components to evaluate skill generalization. We assess system performance on JobSearch-XS across retrieval tasks, provide a demo video, a hosted website and installable package.

pdf bib abs

Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays
Hoor Tamer Elbahnasawi | Marwan Sayed | Sohaila Eltanbouly | Fatima Zahra Brahamia | Tamer Elsayed

Over the past years, Automated Essay Scoring (AES) systems have gained increasing attention as scalable and consistent solutions for assessing proficiency of student writing. Despite recent progress, support for Arabic AES remains limited due to linguistic complexity and scarcity of large publicly-available annotated datasets. In this work, we present Qayyem, a Web-based platform designed to support Arabic AES by providing an integrated workflow for assignment creation, batch essay upload, scoring configuration, and per-trait essay evaluation. Qayyem abstracts the technical complexity of interacting with scoring server APIs, allowing instructors to access advanced scoring services through a user-friendly interface. The platform deploys a number of state-of-the-art Arabic essay scoring models with different effectiveness and efficiency figures.

pdf bib abs

Material claims in papers, patents, etc., often involve physical feasibility (e.g., stability under conditions, property consistency), not just textual feasibility. Yet most claim verifiers operate over language, therefore producing ungrounded judgments. On the other hand, directfirst-principles verification (e.g., density functional theory, DFT) is inflexible and hard to invoke from underspecified free-form claims.Therefore, we introduce **PhyVer**, a **phy**sics-grounded material claim **ver**ification system that bridges this gap by translating claimsinto multi-fidelity physical evidence and interpretable verdicts. To support human-in-the-loop inspection, we present an interactive web interface that visualizes the instantiated structure, optimization trajectories, DFT summaries, and the final decision. On expert-labeled claims, **PhyVer** improves agreement with experts over text-only GPT-5.1, reducing MAE from 1.54 to 1.20 and Signed MAE from0.95 to 0.82, and increasing Accuracy@±1 from 50% to 70%.

pdf bib abs

pAtChWoRK: Patching the Pieces of Public Procurement Documents
Lorena Calvo-Bartolomé | Saúl Blanco Fortes | Erick Cedeño | Jerónimo Arenas-García

Public procurement data is legally open, yet practically locked inside thousands of unstructured PDFs and inconsistent portal metadata. pAtChWoRK starts with these fragmented, unstructured sources then leverages a hybrid pipeline (traditional NLP with LLM-based technologies) to restructure this information into a navigable knowledge base. Specifically, pAtChWoRK corrects manual classification errors, extracts complex unstructured fields such as award and solvency criteria and tenders’ objectives, and assists users in easily navigating the tender landscape. This unified process enables more effective handling of the transparency bottlenecks that hinder competition and oversight in public administration. A user study with practitioners across different procurement

pdf bib abs

The development of audio foundation models has accelerated rapidly since the emergence of GPT-4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces three major challenges: (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources;(2) audio codec, as a key component of audio foundation models, lacks a widely accepted and holistic evaluation methodology; (3) existing speech benchmarks are heavily reliant on English, making it challenging to objectively assess models’ performance on Chinese.We introduce UltraEval-Audio, a unified framework addressing these challenges through a modular architecture supporting 10 languages, 14 task categories, 24 models, and 36 benchmarks with one-command evaluation and real-time leaderboards. For audio codec, we propose a three-dimensional evaluation scheme covering semantic accuracy, timbre fidelity, and acoustic quality. For Chinese evaluation, we introduce two new benchmarks: SpeechCMMLU and SpeechHSK. Our code, benchmarks, and leaderboards are available at https://github.com/OpenBMB/UltraEval-Audio.

pdf bib abs

Paper2Web: Let’s Make Your Paper Alive!
Yuhang Chen | Tianpeng Lv | Yao Wan | Philip S. Yu | Dongping Chen

Academic project websites can more effectively disseminate research when they clearly present core content and enable intuitive navigation and interaction. However, current approaches such as direct Large Language Model (LLM) generation, templates, or direct HTML conversion struggle to produce layout-aware, interactive sites, and a comprehensive evaluation suite for this task has been lacking. In this paper, we introduce Paper2Web, a benchmark dataset and multi-dimensional evaluation framework for assessing academic webpage generation. It incorporates rule-based metrics like Connectivity, Completeness and human-verified LLM-as-a-Judge (covering interactivity, aesthetics, and informativeness), and PaperQuiz, which measures paper-level knowledge retention. We further present PWAgent, an autonomous pipeline that converts scientific papers into interactive and multimedia-rich academic homepages. The agent iteratively refines both content and layout through MCP tools that enhance emphasis, balance, and presentation quality. Our experiments show that PWAgent consistently outperforms end-to-end baselines like template-based webpages and arXiv/alphaXiv versions by a large margin while maintaining low cost, achieving the Pareto-front in academic webpage generation.

pdf bib abs

OxyGent: Making Multi-Agent Systems Modular, Observable, and Evolvable via Oxy Abstraction
Junxing Hu | Tianlong Li | Lei Yu | Ai Han

Deploying production-ready multi-agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, observability, and autonomous evolution. We present OxyGent, an open-source framework driven by two core novelties: a unified Oxy abstraction and the OxyBank evolution engine. The unified abstraction encapsulates agents, tools, LLMs, and reasoning flows as pluggable atomic components, enabling Lego-like scalable system composition and non-intrusive monitoring. To enhance observability, OxyGent introduces permission-driven dynamic planning that replaces rigid workflows with execution graphs generated at runtime, providing adaptive visualizations. Furthermore, to support continuous evolution, OxyBank serves as an AI asset management platform that drives automated data backflow, annotation, and joint evolution. Empirical evaluations and real-world case studies show that OxyGent provides a robust and scalable foundation for MAS. OxyGent is fully open-sourced under the Apache License 2.0 at https://github.com/jd-opensource/OxyGent.

pdf bib abs

We present EduCoder, an open-source web platform designed for annotating classroom conversation transcripts. Existing annotation tools do not support the team-based workflows or access to instructional context that education discourse research requires. EduCoder addresses these gaps by combining transcript text, synchronized video, and instructional materials within a single workspace. The platform supports scoping annotation to specific portions of a lesson, coordinating work across annotation teams, and optionally integrating LLM-generated annotations with structured human–LLM comparison. EduCoder is freely accessible at https://edu-coder.com.

pdf bib abs

We present **CPTCoder**, a human-in-the-loop system that predicts standardized medical procedure codes from clinical text. Clinical procedure coding is an extreme multi-label classification problem over a long-tailed space of short numeric identifiers, where a single-digit difference denotes an entirely different procedure. CPTCoder adapts an instruction-tuned LLM with a code-aware vocabulary and constrained decoding that guarantees all outputs are valid codes. To support human review, we derive per-code posterior inclusion probabilities from n-best reweighting, producing interpretable confidence scores that rank predictions and flag uncertain cases. A post-decoding constraint repair step enforces mutual-exclusion rules between conflicting codes. To enable reproducible research in this underexplored setting, we release **MIMIC-CPT**, a PhysioNet-accessible benchmark of 37,885 expert-cleaned report–code pairs with a deliberately hardened test split: 88% of test examples contain label combinations unseen during training, and over a third include codes with five or fewer training occurrences. We additionally provide 413,085 weakly aligned pairs and evaluate on a separate live dataset from a hospital, which includes out-of-domain radiology reports with billing-expert-verified labels. CPTCoder achieves 0.61 and 0.51 micro-F1 on the hardened MIMIC split and Hospital-298 respectively, outperforming the strongest baseline by 12 and 5 absolute points while reducing digit-level near-miss errors.

pdf bib abs

ParseJargon: Personalized Real-time Jargon Support in Online Meetings
Yifan Song | Wing Yee Au | Hon Yung Wong | Brian Bailey | Tal August

Effective interdisciplinary communication is frequently hindered by domain-specific terms. These terms, or jargon, are dependent on a listener’s background, and rarely do listeners seek explanations due to distraction and social concerns. To address these concerns, we built ParseJargon, an interactive LLM-powered system providing real-time personalized jargon support tailored to users’ individual backgrounds in online meetings. We first evaluated the effectiveness of personalization in a controlled setting with human participants. By comparing ParseJargon against baseline (no support) and general-purpose (non-personalized) conditions, we found that ParseJargon provided more precise jargon identification, and enhanced participants’ comprehension, engagement, and appreciation of colleagues’ work. We then evaluated the potential for using ParseJargon in real-time meetings through a latency test.

pdf bib abs

The olmOCR Project: Building Fully Open OCR using VLMs
Jake Poznanski | Kyle Lo | Luca Soldaini

We present olmOCR, a fully open OCR system developed through iterative public releases and community feedback. The system combines a 7B vision-language model trained in two stages: supervised finetuning on 260K diverse PDF pages, followed by reinforcement learning with visual unit tests over synthetic documents. Visual unit tests are binary checks of structural fidelity, including tables and equations, and serve both as an interpretable evaluation framework and as direct optimization targets. We also introduce olmOCR-Bench, a benchmark of 1.4K challenging PDFs evaluated via visual unit tests, on which olmOCR achieves state-of-the-art performance among open systems and proprietary APIs at a fraction of the cost. We have deployed olmOCR at scale to 100M+ PDFs to curate pretraining data for Olmo 3. We share lessons from our open development process and release all models, data, and code across two major releases.

pdf bib abs

Recent advances in artificial intelligence (AI) have accelerated the growth of both human-authored and AI-generated research outputs, placing increasing strain on traditional academic publishing systems and challenging the scalability of conference- and journal-centered paradigms amid rising submission volumes, reviewer workload, and venue size. To address these challenges, we explore an AI-era publishing paradigm in which both human and AI scientists participate as authors and readers, and papers evolve through continuous, feedback-driven iteration. We propose AiraXiv, an AI-driven open-access platform built on open preprints, AI-augmented analysis and review, and reader feedback. AiraXiv supports human scientists through an interactive UI and AI scientists through Model Context Protocol (MCP)-based interactions. We validate AiraXiv through real-world deployments, including serving as the submission platform for ICAIS 2025, demonstrating its potential as a fast, inclusive, and scalable research infrastructure for the AI era. AiraXiv is publicly available at https://airaxiv.com.

pdf bib abs

TruthSplit: Revealing Conditional Validity in Arguments Through Multi-Worldview Comparative Reasoning
Benjamin Stieger | Maximilian Terberger | Thomas Huber | Christina Niklaus

We present TruthSplit, an interactive system for multi-perspective argument analysis. Existing argumentation tools typically analyze properties of the argument itself, such as structure, quality, stance, or persuasiveness, while leaving perspective-specific background knowledge implicit. TruthSplit addresses this gap by supporting an exploratory analysis of how the same claim can lead to different conclusions when interpreted through worldview-specific values, assumptions, and conceptual definitions. We refer to this perspective-dependent analysis as conditional validity.Given an input argumentative text, TruthSplit extracts claims and premises, applies a three-layer natural language inference (NLI) approach to assess both logical and worldview-specific normative consistency, and conditions large language model (LLM) reasoning on structured worldview profiles that encode core values and decision principles. The system then generates perspective-specific interpretations, identifies value conflicts and assumption gaps, and visualizes divergence through interactive analytical interfaces.

pdf bib abs

The growing demand for Mental Health (MH) services highlights the need for scalable computational tools, yet progress in computational psychology is hindered by scarce sensitive data, complex assessment procedures, and high technical barriers. While language is a well-established marker of different MH conditions, existing NLP solutions are often fragmented, closed-source, or difficult to use, limiting their adoption in interdisciplinary research.We present TONY, an open-source, python TOolkit for NLP in clinical psYchology. TONY bridges traditional psycholinguistic analysis and modern NLP by combining interpretable lexical features with state-of-the-art lightweight transformer models within a unified and easy-to-use framework. This hybrid approach enables robust and transparent text analysis without relying on large-scale models or closed-source software.TONY is designed for researchers and practitioners working at the intersection of NLP and MH, facilitating collaboration across disciplines. Compared to the few existing systems, TONY offers a more comprehensive and exhaustive solution, reducing the barrier to entry through a unified, modular, and reproducible pipeline that integrates classical and neural approaches in a single open framework. The toolkit is released under an open-source license and is evaluated through multiple MH–related datasets, demonstrating its flexibility and effectiveness in low-resource settings

pdf bib abs

Formal verification can provide strong mathematical guarantees about software correctness, but it typically requires developers to write detailed formal specifications (e.g., contracts and loop invariants), which is costly and error-prone. We introduce AutoSpec+, an LLM-driven neuro-symbolic demonstration system that reframes specification writing as constrained structured synthesis: large language models generate candidate specifications at the granularity of proof-relevant program components, while a symbolic verifier acts as a deterministic critic that checks legality, satisfiability, and proof adequacy, rejecting or repairing candidates in an iterative loop. This design turns unconstrained text generation into constrained structured synthesis, substantially reducing hallucinations and producing proof-ready annotations. We evaluate AutoSpec+ on seven benchmark suites, showing strong effectiveness. We release an open-source, Dockerized system with ensemble LLM backends and inter-modular verification support for reproducible demonstration and deployment

pdf bib abs

This paper introduces AnnoHID, a semi-automated annotation framework designed for medical texts in low-resource languages. The system integrates large language models (LLMs) for pre-annotation and human validation to support efficient and consistent annotation. We demonstrate its application to Bahasa Indonesia medical social media texts from Alodokter, a medical Q A platform, for Named Entity Recognition (NER) and Medical Concept Normalization (MCN). We conducted a user study with 21 participants to demonstrate the effectiveness of AnnoHID. The results show that LLM-assisted annotation yields higher inter-annotator agreement for both NER (𝜅 = 0.76) and MCN (𝜅 = 0.63) and that human review improves raw LLM NER output, raising the F1 score from 0.39 to 0.45. However, LLM assistance did not reduce annotation time and may introduce normalization bias in MCN. The framework is multilingual, human-in-the-loop, and interoperable with standard medical terminologies, such as SNOMED-CT. Future work focuses on mitigating pre-annotation bias, reducing annotation overhead, and scaling evaluations to larger datasets and additional low-resource languages.

pdf bib abs

RECAP: An End-to-End Platform for Capturing, Replaying, and Analyzing AI-Assisted Programming Interactions
Keyu He | Qianou Ma | Valerie Chen | Wayne Chi | Tongshuang Wu

Understanding how developers interact with AI coding assistants requires more than chat logs or git histories in isolation; it requires reconstructing the full context: which prompt led to which edit, what the developer tried and discarded, and how their strategy evolved over time. We present RECAP (Replay and Examine Captured AI Programming), an open-source platform that (1) passively records AI chat sessions and fine-grained code edits inside VS Code without disrupting the developer’s workflow, (2) merges them into a unified timeline for interactive session replay, and (3) exposes an extensible analysis layer, with example modules for behavioral classification and AI reliance measurement. Deployed in a university software engineering course, RECAP captured 2,034 prompts and 8,239 code edits from 41 students across a multi-week project. We demonstrate how the platform’s linked data and replay capabilities enable analyses of developer-AI interaction patterns that no single data source could support.

pdf bib abs

A Dynamic Self-Evolving Extraction System
Moin Aminnaseri | Hannah Kim | Estevam Hruschka

The extraction of structured information from raw text is a fundamental component of many NLP applications, including document retrieval, ranking, and relevance estimation. High-quality extractions often require domain-specific accuracy, up-to-date understanding of specialized taxonomies, and the ability to incorporate emerging jargon and rare outliers. In many domains–such as medical, legal, and HR–the extraction model must also adapt to shifting terminology and benefit from explicit reasoning over structured knowledge. We propose DySECT, a Dynamic Self-Evolving Extraction and Curation Toolkit, which continually improves as it is used. The system incrementally populates a versatile, self-expanding knowledge base (KB) with triples extracted by the LLM. The KB further enriches itself through the integration of probabilistic knowledge and graph-based reasoning, gradually accumulating domain concepts and relationships. The enriched KB then feeds back into the LLM extractor via prompt tuning, sampling of relevant few-shot examples, or fine-tuning using KB-derived synthetic data. As a result, the system forms a symbiotic closed-loop cycle in which extraction continuously improves knowledge, and knowledge continuously improves extraction.

Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.

Financial reporting systems increasingly leverage Large Language Models (LLMs) to extract and summarize corporate disclosures. However, most existing approaches assume a single-market setting and overlook structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs. PDF), and aggregation conventions introduce substantial challenges for semantic alignment and reliable verification. Here, we aim to bridge this gap. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system constructs a unified canonical ontology spanning the income statement, balance sheet, and cash flow statement, and decomposes reporting into auditable stages, including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than treating LLMs as free-form generators, FinReporting employs them as constrained verifiers operating under explicit decision rules with evidence grounding.Evaluated on annual filings from the USA, Japan, and China, FinReporting improves consistency and reliability under heterogeneous reporting regimes. We further release an interactive demo that enables cross-market inspection and supports structured export of localized financial statements. Our demo is available at https://huggingface.co/spaces/BoomQ/FinReporting-Demo. A video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk.

pdf bib abs

Expert Calibration Lens for Pruning Mixture of Experts
Luis Frentzen Salim | Chia-Chun Wu | Tran Van Nhiem | Lun-Wei Ku | Yung-Hui Li

Expert pruning is a practical deployment technique for Mixture-of-Experts (MoE) models. It reduces resource usage and mitigates expert redundancy, but its success depends strongly on the calibration set used for pruning. In domain-general settings, it is unclear which properties of the calibration data drive good pruning outcomes, and the effects of calibration perturbations are often unintuitive. We observe, for example, that calibration sets in different languages can lead to very similar pruning results despite appearing dissimilar on the surface.To address this, we propose Expert Calibration Lens, a lightweight analysis tool that compares expert activation patterns across datasets to predict the impact of calibration perturbations without repeatedly running expensive pruning procedures. We use activations that are quick to compute and evaluate the resulting analysis for downstream task performance.

pdf bib abs

Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers
Yeqi Huang | Yue Chen | Yanwei Ye | Guanhao Su | Luo Mai

General-purpose VLMs remain unreliable for biomedical research because valid answers in scientific papers depend on evidence split across figures, tables, charts, captions, and referring text.Existing post-training pipelines are bottlenecked by costly expert annotation and by synthetic data that drops this evidence structure.We present Ryze, a fully automated system that converts raw biomedical papers into an evidence-enriched training set and a domain-specialized VLM.Ryze synthesizes QA pairs with complete supporting evidence (visual element, caption, extracted structure, and referring paragraphs), reduces layout and OCR errors via chart/table-aware extraction and LLM-based cleansing, and applies a two-stage post-training strategy combining supervised fine-tuning with reinforcement learning.Starting from Qwen3-VL-8B, Ryze produces BioVLM-8B at under $200, achieving 48.0% weighted accuracy on LAB-Bench—outperforming the base model by +12.6% and surpassing GPT-5.2 by +3.8%.We release Ryze as open source together with the trained BioVLM-8B model.

pdf bib abs

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
Asaf Yehudai | Lilach Eden | Michal Shmueli-Scheuer

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation or creating a static taxonomy of agent errors. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.

pdf bib abs

Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified NL2SQL exemplars via sentence embeddings, and synthesizes executable SQL conditioned on the decomposition, retrieved exemplars, and schema, with post-processing validity checks. To improve with use, FD-NL2SQL incorporates two update signals: (i) clinician edits of generated SQL are approved and added to the exemplar bank; and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retaining variants only if they return non-empty results. A second LLM generates the corresponding natural-language question and predicate decomposition for accepted variants, automatically expanding the exemplar bank without additional annotation. The demo interface exposes decomposition, retrieval, synthesis, and execution results to support interactive refinement and continuous improvement.

pdf bib abs

Diagram question answering (Diagram QA) requires reasoning-level attribution that links each question-answer pair to all visual regions needed to derive the answer, rather than only the region containing the final response. Creating such structured evidence across diagrams, charts, maps, circuits, and infographics is time-consuming, and existing annotation tools tightly couple their interfaces to dataset-specific formats. We present **DIAGRAMS**, a lightweight, schema-driven review framework that decouples interface logic from dataset-specific JSON structures through an internal meta-schema and dataset adapters. Given an image and QA pair with optional candidate regions, the system performs QA-conditioned evidence selection and proposes the regions required for reasoning. When QA pairs or candidate regions are missing, it generates them and supports human verification and refinement. Across six Diagram QA datasets, model-suggested evidence achieves 85.39% precision and 75.30% recall against reviewer-final selections (micro-averaged). These results indicate that the review-first framework reduces the number of regions that annotators must create from scratch. Human reviewers accept, edit, or reject each proposed region before export, which structurally limits over-reliance on AI proposals. We release a public demo and installable package to support dataset auditing, grounded supervision creation, and grounded evaluation.

pdf bib abs

Preparing graduate students for effective professional communication remains a central goal of higher education, yet consistently assessing the quality of presentation slide decks - particularly in fast-growing AI/ML programs - poses significant challenges.We introduce SlideGuard, an evaluation agent that assesses slide decks against a comprehensive framework of expert-defined criteria using a visual language model.The criteria, developed in collaboration with domain experts, span visual design, narrative coherence, and argumentative structure.SlideGuard delivers explicit, interpretable justifications for its scoring decisions, and its content-hash-based caching enables efficient re-evaluation after incremental edits, reducing the time educators spend on slide deck evaluation and accelerating feedback delivery to students.We evaluate the approach on a dataset of 150 annotated slide decks and show that it detects the majority of expert-identified issues, with stronger results on structural and visual criteria and known limitations on subjective dimensions such as research quality.SlideGuard is released under the Apache 2.0 license and is available on GitHub,[<https://github.com/Industrial-AI-Research-Lab/SlideGuard>] including all criterion prompts, configuration files, and evaluation scripts to facilitate replication.

pdf bib abs

Spectra: A Mechanistic Interpretability Library for Vision-Language Models
Clement Neo | Yongsen Zheng | Kwok-Yan Lam | Luke Ong

Vision-Language Models (VLMs) have become increasingly important in AI applications, yet interpretability tools for these models lag behind those available for text-only language models. While libraries like TransformerLens have enabled significant progress in understanding language models, existing tools for VLMs are limited to basic activation probing and saving. We introduce Spectra, a library specifically designed for mechanistic interpretability of VLMs that provides unified abstractions for activation patching, attention pattern analysis, and meta-functions across diverse VLM architectures. Built on HuggingFace’s Transformers, our library handles architecture-specific complexities through per-checkpoint configurations while maintaining a simple, high-level interface. We demonstrate the library’s capabilities by performing interpretability experiments on a counting task, showing how researchers can easily perform experiments that were previously cumbersome to do. The library currently supports Qwen2.5-VL, Qwen3-VL, LLaVA 1.5 and SmolVLM, with a design that facilitates extension to additional architectures. The library can be found at github.com/clemneo/vlm-spectra.

pdf bib abs

Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
Yinghao Tang | Yupeng Xie | Yingchaojie Feng | Tingfeng Lan | Wei Chen

Interactive articles help readers engage with complex ideas through exploration, yet creating them remains costly, requiring both domain expertise and web development skills. Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs. We present Vividoc, a human-agent collaborative system that generates interactive educational documents from a single topic input. Vividoc introduces a multi-agent pipeline (Planner, Executor, Evaluator) and the Document Specification (DocSpec), a human-readable intermediate representation that decomposes each interactive visualization into State, Render, Transition, and Constraint components. The DocSpec enables educators to review and refine generation plans before code is produced, bridging the gap between pedagogical intent and executable output. We collect a dataset of 101 real-world interactive documents across 11 domains and conduct a user study showing that ViviDoc produces documents comparable in quality to human-authored ones. Our demo is available at https://vividoc.vercel.app/ and a video demonstration at https://www.youtube.com/watch?v=rJrnPJLyHUI.

pdf bib abs

Praat++: Multimedia Annotation System for Speech and Vocalization
Weiran Zhang | Kenny Q. Zhu

High-quality time-aligned annotation is fundamental to speech processing and animal vocalization research, yet precise boundary localization and consistent labeling remain challenging in collaborative settings. We present Praat++, a web-based multimedia annotation system designed for collaborative, video-informed, and AI-assisted timeline labeling of audio and video data. The system tightly synchronizes waveform, spectrogram, pitch, intensity, and time-aligned video playback with fine-grained region-based editing, enabling precise boundary refinement and improved label accuracy within a unified interface. Praat++ further incorporates role-aware workflow management and human-in-the-loop AI-assisted pre-annotation to enhance inter-annotator consistency and reduce labeling time. Through real-world multimodal speech and animal vocalization annotation scenarios, we demonstrate that Praat++ provides an integrated infrastructure for improving annotation quality and efficiency in dataset construction workflows. The demo video (https://www.youtube.com/watch?v=YboCoBRF5lg), website (https://redgiant.uta.edu/praat) and source code (https://github.com/UTA-ACL2/PraatPlusPlus) are now publicly available.

pdf bib abs

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse
Zhang Zhang | Shuqi Lu | Hongjin Qian | Di He | Zheng Liu

Building LLM-based agents has become increasingly important. Recent works on LLM-based agent self-evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. We propose AgentFactory, a new self-evolution paradigm that preserves successful task solutions as executable subagent code rather than textual experience. Crucially, these subagents are continuously refined based on execution feedback, becoming increasingly robust and efficient as more tasks are encountered. Saved subagents are pure Python code with standardized documentation, enabling portability across any Python-capable system. We demonstrate that AgentFactory enables continuous capability accumulation: its library of executable subagents grows and improves over time, progressively reducing the effort required for similar tasks without manual intervention. Our implementation is open-sourced at https://github.com/zzatpku/AgentFactory, and our demonstration video is available at https://youtu.be/iKSsuAXJHW0.

pdf bib abs

OpenGlass: A Sensing-Computing Split Architecture for Local MLLM-Driven Real-Time Visual Assistance
Mengzhang Li | Yuan Yao

We present OpenGlass, an open-source, privacy-oriented, local-first system for low-latency multimodal visual assistance, with a primary focus on blind and low-vision users. Cloud MLLM assistants offer strong visual understanding, but often require uploading first-person visual data and can suffer multi-second network delays; wearable glasses are ideal for sensing, but cannot host large models under tight compute and power budgets. OpenGlass addresses this gap with a sensing-computing split: an ESP32-based glasses-side unit captures visual context, while a nearby consumer-grade device performs local MLLM inference and local speech output, reducing cloud reliance and keeping raw egocentric visual data on user-controlled devices by default. We evaluate response quality, query-ready-to-audio latency, safety-aware abstention, and auditable logs. Under real ESP32 Wi-Fi capture, OpenGlass reaches 993 ms median user-to-audio latency with resized payloads and 1625 ms with raw 1280×720 payloads; 97.5% and 93.3% of trials fall below 2 s, respectively. OpenGlass is a user-initiated visual-assistance reference platform for obstacle/hazard awareness, sign/object queries, and image-quality self-checking, rather than a certified navigation aid. We release source code, hardware instructions, prompts, evaluation data, and logs.

pdf bib abs

TartanMaroon: Multi-Agent Academic Advising with Iterative Negotiation and Transparent Collaboration
Peidi Dong | Houda Bouamor | Yunze Xiao | Devi G Kurup

We present TartanMaroon, a deployable multi-agent academic advising system that handles the full complexity spectrum of student queries, from factual lookups to constrained multi-semester planning. We make three contributions: (1) a proposal–critique negotiation protocol in which a Planning Agent generates degree plans evaluated in parallel by domain-specialized agents, enabling detection of cross-domain constraint violations that single-pass outputs miss; (2) a real-time transparency interface streaming agent reasoning and negotiation rounds to users, supported by pilot feedback showing increased trust over standard LLM chatbots; and (3) TartanBench, a difficulty-stratified benchmark of 220 advising queries across five complexity tiers, released open-source without exposing individual student records. A five-configuration ablation study establishes a complexity–necessity curve: single-agent systems perform competitively on simple queries, while multi-agent coordination yields gains of up to +31 points on planning tasks.

pdf bib abs

GAMED.AI: A Hierarchical Multi-Agent Framework for Automated Educational Game Generation
Shiven Agarwal | Yash Shah | Ashish Raj Shekhar | Priyanuj Bordoloi | Vivek Gupta

We introduce GameDAI, a hierarchical multi-agent framework that transforms instructor-provided questions into fully playable, pedagogically grounded educational games validated through formal mechanic contracts. Built on phase-based LangGraph sub-graphs, deterministic Quality Gates, and structured Pydantic schemas, GameDAI supports two template families encompassing 15 interaction mechanics across spatial reasoning, procedural execution, and higher-order Bloom’s Taxonomy objectives.Evaluated on 200 questions spanning five subject domains, the system achieves a 90% validation pass rate, 98.3% schema compliance, and 73% token reduction over ReAct agents 73,500 → 19,900 tokens/game) at 0.46 per game. Within this model configuration, these results suggest that phase-bounded architectural structure correlates more strongly with alignment quality than prompting strategy alone.Our demonstration lets attendees generate Bloom's-aligned games from natural language in under 60 seconds, inspect Quality Gate outputs at each pipeline phase, and browse a curated library of 50 games spanning all 15 mechanic types.

pdf bib abs

MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery
Hongran An | Zonglin Yang

Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical limitations: they treat divergent exploratory search and convergent fine-grained refinement as isolated tasks, and they operate autonomously with little to no human guidance. We present MOOSE-Copilot, the first unified framework to bridge this abstraction gap through a formalized human–AI interaction (HAII) protocol. Our system empowers scientists to steer the generative process via three explicit signals: initial blueprints, inter-stage routing, and intra-stage feedback. Using an oracle-simulated evaluation in which an LLM provides idealized expert signals, we show that injecting these structured signals significantly outperforms purely autonomous baselines, characterizing the gains achievable under high-quality guidance. Furthermore, we build a web-based interface that turns the framework into a no-code workflow: researchers pose a question, watch the hypothesis search unfold as an interactive tree, and steer it by selecting hypotheses, routing between stages, and injecting feedback—no command-line agents required. This makes end-to-end hypothesis discovery directly accessible to interdisciplinary researchers.