Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Danilo Croce, Jochen Leidner, Nafise Sadat Moosavi (Editors)


Anthology ID:
2026.eacl-demo
Month:
March
Year:
2026
Address:
Rabat, Marocco
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-demo/
DOI:
ISBN:
979-8-89176-382-1
Bib Export formats:
BibTeX
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-demo.pdf

Public debates surrounding infrastructure and energy projects involve complex networks of stakeholders, arguments, and evolving narratives. Understanding these dynamics is crucial for anticipating controversies and informing engagement strategies, yet existing tools in media intelligence largely rely on descriptive analytics with limited transparency. This paper presents **Stakeholder Suite**, a framework deployed in operational contexts for mapping actors, topics, and arguments within public debates. The system combines actor detection, topic modeling, argument extraction and stance classification in a unified pipeline. Tested on multiple energy infrastructure projects as a case study, the approach delivers fine-grained, source-grounded insights while remaining adaptable to diverse domains. The framework achieves strong retrieval precision and stance accuracy, producing arguments judged relevant in 75% of pilot use cases. Beyond quantitative metrics, the tool has proven effective for operational use: helping project teams visualize networks of influence, identify emerging controversies, and support evidence-based decision-making.
This paper introduces DeepPavlov 1.1, a new version of an open-source library for natural language processing (NLP). DeepPavlov 1.1 supports both traditional NLP tasks (like named entity recognition, sentiment classification) and new tasks needed to enhance LLMs truthfulness and reliability. These tools include: a hallucination detection model, an evergreen question classifier, and a toxicity classifier. The library is easy to use, flexible, and works with many languages. It is designed to help researchers and developers build better, safer AI systems that use language. It is publicly available under the Apache 2.0 license and includes access to an interactive online demo.
In this paper, we present PropGenie, a novel multi-agent framework based on large language models (LLMs) to deliver comprehensive real estate assistance in real-world scenarios. PropGenie coordinates eight specialized sub-agents, each tailored for distinct tasks, including search and recommendation, question answering, financial calculations, and task execution. To enhance response accuracy and reliability, the system integrates diverse knowledge sources and advanced computational tools, leveraging structured, unstructured, and multimodal retrieval-augmented generation techniques. Experiments on real user queries show that PropGenie outperforms both a general-purpose LLM (OpenAI’s o3-mini-high) and a domain-specific chatbot (Realty AI’s Madison) in real estate scenarios. We hope that PropGenie serves as a valuable reference for future research in broader AI-driven applications.
In today’s rapidly evolving large language model (LLM) landscape, technology companies such as Cisco face the difficult challengeof selecting the most suitable model for downstream tasks that demand deep, domain-specificproduct knowledge. Specialized benchmarks not only inform this decision making but alsocan be leveraged to rapidly create quizzes that can effectively train engineering and marketingpersonnel on novel product offerings in a continually growing Cisco product space.We present Pro-QuEST, our Prompt-chain based Quiz Engine using state-of-the-art LLMsfor generating multiple-choice questions on Specialized Technical products. In Pro-QuEST,we first identify key terms and topics from a given professional certification textbook orproduct guide, and generate a series of multiple-choice questions using domain-knowledgeguided prompts. We show LLM benchmarking results with the question benchmarks generated by Pro-QuEST using a range of latestopen-source, and proprietary LLMs and compare them with expert-created exams and review questions to derive insights on their composition and difficulty. Our experiments indicate that though there is room for improvementin Pro-QuEST to generate questions of the complexity levels seen in expert-designed certification exams, question-type based prompts provide a promising direction to address this limitation. In sample user studies with Cisco personnel, Pro-QuEST was received with high optimism for its practical usefulness in quicklycompiling quizzes for self-assessment on knowledge of novel products in the rapidly changing tech sector.
A detailed understanding of the basic properties of text collections produced by humans or generated synthetically is vital for all steps of the natural language processing system life cycle, from training to evaluating model performance and synthetic texts.To facilitate the analysis of these properties, we introduce elfen, a Python library for efficient linguistic feature extraction for text datasets. It includes the largest set of item-level linguistic features in eleven feature areas: surface-level, POS, lexical richness, readability, named entity, semantic, information-theoretic, emotion, psycholinguistic, dependency, and morphological features. Building on top of popular NLP and modern dataframe libraries, elfen enables feature extraction in various languages (80 at the moment) on thousands of items, even given limited computing resources. We show how using elfen enables linguistically informed data selection, outlier detection, and text collection comparison.We release elfen as an open-source PyPI package, accompanied by extensive documentation, including tutorials. We host the code at https://github.com/mmmaurer/elfen/, make it available through the GESIS Methods Hub at https://methodshub.gesis.org/library/methods/elfen/, and provide documentation and tutorials at https://elfen.readthedocs.io/en/latest/.
Despite growing interest in measuring linguistic diversity on the one hand and the increasing availability of cross-linguistically comparable parsed corpora on the other, tools for systematically measuring the diversity of specific linguistic phenomena on such data remain limited. To address this gap, we present DELTA, an open-source framework that integrates dependency tree querying with diversity computation, enabling systematic measurement across multiple linguistic levels (e.g., lexis, morphology, syntax) and multiple diversity dimensions (variety, balance, disparity). The pipeline processes CoNLL-U formatted corpora through configurable workflows, treating the format as a general-purpose tabular structure independent of specific annotation conventions. We validate DELTA on Parallel Universal Dependencies multilingual dataset, demonstrating its capacity for corpus profiling and cross-corpus diversity comparison.
Sustainability reports contain rich Environmental, Social and Governance (ESG) information, but their heterogeneous layouts and complex multi-table structures pose major challenges for LLMs, especially for unit normalization, cross-document reasoning, and precise numerical computation. We present CLARIESG, an end-to-end system that couples robust table extraction with a structured prompting framework for multi-table filtering, normalization, and program-of-thought reasoning. On ESG-focused multi-table benchmarks, CLARIESG consistently outperforms standard prompting and provides transparent, auditable reasoning, supporting more reliable ESG analysis and greenwashing detection in real-world settings.
Recent advancements in Large Language Models (LLMs) have showcased their proficiency in answering natural language queries. However, their effectiveness is hindered by limited domain-specific knowledge, raising concerns about the reliability of their responses. We introduce a hybrid system that augments LLMs with domain-specific knowledge graphs (KGs), thereby aiming to enhance factual correctness using a KG-based retrieval approach. We focus on a medical KG to demonstrate our methodology, which includes (1) pre-processing, (2) Cypher query generation, (3) Cypher query processing, (4) KG retrieval, and (5) LLM-enhanced response generation. We evaluate our system on a curated dataset of 69 samples, achieving a precision of 78% in retrieving correct KG nodes. Our findings indicate that the hybrid system surpasses a standalone LLM in accuracy and completeness, as verified by an LLM-as-a-Judge evaluation method. This positions the system as a promising tool for applications that demand factual correctness and completeness, such as target identification — a critical process in pinpointing biological entities for disease treatment or crop enhancement. Moreover, its intuitive search interface and ability to provide accurate responses within seconds make it well-suited for time-sensitive, precision-focused research contexts. We publish the source code together with the dataset and the prompt templates used.
While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques – Attribution Analysis, Function Vector Analysis, and Circuit Tracing – and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations successfully bridged the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user’s prior LLM experience and their comprehension scores, indicating that the system effectively leveled the playing field. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.
LLM-based tutors are typically single-turn assistants that lack persistent representations of learner knowledge, making it difficult to provide principled, transparent, and long-term pedagogical support. We introduce IntelliCode, a multi-agent LLM tutoring system built around a centralized, versioned learner state that integrates mastery estimates, misconceptions, review schedules, and engagement signals. A StateGraph Orchestrator coordinates six specialized agents: skill assessment, learner profiling, graduated hinting, curriculum selection, spaced repetition, and engagement monitoring, each operating as a pure transformation over the shared state under a single-writer policy. This architecture enables auditable mastery updates, proficiency-aware hints, dependency-aware curriculum adaptation, and safety-aligned prompting.The demo showcases an end-to-end tutoring workflow: a learner attempts a DSA problem, receives a conceptual hint when stuck, submits a corrected solution, and immediately sees mastery updates and a personalized review interval. We report validation results with simulated learners, showing stable state updates, improved task success with graduated hints, and diverse curriculum coverage. IntelliCode demonstrates how persistent learner modeling, orchestrated multi-agent reasoning, and principled instructional design can be combined to produce transparent and reliable LLM-driven tutoring.
Membership Inference Attacks (MIAs) aim to determine whether a specific data point was included in the training set of a target model. Although there are have been numerous methods developed for detecting data contamination in large language models (LLMs), their performance on multimodal LLMs (MLLMs) falls short due to the instabilities introduced through multimodal component adaptation and possible distribution shifts across multiple inputs. In this work, we investigate multimodal membership inference and address two issues: first, by identifying distribution shifts in the existing datasets, and second, by releasing an extended baseline pipeline to detect them. We also generalize the perturbation-based membership inference methods to MLLMs and release FiMMIA — a modular Framework for Multimodal MIA. We propose to train a neural networks to analyze the target model’s behavior on perturbed inputs, capturing interactions between semantic domains and loss values on members and non-members in the local neighborhood of each sample. Comprehensive evaluations on various fine-tuned multimodal models demonstrate the effectiveness of our perturbation-based membership inference attacks in multimodal settings.
Disinformation and advanced generative AI content pose a significant challenge for journalists and fact-checkers who must rapidly verify digital media. While many NLP models exist for detecting signals like persuasion techniques, subjectivity, and AI-generated text, they often remain inaccessible to non-expert users and are not integrated into their daily workflows as a unified framework. This paper demonstrates the Verification Assistant, a browser-based tool designed to bridge this gap. The Verification Assistant, a core component of the widely adopted Verification Plugin (140,000+ users), allows users to submit URLs or media files to a unified interface. It automatically extracts content and routes it to a suite of backend NLP classifiers, presenting actionable credibility signals, AI-generation likelihood, and other verification advice in an easy-to-digest format. This paper will showcase the tool’s architecture, its integration of multiple NLP services, and its real-world application for detecting disinformation.
For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free-form clinical notes into structured FHIR resources address narrowly defined tasks and rely on modular approaches or LLMs with instruction tuning and constrained decoding. As those solutions frequently suffer from limited generalizability and structural inconformity, we propose an end-to-end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions. Gemini 2.5-Pro excels in our evaluation on synthetic and clinical datasets, yet ambiguity and feasibility of collecting ground-truth data remain open problems.
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present BOOM, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline[All released code and models are licensed under the MIT License].
Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at https://github.com/kinit-sk/PEFT-Factory.
Explaining text similarity and developing interpretable models are emerging research challenges (Opitz et al., 2025). We release XPLAINSIM, a Python package that unifies three complementary approaches for explaining textual similarity in an easily accessible way: 1. a token attribution method that explains how individual word interactions contribute to the predicted similarity of any embedding model; 2. a method for inferring structured neural embedding spaces that capture explainable aspects of text, and 3. a symbolic approach that explains textual similarity transparently through parsed meaning representations. We demonstrate the value of our package through intuitive examples and three focused empirical research studies. The first study evaluates interpretability methods for constructing cross-lingual token alignments. The second investigates how modern information retrieval methods handle stop words. The third sheds more light on a long-standing question in computational linguistics: the distinction between relatedness and similarity. XPLAINSIM is available at https://github.com/flipz357/XPLAINSIM.
High-quality datasets are crucial for training effective state of the art machine translation systems. However, due to the data-intensive nature of these systems, they have to be trained on large amounts of text that can easily go beyond the scope of full human inspection. This makes the presence of noise that can degrade overall system performance a frequent and significant issue. While various approaches have been developed to identify and select only the highest-quality training examples, this is undesirable in scenarios where resources are limited. For this reason, we introduce AlignFix, an open-source tool for augmenting data, identifying and correcting errors in parallel corpora. Leveraging word alignments, AlignFix extracts consistent phrase pairs, enabling targeted replacements that can improve the dataset quality. Besides targeted replacements, the tool enables contextual augmentation by duplicating sentences and allowing users to substitute words with alternatives of their choice. The tool maintains and updates the underlying word alignments, thereby avoiding the costly recomputation. AlignFix runs locally in the browser, requires no installation, and ensures that all data remains entirely on the client side. It is released under Apache 2.0 license, encouraging broad adoption, reuse, and further development. A live demo is available at https://ifi-alignfix.uibk.ac.at.
PromptLab is a web-based platform for collaborative prompt engineering across diverse natural language processing tasks and datasets. The platform addresses primary challenges in prompt development, including template creation, collaborative review, and quality assurance through a comprehensive workflow that supports both individual researchers and team-based projects. PromptLab integrates with HuggingFace and provides AI-assisted prompt generation via OpenRouter[<https://openrouter.ai/>], and supporting real-time validation with multiple Large Language Models (LLMs). The platform features a flexible templating system using Jinja2, role-based project management, peer review processes, and supports programmatic access through RESTful APIs. To ensure data privacy and support sensitive research environments, PromptLab includes an easy CI/CD pipeline for self-hosted deployments and institutional control. We demonstrate the platform’s effectiveness through two evaluations: a controlled comparison study with six researchers across five benchmark datasets and 13 models with 90 prompts; and a comprehensive case study in instruction tuning research, where over 350 prompts across 80+ datasets have been developed and validated by multiple team members. The platform is available at https://promptlab.up.railway.app and the source code is available on GitHub at https://github.com/KFUPM-JRCAI/PromptLab.
As large language models (LLMs) are deployed widely, detecting and understanding bias in their outputs is critical. We present LLM BiasScope, a web application for side-by-side comparison of LLM outputs with real-time bias analysis. The system supports multiple providers (Google Gemini, DeepSeek, MiniMax, Mistral, Meituan, Meta Llama) and enables researchers and practitioners to compare models on the same prompts while analyzing bias patterns.LLM BiasScope uses a two-stage bias detection pipeline: sentence-level bias detection followed by bias type classification for biased sentences. The analysis runs automatically on both user prompts and model responses, providing statistics, visualizations, and detailed breakdowns of bias types. The interface displays two models side-by-side with synchronized streaming responses, per-model bias summaries, and a comparison view highlighting differences in bias distributions.The system is built on Next.js with React, integrates Hugging Face inference endpoints for bias detection, and uses the Vercel AI SDK for multi-provider LLM access. Features include real-time streaming, export to JSON/PDF, and interactive visualizations (bar charts, radar charts) for bias analysis. LLM BiasScope is available as an open-source web application, providing a practical tool for bias evaluation and comparative analysis of LLM behaviour.
Large-scale scientific research on historical documents — particularly medieval Arabic manuscripts — remains challenging due to the need for advanced paleographic and linguistic training, the large volume of hand-written materials, and the absence of assisting software. In this paper, we propose InkSight, the first end-to-end Arabic manuscript analysis tool for manuscript-based analytics and research hypothesis testing. InkSight integrates three key components: (i) an Optical Character Recognition (OCR) module utilizing a Large Visual Language Model (LVLM); (ii) a lightweight document indexing and information retrieval module that enables query-based evidence retrieval from book-length manuscripts; and (iii) a flexible Large Language Model (LLM) prompting interface factually grounded to the given manuscript via Retrieval-Augmented Generation (RAG). Empirical evaluation on the existing KITAB OCR benchmark and our in-house dataset of ancient Arabic manuscripts has revealed that historical research can be effectively supported using smaller fine-tuned LVLMs without relying on larger proprietary models. The live web demo for InkSight is available freely at: https://inksight.ru and the source code for InkSight is publicly available at Github: https://github.com/ds-hub-sochi/InkSight-tool.
Prompt optimization has become crucial for enhancing the performance of large language models (LLMs) across a broad range of tasks. Although many research papers demonstrate its effectiveness, practical adoption is hindered because existing implementations are often tied to unmaintained, isolated research codebases or require invasive integration into application frameworks. To address this, we introduce promptolution, a unified, modular open-source framework that provides all components required for prompt optimization within a single extensible system for both practitioners and researchers. It integrates multiple contemporary discrete prompt optimizers, supports systematic and reproducible benchmarking, and returns framework-agnostic prompt strings, enabling seamless integration into existing LLM pipelines while remaining agnostic to the underlying model implementation.
We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference.The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on HuggingFace. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains.T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.Demo: https://t-pro2eagle.streamlit.app/https://huggingface.co/collections/t-tech/t-pro-20
We present SDialog, an MIT-licensed open-source Python toolkit for end-to-end development, simulation, evaluation, and analysis of LLM-based conversational agents. Built around a standardized Dialog representation, SDialog unifies persona-driven multi-agent simulation with composable orchestration for controlled synthetic dialog generation; multi-layer evaluation combining linguistic metrics, LLM-as-a-judge assessments, and functional correctness validators; mechanistic interpretability tools for activation inspection and causal behavior steering via feature ablation and induction; and audio rendering with full acoustic simulation, including 3D room modeling and microphone effects. The toolkit integrates with major LLM backends under a consistent API, enabling mixed-backend and reproducible experiments. By bridging agent construction, user simulation, dialog generation, evaluation, and interpretability within a single coherent workflow, SDialog enables more controlled, transparent, and systematic research on conversational systems.
In this work, we present a modular and interpretable framework that uses Large Language Models (LLMs) to automate candidate assessment in recruitment. The system integrates diverse sources—including job descriptions, CVs, interview transcripts, and HR feedback—to generate structured evaluation reports that mirror expert judgment. Unlike traditional ATS tools that rely on keyword matching or shallow scoring, our approach employs role-specific, LLM-generated rubrics and a multi-agent architecture to perform fine-grained, criteria-driven evaluations. The framework outputs detailed assessment reports, candidate comparisons, and ranked recommendations that are transparent, auditable, and suitable for real-world hiring workflows. Beyond rubric-based analysis, we introduce an LLM-Driven Active Listwise Tournament mechanism for candidate ranking. Instead of noisy pairwise comparisons or inconsistent independent scoring, the LLM ranks small candidate subsets (“mini-tournaments”), and these listwise permutations are aggregated using a Plackett–Luce model. An active-learning loop selects the most informative subsets, producing globally coherent and sample-efficient rankings. This adaptation of listwise LLM preference modeling—previously explored in financial asset ranking —provides a principled and highly interpretable methodology for large-scale candidate ranking in talent acquisition.
We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove’s data management features reduce memory consumption by a factor of 2.6. Moreover, Trove’s easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.
We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.
Large language models (LLMs) have expanded the potential for AI-assisted scientific claim verification, yet existing systems often exhibit unverifiable attributions, shallow evidence mapping, and hallucinated citations. We present SciTrue, a claim verification system providing source-level accountability and evidence traceability. SciTrue links each claim component to explicit, verifiable scientific sources, enabling users to inspect and challenge model inferences, addressing limitations of both general-purpose and search-augmented LLMs. In a human evaluation of 300 attributions, SciTrue achieves high fidelity in summary traceability, attribution accuracy, and context alignment, substantially outperforming RAG-based baselines such as GPT-4o-search-preview and Perplexity Sonar Pro. These results underscore the importance of principled attribution and context-aware reasoning in AI-assisted scientific verification. A system demo is available at .
Linguistic features remain essential for interpretability and tasks that involve style, structure, and readability, but existing Spanish tools offer limited coverage. We present PUCPMetrix, an open-source and comprehensive toolkit for linguistic analysis of Spanish texts. PUCP-Metrix includes 182 linguistic metrics spanning lexical diversity, syntactic and semantic complexity, cohesion, psycholinguistics, and readability. It enables fine-grained, interpretable text analysis. We evaluate its usefulness on Automated Readability Assessment and Machine-Generated Text Detection, showing competitive performance compared to an existing repository and strong neural baselines. PUCP-Metrix offers a comprehensive and extensible resource for Spanish, supporting diverse NLP applications.
Multi-Modal Large Language Models (MLLMs) can now solve entire exams directlyfrom uploaded PDF assessments, raising urgent concerns about academic integrity and the reliability of grades and credentials. Existing watermarking techniques either operate at the token level or assume control over the model’s decoding process, making them ineffective when students query proprietary black-box systems using instructor-provided documents. We present INTEGRITYSHIELD,a document-layer watermarking system that embeds schema-aware, item-level watermarks into assessment PDFs while keeping their human-visible appearance unchanged. These watermarks consistently prevent MLLMs from answering shielded exam PDFs and encode stable, item-level signatures that can be reliably recovered from model or student responses. Across 30 question papers spanning STEM, humanities, and medical reasoning, INTEGRITYSHIELD achieves exceptionally high prevention (91-94% exam-level blocking) and strong detection reliability (89-93% signature retrieval) across four commercial MLLMs. Our demo showcases an interactive interface where instructors upload an exam, preview watermark behavior, and inspect pre/post AI performance and authorship evidence.
The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets from scientific literature. SCILIRE has been designed around Human-AI teaming principles centred on workflows for verifying and curating data. It facilitates an iterative workflow in which researchers can review and correct AI outputs. Furthermore, this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity and facilitates efficient dataset creation.
In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling. Unlike the well-established autoregressive language modeling paradigm, which has a plethora of standard training and inference libraries, implementations of non-autoregressive language modeling have largely been bespoke making it difficult to perform systematic comparisons of different methods. Moreover, each non-autoregressive language model typically requires it own data collation, loss, and prediction logic, making it challenging to reuse common components. In this work, we present the XLM python package, which is designed to make implementing small non-autoregressive language models faster with a secondary goal of providing a suite of small pre-trained models (through a companion package) that can be used by the research community.
We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotation.
Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains. However, traditional statistical metrics are poorly suited to open-ended generation tasks, leading to growing reliance on LLM-based evaluation methods. These methods, while often more flexible, introduce additional complexity: they depend on carefully chosen models, prompts, parameters, and evaluation strategies, making the evaluation process prone to misconfiguration and bias. In this work, we present EvalSense, a flexible, extensible framework for constructing domain-specific evaluation suites for LLMs. EvalSense provides out-of-the-box support for a broad range of model providers and evaluation strategies, and assists users in selecting and deploying suitable evaluation methods for their specific use-cases. This is achieved through two unique components: (1) an interactive guide aiding users in evaluation method selection and (2) automated meta-evaluation tools that assess the reliability of different evaluation approaches using perturbed data. We demonstrate the effectiveness of EvalSense in a case study involving the generation of clinical notes from unstructured doctor-patient dialogues, using a popular open dataset. All code, documentation, and assets associated with EvalSense are open-source and publicly available at https://github.com/nhsengland/evalsense.
Tracking financial investments in climate adaptation is complex and expertise-intensive, particularly for Early Warning Systems (EWS), where multilateral development bank (MDB) and fund reports lack standardized financial reporting and appear as heterogeneous PDFs with complex tables and inconsistent layouts.We introduce an agent-based Retrieval-Augmented Generation (RAG) system that uses hybrid retrieval and internal chain-of-thought (CoT) reasoning to extract relevant financial data, classify EWS investments, and allocate budgets with grounding evidence spans. While these components are individually established, our contribution is their integration into a domain-specific workflow tailored to heterogeneous MDB reports and numerically grounded EWS budget allocation. On a manually annotated CREWS Fund corpus, our system outperforms four alternatives (zero-shot classifier, few-shot “zero rule” classifier, fine-tuned transformer-based classifier, and few-shot CoT+ICL classifier) on multi-label classification and budget allocation, achieving 87% accuracy, 89% precision, and 83% recall. We further benchmark against the Gemini 2.5 Flash AI Assistant on an expert-annotated MDB evidence set co-curated with the World Meteorological Organization (WMO), enabling a comparative analysis of glass-box agents versus black-box assistants in transparency and performance. The system is publicly deployed and accessible at https://ews-front.vercel.app/ (see Appendix A for demonstration details and Appendix B for dataset statistics and splits). We will open-source all code, LLM generations, and human annotations to support further work on AI-assisted climate finance.
Evaluating Retrieval-Augmented Generation(RAG) systems remains a challenging task: existingmetrics often collapse heterogeneous behaviorsinto single scores and provide little insightinto whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduceRAGVUE, a diagnostic and explainableframework for automated, reference-freeevaluation of RAG pipelines. RAGVUE decomposesRAG behavior into retrieval quality,answer relevance and completeness, strictclaim-level faithfulness, and judge calibration.Each metric includes a structured explanation,making the evaluation process transparent. Ourframework supports both manual metric selectionand fully automated agentic evaluation. Italso provides a Python API, CLI, and a localStreamlit interface for interactive usage. Incomparative experiments, RAGVUE surfacesfine-grained failures that existing tools suchas RAGAS often overlook. We showcase thefull RAGVUE workflow and illustrate how itcan be integrated into research pipelines andpractical RAG development. The source codeand detailed instructions on usage are publiclyavailable on Github.
We introduce QSTN, an open-source Python framework for systematically generating responses from questionnaire-style prompts to support in-silico surveys and annotation tasks with large language models (LLMs). QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods. Our extensive evaluation (>40 million survey responses) shows that question structure and response generation methods have a significant impact on the alignment of generated survey responses with human answers. We also find that answers can be obtained for a fraction of the compute cost, by changing the presentation method. In addition, we offer a no-code user interface that allows researchers to set up robust experiments with LLMs without coding knowledge. We hope that QSTN will support the reproducibility and reliability of LLM-based research in the future.
Architectural design relies on 3D modeling procedures, generally carried out in Building Information Modeling (BIM) formats. In this setting, architects and designers collaborate on building designs, iterating over many possible versions until a final design is agreed upon. However, this iteration is complicated by the fact that any changes need to be made by manually introducing changes to the complex BIM files, which lengthens the design process and makes it difficult to quickly prototype changes. To speed up prototyping, we propose VR-Arch, a virtual assistant that allows users to interact with the BIM file in a virtual reality (VR) environment. This framework enables users to 1) make changes directly in the VR environment, 2) make complex queries about the BIM, and 3) combine these to perform more complex actions. All of this is done via voice commands and processed through a ReAct-based agentic system that selects appropriate tools depending on the query context.This multi-tool approach enables real-time, contextualized interaction through natural language, allowing for a faster and more natural prototyping experience.
We present a new deliberation interface that enables users to engage with multiple large language models (LLMs), coordinated by a moderator agent that assigns roles, manages turn-taking, and ensures structured interaction. Grounded in argumentation theory, the system fosters critical thinking through user–LLM dialogues, real-time summaries of agreements and open questions, and argument maps. Rather than treating LLMs as mere answer providers, our tool positions them as reasoning partners, supporting epistemically responsible human–AI collaboration. It exemplifies hybrid argumentation and aligns with recent calls for “reasonable parrots,” where LLM agents interact with users guided by argumentative principles such as relevance, responsibility, and freedom. A user study shows that participants found the tool easy to use, perspective-enhancing, and promising for research, while suggesting areas for improvement. We make the deliberation interface accessible for testing and provide a recorded demonstration.
This paper presents a new open-source web application for simultaneous speech-to-text translation. The system translates live Estonian speech into English, Russian, and Ukrainian text, and also supports English-to-Estonian translation. Our solution uses a cascaded architecture that combines streaming speech recognition with a recently proposed LLM-based simultaneous translation model. The LLM treats translation as a conversation, processing input in small five-word chunks. Our streaming speech recognition achieves a word error rate of 10.2% and a BLEU score of 26.1 for Estonian-to-English, significantly outperforming existing streaming solutions. The application is designed for real-world use, featuring a latency of only 3 - 6 seconds. The application is available at https://est2eng.vercel.app.
Many research areas rely on data from theweb to gain insights and test their methods.However, collecting comprehensive researchdatasets often demands manually reviewingmany web pages to identify and record relevantdata points, which is labor-intensive and sus-ceptible to error. While the emergence of largelanguage models (LLM)-powered web agentshas begun to automate parts of this process,they often struggle to ensure the validity of thedata they collect. Indeed, these agents exhibitseveral recurring failure modes—including hal-lucinating or omitting values, misinterpretingpage semantics, and failing to detect invalidinformation—which are subtle and difficultto detect and correct manually. To addressthis, we introduce the AI Committee, a novelmodel-agnostic multi-agent system that auto-mates the process of validating and remediatingweb-sourced datasets. Each agent is special-ized in a distinct task in the data quality assur-ance pipeline, from source scrutiny and fact-checking to data remediation and integrity val-idation. The AI Committee leverages variousLLM capabilities—including in-context learn-ing for dataset adaptation, chain-of-thought rea-soning for complex semantic validation, and aself-correction loop for data remediation—allwithout task-specific training. We demonstratethe effectiveness of our system by applyingit to three real-world datasets, showing that itgeneralizes across LLMs and significantly out-performs baseline approaches, achieving datacompleteness up to 73.3% and precision up to97.3%. We additionally conduct an ablationstudy demonstrating the contribution of eachagent to the Committee’s performance. Thiswork is released as an open-source tool for theresearch community
Entity linking (EL) aims to disambiguate named entities in text by mapping them to the appropriate entities in a knowledge base. However, it is difficult to use some EL methods, as they sometimes have issues in reproducibility due to limited maintenance or the lack of official resources.To address this, we introduce , a unified library for using and developing entity linking systems through a unified interface. Our library flexibly integrates various candidate retrievers and re-ranking models, making it easy to compare and use any entity linking methods within a unified framework. In addition, it is designed with a strong emphasis on API usability, making it highly extensible, and it supports both command-line tools and APIs. Our code is available on GitHub and is also distributed via PyPI under the MIT-license. The video is available on YouTube.
Our system is built upon a multi-modal information extraction pipeline designed to process and interpret corporate sustainability reports. This integrated framework systematically handles diverse data formats—including text, tables, figures, and infographics—to extract, structure, and evaluate ESG-related content. The extracted multi-modal data is subsequently formalized into a structured knowledge graph (KG), which serves as both a semantic framework for representing entities, relationships, and metrics relevant to ESG domains, and as the foundational infrastructure for the automated compliance system. This KG enables high-precision retrieval of information across multiple source formats and reporting modalities. The trustworthy, context-rich representations provided by the knowledge graph establish a verifiable evidence base, creating a critical foundation for reliable retrieval-augmented generation (RAG) and subsequent LLM-based scoring and analysis of automatic ESG compliance system.
Bangla is one of the world’s most widely spoken languages, yet it remains significantly under-resourced in natural language processing (NLP). Existing efforts have focused on isolated tasks such as Part-of-Speech (POS) tagging and Named Entity Recognition (NER), but comprehensive, integrated systems for core NLP tasks including Shallow Parsing and Dependency Parsing are largely absent. To address this gap, we present BanSuite, a unified Bangla NLP ecosystem developed under the EBLICT project. BanSuite combines a large-scale, manually annotated Bangla Treebank with high-quality pretrained models for POS tagging, NER, shallow parsing, and dependency parsing, achieving strong in-domain baseline performance (POS: 90.16 F1, NER: 90.11 F1, SP: 86.92 F1, DP: 90.27 UAS). The system is accessible through a Python toolkit (Bkit) and a Web Application, providing both researchers and non-technical users with robust NLP functionalities, including tokenization, normalization, lemmatization, and syntactic parsing. In benchmarking against existing Bangla NLP tools and multilingual Large Language Models (LLMs), BanSuite demonstrates superior task performance while maintaining high efficiency in resource usage. By offering the first comprehensive, open, and integrated NLP platform for Bangla, BanSuite lays a scalable foundation for research, application development, and further advancement of low-resource language technologies. A demonstration video is provided to illustrate the system’s functionality in https://youtu.be/3pcfiUQfCoA