Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)

Hao Fei, Kewei Tu, Yuhui Zhang, Xiang Hu, Wenjuan Han, Zixia Jia, Zilong Zheng, Yixin Cao, Meishan Zhang, Wei Lu, N. Siddharth, Lilja Øvrelid, Nianwen Xue, Yue Zhang (Editors)

Anthology ID:: 2025.xllm-1
Month:: August
Year:: 2025
Address:: Vienna, Austria
Venues:: XLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.xllm-1/
DOI:
ISBN:: 979-8-89176-286-2
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.xllm-1.pdf

PDF (full) BibTeX Search

pdf bib abs
Fine-Tuning Large Language Models for Relation Extraction within a Retrieval-Augmented Generation Framework
Sefika Efeoglu | Adrian Paschke

Information Extraction (IE) plays a pivotal role in transforming unstructured data into structured formats, such as Knowledge Graphs. One of the main tasks within IE is Relation Extraction (RE), which identifies relations between entities in text data. This process enriches the semantic understanding of documents, enabling more precise information retrieval and query answering. Recent works leveraging pre-trained language models have demonstrated significant performance improvements in RE. In the current era of Large Language Models (LLMs), fine-tuning these LLMs can mitigate the limitations of zero-shot RE methods, particularly in overcoming the domain adaptation challenges inherent in RE. This work explores not only the effectiveness of fine-tuned LLMs but also their integration into a Retrieval-Augmented Generation (RAG)-based RE approach to address domain adaptation challenges when general-purpose LLMs serve as generators within the RAG framework. Empirical evaluations on the TACRED, TACRED-Revisited (TACREV), and Re-TACRED datasets reveal substantial performance improvements with fine-tuned LLMs, such as Llama2-7B, Mistral-7B, and Flan-T5 Large and surpass previous methods on these datasets.

This paper compares two approaches for table extraction from images: deep learning computer vision and Multimodal Large Language Models (MLLMs). Computer vision models for table extraction, such as the Table Transformer model (TATR), have enhanced the extraction of complex table structural layouts by leveraging deep learning for precise structural recognition combined with traditional Optical Character Recognition (OCR). Conversely, MLLMs, which process both text and image inputs, present a novel approach by potentially bypassing the limitations of TATR plus OCR methods altogether. Models such as GPT-4o, Phi-3 Vision, and Granite Vision 3.2 demonstrate the potential of MLLMs to analyze and interpret table images directly, offering enhanced accuracy and robust extraction capabilities. A state-of-the-art metric like Grid Table Similarity (GriTS) evaluated these methodologies, providing nuanced insights into structural and text content effectiveness. Utilizing the PubTables-1M dataset, a comprehensive and widely used benchmark in the field, this study highlights the strengths and limitations of each approach, setting the stage for future innovations in table extraction technologies. Deep learning computer vision techniques still have a slight edge when extracting table structural layout, but in terms of text cell content, MLLMs are far better.

pdf bib abs
Injecting Structured Knowledge into LLMs via Graph Neural Networks
Zichao Li | Zong Ke | Puning Zhao

Large language models (LLMs) have achieved remarkable success in natural language processing (NLP), but they often struggle to capture explicit linguistic structures and world knowledge. To address this limitation, we propose a hybrid model that integrates LLMs with graph neural networks (GNNs) to inject structured knowledge into NLP tasks. Our approach leverages the strengths of both components: LLMs provide rich contextual representations, while GNNs encode explicit structural priors from sources such as dependency trees, Abstract Meaning Representations (AMRs), and knowledge graphs. We evaluate the hybrid model on a diverse set of tasks, including semantic parsing, multi-hop question answering, text summarization, commonsense reasoning, and dependency parsing. Experimental results demonstrate consistent improvements over both standalone baselines and state-of-the-art methods, with relative gains of up to 2.3% in Exact Match scores for multi-hop QA and 1.7% in accuracy for commonsense reasoning. Ablation studies and sensitivity analyses further highlight the importance of balancing contextual and structural information. By bridging the gap between unstructured textual data and structured knowledge, our work advances the state of the art in NLP and paves the way for more interpretable and robust language models.

pdf bib abs
Regular-pattern-sensitive CRFs for Distant Label Interactions
Sean Papay | Roman Klinger | Sebastian Padó

While LLMs have grown popular in sequence labeling, linear-chain conditionalrandom fields (CRFs) remain a popular alternativewith the ability to directly model interactions between labels.However, the Markov assumption limits them to interactions between adjacent labels.Weighted finite-state transducers (FSTs), in contrast, can modeldistant label–label interactions, but exact label inference is intractable in general.In this work, we present regular-pattern-sensitiveCRFs (RPCRFs), a method of enriching standardlinear-chain CRFs with the ability to learnlong-distance label interactions through user-specified patterns.This approach allows users to write regular-expressionlabel patterns concisely specifying which types of interactionsthe model should take into account, allowingthe model to learn from data whether and inwhich contexts these patterns occur. The resultcan be interpreted alternatively as a CRF augmented with additional,non-local potentials,or as a finite-state transducer whose structureis defined by a set of easily-interpretable patterns.Critically, exact training and inferenceare tractable for many pattern sets. We detailhow an RPCRF can be automatically constructed from a set of user-specified patterns,and demonstrate the model’s effectiveness ona sequence of three synthetic sequence modeling datasets.

pdf bib abs
From Syntax to Semantics: Evaluating the Impact of Linguistic Structures on LLM-Based Information Extraction
Anushka Swarup | Avanti Bhandarkar | Ronald Wilson | Tianyu Pan | Damon Woodard

Large Language Models (LLMs) have brought significant breakthroughs across all areas of Natural Language Processing (NLP), including Information Extraction (IE). However, knowledge gaps remain regarding their effectiveness in extracting entity-relation triplets, i.e. Joint Relation Extraction (JRE). JRE has been a key operation in creating knowledge bases that can be used to enhance Retrieval Augmented Generation (RAG) systems. Prior work highlights low-quality triplets generated by LLMs. Thus, this work investigates the impact of incorporating linguistic structures, such as constituency and dependency trees and semantic role labeling, to enhance the quality of the extracted triplets. The findings suggest that incorporating specific structural information enhances the uniqueness and topical relevance of the triplets, particularly in scenarios where multiple relationships are present.

pdf bib abs
Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models
Bram Willemsen | Gabriel Skantze

In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.

pdf bib abs
Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis
Daoyang Li | Haiyan Zhao | Qingcheng Zeng | Mengnan Du

Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of other world’s languages. In this paper, we extend these probing methods to a multilingual context, investigating how LLMs encode linguistic structures across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results provide insights into how linguistic structures are represented differently across languages in LLMs and emphasize the need for improved structure modeling for low-resource languages.

pdf bib abs
Self-Contrastive Loop of Thought Method for Text-to-SQL Based on Large Language Model
Fengrui Kang

Text-to-SQL is a task with excellent prospects and challenges, and it aims to convert natural language queries (NL) into corresponding structured query language (SQL) statements. The main challenge of this task is how to efficiently transform unstructured data and structured data. In recent years, the emergence of large language models (LLMs) has further promoted the development of this field. However, current LLM-based text-to-SQL methods rely on specific few-shot example construction, resulting in poor performance across domains. To solve this problem, we propose a text-to-SQL method of self-contrastive loop of thought structure. This method designs the LLM inference process as a loop structure based on the comparison of positive and negative examples. The model optimizes the generated results through continuous verification and error correction, greatly improving accuracy and reducing dependence on few-shot example construction. The experimental results on SPIDER and BIRD datasets show that this method can generate SQL with higher precision without relying on few-shot example construction.

pdf bib abs
Combining Automated and Manual Data for Effective Downstream Fine-Tuning of Transformers for Low-Resource Language Applications
Ulyana Isaeva | Danil Astafurov | Nikita Martynov

This paper addresses the constraints of down-stream applications of pre-trained language models (PLMs) for low-resource languages. These constraints are pre-train data deficiency preventing a low-resource language from being well represented in a PLM and inaccessibility of high-quality task-specific data annotation that limits task learning. We propose to use automatically labeled texts combined with manually annotated data in a two-stage task fine-tuning approach. The experiments revealed that utilizing such methodology combined with vocabulary adaptation may compensate for the absence of a targeted PLM or the deficiency of manually annotated data. The methodology is validated on the morphological tagging task for the Udmurt language. We publish our best model that achieved 93.25% token accuracy on HuggingFace Hub along with the training code1.

pdf bib abs
Seamlessly Integrating Tree-Based Positional Embeddings into Transformer Models for Source Code Representation
Patryk Bartkowiak | Filip Graliński

Transformer-based models have demonstrated significant success in various source code representation tasks. Nonetheless, traditional positional embeddings employed by these models inadequately capture the hierarchical structure intrinsic to source code, typically represented as Abstract Syntax Trees (ASTs). To address this, we propose a novel tree-based positional embedding approach that explicitly encodes hierarchical relationships derived from ASTs, including node depth and sibling indices. These hierarchical embeddings are integrated into the transformer architecture, specifically enhancing the CodeBERTa model. We thoroughly evaluate our proposed model through masked language modeling (MLM) pretraining and clone detection fine-tuning tasks. Experimental results indicate that our Tree-Enhanced CodeBERTa consistently surpasses the baseline model in terms of loss, accuracy, F1 score, precision, and recall, emphasizing the importance of incorporating explicit structural information into transformer-based representations of source code.

pdf bib abs
Enhancing AMR Parsing with Group Relative Policy Optimization
Botond Barta | Endre Hamerlik | Milán Nyist | Masato Ito | Judit Acs

We investigate the capabilities of the openly available Llama 3.2 1B language model for Abstract Meaning Representation (AMR) parsing through supervised fine-tuning, further enhanced by reinforcement learning via Group Relative Policy Optimization (GRPO). Existing supervised methods for AMR parsing face limitations due to static loss functions and challenges in capturing complex semantic phenomena. To address this, our GRPO-based approach explicitly optimizes fine-grained semantic rewards, including Smatch scores, frame-argument correctness, and structural validity of logical operations. Experimental results show that supervised fine-tuning alone establishes Llama as a capable English AMR parser, and subsequent GRPO fine-tuning further improves its performance. Our final model achieves higher Smatch scores, consistently respects critical low-level semantic constraints, and outperforms existing parsers on high-level semantic evaluation metrics across diverse linguistic phenomena.

pdf bib abs
Structure Modeling Approach for UD Parsing of Historical Modern Japanese
Hiroaki Ozaki | Mai Omura | Kanako Komiya | Masayuki Asahara | Toshinobu Ogiso

This study shows the effectiveness of structure modeling for transfer ability in diachronic syntactic parsing. The syntactic parsing for historical languages is significant from a humanities and quantitative linguistics perspective to enable annotation support and analysis on unannotated documents.We compared the zero-shot transfer ability between Transformer-based Biaffine UD parsers and our structure modeling approach. The structure modeling approach is a pipeline method consisting with dictionary-based morphological analysis (MeCab), a deep learning-based phrase (bunsetsu) analysis (Monaka), SVM-based phrase dependency parsing (CaboCha) and a rule-based conversion from phrase dependencies to UD.This pipeline closely follows the methodology used in constructing Japanese UD corpora.Experimental results showed that the structure modeling approach outperformed zero-shot transfer from the contemporary to the modern Japanese. Moreover, the structure modeling approach outperformed several existing UD parsers in contemporary Japanese. To this end, the structure modeling approach outperformed in the diachronic transfer of Japanese by a wide margin and was useful to those applications for digital humanities and quantitative linguistics.

pdf bib abs
BARTABSA++: Revisiting BARTABSA with Decoder LLMs
Jan Pfister | Tom Völker | Anton Vlasjuk | Andreas Hotho

We revisit the BARTABSA framework for aspect-based sentiment analysis with modern decoder LLMs to assess the importance of explicit structure modeling today. Our updated implementation - BARTABSA++ - features architectural enhancements that boost performance and training stability.Systematic testing with various encoder-decoder architectures shows that BARTABSA++ with BART-Large achieves state-of-the-art results, even surpassing a finetuned GPT-4o model.Our analysis indicates the encoder’s representational quality is vital, while the decoder’s role is minimal, explaining the limited benefits of scaling decoder-only LLMs for this task. These findings underscore the complementary roles of explicit structured modeling and large language models, indicating structured approaches remain competitive for tasks requiring precise relational information extraction.

pdf bib abs
Typed-RAG: Type-Aware Decomposition of Non-Factoid Questions for Retrieval-Augmented Generation
DongGeon Lee | Ahjeong Park | Hyeri Lee | Hyeonseo Nam | Yunho Maeng

Non-factoid question answering (NFQA) poses a significant challenge due to its open-ended nature, diverse intents, and the necessity for multi-aspect reasoning, rendering conventional retrieval-augmented generation (RAG) approaches insufficient. To address this, we introduce Typed-RAG, a type-aware framework utilizing multi-aspect query decomposition tailored specifically for NFQA. Typed-RAG categorizes NFQs into distinct types—such as debate, experience, and comparison—and decomposes them into single-aspect sub-queries for targeted retrieval and generation. By synthesizing the retrieved results of these sub-queries, Typed-RAG generates more informative and contextually relevant responses. Additionally, we present Wiki-NFQA, a novel benchmark dataset encompassing diverse NFQ types. Experimental evaluation demonstrates that TypeRAG consistently outperforms baseline approaches, confirming the effectiveness of type-aware decomposition in improving both retrieval quality and answer generation for NFQA tasks.

pdf bib abs
Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction
Nils Hellwig | Jakob Fehle | Udo Kruschwitz | Christian Wolff

Aspect sentiment quad prediction (ASQP) facilitates a detailed understanding of opinions expressed in a text by identifying the opinion term, aspect term, aspect category and sentiment polarity for each opinion. However, annotating a full set of training examples to fine-tune models for ASQP is a resource-intensive process. In this study, we explore the capabilities of large language models (LLMs) for zero- and few-shot learning on the ASQP task across five diverse datasets. We report F1 scores almost up to par with those obtained with state-of-the-art fine-tuned models and exceeding previously reported zero- and few-shot performance. In the 20-shot setting on the Rest16 restaurant domain dataset, LLMs achieved an F1 score of 51.54, compared to 60.39 by the best-performing fine-tuned method MVP. Additionally, we report the performance of LLMs in target aspect sentiment detection (TASD), where the F1 scores were close to fine-tuned models, achieving 68.93 on Rest16 in the 30-shot setting, compared to 72.76 with MVP. While human annotators remain essential for achieving optimal performance, LLMs can reduce the need for extensive manual annotation in ASQP tasks.

pdf bib abs
Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs
Ankush Raut | Xiaofeng Zhu | Maria Pacheco

This paper evaluates the ability of Large Language Models (LLMs) to leverage contextual information in the form of structured linguistic representations. Specifically, we examine the impact of encoding both short and long contexts using Abstract Meaning Representation (AMR) structures across a diverse set of language tasks. We perform our analysis using 8-bit quantized and instruction-tuned versions of Llama 3.1 (8B), Phi-3, and Mistral 7B. Our results indicate that, for tasks involving short contexts, augmenting the prompt with the AMR of the original language context often degrades the performance of the underlying LLM. However, for tasks that involve long contexts, such as dialogue summarization in the SAMSum dataset, this enhancement improves LLM performance, for example, by increasing the zero-shot cosine similarity score of Llama 3.1 from 66% to 76%. This improvement is more evident in the newer and larger LLMs, but does not extend to the older or smaller ones. In addition, we observe that LLMs can effectively reconstruct the original text from a linearized AMR, achieving a cosine similarity of 81% in the best-case scenario.

pdf bib abs
LLM Dependency Parsing with In-Context Rules
Michael Ginn | Alexis Palmer

We study whether incorporating rules (in various formats) can aid large language models to perform dependency parsing. We consider a paradigm in which LLMs first produce symbolic rules given fully labeled examples, and the rules are then provided in a subsequent call that performs the actual parsing. In addition, we experiment with providing human-created annotation guidelines in-context to the LLMs. We test on eight low-resource languages from Universal Dependencies, finding that while both methods for rule incorporation improve zero-shot performance, the benefit disappears with a few labeled in-context examples.

Large language models (LLMs) have advanced document-level relation extraction (DocRE), but DocRE is more complex than sentence-level relation extraction (SentRE), facing challenges like diverse relation types, coreference resolution and long-distance dependencies. Traditional pipeline methods, which detect relations before generating triplets, often propagate errors and harm performance. Meanwhile, fine-tuning methods require extensive human-annotated data, and in-context learning (ICL) underperforms compared to supervised approaches. We propose an iterative reflection framework for DocRE, inspired by human non-linear reading cognition. The framework leverages explicit and implicit relations between triplets to provide feedback for LLMs refinement. Explicit feedback uses logical rules-based reasoning, while implicit feedback reconstructs triplets into documents for comparison. This dual-process iteration mimics human semantic cognition, enabling dynamic optimization through self-generated supervision. For the first time, this achieves zero-shot performance comparable to fully supervised models. Experiments show our method surpasses existing LLM-based approaches and matches state-of-the-art BERT-based methods.

Event-keyed summarization (EKS) requires summarizing a specific event described in a document given the document text and an event representation extracted from it. In this work, we extend EKS to the cross-document setting (CDEKS), in which summaries must synthesize information from accounts of the same event as given by multiple sources. We introduce **SEAMuS** (**S**ummaries of **E**vents **A**cross **Mu**ltiple **S**ources), a high-quality dataset for CDEKS based on an expert reannotation of the FAMuS dataset for cross-document argument extraction. We present a suite of baselines on SEAMuS—covering both smaller, fine-tuned models, as well as zero- and few-shot prompted LLMs—along with detailed ablations and a human evaluation study, showing SEAMuS to be a valuable benchmark for this new task.

pdf bib abs
Transfer of Structural Knowledge from Synthetic Languages
Mikhail Budnikov | Ivan Yamshchikov

This work explores transfer learning from several synthetic languages to English. We investigate the structure of the embeddings in the fine-tuned models, the information they contain, and the capabilities of the fine-tuned models on simple linguistic tasks. We also introduce a new synthetic language that leads to better transfer to English than the languages used in previous research. Finally, we introduce Tiny-Cloze Benchmark — a new synthetic benchmark for natural language understanding that is more informative for less powerful models. We use Tiny-Cloze Benchmark to evaluate fine-tuned models in several domains demonstrating that fine-tuning on a new synthetic language allows for better performance on a variety of tasks.

In the large language model (LLM) revolution, embedding is a key component of various systems, such as retrieving knowledge or memories for LLMs or building content moderation filters. As such cases span from English to other natural or programming languages, from retrieval to classification and beyond, it is advantageous to build a unified embedding model rather than dedicated ones for each scenario. In this context, the pre-trained multilingual decoder-only large language models, e.g., BLOOM, emerge as a viable backbone option. To assess their potential, we propose straightforward strategies for constructing embedders and introduce a universal evaluation benchmark. Experimental results show that our trained model is proficient at generating good embeddings across languages and tasks, even extending to languages and tasks for which no finetuning/pretraining data is available. We also present detailed analyses and additional evaluations. We hope that this work could encourage the development of more robust open-source universal embedders.

Dialogue-level dependency parsing is crucial for understanding complex linguistic structures in conversational data, yet progress has been hindered by limited annotated resources and inadequate modeling of dialogue dynamics. Existing methods often fail to capture both intra- and inter-utterance dependencies effectively, particularly in languages like Chinese with rich contextual interactions. To address these challenges, we propose InterParser, a novel framework that integrates a pretrained language model (PLM), bidirectional GRU (BiGRU), and biaffine scoring for comprehensive dependency parsing. Our model encodes token sequences using a PLM, refines representations via deep BiGRU layers, and employs separate projections for “head” and “dependent” roles to optimize arc and relation prediction. For cross-utterance dependencies, speaker-specific feature projections are introduced to enhance dialogue-aware scoring. Joint training minimizes cross-entropy losses for both intra- and inter-utterance dependencies, ensuring unified optimization. Experiments on a standard Chinese benchmark demonstrate that InterParser significantly outperforms prior methods, achieving state-of-the-art labeled attachment scores (LAS) for both intra- and inter-utterance parsing.

The LLMSR@XLLM25 formulates a low-resource structural reasoning task that challenges LLMs to generate interpretable, step-by-step rationales with minimal labeled data. We present Less is More, the third-place winning approach in the LLMSR@XLLM25, which focuses on structured reasoning from only 24 labeled examples. Our approach leverages a multi-agent framework with reverse-prompt induction, retrieval-augmented reasoning synthesis via GPT-4o, and dual-stage reward-guided filtering to distill high-quality supervision across three subtasks: question parsing, CoT parsing, and step-level verification. All modules are fine-tuned from Meta-Llama-3-8B-Instruct under a unified LoRA+ setup. By combining structure validation with reward filtering across few-shot and zero-shot prompts, our pipeline consistently improves structure reasoning quality. These results underscore the value of controllable data distillation in enhancing structured inference under low-resource constraints. Our code is available at https://github.com/JhCircle/Less-is-More.

pdf bib abs
SpeechEE@XLLM25: End-to-End Structured Event Extraction from Speech
Soham Chaudhuri | Diganta Biswas | Dipanjan Saha | Dipankar Das | Sivaji Bandyopadhyay

Event extraction from text is a complex taskthat involves the identification of event triggersand their supporting arguments. Whenapplied to speech, this task becomes evenmore challenging due to the continuous natureof audio signals and the need for robustAutomatic Speech Recognition (ASR). Thispaper proposes an approach that integratesASR with event extraction by utilizing theWhisper model for speech recognition and aText2Event2 Transformer for extracting eventsfrom English audio samples. The Whispermodel is used to generate transcripts from audio,which are then fed into the Text2Event2Transformer to identify event triggers and theirarguments. This approach combines two difficulttasks into one, streamlining the processof extracting structured event information directlyfrom audio. Our approach leverages arobust ASR system (Whisper) followed by aparameter-efficient transformer (Text2Event2fine-tuned via LoRA) to extract structuredevents from raw speech. Unlike prior worktrained on gold textual input, our pipeline istrained end-to-end on noisy ASR outputs. Despitesignificant resource constraints and datanoise, our system ranked first in the ACL 2025XLLM Shared Task II.

pdf bib abs
DocIE@XLLM25: ZeroSemble - Robust and Efficient Zero-Shot Document Information Extraction with Heterogeneous Large Language Model Ensembles
Nguyen Le | An Thien | Son Luu | Kiet Nguyen

The schematization of knowledge, including the extraction of entities and relations from documents, poses significant challenges to traditional approaches because of the document’s ambiguity, heterogeneity, and high cost domain-specific training. Although Large Language Models (LLMs) allow for extraction without prior training on the dataset, the requirement of fine-tuning along with low precision, especially in relation extraction, serves as an obstacle. In absence of domain-specific training, we present a new zero-shot ensemble approach using DeepSeek-R1-Distill-Llama-70B, Llama-3.3-70B, and Qwen-2.5-32B. Our key innovation is a two-stage pipeline that first consolidates high-confidence entities through ensemble techniques, then leverages Qwen-2.5-32B with engineered prompts to generate precise semantic triples. This approach effectively resolves the low precision problem typically encountered in relation extraction. Experiments demonstrate significant gains in both accuracy and efficiency across diverse domains, with our method ranking in the top 2 on the official leaderboard in Shared Task-IV of The 1st Joint Workshop on Large Language Models and Structure Modeling. This competitive performance validates our approach as a compelling solution for practitioners seeking robust document-level information extraction without the burden of task-specific fine-tuning. Our code can be found at https://github.com/dinhthienan33/ZeroSemble.

pdf bib abs
DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations
Nicholas Popovic | Ashish Kangen | Tim Schopf | Michael Färber

Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings.In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction.In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model.This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time.Based on our approach we produce a synthetic dataset of over 5k Wikipedia abstracts with approximately 59k entities and 30k relation triples.Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting.The code and synthetic dataset are made available for future research.

pdf bib abs
LLMSR@XLLM25: Integrating Reasoning Prompt Strategies with Structural Prompt Formats for Enhanced Logical Inference
Le Tai | Thin Van

This paper illustrates our NBTailee team sys- tem approach in XLLM-ACL 2025 Task-III: LLM for Structural Reasoning (LLM-SR), aim- ing to solve both Task: Question parsing and CoT parsing. The process of extracting state- ments and evidence is similar to Discourse Pars- ing. Correct extraction of statements or evi- dence from the COT is crucial at the outset. Next, the pairwise relationship between a spe- cific statement and its corresponding evidence is assessed (a statement should be followed by its related evidence from the CoT). Both seman- tic and lexical similarity are used to evaluate the accuracy of statements and evidence predic- tions. Finally, once a statement-evidence pair is correctly extracted, it is evaluated to deter- mine whether the evidence can logically deduce the statement. To tackle Question Parsing and CoT Parsing, we implement and investigate var- ious solutions, including (1) applying different structural prompt formats like JSON, Mark- down, or XML. (2) utilising various prompt techniques: Few-shot, Chain of thought, and Multi-hop prompting. (3) Taking advantage of Natural Language Inference (NLI) model for the Statement Verification step. Our best of- ficial result is a 243.047 mean score for test phases A and B, and finally, we rank 7th on the final leaderboard.

pdf bib abs
DocIE@XLLM25: UIEPrompter: A Unified Training-Free Framework for universal document-level information extraction via Structured Prompt
Chengfeng Qiu | Lifeng Zhou | Kaifeng Wei | Yuke Li

We introduce UIEPrompter, a unified, training-free framework that secures 1st place in the ACL 2025 shared competition on universal document-level information extraction. UIEPrompter effectively addresses both named entity recognition and relation extraction without the need for annotated data.Leveraging large language models, UIEPrompter establishes a zero-shot baseline through role-specific prompts, which are then refined via few-shot guidance and constrained output generation prompt to align with competition schemas. Additionally, by integrating outputs from several large language models, we reduce individual model biases, thereby improving overall performance. Evaluated on the competition evaluation dataset, UIEPrompter showcases outstanding performance in document-level information extraction, ultimately securing first place. The implementation code is available on GitHub.

pdf bib abs
LLMSR@XLLM25: SWRV: Empowering Self-Verification of Small Language Models through Step-wise Reasoning and Verification
Danchun Chen

Large language models (LLMs) have shown impressive reasoning capabilities through Chain-of-Thought (CoT). However, the reasoning processes remain inexplicable and uncontrollable. In this paper, we tackle the task hosted by (CITATION) by introducing a Step-Wise Reasoning and Verification (SWRV) framework, a two-stage Parser–Verifier one, that decomposes generated reasoning process into discrete inference steps and rigorously validates each one. First, our Parser extracts problem constraints and the sequence of reasoning steps from the LLM’s reasoning process. Then, our Verifier prompts itself or leverages a deterministic symbolic solver to formally check the logical correctness of every step. To ensure robust parsing, we also fine‐tune a compact LM on a small, high‐quality annotation set produced by a more powerful LLM. Experiments on the dataset (CITATION) demonstrate significant gains over baseline approaches, illustrating the effectiveness of our method for step‐wise analysis of LLM chain-of-thought reasoning. The code is publicly available at https://github.com/Teganone/XLLM_LLMSRhttps://github.com/Teganone/XLLM_LLMSR.

pdf bib abs
LLMSR@XLLM25: An Empirical Study of LLM for Structural Reasoning
Xinye Li | Mingqi Wan | Dianbo Sui

We present Team asdfo123’s submission to the XLLM@ACL 2025–LLM-SR shared task, which evaluates large language models on producing fine-grained, controllable, and interpretable reasoning processes. Systems must extract all problem conditions, decompose a chain of thought into statement–evidence pairs, and verify the logical validity of each pair. Leveraging only the off-the-shelf Meta-Llama-3-8B-Instruct, we craft a concise few-shot, multi-turn prompt that first enumerates all conditions and then guides the model to label, cite, and adjudicate every reasoning step. A lightweight post-processor based on regular expressions normalises spans and enforces the official JSON schema. Without fine-tuning, external retrieval, or ensembling, our method ranks 5th overall, achieving macro-F₁ scores on par with substantially more complex and resource-consuming pipelines. We conclude by analysing the strengths and limitations of our approach and outlining directions for future research in structural reasoning with LLMs. Our code is available at https://github.com/asdfo123/LLMSR-asdfo123

In this paper, we present a novel pipeline for the XLLM Shared Task-III: Large Language Model for Structural Reasoning (LLM-SR). Our pipeline addresses key challenges in automatic process-reward training data construction, such as high manual annotation costs, limited accuracy of large models in structured data processing, and dependency on auxiliary information for validation. To overcome these limitations, we first decompose the construction process into extraction and validation phases. Leveraging model-generated annotations, we produce pseudo-labeled data and iteratively refine model performance. Second, by analyzing structured data patterns, we encode structural constraints into a rule-based module and fine-tune the model with Gradient Reward Policy Optimization (GRPO), significantly improving structured data extraction success rates. Finally, we train the model to generate critical responses that assess evidence-conclusion relationships, thus enhancing validation reliability. Experimental results demonstrate that our pipeline outperforms models with an order of magnitude more parameters and achieves the first position on the task.

pdf bib abs
SpeechEE@XLLM25: Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction
Máté Gedeon

Speech Event Extraction (SpeechEE) is a challenging task that lies at the intersection of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP), requiring the identification of structured event information from spoken language. In this work, we present a modular, pipeline-based SpeechEE framework that integrates high-performance ASR with semantic search-enhanced prompting of Large Language Models (LLMs). Our system first classifies speech segments likely to contain events using a hybrid filtering mechanism including rule-based, BERT-based, and LLM-based models. It then employs fewshot LLM prompting, dynamically enriched via semantic similarity retrieval, to identify event triggers and extract corresponding arguments. We evaluate the pipeline using multiple LLMs—Llama3-8B, GPT-4o-mini, and o1-mini—highlighting significant performance gains with o1-mini, which achieves 63.3% F1 on trigger classification and 27.8% F1 on argument classification, outperforming prior benchmarks. Our results demonstrate that pipeline approaches, when empowered by retrievalaugmented LLMs, can rival or exceed end-toend systems while maintaining interpretability and modularity. This work provides practical insights into LLM-driven event extraction and opens pathways for future hybrid models combining textual and acoustic features