This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
WeiyiSun
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored.This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.
Large Language Models (LLMs) have shown promise in summarizing complex documents, but their limitations in handling lengthy documents and capturing global information hinder their performance in tasks like Query-Focused Summarization (QFS). We explore GraphRAG, a retrieval-augmented generation approach that utilizes a globally summarized knowledge graph derived from an LLM. We apply GraphRAG to the Financial Narrative Summarization (FNS) dataset, which consists of lengthy financial reports. Our results show that a naive RAG approach outperforms GraphRAG in terms of comprehensiveness, directness, conciseness and completeness. However, we demonstrate that optimizing entity and relation extraction using an LLM as an optimizer can enhance GraphRAG’s performance. Our study highlights the need for domain-specific optimization to improve GraphRAG’s capabilities for summarization tasks in facts-heavy domains like finance. We propose an optimization framework that extends GraphRAG’s original domain adaptation strategy by incorporating entity and relations optimization, leading to improved performance in capturing relevant entities and relationships. Our findings contribute to the development of more effective summarization models for complex documents in finance and other domains.
This paper presents our contribution to the Financial Document Causality Detection (FinCausal) task 2025. The FinCausal challenge centers on the extraction of cause-and-effect relationships from financial texts written in both English and Spanish. We introduce KULFi, a novel Knowledge Utilization framework designed to augment the capabilities of Large Language Models (LLMs) by leveraging the expertise of more advanced reasoning models. Through the utilization of Teacher LLMs to generate task-specific instructions, KULFi optimizes the performance of Student LLMs via automated prompt optimization. We evaluate the efficacy of KULFi on the Financial Document Causality Detection Task, where Student LLM achieves a similarity score comparable to human-guided prompt optimization for the same LLM, demonstrating significant improvements in causal reasoning performance. Our results demonstrate that KULFi enables effective knowledge transfer from more robust models to less capable ones, as well as efficient learning from training data, minimizing the need for human input in prompt design and enabling more precise causal analysis in financial contexts. Our system attained SAS and Exact Match scores of 0.92 and 0.35 on the English dataset, and 0.92 and 0.09 on the Spanish dataset, respectively. This framework has far-reaching implications, with potential applications in enhancing decision-making across complex financial environments.
This paper presents a model architecture and training pipeline for attribute value extraction from search queries. The model uses weak labels generated from customer interactions to train a transformer-based NER model. A two-stage normalization process is then applied to deal with the problem of a large label space: first, the model output is normalized onto common generic attribute values, then it is mapped onto a larger range of actual product attribute values. This approach lets us successfully apply a transformer-based NER model to the extraction of a broad range of attribute values in a real-time production environment for e-commerce applications, contrary to previous research. In an online test, we demonstrate business value by integrating the model into a system for semantic product retrieval and ranking.
Recently, semantic search has been successfully applied to E-commerce product search and the learned semantic space for query and product encoding are expected to generalize well to unseen queries or products. Yet, whether generalization can conveniently emerge has not been thoroughly studied in the domain thus far. In this paper, we examine several general-domain and domain-specific pre-trained Roberta variants and discover that general-domain fine-tuning does not really help generalization which aligns with the discovery of prior art, yet proper domain-specific fine-tuning with clickstream data can lead to better model generalization, based on a bucketed analysis of a manually annotated query-product relevance data.
We discuss automatic creation of medical reports from ASR-generated patient-doctor conversational transcripts using an end-to-end neural summarization approach. We explore both recurrent neural network (RNN) and Transformer-based sequence-to-sequence architectures for summarizing medical conversations. We have incorporated enhancements to these architectures, such as the pointer-generator network that facilitates copying parts of the conversations to the reports, and a hierarchical RNN encoder that makes RNN training three times faster with long inputs. A comparison of the relative improvements from the different model architectures over an oracle extractive baseline is provided on a dataset of 800k orthopedic encounters. Consistent with observations in literature for machine translation and related tasks, we find the Transformer models outperform RNN in accuracy, while taking less than half the time to train. Significantly large wins over a strong oracle baseline indicate that sequence-to-sequence modeling is a promising approach for automatic generation of medical reports, in the presence of data at scale.