Jiatao Li
2026
Fin-STAR: Structure-as-Semantics to Resolve Implicitness in Financial Retrieval
Yu Zou | Yan Chen | Lida He | Qi Zhou | Xiaorui Zhou | Aixi Zhong | Yi Wang | Wei Li | Qingyu Wang | Jiatao Li | Wei Gong | Jialei Zeng | Jingmei Zhao | Ke Jiang | Qing Li
Findings of the Association for Computational Linguistics: ACL 2026
Yu Zou | Yan Chen | Lida He | Qi Zhou | Xiaorui Zhou | Aixi Zhong | Yi Wang | Wei Li | Qingyu Wang | Jiatao Li | Wei Gong | Jialei Zeng | Jingmei Zhao | Ke Jiang | Qing Li
Findings of the Association for Computational Linguistics: ACL 2026
Understanding financial documents is critical for high-stakes decision-making yet hindered by systemic semantic implicitness: key facts are rarely explicit in surface text and often determined by global structural cues. Missing these cues invites semantic misinterpretations, such as misreading what a number refers to, an outcome unacceptable in high-stakes environments. However, existing Retrieval-Augmented Generation (RAG) systems typically treat structure as a physical navigational skeleton rather than intrinsic semantic knowledge. To address this, we introduce Fin-STAR (Financial STructure-As-Semantics Retrieval), a framework redefining hierarchy as intrinsic semantics. Fin-STAR incorporates a novel Structure-Enriched Semantic Indexing mechanism that augments the hierarchical lineage with snippet-derived virtual nodes, and injects this enriched context via a semantic cross-attention paradigm, rendering implicit cues explicit. By grounding evidence within its structural scope, we preserve factual invariance and ensure contextual integrity. Addressing the lack of granular public datasets, we conduct experiments on FinTierQA Gold, a curated expert benchmark. Results show that Fin-STAR outperforms state-of-the-art hierarchical and graph-based baselines across diverse query complexities, document types, and markets. Notably, ablations confirm that our semantic injection consistently outperforms alternative strategies. Finally, we release FinTierQA, comprising 3.9M pairs automatically constructed from 78k documents via our framework .
2025
Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection
Jiatao Li | Xiaojun Wan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiatao Li | Xiaojun Wan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rise of Large Language Models (LLMs) necessitates accurate AI-generated text detection. However, current approaches largely overlook the influence of author characteristics. We investigate how sociolinguistic attributes—gender, CEFR proficiency, academic field, and language environment—impact state-of-the-art AI text detectors. Using the ICNALE corpus of human-authored texts and parallel AI-generated texts from diverse LLMs, we conduct a rigorous evaluation employing multi-factor ANOVA and weighted least squares (WLS). Our results reveal significant biases: CEFR proficiency and language environment consistently affected detector accuracy, while gender and academic field showed detector-dependent effects. These findings highlight the crucial need for socially aware AI text detection to avoid unfairly penalizing specific demographic groups. We offer novel empirical evidence, a robust statistical framework, and actionable insights for developing more equitable and reliable detection systems in real-world, out-of-domain contexts. This work paves the way for future research on bias mitigation, inclusive evaluation benchmarks, and socially responsible LLM detectors.
Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models
Jiatao Li | Xinyu Hu | Xunjian Yin | Xiaojun Wan
Findings of the Association for Computational Linguistics: NAACL 2025
Jiatao Li | Xinyu Hu | Xunjian Yin | Xiaojun Wan
Findings of the Association for Computational Linguistics: NAACL 2025
The integration of documents generated by LLMs themselves (Self-Docs) alongside retrieved documents has emerged as a promising strategy for retrieval-augmented generation systems. However, previous research primarily focuses on optimizing the use of Self-Docs, with their inherent properties remaining underexplored. To bridge this gap, we first investigate the overall effectiveness of Self-Docs, identifying key factors that shape their contribution to RAG performance (RQ1). Building on these insights, we develop a taxonomy grounded in Systemic Functional Linguistics to compare the influence of various Self-Docs categories (RQ2) and explore strategies for combining them with external sources (RQ3). Our findings reveal which types of Self-Docs are most beneficial and offer practical guidelines for leveraging them to achieve significant improvements in knowledge-intensive question answering tasks.