Georgi Nenkov Georgiev
2026
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation.To assess reasoning capacity, we propose ChainEval, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap.Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. This project is available at https://github.com/mbzuai-nlp/finchain.git.
FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosure
Fan Zhang | Mingzi Song | Rania Elbadry | Yankai Chen | Shaobo Wang | Yixi Zhou | Xunwen Zheng | Yueru He | Yuyang Dai | Georgi Nenkov Georgiev | Ayesha Gull | Muhammad Usman Safder | Fan Wu | Liyuan Meng | Fengxian Ji | Junning Zhao | Xueqing Peng | Jimin Huang | YU Chen | Xue Liu | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Fan Zhang | Mingzi Song | Rania Elbadry | Yankai Chen | Shaobo Wang | Yixi Zhou | Xunwen Zheng | Yueru He | Yuyang Dai | Georgi Nenkov Georgiev | Ayesha Gull | Muhammad Usman Safder | Fan Wu | Liyuan Meng | Fengxian Ji | Junning Zhao | Xueqing Peng | Jimin Huang | YU Chen | Xue Liu | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Financial reporting systems increasingly leverage Large Language Models (LLMs) to extract and summarize corporate disclosures. However, most existing approaches assume a single-market setting and overlook structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs. PDF), and aggregation conventions introduce substantial challenges for semantic alignment and reliable verification. Here, we aim to bridge this gap. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system constructs a unified canonical ontology spanning the income statement, balance sheet, and cash flow statement, and decomposes reporting into auditable stages, including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than treating LLMs as free-form generators, FinReporting employs them as constrained verifiers operating under explicit decision rules with evidence grounding.Evaluated on annual filings from the USA, Japan, and China, FinReporting improves consistency and reliability under heterogeneous reporting regimes. We further release an interactive demo that enables cross-market inspection and supports structured export of localized financial statements. Our demo is available at https://huggingface.co/spaces/BoomQ/FinReporting-Demo. A video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk.
2024
Factuality of Large Language Models: A Survey
Yuxia Wang | Minghan Wang | Muhammad Arslan Manzoor | Fei Liu | Georgi Nenkov Georgiev | Rocktim Jyoti Das | Preslav Nakov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Yuxia Wang | Minghan Wang | Muhammad Arslan Manzoor | Fei Liu | Georgi Nenkov Georgiev | Rocktim Jyoti Das | Preslav Nakov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs), especially when instruction-tuned for chat, have become part of our daily lives, freeing people from the process of searching, extracting, and integrating information from multiple sources by offering a straightforward answer to a variety of questions in a single place. Unfortunately, in many cases, LLM responses are factually incorrect, which limits their applicability in real-world scenarios. As a result, research on evaluating and improving the factuality of LLMs has attracted a lot of research attention recently. In this survey, we critically analyze existing work with the aim to identify the major challenges and their associated causes, pointing out to potential solutions for improving the factuality of LLMs, and analyzing the obstacles to automated factuality evaluation for open-ended text generation. We further offer an outlook on where future research should go.
OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
Hasan Iqbal | Yuxia Wang | Minghan Wang | Georgi Nenkov Georgiev | Jiahui Geng | Iryna Gurevych | Preslav Nakov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Hasan Iqbal | Yuxia Wang | Minghan Wang | Georgi Nenkov Georgiev | Jiahui Geng | Iryna Gurevych | Preslav Nakov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
The increased use of large language models (LLMs) across a variety of real-world applications calls for automatic tools to check the factual accuracy of their outputs, as LLMs often hallucinate. This is difficult as it requires assessing the factuality of free-form open-domain responses. While there has been a lot of research on this topic, different papers use different evaluation benchmarks and measures,which makes them hard to compare and hampers future progress. To mitigate these issues, we developed OpenFactCheck, a unified framework, with three modules: (i) RESPONSEEVAL, which allows users to easily customize an automatic fact-checking system and to assess the factuality of all claims in an input document using that system, (ii) LLMEVAL, which assesses the overall factuality of an LLM, and (iii) CHECKEREVAL, a module to evaluate automatic fact-checking systems. OpenFactCheck is open-sourced (https://github.com/mbzuai-nlp/openfactcheck) and publicly released as a Python library (https://pypi.org/project/openfactcheck/) and also as a web service (http://app.openfactcheck.com). A video describing the system is available at https://youtu.be/-i9VKL0HleI.
Search
Fix author
Co-authors
- Preslav Nakov 4
- Yuxia Wang 3
- Rania Elbadry 2
- Jimin Huang 2
- Xueqing Peng 2
- Minghan Wang 2
- Zhuohan Xie 2
- Fan Zhang 2
- Sophia Ananiadou 1
- Debopriyo Banerjee 1
- Tanmoy Chakraborty 1
- YU Chen (陈昱) 1
- Yankai Chen 1
- Yuyang Dai 1
- Rocktim Jyoti Das 1
- Jiahui Geng 1
- Ayesha Gull 1
- Iryna Gurevych 1
- Yueru He 1
- Hasan Iqbal 1
- Fengxian Ji 1
- Fajri Koto 1
- Ivan Koychev 1
- Salem Lahlou 1
- Haonan Li 1
- Fei Liu 1
- Xue Liu 1
- Hachem Madmoun 1
- Muhammad Arslan Manzoor 1
- Liyuan Meng 1
- Daniil Orel 1
- Lingfei Qian 1
- Muhammad Usman Safder 1
- Dhruv Sahnan 1
- Aaryamonvikram Singh 1
- Mingzi Song 1
- Veselin Stoyanov 1
- Jinyan Su 1
- Rushil Thareja 1
- Shaobo Wang 1
- Fan Wu 1
- Rui Xing 1
- Chen Xu 1
- Junning Zhao 1
- Xunwen Zheng 1
- Yixi Zhou 1