K.p. Subbalakshmi


2025

Recent advancements have underscored the potential of large language model (LLM)-based agents in financial decision-making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce InvestorBench, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts. InvestorBench enhances the versatility of LLM-enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks and cryptocurrencies, and exchange-traded funds (ETFs). Additionally, we assess the reasoning and decision-making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open-source, datasets and developed a comprehensive suite of environments for financial decision-making. This establishes a highly accessible platform for evaluating financial agents’ performance across various scenarios.
Large Language Model (LLM)–based research assistant tools demonstrate impressive capabilities, yet their outputs may contain hallucinations that compromise reliability. Therefore, detecting hallucinations in automatically generated scientific content is essential. SciHal2025: Hallucination Detection for Scientific Content challenge @ ACL 2025 provides a valuable platform for advancing this goal. This paper presents our solution to the SciHal2025 challenge. Our method combines several prompting strategies with the fine-tuned base LLMs. We first benchmark multiple LLMs on the SciHal dataset. Next, we developed a detection pipeline that integrates few-shot and chain-of-thought prompting. Hidden representations extracted from the LLMs serve as features for an auxiliary classifier, further improving accuracy. Finally, we fine-tuned the selected base LLMs to enhance end-to-end performance. In this paper, we present comprehensive experimental results and discuss the implications of our findings for future hallucination detection research for scientific content.