Alex Chandler

2025

pdf bib abs
Deloitte (Drocks) at the Financial Misinformation Detection Challenge Task: Enhancing Misinformation Detection through Instruction-Tuned Models
Harika Abburi | Alex Chandler | Edward Bowen | Sanmitra Bhattacharya | Nirmala Pudota
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)

Large Language Models (LLMs) are capable of producing highly fluent and convincing text; however, they can sometimes include factual errors and misleading information. Consequently, LLMs have emerged as tools for the rapid and cost-effective generation of financial misinformation, enabling bad actors to harm individual investors and attempt to manipulate markets. In this study, we instruction-tune Generative Pre-trained Transformers (GPT-4o-mini) to detect financial misinformation and produce concise explanations for why a given claim or statement is classified as misinformation, leveraging the contextual information provided. Our model achieved fourth place in Financial Misinformation Detection (FMD) shared task with a micro F1 score of 0.788 and a ROUGE-1 score of 0.743 on the private test set of FACT-checking within the FINancial domain (FIN-FACT) dataset provided by the shared task organizers.

pdf bib abs
Deloitte (Drocks) at SemEval-2025 Task 3: Fine-Grained Multi-lingual Hallucination Detection Using Internal LLM Weights
Alex Chandler | Harika Abburi | Sanmitra Bhattacharya | Edward Bowen | Nirmala Pudota
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Large Language Models (LLMs) have greatly advanced the field of Natural Language Generation (NLG). Despite their remarkable capabilities, their tendency to hallucinate—producing inaccurate or misleading information-remains a barrier to wider adoption. Current hallucination detection methods mainly employ coarse-grained binary classification at the sentence or document level, overlooking the need for precise identification of the specific text spans containing hallucinations. In this paper, we proposed a methodology that generates supplementary context and processes text using an LLM to extract internal weights (features) from various layers. These extracted features serve as input for a neural network classifier designed to perform token-level binary detection of hallucinations. Subsequently, we map the resulting token-level predictions to character-level predictions, enabling the identification of spans of hallucinated text, which we refer to as hallucination spans. Our model achieved a top-ten ranking in 13 of the 14 languages and secured first place for the French language in the SemEval: Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes (Mu-SHROOM), utilizing the Mu-SHROOM dataset provided by the task organizers.

2024

pdf bib abs
Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors
Alex Chandler | Devesh Surve | Hui Su
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Accurate text summarization is one of the most common and important tasks performed by Large Language Models, where the costs of human review for an entire document may be high, but the costs of errors in summarization may be even greater. We propose Detecting Errors through Ensembling Prompts (DEEP) - an end-to-end large language model framework for detecting factual errors in text summarization. Our framework uses a diverse set of LLM prompts to identify factual inconsistencies, treating their outputs as binary features, which are then fed into ensembling models. We then calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination. We demonstrate that prior models for detecting factual errors in summaries perform significantly worse without optimizing the thresholds on subsets of the evaluated dataset. Our framework achieves state-of-the-art (SOTA) balanced accuracy on the AggreFact-XSUM FTSOTA, TofuEval Summary-Level, and HaluEval Summarization benchmarks in detecting factual errors within transformer-generated text summaries. It does so without any fine-tuning of the language model or reliance on thresholding techniques not available in practical settings.

Co-authors

Devesh Surve 1

Venues

Fix author