Denys Katerenchuk

2026

FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking
Denys Katerenchuk | Pablo Duboue | Keelan Evanini
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Large language models (LLMs) are rapidly being adopted across various domains. However, their adoption in banking industry faces resistance due to demands for high accuracy, regulatory compliance, and the need for verifiable and grounded responses. We present a unified, data-efficient framework for training grounded domain-specific LLMs that optimizes answer quality, citation grounding, and calibrated refusal under real-world deployment constraints. First, we describe a data generation pipeline that combines LLM-as-a-Judge filtering, citation annotation, and curriculum learning with only 143M tokens. The resulting 12B model achieves high answer quality outperforming GPT-4.1 on citation grounding, with a modest citation tradeoff versus the untuned base. Second, we propose a calibrated refusal mechanism: training on 22% unanswerable examples yield a 12% “I don’t know” rate, substantially improving over the base model’s unsafe 4.3% rate while avoiding GPT-4.1’s over-refusal (20.2%). Third, we present an end-to-end methodology spanning from data curation to quantized serving. The system is deployed at 40+ financial institutions, achieving a 7.1percentage point improvement in query resolution (p < 0.001). Additionally, the model delivers 3–5x faster responses at 20–50x lower cost compared to GPT-4.1.

2018

pdf bib

Interpersonal Relationship Labels for the CALLHOME Corpus
Denys Katerenchuk | David Guy Brizan | Andrew Rosenberg
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs

Age Group Classification with Speech and Metadata Multimodality Fusion
Denys Katerenchuk
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Children comprise a significant proportion of TV viewers and it is worthwhile to customize the experience for them. However, identifying who is a child in the audience can be a challenging task. We present initial studies of a novel method which combines utterances with user metadata. In particular, we develop an ensemble of different machine learning techniques on different subsets of data to improve child detection. Our initial results show an 9.2% absolute improvement over the baseline, leading to a state-of-the-art performance.

2016

pdf bib abs

RankDCG: Rank-Ordering Evaluation Measure
Denys Katerenchuk | Andrew Rosenberg
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Ranking is used for a wide array of problems, most notably information retrieval (search). Kendall’s τ, Average Precision, and nDCG are a few popular approaches to the evaluation of ranking. When dealing with problems such as user ranking or recommendation systems, all these measures suffer from various problems, including the inability to deal with elements of the same rank, inconsistent and ambiguous lower bound scores, and an inappropriate cost function. We propose a new measure, a modification of the popular nDCG algorithm, named rankDCG, that addresses these problems. We provide a number of criteria for any effective ranking algorithm and show that only rankDCG satisfies them all. Results are presented on constructed and real data sets. We release a publicly available rankDCG evaluation package.

Co-authors

Venues

Fix author