2025
pdf
bib
abs
Low-resource Buryat-Russian neural machine translation
Dari Baturova
|
Sarana Abidueva
|
Dmitrii Lichko
|
Ivan Bondarenko
Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics
This paper presents a study on the development of a neural machine translation (NMT) system for the Russian-Buryat language pair, focusing on addressing the challenges of low-resource translation.We also present a parallel corpus, constructed by processing existing texts and organizing the translation process, supplemented by data augmentation techniques to enhance model training. We managed to achieve BLEU score of 20 and 35 for translation to Buryat andRussian respectively. Native speakers have evaluated the translations as acceptable.Future directions include expanding and cleaning the dataset, improving model training techniques, and exploring dialectal variations within the Buryat language.
pdf
bib
abs
Pisets: A Robust Speech Recognition System for Lectures and Interviews
Ivan Bondarenko
|
Daniil Grebenkin
|
Oleg Sedukhin
|
Mikhail Klementev
|
Derunets Roman
|
Lyudmila Budneva
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
This work presents a speech-to-text system “Pisets” for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system’s effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of “Pisets” system is publicly available at GitHub: https://github.com/bond005/pisets.
pdf
bib
abs
TabaQA at SemEval-2025 Task 8: Column Augmented Generation for Question Answering over Tabular Data
Ekaterina Antropova
|
Egor Kratkov
|
Roman Derunets
|
Margarita Trofimova
|
Ivan Bondarenko
|
Alexander Panchenko
|
Vasily Konovalov
|
Maksim Savkin
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
The DataBench shared task in the SemEval-2025 competition aims to tackle the problem of QA from data in tables. Given the diversity of the structure of tables, there are different approaches to retrieving the answer. Although Retrieval-Augmented Generation (RAG) is a viable solution, extracting relevant information from tables remains challenging. In addition, the table can be prohibitively large for direct integration into the LLM context. In this paper, we address QA over tabular data first by identifying relevant columns that might contain the answers, then the LLM generates answers by providing the context of the relevant columns, and finally, the LLM refines its answers. This approach secured us 7th place in the DataBench lite category.
pdf
bib
abs
FactDebug at SemEval-2025 Task 7: Hybrid Retrieval Pipeline for Identifying Previously Fact-Checked Claims Across Multiple Languages
Evgenii Nikolaev
|
Ivan Bondarenko
|
Islam Aushev
|
Vasilii Krikunov
|
Andrei Glinskii
|
Vasily Konovalov
|
Julia Belikova
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
The proliferation of multilingual misinformation demands robust systems for crosslingual fact-checked claim retrieval. This paper addresses SemEval-2025 Shared Task 7, which challenges participants to retrieve fact-checks for social media posts across 14 languages, even when posts and fact-checks are in different languages. We propose a hybrid retrieval pipeline that combines sparse lexical matching (BM25, BGE-m3) and dense semantic retrieval (pretrained and fine-tuned BGE-m3) with dynamic fusion and curriculum-trained rerankers. Our system achieves 67.2% crosslingual and 86.01% monolingual accuracy on the Shared Task MultiClaim dataset.
2018
pdf
bib
abs
Conditional Random Fields for Metaphor Detection
Anna Mosolova
|
Ivan Bondarenko
|
Vadim Fomin
Proceedings of the Workshop on Figurative Language Processing
We present an algorithm for detecting metaphor in sentences which was used in Shared Task on Metaphor Detection by First Workshop on Figurative Language Processing. The algorithm is based on different features and Conditional Random Fields.