Andrey Savchenko


2025

pdf bib
ATGen: A Framework for Active Text Generation
Akim Tsvigun | Daniil Vasilev | Ivan Tsvigun | Ivan Lysenko | Talgat Bektleuov | Aleksandr Medvedev | Uliana Vinogradova | Nikita Severin | Mikhail Mozikov | Andrey Savchenko | Ilya Makarov | Grigorev Rostislav | Ramil Kuleev | Fedor Zhdanov | Artem Shelmanov
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Active learning (AL) has demonstrated remarkable potential in reducing the annotation effort required for training machine learning models. However, despite the surging popularity of natural language generation (NLG) tasks in recent years, the application of AL to NLG has been limited. In this paper, we introduce Active Text Generation (ATGen) - a comprehensive framework that bridges AL with text generation tasks, enabling the application of state-of-the-art AL strategies to NLG. Our framework simplifies AL-empowered annotation in NLG tasks using both human annotators and automatic annotation agents based on large language models (LLMs). The framework supports LLMs deployed as a service, such as ChatGPT and Claude, or operated on-premises. Furthermore, ATGen provides a unified platform for smooth implementation and benchmarking of novel AL strategies tailored to NLG tasks. Finally, we present experimental results across multiple text generation tasks where we compare the performance of state-of-the-art AL strategies in various settings. We demonstrate that ATGen can reduce both the effort of human annotators and costs for API calls to automatic annotation agents based on LLMs.

2024

pdf bib
Leveraging Summarization for Unsupervised Dialogue Topic Segmentation
Aleksei Artemiev | Daniil Parinov | Alexey Grishanov | Ivan Borisov | Alexey Vasilev | Daniil Muravetskii | Aleksey Rezvykh | Aleksei Goncharov | Andrey Savchenko
Findings of the Association for Computational Linguistics: NAACL 2024

Traditional approaches to dialogue segmentation perform reasonably well on synthetic or written dialogues but suffer when dealing with spoken, noisy dialogs. In addition, such methods require careful tuning of hyperparameters. We propose to leverage a novel approach that is based on dialogue summaries. Experiments on different datasets showed that the new approach outperforms popular state-of-the-art algorithms in unsupervised topic segmentation and requires less setup.

pdf bib
Lost in Translation: Chemical Language Models and the Misunderstanding of Molecule Structures
Veronika Ganeeva | Andrey Sakhovskiy | Kuzma Khrabrov | Andrey Savchenko | Artur Kadurin | Elena Tutubalina
Findings of the Association for Computational Linguistics: EMNLP 2024

The recent integration of chemistry with natural language processing (NLP) has advanced drug discovery. Molecule representation in language models (LMs) is crucial in enhancing chemical understanding. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for assessment of Chemistry LMs of different natures: trained solely on molecules for chemical tasks and on a combined corpus of natural language texts and string-based structures. The framework relies on molecule augmentations that preserve an underlying chemical, such as kekulization and cycle replacements. We evaluate encoder-only and generative LMs by calculating a metric based on the similarity score between distributed representations of molecules and their augmentations. Our experiments on ChEBI-20 and QM9 benchmarks show that these models exhibit significantly lower scores than graph-based molecular models trained without language modeling objectives. Additionally, our results on the molecule captioning task for cross-domain models, MolT5 and Text+Chem T5, demonstrate that the lower the representation-based evaluation metrics, the lower the classical text generation metrics like ROUGE and METEOR.

2020

pdf bib
Ad Lingua: Text Classification Improves Symbolism Prediction in Image Advertisements
Andrey Savchenko | Anton Alekseev | Sejeong Kwon | Elena Tutubalina | Evgeny Myasnikov | Sergey Nikolenko
Proceedings of the 28th International Conference on Computational Linguistics

Understanding image advertisements is a challenging task, often requiring non-literal interpretation. We argue that standard image-based predictions are insufficient for symbolism prediction. Following the intuition that texts and images are complementary in advertising, we introduce a multimodal ensemble of a state of the art image-based classifier, a classifier based on an object detection architecture, and a fine-tuned language model applied to texts extracted from ads by OCR. The resulting system establishes a new state of the art in symbolism prediction.