William Soto Martinez

Also published as: William Soto, William Soto-Martinez, William Eduardo Soto Martinez, William Soto Martinez

2025

Fine-Tuning, Prompting and RAG for Knowledge Graph-to-Russian Text Generation. How do these Methods generalise to Out-of-Distribution Data?
Anna Nikiforovskaya | William Soto-Martinez | Evan Chapple | Claire Gardent
Proceedings of the 18th International Natural Language Generation Conference

Prior work on Knowledge Graph-to-Text generation has mostly evaluated models on in-domain test sets and/or with English as the target language. In contrast, we focus on Russian and we assess how various generation methods perform on out-of-domain, unseen data. Previous studies have shown that enriching the input with target-language verbalisations of entities and properties substantially improves the performance of fine-tuned models for Russian. We compare multiple variants of two contemporary paradigms — LLM prompting and Retrieval-Augmented Generation (RAG) — and investigate alternative ways to integrate such external knowledge into the generation process. Using automatic metrics and human evaluation, we find that on unseen data the fine-tuned model consistently underperforms, revealing limited generalisation capacity; that while it outperforms RAG by a small margin on most datasets, prompting generates less fluent text; and conversely, that RAG generates text that is less faithful to the input. Overall, both LLM prompting and RAG outperform Fine-Tuning across all unseen testsets. The code for this paper is available at https://github.com/Javanochka/KG-to-text-fine-tuning-prompting-rag

pdf bib abs

Semantic Evaluation of Multilingual Data-to-Text Generation via NLI Fine-Tuning: Precision, Recall and F1 scores
William Soto Martinez | Yannick Parmentier | Claire Gardent
Findings of the Association for Computational Linguistics: ACL 2025

Performance in the KG-to-Text task has improved over the years, particularly in English. However, models are still prone to mistakes like Additions and Omissions. Furthermore, few languages are taken into account since both train and test data are not readily available. In this paper, we hope to facilitate the development and improvement of multilingual KG-to-Text models by providing a multilingual evaluation framework that is reference-less (no need for test data) and permits estimating how much a KG-to-Text Model under- (omission) or over- (addition) generates. We focus on two high (English, Russian) and five low (Breton, Irish, Maltese, Welsh, Xhosa) resource languages and show that our metric has fair to moderate correlation with reference-based metrics, positioning it as a consistent alternative when no references are available. We also show that our metric outperforms prior reference-less metrics in correlation with existing human judgments. Additional human evaluation shows moderate to strong correlation with human annotators in assessing precision and recall at a higher granularity level than shown in previous studies. Since our metric provides scores for precision and recall, it helps better assess the level of over- or under-generation of multilingual KG-to-Text models.

pdf bib abs

Multilingual Verbalisation of Knowledge Graphs
Yifei Song | William Soto Martinez | Anna Nikiforovskaya | Evan Chapple | Claire Gardent
Findings of the Association for Computational Linguistics: EMNLP 2025

Most work on Knowledge Graph (KG) verbalisation is monolingual leaving open the question of how to scale KG-to-Text generation to languages with varying amounts of resources. In this work, we explore KG-to-Text generation on nine languages including five high-resource (HR) languages (English, Chinese, French, Spanish, Russian) and four low-resource (LR) languages (Breton, Irish, Maltese, Welsh). We first construct silver multilingual training data for all nine languages and new gold out-of-domain test data for the five HR languages. Using this data and already available in-domain test sets for 7 of our 9 languages, we then compare three strategies: (1) NLG+MT—a state-of-the-art KG-to-English model followed by Machine Translation (MT) into the target language; (2) FTMT—multilingual MT models fine-tuned end-to-end on the silver data; and (3) FewShot—few-shot LLM prompting comparing 4 LLMs. We explore different prompting strategies and show that our best prompting strategy performs the best on all 9 languages, discussing the relative performance of the three approaches on Low vs High Resource languages and on in- vs out-of-domain data.The models, the test set, and the silver training data are available at https://github.com/MeloS7/Multilingual-KG-Verbalisation.

2024

pdf bib abs

Generating from AMRs into High and Low-Resource Languages using Phylogenetic Knowledge and Hierarchical QLoRA Training (HQL)
William Soto Martinez | Yannick Parmentier | Claire Gardent
Proceedings of the 17th International Natural Language Generation Conference

Multilingual generation from Abstract Meaning Representations (AMRs) verbalises AMRs into multiple languages. Previous work has focused on high- and medium-resource languages relying on large amounts of training data. In this work, we consider both high- and low-resource languages capping training data size at the lower bound set by our low-resource languages i.e. 31K. We propose a straightforward technique to enhance results on low-resource while preserving performance on high-resource languages. We iteratively refine a multilingua model to a set of monolingual models using Low-Rank Adaptation with a training curriculum based on a tree structure; this permits investigating how the languages used at each iteration impact generation performance on high and low-resource languages. We show an improvement over both mono and multilingual approaches. Comparing different ways of grouping languages at each iteration step we find two working configurations: grouping related languages which promotes transfer, or grouping distant languages which facilitates regularisation

2023

pdf bib abs

Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP.

pdf bib abs

The WebNLG task consists of mapping a knowledge graph to a text verbalising the con- tent of that graph. The 2017 WebNLG edi- tion required participating systems to gener- ate English text from a set of DBpedia triples, while the 2020 WebNLG+ challenge addition- ally included generation into Russian and se- mantic parsing of English and Russian texts. In contrast, WebNLG 2023 focuses on four under-resourced languages which are severely under-represented in research on text genera- tion, namely Breton, Irish, Maltese and Welsh. In addition, WebNLG 2023 once again includes Russian. In this paper, we present the organi- sation of the shared task (data, timeline, eval- uation), briefly describe the participating sys- tems and summarise results for participating systems.