Patrícia Schmidtová

Also published as: Patricia Schmidtova


2024

pdf bib
Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation
Simone Balloccu | Zdeněk Kasner | Ondřej Plátek | Patrícia Schmidtová | Kristýna Onderková | Mateusz Lango | Ondřej Dušek | Lucie Flek | Ehud Reiter | Dimitra Gkatzia | Simon Mille
Proceedings of the 2nd Workshop on Practical LLM-assisted Data-to-Text Generation

pdf
Automatic Metrics in Natural Language Generation: A survey of Current Evaluation Practices
Patricia Schmidtova | Saad Mahamood | Simone Balloccu | Ondrej Dusek | Albert Gatt | Dimitra Gkatzia | David M. Howcroft | Ondrej Platek | Adarsa Sivaprasad
Proceedings of the 17th International Natural Language Generation Conference

Automatic metrics are extensively used to evaluate Natural Language Processing systems. However, there has been increasing focus on how the are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field.

pdf
factgenie: A Framework for Span-based Evaluation of Generated Texts
Zdeněk Kasner | Ondrej Platek | Patricia Schmidtova | Simone Balloccu | Ondrej Dusek
Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations

We present ‘factgenie‘: a framework for annotating and visualizing word spans in textual model outputs. Annotations can capture various span-based phenomena such as semantic inaccuracies or irrelevant text. With ‘factgenie‘, the annotations can be collected both from human crowdworkers and large language models. Our framework consists of a web interface for data visualization and gathering text annotations, powered by an easily extensible codebase.

pdf
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
Simone Balloccu | Patrícia Schmidtová | Mateusz Lango | Ondrej Dusek
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of indirect data leaking, where modelsare iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI’s GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI’s data usage policy, we extensively document the amount of data leaked to these models during the first year after the model’s release. We report that these models have been globally exposed to ∼4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts.

pdf
Faithfulness of Natural Language Generation
Patricia Schmidtova
Proceedings of the 20th Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems

In this position paper, I present my research interest in the faithfulness of natural language generation, i.e. the adherence to the data provided by a user or the dialog state. I motivate the task and present my progress and plans on the topic. I propose my position on the future of research dialog systems and share topics I would like to discuss during the roundtables.

pdf
ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility
Mateusz Lango | Patricia Schmidtova | Simone Balloccu | Ondrej Dusek
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

In this paper, we describe several reproductions of a human evaluation experiment measuring the quality of automatic dialogue summarization (Feng et al., 2021). We investigate the impact of the annotators’ highest level of education, field of study, and native language on the evaluation of the informativeness of the summary. We find that the evaluation is relatively consistent regardless of these factors, but the biggest impact seems to be a prior specific background in natural language processing (as opposed to, e.g. a background in computer sci- ence). We also find that the experiment setup (asking for single vs. multiple criteria) may have an impact on the results.

2023

pdf
Semantic Accuracy in Natural Language Generation: A Thesis Proposal
Patricia Schmidtova
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

With the fast-growing popularity of current large pre-trained language models (LLMs), it is necessary to dedicate efforts to making them more reliable. In this thesis proposal, we aim to improve the reliability of natural language generation systems (NLG) by researching the semantic accuracy of their outputs. We look at this problem from the outside (evaluation) and from the inside (interpretability). We propose a novel method for evaluating semantic accuracy and discuss the importance of working towards a unified and objective benchmark for NLG metrics. We also review interpretability approaches which could help us pinpoint the sources of inaccuracies within the models and explore potential mitigation strategies.

pdf
Three Ways of Using Large Language Models to Evaluate Chat
Ondřej Plátek | Vojtech Hudecek | Patricia Schmidtova | Mateusz Lango | Ondrej Dusek
Proceedings of The Eleventh Dialog System Technology Challenge

This paper describes the systems submitted by team6 for ChatEval, the DSTC 11 Track 4 competition. We present three different approaches to predicting turn-level qualities of chatbot responses based on large language models (LLMs). We report improvement over the baseline using dynamic few-shot examples from a vector store for the prompts for ChatGPT. We also analyze the performance of the other two approaches and report needed improvements for future work. We developed the three systems over just two weeks, showing the potential of LLMs for this task. An ablation study conducted after the challenge deadline shows that the new Llama 2 models are closing the performance gap between ChatGPT and open-source LLMs. However, we find that the Llama 2 models do not benefit from few-shot examples in the same way as ChatGPT.

pdf bib
Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems
Vojtech Hudecek | Patricia Schmidtova | Tanvi Dinkar | Javier Chiyah-Garcia | Weronika Sieinska
Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems

2022

pdf
THEaiTRobot: An Interactive Tool for Generating Theatre Play Scripts
Rudolf Rosa | Patrícia Schmidtová | Alisa Zakhtarenko | Ondrej Dusek | Tomáš Musil | David Mareček | Saad Ul Islam | Marie Novakova | Klara Vosecka | Daniel Hrbek | David Kostak
Proceedings of the 15th International Conference on Natural Language Generation: System Demonstrations

We present a free online demo of THEaiTRobot, an open-source bilingual tool for interactively generating theatre play scripts, in two versions. THEaiTRobot 1.0 uses the GPT-2 language model with minimal adjustments. THEaiTRobot 2.0 uses two models created by fine-tuning GPT-2 on purposefully collected and processed datasets and several other components, generating play scripts in a hierarchical fashion (title synopsis script). The underlying tool is used in the THEaiTRE project to generate scripts for plays, which are then performed on stage by a professional theatre.

pdf
GPT-2-based Human-in-the-loop Theatre Play Script Generation
Rudolf Rosa | Patrícia Schmidtová | Ondřej Dušek | Tomáš Musil | David Mareček | Saad Obaid | Marie Nováková | Klára Vosecká | Josef Doležal
Proceedings of the 4th Workshop of Narrative Understanding (WNU2022)

We experiment with adapting generative language models for the generation of long coherent narratives in the form of theatre plays. Since fully automatic generation of whole plays is not currently feasible, we created an interactive tool that allows a human user to steer the generation somewhat while minimizing intervention. We pursue two approaches to long-text generation: a flat generation with summarization of context, and a hierarchical text-to-text two-stage approach, where a synopsis is generated first and then used to condition generation of the final script. Our preliminary results and discussions with theatre professionals show improvements over vanilla language model generation, but also identify important limitations of our approach.