Workshop on Creating, Using And Linking Parliamentary Corpora (2022)


pdf (full)
bib (full)
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

pdf bib
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference
Darja Fišer | Maria Eskevich | Jakob Lenardič | Franciska de Jong

pdf bib
ParlaMint II: The Show Must Go On
Maciej Ogrodniczuk | Petya Osenova | Tomaž Erjavec | Darja Fišer | Nikola Ljubešić | Çağrı Çöltekin | Matyáš Kopp | Meden Katja

In ParlaMint I, a CLARIN-ERIC supported project in pandemic times, a set of comparable and uniformly annotated multilingual corpora for 17 national parliaments were developed and released in 2021. For 2022 and 2023, the project has been extended to ParlaMint II, again with the CLARIN ERIC financial support, in order to enhance the existing corpora with new data and metadata; upgrade the XML schema; add corpora for 10 new parliaments; provide more application scenarios and carry out additional experiments. The paper reports on these planned steps, including some that have already been taken, and outlines future plans.

pdf bib
How GermaParl Evolves: Improving Data Quality by Reproducible Corpus Preparation and User Involvement
Andreas Blaette | Julia Rakers | Christoph Leonhardt

The development and curation of large-scale corpora of plenary debates requires not only care and attention to detail when the data is created but also effective means of sustainable quality control. This paper makes two contributions: Firstly, it presents an updated version of the GermaParl corpus of parliamentary debates in the German *Bundestag*. Secondly, it shows how the corpus preparation pipeline is designed to serve the quality of the resource by facilitating effective community involvement. Centered around a workflow which combines reproducibility, transparency and version control, the pipeline allows for continuous improvements to the corpus.

Between History and Natural Language Processing: Study, Enrichment and Online Publication of French Parliamentary Debates of the Early Third Republic (1881-1899)
Marie Puren | Aurélien Pellet | Nicolas Bourgeois | Pierre Vernus | Fanny Lebreton

We present the AGODA (Analyse sémantique et Graphes relationnels pour l’Ouverture des Débats à l’Assemblée nationale) project, which aims to create a platform for consulting and exploring digitised French parliamentary debates (1881-1940) available in the digital library of the National Library of France. This project brings together historians and NLP specialists: parliamentary debates are indeed an essential source for French history of the contemporary period, but also for linguistics. This project therefore aims to produce a corpus of texts that can be easily exploited with computational methods, and that respect the TEI standard. Ancient parliamentary debates are also an excellent case study for the development and application of tools for publishing and exploring large historical corpora. In this paper, we present the steps necessary to produce such a corpus. We detail the processing and publication chain of these documents, in particular by mentioning the problems linked to the extraction of texts from digitised images. We also introduce the first analyses that we have carried out on this corpus with “bag-of-words” techniques not too sensitive to OCR quality (namely topic modelling and word embedding).

A French Corpus of Québec’s Parliamentary Debates
Pierre André Ménard | Desislava Aleksandrova

Parliamentary debates offer a window on political stances as well as a repository of linguistic and semantic knowledge. They provide insights and reasons for laws and regulations that impact electors in their everyday life. One such resource is the transcribed debates available online from the Assemblée Nationale du Québec (ANQ). This paper describes the effort to convert the online ANQ debates from various HTML formats into a standardized ParlaMint TEI annotated corpus and to enrich it with annotations extracted from related unstructured members and political parties list. The resulting resource includes 88 years of debates over a span of 114 years with more than 33.3 billion words. The addition of linguistic annotations is detailed as well as a quantitative analysis of part-of-speech tags and distribution of utterances across the corpus.

Parliamentary Corpora and Research in Political Science and Political History
Luke Blaxill

This keynote reflects on some of the barriers to digitised parliamentary resources achieving greater impact as research tools in political history and political science. As well as providing a view on researchers’ priorities for resource enhancement, I also argue that one of the main challenges for historians and political scientists is simply establishing how to make best use of these datasets through asking new research questions and through understanding and embracing unfamiliar and controversial methods than enable their analysis. I suggest parliamentary resources should be designed and presented to support pioneers trying to publish in often sceptical and traditional fields.

Error Correction Environment for the Polish Parliamentary Corpus
Maciej Ogrodniczuk | Michał Rudolf | Beata Wójtowicz | Sonia Janicka

The paper introduces the environment for detecting and correcting various kinds of errors in the Polish Parliamentary Corpus. After performing a language model-based error detection experiment which resulted in too many false positives, a simpler rule-based method was introduced and is currently used in the process of manual verification of corpus texts. The paper presents types of errors detected in the corpus, the workflow of the correction process and the tools newly implemented for this purpose. To facilitate comparison of a target corpus XML file with its usually graphical PDF source, a new mechanism for inserting PDF page markers into XML was developed and is used for displaying a single source page corresponding to a given place in the resulting XML directly in the error correction environment.

Clustering Similar Amendments at the Italian Senate
Tommaso Agnoloni | Carlo Marchetti | Roberto Battistoni | Giuseppe Briotti

In this paper we describe an experiment for the application of text clustering techniques to dossiers of amendments to proposed legislation discussed in the Italian Senate. The aim is to assist the Senate staff in the detection of groups of amendments similar in their textual formulation in order to schedule their simultaneous voting. Experiments show that the exploitation (extraction, annotation and normalization) of domain features is crucial to improve the clustering performance in many problematic cases not properly dealt with by standard approaches. The similarity engine was implemented and integrated as an experimental feature in the internal application used for the management of amendments in the Senate Assembly and Committees. Thanks to the Open Data strategy pursued by the Senate for several years, all documents and data produced by the institution are publicly available for reuse in open formats.

Entity Linking in the ParlaMint Corpus
Ruben van Heusden | Maarten Marx | Jaap Kamps

The ParlaMint corpus is a multilingual corpus consisting of the parliamentary debates of seventeen European countries over a span of roughly five years. The automatically annotated versions of these corpora provide us with a wealth of linguistic information, including Named Entities. In order to further increase the research opportunities that can be created with this corpus, the linking of Named Entities to a knowledge base is a crucial step. If this can be done successfully and accurately, a lot of additional information can be gathered from the entities, such as political stance and party affiliation, not only within countries but also between the parliaments of different countries. However, due to the nature of the ParlaMint dataset, this entity linking task is challenging. In this paper, we investigate the task of linking entities from ParlaMint in different languages to a knowledge base, and evaluating the performance of three entity linking methods. We will be using DBPedia spotlight, WikiData and YAGO as the entity linking tools, and evaluate them on local politicians from several countries. We discuss two problems that arise with the entity linking in the ParlaMint corpus, namely inflection, and aliasing or the existence of name variants in text. This paper provides a first baseline on entity linking performance on multiple multilingual parliamentary debates, describes the problems that occur when attempting to link entities in ParlaMint, and makes a first attempt at tackling the aforementioned problems with existing methods.

Visualizing Parliamentary Speeches as Networks: the DYLEN Tool
Seung-bin Yim | Katharina Wünsche | Asil Cetin | Julia Neidhardt | Andreas Baumann | Tanja Wissik

In this paper, we present a web based interactive visualization tool for lexical networks based on the utterances of Austrian Members of Parliament. The tool is designed to compare two networks in parallel and is composed of graph visualization, node-metrics comparison and time-series comparison components that are interconnected with each other.

Emotions Running High? A Synopsis of the state of Turkish Politics through the ParlaMint Corpus
Gül M. Kurtoğlu Eskişar | Çağrı Çöltekin

We present the initial results of our quantitative study on emotions (Anger, Disgust, Fear, Happiness, Sadness and Surprise) in Turkish parliament (2011–2021). We use machine learning models to assign emotion scores to all speeches delivered in the parliament during this period, and observe any changes to them in relation to major political and social events in Turkey. We highlight a number of interesting observations, such as anger being the dominant emotion in parliamentary speeches, and the ruling party showing more stable emotions compared to the political opposition, despite its depiction as a populist party in the literature.

Immigration in the Manifestos and Parliament Speeches of Danish Left and Right Wing Parties between 2009 and 2020
Costanza Navarretta | Dorte Haltrup Hansen | Bart Jongejan

The paper presents a study of how seven Danish left and right wing parties addressed immigration in their 2011, 2015 and 2019 manifestos and in their speeches in the Danish Parliament from 2009 to 2020. The annotated manifestos are produced by the Comparative Manifesto Project, while the parliamentary speeches annotated with policy areas (subjects) have been recently released under CLARIN-DK. In the paper, we investigate how often the seven parties addressed immigration in the manifestos and parliamentary debates, and we analyse both datasets after having applied NLP tools to them. A sentiment analysis tool was run on the manifestos and its results were compared with the manifestos’ annotations, while topic modeling was applied to the parliamentary speeches in order to outline central themes in the immigration debates. Many of the resulting topic groups are related to cultural, religious and integration aspects which were heavily debated by politicians and media when discussing immigration policy during the past decade. Our analyses also show differences and similarities between parties and indicate how the 2015 immigrant crisis is reflected in the two types of data. Finally, we discuss advantages and limitations of our quantitative and tool-based analyses.

Parliamentary Discourse Research in Sociology: Literature Review
Jure Skubic | Darja Fišer

One of the major sociological research interests has always been the study of political discourse. This literature review gives an overview of the most prominent topics addressed and the most popular methods used by sociologists. We identify the commonalities and the differences of the approaches established in sociology with corpus-driven approaches in order to establish how parliamentary corpora and corpus-based approaches could be successfully integrated in sociological research. We also highlight how parliamentary corpora could be made even more useful for sociologists. Keywords: parliamentary discourse, sociology, parliamentary corpora

FrameASt: A Framework for Second-level Agenda Setting in Parliamentary Debates through the Lense of Comparative Agenda Topics
Christopher Klamm | Ines Rehbein | Simone Paolo Ponzetto

This paper presents a framework for studying second-level political agenda setting in parliamentary debates, based on the selection of policy topics used by political actors to discuss a specific issue on the parliamentary agenda. For example, the COVID-19 pandemic as an agenda item can be contextualised as a health issue or as a civil rights issue, as a matter of macroeconomics or can be discussed in the context of social welfare. Our framework allows us to observe differences regarding how different parties discuss the same agenda item by emphasizing different topical aspects of the item. We apply and evaluate our framework on data from the German Bundestag and discuss the merits and limitations of our approach. In addition, we present a new annotated data set of parliamentary debates, following the coding schema of policy topics developed in the Comparative Agendas Project (CAP), and release models for topic classification in parliamentary debates.

Comparing Formulaic Language in Human and Machine Translation: Insight from a Parliamentary Corpus
Yves Bestgen

A recent study has shown that, compared to human translations, neural machine translations contain more strongly-associated formulaic sequences made of relatively high-frequency words, but far less strongly-associated formulaic sequences made of relatively rare words. These results were obtained on the basis of translations of quality newspaper articles in which human translations can be thought to be not very literal. The present study attempts to replicate this research using a parliamentary corpus. The results confirm the observations on the news corpus, but the differences are less strong. They suggest that the use of text genres that usually result in more literal translations, such as parliamentary corpora, might be preferable when comparing human and machine translations.

Adding the Basque Parliament Corpus to ParlaMint Project
Jon Alkorta | Mikel Iruskieta Quintian

The aim of this work is to describe the colection created with transcript of the Basque parliamentary speeches. This corpus follows the constraints of the ParlaMint project. The Basque ParlaMint corpus consists of two versions: the first version stands for what was said in the Basque Parliament, that is, the original bilingual corpus in Basque and in Spanish to analyse what and how was said, while the second is only in Basque with the original and translated passages to promote studies on the content of the parliament speeches.

ParlaSpeech-HR - a Freely Available ASR Dataset for Croatian Bootstrapped from the ParlaMint Corpus
Nikola Ljubešić | Danijel Koržinek | Peter Rupnik | Ivo-Pavao Jazbec

This paper presents our bootstrapping efforts of producing the first large freely available Croatian automatic speech recognition (ASR) dataset, 1,816 hours in size, obtained from parliamentary transcripts and recordings from the ParlaMint corpus. The bootstrapping approach to the dataset building relies on a commercial ASR system for initial data alignment, and building a multilingual-transformer-based ASR system from the initial data for full data alignment. Experiments on the resulting dataset show that the difference between the spoken content and the parliamentary transcripts is present in ~4-5% of words, which is also the word error rate of our best-performing ASR system. Interestingly, fine-tuning transformer models on either normalized or original data does not show a difference in performance. Models pre-trained on a subset of raw speech data consisting of Slavic languages only show to perform better than those pre-trained on a wider set of languages. With our public release of data, models and code, we are paving the way forward for the preparation of the multi-modal corpus of Croatian parliamentary proceedings, as well as for the development of similar free datasets, models and corpora for other under-resourced languages.

Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus
Tommaso Agnoloni | Roberto Bartolini | Francesca Frontini | Simonetta Montemagni | Carlo Marchetti | Valeria Quochi | Manuela Ruisi | Giulia Venturi

This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 period and a former period for reference and comparison according to the CLARIN ParlaMint guidelines and prescriptions. The corpus contains 1199 sessions and 79,373 speeches, for a total of about 31 million words and was encoded according to the ParlaCLARIN TEI XML format, as well as in CoNLL-UD format. It includes extensive metadata about the speakers, the sessions, the political parties and Parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity classification was also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.

ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions
Baybars Kulebi | Carme Armentano-Oller | Carlos Rodriguez-Penagos | Marta Villegas

Recently, various end-to-end architectures of Automatic Speech Recognition (ASR) are being showcased as an important step towards providing language technologies to all languages instead of a select few such as English. However many languages are still suffering due to the “digital gap,” lacking thousands of hours of transcribed speech data openly accessible that is necessary to train modern ASR architectures. Although Catalan already has access to various open speech corpora, these corpora lack diversity and are limited in total volume. In order to address this lack of resources for Catalan language, in this work we present ParlamentParla, a corpus of more than 600 hours of speech from Catalan Parliament sessions. This corpus has already been used in training of state-of-the-art ASR systems, and proof-of-concept text-to-speech (TTS) models. In this work we explain in detail the pipeline that allows the information publicly available on the parliamentary website to be converted to a speech corpus compatible with training of ASR and possibly TTS models.

ParlaMint-RO: Chamber of the Eternal Future
Petru Rebeja | Mădălina Chitez | Roxana Rogobete | Andreea Dincă | Loredana Bercuci

The present paper aims to describe the collection of ParlaMint-RO corpus and to analyse several trends in parliamentary debates (plenary sessions of the Lower House) held in between 2000 and 2020). After a short description of the data collection (of existing transcripts), the workflow of data processing (text extraction, conversion, encoding, linguistic annotation), and an overview of the corpus, the paper will move on to a multi-layered linguistic analysis to validate interdisciplinary perspectives. We use computational methods and corpus linguistics approaches to scrutinize the future tense forms used by Romanian speakers, in order to create a data-supported profile of the parliamentary group strategies and planning.