The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Torino, Italia
May, 2024

Volumes

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) 1555 papers
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries 14 papers
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024 16 papers
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024 9 papers
Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024 34 papers
Proceedings of the Workshop on Cognitive Aspects of the Lexicon @ LREC-COLING 2024 20 papers
Proceedings of the First Workshop on Language-driven Deliberation Technology (DELITE) @ LREC-COLING 2024 8 papers
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024 19 papers
Proceedings of the Workshop on Deep Learning and Linked Data (DLnLD) @ LREC-COLING 2024 9 papers
Proceedings of the Fifth International Workshop on Designing Meaning Representations @ LREC-COLING 2024 18 papers
Proceedings of the Seventh Workshop on e-Commerce and NLP @ LREC-COLING 2024 16 papers
Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024 9 papers
Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing 35 papers
Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024 13 papers
Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024 10 papers
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024 27 papers
Proceedings of the 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024 19 papers
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024 16 papers
Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024 12 papers
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024 34 papers
Proceedings of the 2nd Workshop on Mathematical Natural Language Processing @ LREC-COLING 2024 6 papers
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024 28 papers
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge) @ LREC-COLING-2024 6 papers
Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024 17 papers
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024 18 papers
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024 26 papers
Proceedings of the Second Workshop on Natural Language Processing for Political Sciences @ LREC-COLING 2024 11 papers
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024 18 papers
Proceedings of the Fifth Workshop on Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments @LREC-COLING 2024 12 papers
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024 10 papers
Proceedings of the First Workshop on Reference, Framing, and Perspective @ LREC-COLING 2024 6 papers
Proceedings of Safety4ConvAI: The Third Workshop on Safety for Conversational AI @ LREC-COLING 2024 6 papers
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources 46 papers
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024 51 papers
Proceedings of the Second International Workshop Towards Digital Language Equality (TDLE): Focusing on Sustainability @ LREC-COLING 2024 7 papers
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024 18 papers
Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024 17 papers
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation 12 papers

pdf (full)
bib (full) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Multimodal machine translation (MMT) is a challenging task that seeks to improve translation quality by incorporating visual information. However, recent studies have indicated that the visual information provided by existing MMT datasets is insufficient, causing models to disregard it and overestimate their capabilities. This issue presents a significant obstacle to the development of MMT research. This paper presents a novel solution to this issue by introducing 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese, each with corresponding images. Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets. We utilize a word sense disambiguation model to select ambiguous data from vision-and-language datasets, resulting in a more challenging dataset. We further benchmark several state-of-the-art MMT models on our proposed dataset. Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets. Our work provides a valuable resource for researchers in the field of multimodal learning and encourages further exploration in this area. The data, code and scripts are freely available at https://github.com/MaxyLee/3AM.

pdf bib abs
A Benchmark Evaluation of Clinical Named Entity Recognition in French
Nesrine Bannour | Christophe Servan | Aurélie Névéol | Xavier Tannier

Background: Transformer-based language models have shown strong performance on many Natural Language Processing (NLP) tasks. Masked Language Models (MLMs) attract sustained interest because they can be adapted to different languages and sub-domains through training or fine-tuning on specific corpora while remaining lighter than modern Large Language Models (MLMs). Recently, several MLMs have been released for the biomedical domain in French, and experiments suggest that they outperform standard French counterparts. However, no systematic evaluation comparing all models on the same corpora is available. Objective: This paper presents an evaluation of masked language models for biomedical French on the task of clinical named entity recognition. Material and methods: We evaluate biomedical models CamemBERT-bio and DrBERT and compare them to standard French models CamemBERT, FlauBERT and FrAlBERT as well as multilingual mBERT using three publically available corpora for clinical named entity recognition in French. The evaluation set-up relies on gold-standard corpora as released by the corpus developers. Results: Results suggest that CamemBERT-bio outperforms DrBERT consistently while FlauBERT offers competitive performance and FrAlBERT achieves the lowest carbon footprint. Conclusion: This is the first benchmark evaluation of biomedical masked language models for French clinical entity recognition that compares model performance consistently on nested entity recognition using metrics covering performance and environmental impact.

This paper introduces a novel benchmark that has been designed as a test bed for evaluating whether artificial agents are able to understand how to perform everyday activities, with a focus on the cooking domain. Understanding how to cook recipes is a highly challenging endeavour due to the underspecified and grounded nature of recipe texts, combined with the fact that recipe execution is a knowledge-intensive and precise activity. The benchmark comprises a corpus of recipes, a procedural semantic representation language of cooking actions, qualitative and quantitative kitchen simulators, and a standardised evaluation procedure. Concretely, the benchmark task consists in mapping a recipe formulated in natural language to a set of cooking actions that is precise enough to be executed in the simulated kitchen and yields the desired dish. To overcome the challenges inherent to recipe execution, this mapping process needs to incorporate reasoning over the recipe text, the state of the simulated kitchen environment, common-sense knowledge, knowledge of the cooking domain, and the action space of a virtual or robotic chef. This benchmark thereby addresses the growing interest in human-centric systems that combine natural language processing and situated reasoning to perform everyday activities.

pdf abs
ABLE: Agency-BeLiefs Embedding to Address Stereotypical Bias through Awareness Instead of Obliviousness
Michelle YoungJin Kim | Junghwan Kim | Kristen Johnson

Natural Language Processing (NLP) models tend to inherit and amplify stereotypical biases present in their training data, leading to harmful societal consequences. Current efforts to rectify these biases typically revolve around making models oblivious to bias, which is at odds with the idea that humans require increased awareness to tackle these biases better. This prompts a fundamental research question: are bias-oblivious models the only viable solution to combat stereotypical biases? This paper answers this question by proposing the Agency-BeLiefs Embedding (ABLE) model, a novel approach that actively encodes stereotypical biases into the embedding space. ABLE draws upon social psychological theory to acquire and represent stereotypical biases in the form of agency and belief scores rather than directly representing stereotyped groups. Our experimental results showcase ABLE’s effectiveness in learning agency and belief stereotypes while preserving the language model’s proficiency. Furthermore, we underscore the practical significance of incorporating stereotypes within the ABLE model by demonstrating its utility in various downstream tasks. Our approach exemplifies the potential benefits of addressing bias through awareness, as opposed to the prevailing approach of mitigating bias through obliviousness.

pdf abs
Abstractive Multi-Video Captioning: Benchmark Dataset Construction and Extensive Evaluation
Rikito Takahashi | Hirokazu Kiyomaru | Chenhui Chu | Sadao Kurohashi

This paper introduces a new task, abstractive multi-video captioning, which focuses on abstracting multiple videos with natural language. Unlike conventional video captioning tasks generating a specific caption for a video, our task generates an abstract caption of the shared content in a video group containing multiple videos. To address our task, models must learn to understand each video in detail and have strong abstraction abilities to find commonalities among videos. We construct a benchmark dataset for abstractive multi-video captioning named AbstrActs. AbstrActs contains 13.5k video groups and corresponding abstract captions. As abstractive multi-video captioning models, we explore two approaches: end-to-end and cascade. For evaluation, we proposed a new metric, CocoA, which can evaluate the model performance based on the abstractness of the generated captions. In experiments, we report the impact of the way of combining multiple video features, the overall model architecture, and the number of input videos.

pdf abs
Abstract-level Deductive Reasoning for Pre-trained Language Models
Xin Wu | Yi Cai | Ho-fung Leung

Pre-trained Language Models have been shown to be able to emulate deductive reasoning in natural language. However, PLMs are easily affected by irrelevant information (e.g., entity) in instance-level proofs when learning deductive reasoning. To address this limitation, we propose an Abstract-level Deductive Reasoner (ADR). ADR is trained to predict the abstract reasoning proof of each sample, which guides PLMs to learn general reasoning patterns rather than instance-level knowledge. Experimental results demonstrate that ADR significantly reduces the impact of PLMs learning instance-level knowledge (over 70%).

Text generation with beam search has proven successful in a wide range of applications. We point out that, though largely overlooked in the literature, the commonly-used implementation of beam decoding (e.g., Hugging Face Transformers and fairseq) uses a first come, first served heuristic: it keeps a set of already completed sequences over time steps and stops when the size of this set reaches the beam size. Based on this finding, we introduce a patience factor, a simple modification to this beam decoding implementation, that generalizes the stopping criterion and provides flexibility to the depth of search. Empirical results demonstrate that adjusting this patience factor improves decoding performance of strong pretrained models on news text summarization and machine translation over diverse language pairs, with a negligible inference slowdown. Our approach only modifies one line of code and can be thus readily incorporated in any implementation. Further, we find that different versions of beam decoding result in large performance differences in summarization, demonstrating the need for clarity in specifying the beam search implementation in research work. Our code will be available upon publication.

pdf abs
A Canonical Form for Flexible Multiword Expressions
Jan Odijk | Martin Kroon

This paper proposes a canonical form for Multiword Expressions (MWEs), in particular for the Dutch language. The canonical form can be enriched with all kinds of annotations that can be used to describe the properties of the MWE and its components. It also introduces the DUCAME (DUtch CAnonical Multiword Expressions) lexical resource with more than 11k MWEs in canonical form. DUCAME is used in MWE-Finder to automatically generate queries for searching for flexible MWEs in large text corpora.

Empowered by the large-scale pretrained language models, existing dialogue systems have demonstrated impressive performance conducting fluent and natural-sounding conversations. However, they are still plagued by the <b>hallucination</b> problem, causing unpredictable factual errors in the generated responses. Recently, knowledge-grounded dialogue generation models, that intentionally invoke external knowledge resources to more informative responses, are also proven to be effective in reducing hallucination. Following the idea of getting high-quality knowledge, a few efforts have achieved pretty good performance on this issue. As some inevitable knowledge noises may also lead to hallucinations, it is emergent to investigate the reason and future directions for building noise-tolerant methods in KGD tasks. In this paper, we analyze the causal story behind this problem with counterfactual reasoning methods. Based on the causal effect analysis, we propose a possible solution for alleviating the hallucination in KGD by exploiting the dialogue-knowledge interaction. Experimental results of our example implementation show that this method can reduce hallucination without disrupting other dialogue performance, while keeping adaptive to different generation models. We hope our efforts can support and call for more attention to developing lightweight techniques towards robust and trusty dialogue systems.

pdf abs
Access Control Framework for Language Collections
Ben Foley | Peter Sefton | Simon Musgrave | Moises Sacal Bonequi

This paper introduces the licence-based access control framework developed by the Language Data Commons of Australia (LDaCA) for a range of language collections, with examples given of implementation for significant Indigenous and Australian English collections. Language collections may be curated for many reasons, such as documentation for language revival, for research, security or commercial purposes. Some language collections are created with the intention of being “Open Access”; publicly available with no restriction. Other collections require that access be limited to individuals or groups of people, either at the collection level or at the level of individual items, such as a recording. To facilitate access, while respecting the intended access conditions for a collection, or collection items, some form of user identification and authorisation process is typically required. The access control framework described in this paper is based upon descriptions of access conditions in easy-to-read licences which are stored alongside data files in the collections; and is implemented using identity-based authentication and authorisation systems where required. The framework accommodates accessibility needs from unrestricted to extremely limited access, is dynamic, and able to be modified in response to changes in access needs. Storing licences with the data is a significant development in separating language data and access requirements from access infrastructure.

Previous stance detection studies typically concentrate on evaluating stances within individual instances, thereby exhibiting limitations in effectively modeling multi-party discussions concerning the same specific topic, as naturally transpire in authentic social media interactions. This constraint arises primarily due to the scarcity of datasets that authentically replicate real social media contexts, hindering the research progress of conversational stance detection. In this paper, we introduce a new multi-turn conversation stance detection dataset (called MT-CSD), which encompasses multiple targets for conversational stance detection. To derive stances from this challenging dataset, we propose a global-local attention network (GLAN) to address both long and short-range dependencies inherent in conversational data. Notably, even state-of-the-art stance detection methods, exemplified by GLAN, exhibit an accuracy of only 50.47%, highlighting the persistent challenges in conversational stance detection. Furthermore, our MT-CSD dataset serves as a valuable resource to catalyze advancements in cross-domain stance detection, where a classifier is adapted from a different yet related target. We believe that MT-CSD will contribute to advancing real-world applications of stance detection research. Our source code, data, and models are available at https://github.com/nfq729/MT-CSD.

pdf abs
A Closer Look at Clustering Bilingual Comparable Corpora
Anna Laskina | Eric Gaussier | Gaelle Calvary

We study in this paper the problem of clustering comparable corpora, building upon the observation that different types of clusters can be present in such corpora: monolingual clusters comprising documents in a single language, and bilingual or multilingual clusters comprising documents written in different languages. Based on a state-of-the-art deep variant of Kmeans, we propose new clustering models fully adapted to comparable corpora and illustrate their behavior on several bilingual collections (in English, French, German and Russian) created from Wikipedia.

pdf abs
AcnEmpathize: A Dataset for Understanding Empathy in Dermatology Conversations
Gyeongeun Lee | Natalie Parde

Empathy is critical for effective communication and mental health support, and in many online health communities people anonymously engage in conversations to seek and provide empathetic support. The ability to automatically recognize and detect empathy contributes to the understanding of human emotions expressed in text, therefore advancing natural language understanding across various domains. Existing empathy and mental health-related corpora focus on broader contexts and lack domain specificity, but similarly to other tasks (e.g., learning distinct patterns associated with COVID-19 versus skin allergies in clinical notes), observing empathy within different domains is crucial to providing tailored support. To address this need, we introduce AcnEmpathize, a dataset that captures empathy expressed in acne-related discussions from forum posts focused on its emotional and psychological effects. We find that transformer-based models trained on our dataset demonstrate excellent performance at empathy classification. Our dataset is publicly released to facilitate analysis of domain-specific empathy in online conversations and advance research in this challenging and intriguing domain.

pdf abs
A Collection of Pragmatic-Similarity Judgments over Spoken Dialog Utterances
Nigel Ward | Divette Marco

Automatic measures of similarity between sentences or utterances are invaluable for training speech synthesizers, evaluating machine translation, and assessing learner productions. While there exist measures for semantic similarity and prosodic similarity, there are as yet none for pragmatic similarity. To enable the training of such measures, we developed the first collection of human judgments of pragmatic similarity between utterance pairs. 9 judges listened to 220 utterance pairs, each consisting of an utterance extracted from a recorded dialog and a re-enactment of that utterance under various conditions designed to create various degrees of similarity. Each pair was rated on a continuous scale. The average inter-judge correlation was 0.45. We make this data available at https://github.com/divettemarco/PragSim .

pdf abs
A Community-Driven Data-to-Text Platform for Football Match Summaries
Pedro Fernandes | Sérgio Nunes | Luís Santos

Data-to-text systems offer a transformative approach to generating textual content in data-rich environments. This paper describes the architecture and deployment of Prosebot, a community-driven data-to-text platform tailored for generating textual summaries of football matches derived from match statistics. The system enhances the visibility of lower-tier matches, traditionally accessible only through data tables. Prosebot uses a template-based Natural Language Generation (NLG) module to generate initial drafts, which are subsequently refined by the reading community. Comprehensive evaluations, encompassing both human-mediated and automated assessments, were conducted to assess the system’s efficacy. Analysis of the community-edited texts reveals that significant segments of the initial automated drafts are retained, suggesting their high quality and acceptance by the collaborators. Preliminary surveys conducted among platform users highlight a predominantly positive reception within the community.

pdf abs
A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking the Privacy-Utility Trade-off
Stephen Meisenbacher | Nihildev Nandakumar | Alexandra Klymenko | Florian Matthes

The application of Differential Privacy to Natural Language Processing techniques has emerged in relevance in recent years, with an increasing number of studies published in established NLP outlets. In particular, the adaptation of Differential Privacy for use in NLP tasks has first focused on the *word-level*, where calibrated noise is added to word embedding vectors to achieve “noisy” representations. To this end, several implementations have appeared in the literature, each presenting an alternative method of achieving word-level Differential Privacy. Although each of these includes its own evaluation, no comparative analysis has been performed to investigate the performance of such methods relative to each other. In this work, we conduct such an analysis, comparing seven different algorithms on two NLP tasks with varying hyperparameters, including the *epsilon* parameter, or privacy budget. In addition, we provide an in-depth analysis of the results with a focus on the privacy-utility trade-off, as well as open-source our implementation code for further reproduction. As a result of our analysis, we give insight into the benefits and challenges of word-level Differential Privacy, and accordingly, we suggest concrete steps forward for the research field.

While extensive work has examined the explicit and implicit biases in large language models (LLMs), little research explores the relation between these two types of biases. This paper presents a comparative study of the explicit and implicit biases in LLMs grounded in social psychology. Social psychology distinguishes between explicit and implicit biases by whether the bias can be self-recognized by individuals. Aligning with this conceptualization, we propose a self-evaluation-based two-stage measurement of explicit and implicit biases within LLMs. First, the LLM is prompted to automatically fill templates with social targets to measure implicit bias toward these targets, where the bias is less likely to be self-recognized by the LLM. Then, the LLM is prompted to self-evaluate the templates filled by itself to measure explicit bias toward the same targets, where the bias is more likely to be self-recognized by the LLM. Experiments conducted on state-of-the-art LLMs reveal human-like inconsistency between explicit and implicit occupational gender biases. This work bridges a critical gap where prior studies concentrate solely on either explicit or implicit bias. We advocate that future work highlight the relation between explicit and implicit biases in LLMs.

Dehumanisation involves the perception and/or treatment of a social group’s members as less than human. This phenomenon is rarely addressed with computational linguistic techniques. We adapt a recently proposed approach for English, making it easier to transfer to other languages and to evaluate, introducing a new sentiment resource, the use of zero-shot cross-lingual valence and arousal detection, and a new method for statistical significance testing. We then apply it to study attitudes to migration expressed in Slovene newspapers, to examine changes in the Slovene discourse on migration between the 2015-16 migration crisis following the war in Syria and the 2022-23 period following the war in Ukraine. We find that while this discourse became more negative and more intense over time, it is less dehumanising when specifically addressing Ukrainian migrants compared to others.

pdf abs
A Computational Approach to Quantifying Grammaticization of English Deverbal Prepositions
Ryo Nagata | Yoshifumi Kawasaki | Naoki Otani | Hiroya Takamura

This paper explores grammaticization of deverbal prepositions by a computational approach based on corpus data. Deverbal prepositions are words or phrases that are derived from a verb and that behave as a preposition such as “regarding” and “according to”. Linguistic studies have revealed important aspects of grammaticization of deverbal prepositions. This paper augments them by methods for measuring the degree of grammaticization of deverbal prepositions based on non-contextualized or contextualized word vectors. Experiments show that the methods correlate well with human judgements (as high as 0.69 in Spearman’s rank correlation coefficient). Using the best-performing method, this paper further shows that the methods support previous findings in linguistics including (i) Deverbal prepositions are marginal in terms of prepositionality; and (ii) The process where verbs are grammaticized into prepositions is gradual. As a pilot study, it also conducts a diachronic analysis of grammaticization of deverbal preposition.

pdf abs
A Computational Model of Latvian Morphology
Peteris Paikens | Lauma Pretkalniņa | Laura Rituma

In this paper we describe a computational model of Latvian morphology that provides a formal structure for Latvian word form inflection and has been implemented in software for generation, analysis and lemmatization of Latvian word forms. The work was motivated by the need for a NLP inflection model that can cover all the complexity of Latvian language and explicitly enumerate and handle the many exceptions to the general Latvian inflection principles. This is an evolution of earlier work, extending the initial proof of concept model to properly cover Latvian language. We provide a set of morphological paradigms that differ from current linguistic tradition, a set of systematic stem changes and combine it with an extensive lexicon that includes paradigm information and structured morphological attributes for 118 000 lexemes. This model has been applied on both dictionary and corpora data, demonstrating that it provides a good coverage for modern Latvian literary language. We also consider that there is a good potential to extend this also to the related Latgalian language.

pdf abs
A Concept Based Approach for Translation of Medical Dialogues into Pictographs
Johanna Gerlach | Pierrette Bouillon | Jonathan Mutal | Hervé Spechbach

Pictographs have been found to improve patient comprehension of medical information or instructions. However, tools to produce pictograph representations from natural language are still scarce. In this contribution we describe a system that automatically translates French speech into pictographs to enable diagnostic interviews in emergency settings, thereby providing a tool to overcome the language barrier or provide support in Augmentative and Alternative Communication (AAC) contexts. Our approach is based on a semantic gloss that serves as pivot between spontaneous language and pictographs, with medical concepts represented using the UMLS ontology. In this study we evaluate different available pre-trained models fine-tuned on artificial data to translate French into this semantic gloss. On unseen data collected in real settings, consisting of questions and instructions by physicians, the best model achieves an F0.5 score of 86.7. A complementary human evaluation of the semantic glosses differing from the reference shows that 71% of these would be usable to transmit the intended meaning. Finally, a human evaluation of the pictograph sequences derived from the gloss reveals very few additions, omissions or order issues (<3%), suggesting that the gloss as designed is well suited as a pivot for translation into pictographs.

pdf abs
A Construction Grammar Corpus of Varying Schematicity: A Dataset for the Evaluation of Abstractions in Language Models
Claire Bonial | Harish Tayyar Madabushi

Large Language Models (LLMs) have been developed without a theoretical framework, yet we posit that evaluating and improving LLMs will benefit from the development of theoretical frameworks that enable comparison of the structures of human language and the model of language built up by LLMs through the processing of text. In service of this goal, we develop the Construction Grammar Schematicity (“CoGS”) corpus of 10 distinct English constructions, where the constructions vary with respect to schematicity, or in other words the level to which constructional slots require specific, fixed lexical items, or can be filled with a variety of elements that fulfill a particular semantic role of the slot. Our corpus constructions are carefully curated to range from substantive, frozen constructions (e.g., Let-alone) to entirely schematic constructions (e.g., Resultative). The corpus was collected to allow us to probe LLMs for constructional information at varying levels of abstraction. We present our own probing experiments using this corpus, which clearly demonstrate that even the largest LLMs are limited to more substantive constructions and do not exhibit recognition of the similarity of purely schematic constructions. We publicly release our dataset, prompts, and associated model responses.

pdf abs
A Controlled Reevaluation of Coreference Resolution Models
Ian Porada | Xiyuan Zou | Jackie Chi Kit Cheung

All state-of-the-art coreference resolution (CR) models involve finetuning a pretrained language model. Whether the superior performance of one CR model over another is due to the choice of language model or other factors, such as the task-specific architecture, is difficult or impossible to determine due to lack of a standardized experimental setup. To resolve this ambiguity, we systematically evaluate five CR models and control for certain design decisions including the pretrained language model used by each. When controlling for language model size, encoder-based CR models outperform more recent decoder-based models in terms of both accuracy and inference speed. Surprisingly, among encoder-based CR models, more recent models are not always more accurate, and the oldest CR model that we test generalizes the best to out-of-domain textual genres. We conclude that controlling for the choice of language model reduces most, but not all, of the increase in F1 score reported in the past five years.

pdf abs
A Corpus and Method for Chinese Named Entity Recognition in Manufacturing
Ruiting Li | Peiyan Wang | Libang Wang | Danqingxin Yang | Dongfeng Cai

Manufacturing specifications are documents entailing different techniques, processes, and components involved in manufacturing. There is a growing demand for named entity recognition (NER) resources and techniques for manufacturing-specific named entities, with the development of smart manufacturing. In this paper, we introduce a corpus of Chinese manufacturing specifications, named MS-NERC, including 4,424 sentences and 16,383 entities. We also propose an entity recognizer named Trainable State Transducer (TST), which is initialized with a finite state transducer describing the morphological patterns of entities. It can directly recognize entities based on prior morphological knowledge without training. Experimental results show that TST achieves an overall 82.05% F1 score for morphological-specific entities in zero-shot. TST can be improved through training, the result of which outperforms neural methods in few-shot and rich-resource. We believe that our corpus and model will be valuable resources for NER research not only in manufacturing but also in other low-resource domains.

We develop novel annotation guidelines for sentence-level subjectivity detection, which are not limited to language-specific cues. We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics. Our corpus paves the way for subjectivity detection in English and across other languages without relying on language-specific tools, such as lexicons or machine translation. We evaluate state-of-the-art multilingual transformer-based models on the task in mono-, multi-, and cross-language settings. For this purpose, we re-annotate an existing Italian corpus. We observe that models trained in the multilingual setting achieve the best performance on the task.

pdf abs
A Corpus of German Abstract Meaning Representation (DeAMR)
Christoph Otto | Jonas Groschwitz | Alexander Koller | Xiulin Yang | Lucia Donatelli

We present the first comprehensive set of guidelines for German Abstract Meaning Representation (Deutsche AMR, DeAMR) along with an annotated corpus of 400 DeAMR. Taking English AMR (EnAMR) as our starting point, we propose significant adaptations to faithfully represent the structure and semantics of German, focusing particularly on verb frames, compound words, and modality. We validate our annotation through inter-annotator agreement and further evaluate our corpus with a comparison of structural divergences between EnAMR and DeAMR on parallel sentences, replicating previous work that finds both cases of cross-lingual structural alignment and cases of meaningful linguistic divergence. Finally, we fine-tune state-of-the-art multi-lingual and cross-lingual AMR parsers on our corpus and find that, while our small corpus is insufficient to produce quality output, there is a need to continue develop and evaluate against gold non-English AMR data.

pdf abs
A Corpus of Spontaneous L2 English Speech for Real-situation Speaking Assessment
Sylvain Coulange | Marie-Hélène Fries | Monica Masperi | Solange Rossato

When assessing second language proficiency (L2), evaluation of spontaneous speech performance is crucial. This paper presents a corpus of spontaneous L2 English speech, focusing on the speech performance of B1 and B2 proficiency speakers. Two hundred and sixty university students were recorded during a speaking task as part of a French national certificate in English. This task entailed a 10-minute role-play among 2 or 3 candidates, arguing about a controversial topic, in order to reach a negotiated compromise. Each student’s performance was evaluated by two experts, categorizing them into B2, B1 or below B1 speaking proficiency levels. Automatic diarization, transcription, and alignment at the word level were performed on the recorded conversations, in order to analyse lexical stress realisation in polysyllabic plain words of B1 and B2 proficiency students. Results showed that only 35.4% of the 6,350 targeted words had stress detected on the expected syllable, revealing a common stress shift to the final syllable. Besides a substantial inter-speaker variability (0% to 68.4%), B2 speakers demonstrated a slightly higher stress accuracy (36%) compared to B1 speakers (29.6%). Those with accurate stress placement utilized F0 and intensity to make syllable prominence, while speakers with lower accuracy tended to lengthen words on their last syllables, with minimal changes in other dimensions.

pdf abs
Action and Reaction Go Hand in Hand! a Multi-modal Dialogue Act Aided Sarcasm Identification
Mohit Singh Tomar | Tulika Saha | Abhisek Tiwari | Sriparna Saha

Sarcasm primarily involves saying something but “meaning the opposite” or “meaning something completely different” in order to convey a particular tone or mood. In both the above cases, the “meaning” is reflected by the communicative intention of the speaker, known as dialogue acts. In this paper, we seek to investigate a novel phenomenon of analyzing sarcasm in the context of dialogue acts with the hypothesis that the latter helps to understand the former better. Toward this aim, we extend the multi-modal MUStARD dataset to enclose dialogue acts for each dialogue. To demonstrate the utility of our hypothesis, we develop a dialogue act-aided multi-modal transformer network for sarcasm identification (MM-SARDAC), leveraging interrelation between these tasks. In addition, we introduce an order-infused, multi-modal infusion mechanism into our proposed model, which allows for a more intuitive combined modality representation by selectively focusing on relevant modalities in an ordered manner. Extensive empirical results indicate that dialogue act-aided sarcasm identification achieved better performance compared to performing sarcasm identification alone. The dataset and code are available at https://github.com/mohit2b/MM-SARDAC.

Sign language is the primary communication medium for people who are deaf or have hearing loss. However, given the divergent range of sensory abilities of these individuals, there is a communication gap that needs to be addressed. In this paper, we present action-concentrated embedding (ACE), which is a novel sign token embedding framework. Additionally, to provide a more structured foundation for sign language analysis, we introduce a dedicated notation system tailored for sign language that endeavors to encapsulate the nuanced gestures and movements that are integral with sign communication. The proposed ACE approach tracks a signer’s actions based on human posture estimation. Tokenizing these actions and capturing the token embedding using a short-time Fourier transform encapsulates the time-based behavioral changes. Hence, ACE offers input embedding to translate sign language into natural language sentences. When tested against a disaster sign language dataset using automated machine translation measures, ACE notably surpasses prior research in terms of translation capabilities, improving the performance by up to 5.79% for BLEU-4 and 5.46% for ROUGE-L metric.

pdf abs
Active Learning Design Choices for NER with Transformers
Robert Vacareanu | Enrique Noriega-Atala | Gus Hahn-Powell | Marco A. Valenzuela-Escarcega | Mihai Surdeanu

We explore multiple important choices that have not been analyzed in conjunction regarding active learning for token classification using transformer networks. These choices are: (i) how to select what to annotate, (ii) decide whether to annotate entire sentences or smaller sentence fragments, (iii) how to train with incomplete annotations at token-level, and (iv) how to select the initial seed dataset. We explore whether annotating at sub-sentence level can translate to an improved downstream performance by considering two different sub-sentence annotation strategies: (i) entity-level, and (ii) token-level. These approaches result in some sentences being only partially annotated. To address this issue, we introduce and evaluate multiple strategies to deal with partially-annotated sentences during the training process. We show that annotating at the sub-sentence level achieves comparable or better performance than sentence-level annotations with a smaller number of annotated tokens. We then explore the extent to which the performance gap remains once accounting for the annotation time and found that both annotation schemes perform similarly.

We present and describe two language resources in this paper: CATalog 1.0, the largest text corpus in Catalan to date, and CURATE (Corpus Utility for RAting TExt), a modular, parallelizable pipeline used for processing and scoring documents based on text quality that we have optimised to run in High Performance Cluster (HPC) environments. In the coming sections we describe our data preprocessing pipeline at length; traditional pipelines usually implement a set of binary filters such that a given document is either in or out. In our experience with Catalan, in lower-resource settings it is more practical to instead assign a document a soft score to allow for more flexible decision-making. We describe how the document score is calculated and highlight its interpretability by showing that it is significantly correlated with human judgements as obtained from a comparative judgement experiment. We additionally describe the different subcorpora that make up CATalog 1.0.

pdf abs
AdaKron: An Adapter-based Parameter Efficient Model Tuning with Kronecker Product
Marco Braga | Alessandro Raganato | Gabriella Pasi

The fine-tuning paradigm has been widely adopted to train neural models tailored for specific tasks. However, the recent upsurge of Large Language Models (LLMs), characterized by billions of parameters, has introduced profound computational challenges to the fine-tuning process. This has fueled intensive research on Parameter-Efficient Fine-Tuning (PEFT) techniques, usually involving the training of a selective subset of the original model parameters. One of the most used approaches is Adapters, which add trainable lightweight layers to the existing pretrained weights. Within this context, we propose AdaKron, an Adapter-based fine-tuning with the Kronecker product. In particular, we leverage the Kronecker product to combine the output of two small networks, resulting in a final vector whose dimension is the product of the dimensions of the individual outputs, allowing us to train only 0.55% of the model’s original parameters. We evaluate AdaKron performing a series of experiments on the General Language Understanding Evaluation (GLUE) benchmark, achieving results in the same ballpark as recent state-of-the-art PEFT methods, despite training fewer parameters.

pdf abs
Adaptive Reinforcement Tuning Language Models as Hard Data Generators for Sentence Representation
Bo Xu | Yifei Wu | Shouang Wei | Ming Du | Hongya Wang

Sentence representation learning is a fundamental task in NLP. Existing methods use contrastive learning (CL) to learn effective sentence representations, which benefit from high-quality contrastive data but require extensive human annotation. Large language models (LLMs) like ChatGPT and GPT4 can automatically generate such data. However, this alternative strategy also encounters challenges: 1) obtaining high-quality generated data from small-parameter LLMs is difficult, and 2) inefficient utilization of the generated data. To address these challenges, we propose a novel adaptive reinforcement tuning (ART) framework. Specifically, to address the first challenge, we introduce a reinforcement learning approach for fine-tuning small-parameter LLMs, enabling the generation of high-quality hard contrastive data without human feedback. To address the second challenge, we propose an adaptive iterative framework to guide the small-parameter LLMs to generate progressively harder samples through multiple iterations, thereby maximizing the utility of generated data. Experiments conducted on seven semantic text similarity tasks demonstrate that the sentence representation models trained using the synthetic data generated by our proposed method achieve state-of-the-art performance. Our code is available at https://github.com/WuNein/AdaptCL.

Traditional non-simultaneous Sign Language Translation (SLT) methods, while effective for pre-recorded videos, face challenges in real-time scenarios due to inherent inference delays. The emerging field of simultaneous SLT aims to address this issue by progressively translating incrementally received sign video. However, the sole existing work in simultaneous SLT adopts a fixed gloss-based policy, which suffer from limitations in boundary prediction and contextual comprehension. In this paper, we delve deeper into this area and propose an adaptive policy for simultaneous SLT. Our approach introduces the concept of “confident translation length”, denoting maximum accurate translation achievable from current input. An estimator measures this length for streaming sign video, enabling the model to make informed decisions on whether to wait for more input or proceed with translation. To train the estimator, we construct a training data of confident translation length based on the longest common prefix between translations of partial and complete inputs. Furthermore, we incorporate adaptive training, utilizing pseudo prefix pairs, to refine the offline translation model for optimal performance in simultaneous scenarios. Experimental results on PHOENIX2014T and CSL-Daily demonstrate the superiority of our adaptive policy over existing methods, particularly excelling in situations requiring extremely low latency.

pdf abs
A Dataset for Named Entity Recognition and Entity Linking in Chinese Historical Newspapers
Baptiste Blouin | Cécile Armand | Christian Henriot

In this study, we present a novel historical Chinese dataset for named entity recognition, entity linking, coreference and entity relations. We use data from Chinese newspapers from 1872 to 1949 and multilingual bibliographic resources from the same period. The period and the language are the main strength of the present work, offering a resource which covers different styles and language uses, as well as the largest historical Chinese NER dataset with manual annotations from this transitional period. After detailing the selection and annotation process, we present the very first results that can be obtained from this dataset. Texts and annotations are freely downloadable from the GitHub repository.

User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.

pdf abs
Adding SPICE to Life: Speaker Profiling in Multiparty Conversations
Shivani Kumar | Rishabh Gupta | Md. Shad Akhtar | Tanmoy Chakraborty

In the realm of conversational dynamics, individual idiosyncrasies challenge the suitability of a one-size-fits-all approach for dialogue agent responses. Prior studies often assumed the speaker’s persona’s immediate availability, a premise not universally applicable. To address this gap, we explore the Speaker Profiling in Conversations (SPC) task, aiming to synthesize persona attributes for each dialogue participant. SPC comprises three core subtasks: persona discovery, persona-type identification, and persona-value extraction. The first subtask identifies persona-related utterances, the second classifies specific attributes, and the third extracts precise values for the persona. To confront this multifaceted challenge, we’ve diligently compiled SPICE, an annotated dataset, underpinning our thorough evaluation of diverse baseline models. Additionally, we benchmark these findings against our innovative neural model, SPOT, presenting an exhaustive analysis encompassing a nuanced assessment of quantitative and qualitative merits and limitations.

pdf abs
ADEA: An Argumentative Dialogue Dataset on Ethical Issues Concerning Future A.I. Applications
Christian Hauptmann | Adrian Krenzer | Antonia Krause | Frank Puppe

Introducing ADEA: a German dataset that captures online dialogues and focuses on ethical issues related to future AI applications. This dataset, which includes over 2800 labeled user utterances on four different topics, is specifically designed for the training of chatbots that can navigate the complexities of real-world ethical AI conversations. The creation of these dialogues is the result of two carefully conducted studies in which university students interacted with an argumentative dialogue system. A fundamental part of our methodology is the use of German argument graphs. These graphs not only form the knowledge base of the dialogue system but also serve as an effective annotation scheme for the dialogues. Apart from the introduction of the dataset and the argument graphs, we provide a preliminary benchmark using GPT-4 via the OpenAI API. This provides researchers with a concrete reference point while demonstrating the potential of our dataset. We make our dataset and argument graphs available at https://github.com/HaupChris/ADEA-Dialogue-Dataset.

pdf abs
A Decade of Scholarly Research on Open Knowledge Graphs
Houcemeddine Turki | Abraham Toluwase Owodunni | Mohamed Ali Hadj Taieb | René Fabrice Bile | Mohamed Ben Aouicha

The proliferation of open knowledge graphs has led to a surge in scholarly research on the topic over the past decade. This paper presents a bibliometric analysis of the scholarly literature on open knowledge graphs published between 2013 and 2023. The study aims to identify the trends, patterns, and impact of research in this field, as well as the key topics and research questions that have emerged. The work uses bibliometric techniques to analyze a sample of 4445 scholarly articles retrieved from Scopus. The findings reveal an ever-increasing number of publications on open knowledge graphs published every year, particularly in developed countries (+50 per year). These outputs are published in highly-referred scholarly journals and conferences. The study identifies three main research themes: (1) knowledge graph construction and enrichment, (2) evaluation and reuse, and (3) fusion of knowledge graphs into NLP systems. Within these themes, the study identifies specific tasks that have received considerable attention, including entity linking, knowledge graph embedding, and graph neural networks.

pdf abs
A Differentiable Integer Linear Programming Solver for Explanation-Based Natural Language Inference
Mokanarangan Thayaparan | Marco Valentino | André Freitas

Integer Linear Programming (ILP) has been proposed as a formalism for encoding precise structural and semantic constraints for Natural Language Inference (NLI). However, traditional ILP frameworks are non-differentiable, posing critical challenges for the integration of continuous language representations based on deep learning. In this paper, we introduce a novel approach, named Diff-Comb Explainer, a neuro-symbolic architecture for explanation-based NLI based on Differentiable BlackBox Combinatorial Solvers (DBCS). Differently from existing neuro-symbolic solvers, Diff-Comb Explainer does not necessitate a continuous relaxation of the semantic constraints, enabling a direct, more precise, and efficient incorporation of neural representations into the ILP formulation. Our experiments demonstrate that Diff-Comb Explainer achieves superior performance when compared to conventional ILP solvers, neuro-symbolic black-box solvers, and Transformer-based encoders. Moreover, a deeper analysis reveals that Diff-Comb Explainer can significantly improve the precision, consistency, and faithfulness of the constructed explanations, opening new opportunities for research on neuro-symbolic architectures for explainable and transparent NLI in complex domains.

pdf abs
A Document-Level Text Simplification Dataset for Japanese
Yoshinari Nagai | Teruaki Oka | Mamoru Komachi

Document-level text simplification, a task that combines single-document summarization and intra-sentence simplification, has garnered significant attention. However, studies have primarily focused on languages such as English and German, leaving Japanese and similar languages underexplored because of a scarcity of linguistic resources. In this study, we devised JADOS, the first Japanese document-level text simplification dataset based on newspaper articles and Wikipedia. Our dataset focuses on simplification, to enhance readability by reducing the number of sentences and tokens in a document. We conducted investigations using our dataset. Firstly, we analyzed the characteristics of Japanese simplification by comparing it across different domains and with English counterparts. Moreover, we experimentally evaluated the performances of text summarization methods, transformer-based text simplification models, and large language models. In terms of D-SARI scores, the transformer-based models performed best across all domains. Finally, we manually evaluated several model outputs and target articles, demonstrating the need for document-level text simplification models in Japanese.

pdf abs
A Dual-View Approach to Classifying Radiology Reports by Co-Training
Yutong Han | Yan Yuan | Lili Mou

Radiology report analysis provides valuable information that can aid with public health initiatives, and has been attracting increasing attention from the research community. In this work, we present a novel insight that the structure of a radiology report (namely, the Findings and Impression sections) offers different views of a radiology scan. Based on this intuition, we further propose a co-training approach, where two machine learning models are built upon the Findings and Impression sections, respectively, and use each other’s information to boost performance with massive unlabeled data in a semi-supervised manner. We conducted experiments in a public health surveillance study, and results show that our co-training approach is able to improve performance using the dual views and surpass competing supervised and semi-supervised methods.

pdf abs
Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms
Wonkee Lee | Seong-Hwan Heo | Jong-Hyeok Lee

Semi-supervised learning that leverages synthetic data for training has been widely adopted for developing automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create high-quality synthetic data. Given that APE takes as input a machine-translation result that might include errors, we present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data. We introduce a noising-based data-synthesis method by adapting the masked language model approach, generating a noisy text from a clean text by infilling masked tokens with erroneous tokens. Moreover, we propose selective corpus interleaving that combines two separate synthetic datasets by taking only the advantageous samples to enhance the quality of the synthetic data further. Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.

Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings, unveiling the discourse topic structure of a document. Compared with sentence-level topic structure, the paragraph-level topic structure can quickly grasp and understand the overall context of the document from a higher level, benefitting many downstream tasks such as summarization, discourse parsing, and information retrieval. However, the lack of large-scale, high-quality Chinese paragraph-level topic structure corpora restrained relative research and applications. To fill this gap, we build the Chinese paragraph-level topic representation, corpus, and benchmark in this paper. Firstly, we propose a hierarchical paragraph-level topic structure representation with three layers to guide the corpus construction. Then, we employ a two-stage man-machine collaborative annotation method to construct the largest Chinese Paragraph-level Topic Structure corpus (CPTS), achieving high quality. We also build several strong baselines, including ChatGPT, to validate the computability of CPTS on two fundamental tasks (topic segmentation and outline generation) and preliminarily verified its usefulness for the downstream task (discourse parsing).

Transformer language models (LMs) are fundamental to NLP research methodologies and applications in various languages. However, developing such models specifically for the Russian language has received little attention. This paper introduces a collection of 13 Russian Transformer LMs, which spans encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) architectures. We provide a report on the model architecture design and pretraining, and the results of evaluating their generalization abilities on Russian language understanding and generation datasets and benchmarks. By pretraining and releasing these specialized Transformer LMs, we aim to broaden the scope of the NLP research directions and enable the development of industrial solutions for the Russian language.

With an auxiliary corpus (non-target speaker corpus) for model pre-training, Text-to-Speech (TTS) methods can generate high-quality speech with a limited target speaker corpus. However, this approach comes with expensive training costs. To overcome the challenge, a high-quality TTS method is proposed, significantly reducing training costs while maintaining the naturalness of synthesized speech. In this paper, we propose an auxiliary corpus compression algorithm that reduces the training cost while the naturalness of the synthesized speech is not significantly degraded. We then use the compressed corpus to pre-train the proposed TTS model CMDTTS, which fuses phoneme and word multi-level prosody modeling components and denoises the generated mel-spectrograms using denoising diffusion probabilistic models (DDPMs). In addition, a fine-tuning step that the conditional generative adversarial network (cGAN) is introduced to embed the target speaker feature and improve speech quality using the target speaker corpus. Experiments are conducted on Chinese and English single speaker’s corpora, and the results show that the method effectively balances the model training speed and the synthesized speech quality and outperforms the current models.

We introduce a frustratingly simple, highly efficient, and surprisingly effective decoding method, termed Frustratingly Simple Decoding (FSD), for neural text generation. The idea behind FSD is straightforward: We construct an anti-language model (anti-LM) based on previously generated text, which is employed to penalize the future generation of repetitive content. The anti-LM can be implemented as simple as an n-gram language model or a vectorized variant. In this way, FSD incurs no additional model parameters and negligible computational overhead (FSD can be as fast as greedy search). Despite its simplicity, FSD is surprisingly effective and generalizes across different datasets, models, and languages. Extensive experiments show that FSD outperforms established strong baselines in terms of generation quality, decoding speed, and universality.

pdf abs
A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions
Shun Inadumi | Seiya Kawano | Akishige Yuguchi | Yasutomo Kawanishi | Koichiro Yoshino

Situated conversations, which refer to visual information as visual question answering (VQA), often contain ambiguities caused by reliance on directive information. This problem is exacerbated because some languages, such as Japanese, often omit subjective or objective terms. Such ambiguities in questions are often clarified by the contexts in conversational situations, such as joint attention with a user or user gaze information. In this study, we propose the Gaze-grounded VQA dataset (GazeVQA) that clarifies ambiguous questions using gaze information by focusing on a clarification process complemented by gaze information. We also propose a method that utilizes gaze target estimation results to improve the accuracy of GazeVQA tasks. Our experimental results showed that the proposed method improved the performance in some cases of a VQA system on GazeVQA and identified some typical problems of GazeVQA tasks that need to be improved.

pdf abs
Agenda-Driven Question Generation: A Case Study in the Courtroom Domain
Yi Fung | Anoop Kumar | Aram Galstyan | Heng Ji | Prem Natarajan

This paper introduces a novel problem of automated question generation for courtroom examinations, CourtQG. While question generation has been studied in domains such as educational testing and product description, CourtQG poses several unique challenges owing to its non-cooperative and agenda-driven nature. Specifically, not only the generated questions need to be relevant to the case and underlying context, they also have to achieve certain objectives such as challenging the opponent’s arguments and/or revealing potential inconsistencies in their answers. We propose to leverage large language models (LLM) for CourtQG by fine-tuning them on two auxiliary tasks, agenda explanation (i.e., uncovering the underlying intents) and question type prediction. We additionally propose cold-start generation of questions from background documents without relying on examination history. We construct a dataset to evaluate our proposed method and show that it generates better questions according to standard metrics when compared to several baselines.

pdf abs
A Generative Model for Lambek Categorial Sequents
Jinman Zhao | Gerald Penn

In this work, we introduce a generative model, PLC+, for generating Lambek Categorial Grammar(LCG) sequents. We also introduce a simple method to numerically estimate the model’s parameters from an annotated corpus. Then we compare our model with probabilistic context-free grammars (PCFGs) and show that PLC+ simultaneously assigns a higher probability to a common corpus, and has greater coverage.

pdf abs
Agent-based Modeling of Language Change in a Small-world Network
Dalmo Buzato | Evandro Cunha

Language change has been the subject of numerous studies in linguistics. However, due to the dynamic and complex nature of this phenomenon, and to the difficulty of obtaining extensive real data of language in use, some of its aspects remain obscure. In recent years, nonetheless, research has used computational modeling to simulate features related to variation, change, propagation, and evolution of languages in speech communities, finding compelling results. In this article, agent-based modeling and simulation is used to study language change. Drawing on previous studies, a speech community was modeled using Zachary’s karate club network, a well-established small-world network model in the field of complex systems. Idiolects were assigned through numerical values for each agent. The results demonstrate that the centrality of each agent in the network, interpreted as social prestige, appears to be a factor influencing change. Additionally, the nature of idiolects also seems to impact the spread of linguistic variants in the language change process. These findings complement the theoretical understanding of the language change phenomenon with new simulation data and provide new avenues for research.

pdf abs
Agettivu, Aggitivu o Aghjettivu? POS Tagging Corsican Dialects
Alice Millour | Lorenza Brasile | Alberto Ghia | Laurent Kevers

In this paper we present a series of experiments towards POS tagging Corsican, a less-resourced language spoken in Corsica and linguistically related to Italian. The first contribution is Corsican-POS, the first gold standard POS-tagged corpus for Corsica, composed of 500 sentences manually annotated with the Universal POS tagset. Our second contribution is a set of experiments and evaluation of POS tagging models which starts with a baseline model for Italian and is aimed at finding the best training configuration, namely in terms of the size and combination strategy of the existing raw and annotated resources. These experiments result in (i) the first POS tagger for Corsican, reaching an accuracy of 93.38%, (ii) a quantification of the gain provided by the use of each available resource. We find that the optimal configuration uses Italian word embeddings further specialized with Corsican embeddings and trained on the largest gold corpus for Corsican available so far.

Recent advancements in Chain-of-Thought prompting have facilitated significant breakthroughs for Large Language Models (LLMs) in complex reasoning tasks. Current research enhances the reasoning performance of LLMs by sampling multiple reasoning chains and ensembling based on the answer frequency. However, this approach fails in scenarios where the correct answers are in the minority. We identify this as a primary factor constraining the reasoning capabilities of LLMs, a limitation that cannot be resolved solely based on the predicted answers. To address this shortcoming, we introduce a hierarchical reasoning aggregation framework AoR (Aggregation of Reasoning), which selects answers based on the evaluation of reasoning chains. Additionally, AoR incorporates dynamic sampling, adjusting the number of reasoning chains in accordance with the complexity of the task. Experimental results on a series of complex reasoning tasks show that AoR outperforms prominent ensemble methods. Further analysis reveals that AoR not only adapts various LLMs but also achieves a superior performance ceiling when compared to current methods.

pdf abs
A Hierarchical Sequence-to-Set Model with Coverage Mechanism for Aspect Category Sentiment Analysis
Siyu Wang | Jianhui Jiang | Shengran Dai | Jiangtao Qiu

Aspect category sentiment analysis (ACSA) aims to simultaneously detect aspect categories and their corresponding sentiment polarities (category-sentiment pairs). Some recent studies have used pre-trained generative models to complete ACSA and achieved good results. However, for ACSA, generative models still face three challenges. First, addressing the missing predictions in ACSA is crucial, which involves accurately predicting all category-sentiment pairs within a sentence. Second, category-sentiment pairs are inherently a disordered set. Consequently, the model incurs a penalty even when its predictions are correct, but the predicted order is inconsistent with the ground truths. Third, different aspect categories should focus on relevant sentiment words, and the polarity of the aspect category should be the aggregation of the polarities of these sentiment words. This paper proposes a hierarchical generative model with a coverage mechanism using sequence-to-set learning to tackle all three challenges simultaneously. Our model’s superior performance is demonstrated through extensive experiments conducted on several datasets.

pdf abs
A Hong Kong Sign Language Corpus Collected from Sign-interpreted TV News
Zhe Niu | Ronglai Zuo | Brian Mak | Fangyun Wei

This paper introduces TVB-HKSL-News, a new Hong Kong sign language (HKSL) dataset collected from a TV news program over a period of 7 months. The dataset is collected to enrich resources for HKSL and support research in large-vocabulary continuous sign language recognition (SLR) and translation (SLT). It consists of 16.07 hours of sign videos of two signers with a vocabulary of 6,515 glosses (for SLR) and 2,850 Chinese characters or 18K Chinese words (for SLT). One signer has 11.66 hours of sign videos and the other has 4.41 hours. One objective in building the dataset is to support the investigation of how well large-vocabulary continuous sign language recognition/translation can be done for a single signer given a (relatively) large amount of his/her training data, which could potentially lead to the development of new modeling methods. Besides, most parts of the data collection pipeline are automated with little human intervention; we believe that our collection method can be scaled up to collect more sign language data easily for SLT in the future for any sign languages if such sign-interpreted videos are available. We also run a SOTA SLR/SLT model on the dataset and get a baseline SLR word error rate of 34.08% and a baseline SLT BLEU-4 score of 23.58 for benchmarking future research on the dataset.

pdf abs
A Hybrid Approach to Aspect Based Sentiment Analysis Using Transfer Learning
Gaurav Negi | Rajdeep Sarkar | Omnia Zayed | Paul Buitelaar

Aspect-Based Sentiment Analysis ( ABSA) aims to identify terms or multiword expressions (MWEs) on which sentiments are expressed and the sentiment polarities associated with them. The development of supervised models has been at the forefront of research in this area. However, training these models requires the availability of manually annotated datasets which is both expensive and time-consuming. Furthermore, the available annotated datasets are tailored to a specific domain, language, and text type. In this work, we address this notable challenge in current state-of-the-art ABSA research. We propose a hybrid approach for Aspect Based Sentiment Analysis using transfer learning. The approach focuses on generating weakly-supervised annotations by exploiting the strengths of both large language models (LLM) and traditional syntactic dependencies. We utilise syntactic dependency structures of sentences to complement the annotations generated by LLMs, as they may overlook domain-specific aspect terms. Extensive experimentation on multiple datasets is performed to demonstrate the efficacy of our hybrid method for the tasks of aspect term extraction and aspect sentiment classification.

pdf abs
A Japanese News Simplification Corpus with Faithfulness
Toru Urakawa | Yuya Taguchi | Takuro Niitsuma | Hideaki Tamori

Text Simplification enhances the readability of texts for specific audiences. However, automated models may introduce unwanted content or omit essential details, necessitating a focus on maintaining faithfulness to the original input. Furthermore, existing simplified corpora contain instances of low faithfulness. Motivated by this issue, we present a new Japanese simplification corpus designed to prioritize faithfulness. Our collection comprises 7,075 paired sentences simplified from newspaper articles. This process involved collaboration with language education experts who followed guidelines balancing readability and faithfulness. Through corpus analysis, we confirmed that our dataset preserves the content of the original text, including personal names, dates, and city names. Manual evaluation showed that our corpus robustly maintains faithfulness to the original text, surpassing other existing corpora. Furthermore, evaluation by non-native readers confirmed its readability to the target audience. Through the experiment of fine-tuning and in-context learning, we demonstrated that our corpus enhances faithful sentence simplification.

Knowledge-based, open-domain dialogue generation aims to build chit-chat systems that talk to humans using mined support knowledge. Many types and sources of knowledge have previously been shown to be useful as support knowledge. Even in the era of large language models, response generation grounded in knowledge retrieved from additional up-to-date sources remains a practically important approach. While prior work using single-source knowledge has shown a clear positive correlation between the performances of knowledge selection and response generation, there are no existing multi-source datasets for evaluating support knowledge retrieval. Further, prior work has assumed that the knowledge sources available at test time are the same as during training. This unrealistic assumption unnecessarily handicaps models, as new knowledge sources can become available after a model is trained. In this paper, we present a high-quality benchmark named multi-source Wizard of Wikipedia (Ms.WoW) for evaluating multi-source dialogue knowledge selection and response generation. Unlike existing datasets, it contains clean support knowledge, grounded at the utterance level and partitioned into multiple knowledge sources. We further propose a new challenge, dialogue knowledge plug-and-play, which aims to test an already trained dialogue model on using new support knowledge from previously unseen sources in a zero-shot fashion.

pdf abs
A Large Annotated Reference Corpus of New High German Poetry
Thomas Haider

This paper introduces a large annotated corpus of public domain German poetry, covering the time period from 1600 to the 1920s with 65k poems. We describe how the corpus was compiled, how it was cleaned (including duplicate detection), and how it looks now in terms of size, format, temporal distribution, and automatic annotation. Besides metadata, the corpus contains reliable annotation of tokens, syllables, part-of-speech, and meter and verse measure. Finally, we give some statistics on the annotation and an overview of other poetry corpora.

Cross-lingual pre-training methods mask and predict tokens in multilingual text to generalize diverse multilingual information. However, due to the lack of sufficient aligned multilingual resources in the pre-training process, these methods may not fully explore the multilingual correlation of masked tokens, resulting in the limitation of multilingual information interaction. In this paper, we propose a lifelong multilingual multi-granularity semantic alignment approach, which continuously extracts massive aligned linguistic units from noisy data via a maximum co-occurrence probability algorithm. Then, the approach releases a version of the multilingual multi-granularity semantic alignment resource, supporting seven languages, namely English, Czech, German, Russian, Romanian, Hindi and Turkish. Finally, we propose how to use this resource to improve the translation performance on WMT14 18 benchmarks in twelve directions. Experimental results show an average of 0.3 1.1 BLEU improvements in all translation benchmarks. The analysis and discussion also demonstrate the superiority and potential of the proposed approach. The resource used in this work will be publicly available.

pdf abs
A Lightweight Approach to a Giga-Corpus of Historical Periodicals: The Story of a Slovenian Historical Newspaper Collection
Filip Dobranić | Bojan Evkoski | Nikola Ljubešić

Preparing historical newspaper collections is a complicated endeavour, consisting of multiple steps that have to be carefully adapted to the specific content in question, including imaging, layout prediction, optical character recognition, and linguistic annotation. To address the high costs associated with the process, we present a lightweight approach to producing high-quality corpora and apply it to a massive collection of Slovenian historical newspapers from the 18th, 19th and 20th century resulting in a billion-word giga-corpus. We start with noisy OCR-ed data produced by different technologies in varying periods by the National and University Library of Slovenia. To address the inherent variability in the quality of textual data, a challenge commonly encountered in digital libraries globally, we perform a targeted post-digitisation correction procedure, coupled with a robust curation mechanism for noisy texts via language model inference. Subsequently, we subject the corrected and filtered output to comprehensive linguistic annotation, enriching the corpus with part-of-speech tags, lemmas, and named entity labels. Finally, we perform an analysis through topic modeling at the noun lemma level, along with a frequency analysis of the named entities, to confirm the viability of our corpus preparation method.

pdf abs
Aligning the Norwegian UD Treebank with Entity and Coreference Information
Tollef Emil Jørgensen | Andre Kåsen

This paper presents a merged collection of entity and coreference annotated data grounded in the Universal Dependencies (UD) treebanks for the two written forms of Norwegian: Bokmål and Nynorsk. The aligned and converted corpora are the Norwegian Named Entities (NorNE) and Norwegian Anaphora Resolution Corpus (NARC). While NorNE is aligned with an older version of the treebank, NARC is misaligned and requires extensive transformation from the original annotations to the UD structure and CoNLL-U format. Here, we demonstrate the conversion and alignment processes, along with an analysis of discovered issues and errors in the data, some of which include data split overlaps in the original treebank. These procedures and the developed system may prove helpful for future work on processing and aligning data from universal dependencies. The merged corpora comprise the first Norwegian UD treebank enriched with named entities and coreference information, supporting the standardized format for the CorefUD initiative.

The visual question localized-answering (VQLA) system has garnered increasing attention due to its potential as a knowledgeable assistant in surgical education. Apart from providing text-based answers, VQLA can also pinpoint the specific region of interest for better surgical scene understanding. Although recent Transformer-based models for VQLA have obtained promising results, they (1) conduct vanilla text-to-image cross attention, leading to unidirectional and coarse-grained alignment; (2) ignore exploiting the semantics of answers to further boost performance. In this paper, we propose a novel model termed OTAS, which first introduces optimal transport to achieve bidirectional and fine-grained alignment between images and questions, enabling more precise localization. Besides, OTAS incorporates a set of learnable candidate answer embeddings to query the probability of each answer class for a given image-question pair. Through Transformer attention, the candidate answer embeddings interact with the fused features of the image-question pair to make the answer decision. Extensive experiments on two widely-used benchmark datasets demonstrate the superiority of our model over state-of-the-art methods.

The advent of scalable deep models and large datasets has improved the performance of Neural Machine Translation (NMT). Knowledge Distillation (KD) enhances efficiency by transferring knowledge from a teacher model to a more compact student model. However, KD approaches to Transformer architecture often rely on heuristics, particularly when deciding which teacher layers to distill from. In this paper, we introduce the “Align-to-Distill” (A2D) strategy, designed to address the feature mapping problem by adaptively aligning student attention heads with their teacher counterparts during training. The Attention Alignment Module (AAM) in A2D performs a dense head-by-head comparison between student and teacher attention heads across layers, turning the combinatorial mapping heuristics into a learning problem. Our experiments show the efficacy of A2D, demonstrating gains of up to +3.61 and +0.63 BLEU points for WMT-2022 De→Dsb and WMT-2014 En→De, respectively, compared to Transformer baselines.The code and data are available at https://github.com/ncsoft/Align-to-Distill.

pdf abs
A Linguistically-Informed Annotation Strategy for Korean Semantic Role Labeling
Yige Chen | KyungTae Lim | Jungyeul Park

Semantic role labeling is an essential component of semantic and syntactic processing of natural languages, which reveals the predicate-argument structure of the language. Despite its importance, semantic role labeling for the Korean language has not been studied extensively. One notable issue is the lack of uniformity among data annotation strategies across different datasets, which often lack thorough rationales. In this study, we suggest an annotation strategy for Korean semantic role labeling that is in line with the previously proposed linguistic theories as well as the distinct properties of the Korean language. We further propose a simple yet viable conversion strategy from the Sejong verb dictionary to a CoNLL-style dataset for Korean semantic role labeling. Experiment results using a transformer-based sequence labeling model demonstrate the reliability and trainability of the converted dataset.

pdf abs
Alleviating Exposure Bias in Abstractive Summarization via Sequentially Generating and Revising
Jiaxin Duan | Fengyu Lu | Junfei Liu

Abstractive summarization commonly suffers from exposure bias caused by supervised teacher-force learning, that a model predicts the next token conditioned on the accurate pre-context during training while on its preceding outputs at inference. Existing solutions bridge this gap through un- or semi-supervised holistic learning yet still leave the risk of error accumulation while generating a summary. In this paper, we attribute this problem to the limitation of unidirectional autoregressive text generation and introduce post-processing steps to alleviate it. Specifically, we reformat abstractive summarization to sequential generation and revision (SeGRe), i.e., a model in the revision phase re-inputs the generated summary and refines it by contrasting it with the source document. This provides the model additional opportunities to assess the flawed summary from a global view and thereby modify inappropriate expressions. Moreover, we train the SeGRe model with a regularized minimum-risk policy to ensure effective generation and revision. A lot of comparative experiments are implemented on two well-known datasets, exhibiting the new or matched state-of-the-art performance of SeGRe.

This paper presents ALLIES, a meta corpus which gathers and extends existing French corpora collected from radio and TV shows. The corpus contains 1048 audio files for about 500 hours of speech. Agglomeration of data is always a difficult issue, as the guidelines used to collect, annotate and transcribe speech are generally different from one corpus to another. ALLIES intends to homogenize and correct speaker labels among the different files by integrated human feedback within a speaker verification system. The main contribution of this article is the design of a protocol in order to evaluate properly speech segmentation (including music and overlap detection), speaker diarization, speech transcription and speaker change detection. As part of it, a test partition has been carefully manually 1) segmented and annotated according to speech, music, noise, speaker labels with specific guidelines for overlap speech, 2) orthographically transcribed. This article also provides as a second contribution baseline results for several speech processing tasks.

pdf abs
A Logical Pattern Memory Pre-trained Model for Entailment Tree Generation
Li Yuan | Yi Cai | Haopeng Ren | Jiexin Wang

Generating coherent and credible explanations remains a significant challenge in the field of AI. In recent years, researchers have delved into the utilization of entailment trees to depict explanations, which exhibit a reasoning process of how a hypothesis is deduced from the supporting facts. However, existing models often overlook the importance of generating intermediate conclusions with logical consistency from the given facts, leading to inaccurate conclusions and undermining the overall credibility of entailment trees. To address this limitation, we propose the logical pattern memory pre-trained model (LMPM). LMPM incorporates an external memory structure to learn and store the latent representations of logical patterns, which aids in generating logically consistent conclusions. Furthermore, to mitigate the influence of logically irrelevant domain knowledge in the Wikipedia-based data, we introduce an entity abstraction approach to construct the dataset for pre-training LMPM. The experimental results highlight the effectiveness of our approach in improving the quality of entailment tree generation. By leveraging logical entailment patterns, our model produces more coherent and reasonable conclusions that closely align with the underlying premises.

The task of financial analysis primarily encompasses two key areas: stock trend prediction and the corresponding financial question answering. Currently, machine learning and deep learning algorithms (ML&DL) have been widely applied for stock trend predictions, leading to significant progress. However, these methods fail to provide reasons for predictions, lacking interpretability and reasoning processes. Also, they can not integrate textual information such as financial news or reports. Meanwhile, large language models (LLM) have remarkable textual understanding and generation ability. But due to the scarcity of financial training datasets and limited integration with real-time knowledge, LLM still suffer from hallucinations and unable to keep up with the latest information. To tackle these challenges, we first release AlphaFin datasets, combining traditional research datasets, real-time financial data, and handwritten chain-of-thought (CoT) data. It has positive impact on training LLM for completing financial analysis. We then use AlphaFin datasets to benchmark a state-of-the-art method, called Stock-Chain, for effectively tackling the financial analysis task, which integrates retrieval-augmented generation (RAG) techniques. Extensive experiments are conducted to demonstrate the effectiveness of our framework on financial analysis.

pdf abs
A Luxembourgish Corpus as a Gender Bias Evaluation Testset
Dimitra Anastasiou | Carole Blond-Hanten | Marie Gallais

According to the United Nations Development Programme, gender inequality is a metric that is composed of three dimensions: reproductive health, empowerment, and the labour market. Gender inequality is an obstacle to equal opportunities in society as a whole. In this paper we present our work-in-progress of designing and playing a physical game with digital elements. We currently conduct Conversation Analysis of transcribed speech of 58567 words and documenting bias. We also test OpenAI’s ChatGPT for bias in quiz-like gender-related questions.

pdf abs
A Matter of Perspective: Building a Multi-Perspective Annotated Dataset for the Study of Literary Quality
Yuri Bizzoni | Pascale Feldkamp Moreira | Ida Marie S. Lassen | Mads Rosendahl Thomsen | Kristoffer Nielbo

Studies on literary quality have constantly stimulated the interest of critics, both in theoretical and empirical fields. To examine the perceived quality of literary works, some approaches have focused on data annotated through crowd-sourcing platforms, and others relied on available expert annotated data. In this work, we contribute to the debate by presenting a dataset collecting quality judgments on 9,000 19th and 20th century English-language literary novels by 3,150 predominantly Anglophone authors. We incorporate expert opinions and crowd-sourced annotations to allow comparative analyses between different literary quality evaluations. We also provide several textual metrics chosen for their potential connection with literary reception and engagement. While a large part of the texts is subjected to copyright, we release quality and reception measures together with stylometric and sentiment data for each of the 9,000 novels to promote future research and comparison.

pdf abs
AMenDeD: Modelling Concepts by Aligning Mentions, Definitions and Decontextualised Embeddings
Amit Gajbhiye | Zied Bouraoui | Luis Espinosa Anke | Steven Schockaert

Contextualised Language Models (LM) improve on traditional word embeddings by encoding the meaning of words in context. However, such models have also made it possible to learn high-quality decontextualised concept embeddings. Three main strategies for learning such embeddings have thus far been considered: (i) fine-tuning the LM to directly predict concept embeddings from the name of the concept itself, (ii) averaging contextualised representations of mentions of the concept in a corpus, and (iii) encoding definitions of the concept. As these strategies have complementary strengths and weaknesses, we propose to learn a unified embedding space in which all three types of representations can be integrated. We show that this allows us to outperform existing approaches in tasks such as ontology completion, which heavily depends on access to high-quality concept embeddings. We furthermore find that mentions and definitions are well-aligned in the resulting space, enabling tasks such as target sense verification, even without the need for any fine-tuning.

We present a corpus of 100 documents, named OBSINFOX, selected from 17 sources of French press considered unreliable by expert agencies, annotated using 11 labels by 8 annotators. By collecting more labels than usual, by more annotators than is typically done, we can identify features that humans consider as characteristic of fake news, and compare them to the predictions of automated classifiers. We present a topic and genre analysis using Gate Cloud, indicative of the prevalence of satire-like text in the corpus. We then use the subjectivity analyzer VAGO, and a neural version of it, to clarify the link between ascriptions of the label Subjective and ascriptions of the label Fake News. The annotated dataset is available online at the following url: https://github.com/obs-info/obsinfox Keywords: Fake News, Multi-Labels, Subjectivity, Vagueness, Detail, Opinion, Exaggeration, French Press

pdf abs
A Multi-layered Approach to Physical Commonsense Understanding: Creation and Evaluation of an Italian Dataset
Giulia Pensa | Begoña Altuna | Itziar Gonzalez-Dios

In this paper, we explore physical commonsense reasoning of large language models (LLMs) and propose a specific methodology to evaluate low-level understanding of the physical world. Specifically, the goal is to create a test set to analyze physical commonsense reasoning in large language models for Italian and focus on a trustworthy analysis of the results. To that end, we present a tiered Italian dataset, called Graded Italian Annotated dataset (GITA), written and thoroughly annotated by a professional linguist, which allows us to concentrate on three different levels of commonsense understanding. Moreover, we create a semi-automated system to complete the accurate annotation of the dataset. We also validate our dataset by carrying out three tasks with a multilingual model (XLM-RoBERTa) and propose a qualitative analysis of the results. We found out that, although the model may perform at high-level classification tasks, its easoning is inconsistent and unverifiable, since it does not capture intermediate evidence.

pdf abs
A Multilingual Parallel Corpus for Aromanian
Iulia Petrariu | Sergiu Nisioi

We report the creation of the first high-quality corpus of Aromanian - an endangered Romance language spoken in the Balkans - and the equivalent sentence-aligned translations into Romanian, English, and French. The corpus is released publicly using several orthographic standards and consists in short stories collected in the ‘70s in Romania. Additionally, we provide an corpus-based analysis of Aromanian linguistic particularities and the overall demographic and political context which impacts the contemporary development of the language.

The automatic translation of spoken language into pictogram units can facilitate communication involving individuals with language impairments. However, there is no established translation formalism or publicly available datasets for training end-to-end speech translation systems. This paper introduces the first aligned speech, text, and pictogram translation dataset ever created in any language. We provide a French dataset that contains 230 hours of speech resources. We create a rule-based pictogram grammar with a restricted vocabulary and include a discussion of the strategic decisions involved. It takes advantage of an in-depth linguistic study of resources taken from the ARASAAC website. We validate these rules through multiple post-editing phases by expert annotators. The constructed dataset is then used to experiment with a Speech-to-Pictogram cascade model, which employs state-of-the-art Automatic Speech Recognition models. The dataset is freely available under a non-commercial licence. This marks a starting point to conduct research into the automatic translation of speech into pictogram units.

In this paper, we propose a new setting for generating product descriptions from images, augmented by marketing keywords. It leverages the combined power of visual and textual information to create descriptions that are more tailored to the unique features of products. For this setting, previous methods utilize visual and textual encoders to encode the image and keywords and employ a language model-based decoder to generate the product description. However, the generated description is often inaccurate and generic since same-category products have similar copy-writings, and optimizing the overall framework on large-scale samples makes models concentrate on common words yet ignore the product features. To alleviate the issue, we present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference and utilizes the in-context learning capability of language models to produce the description. During training, we keep the visual encoder and language model frozen, focusing on optimizing the modules responsible for creating multimodal in-context references and dynamic prompts. This approach preserves the language generation prowess of large language models (LLMs), facilitating a substantial increase in description diversity. To assess the effectiveness of ModICT across various language model scales and types, we collect data from three distinct product categories within the E-commerce domain. Extensive experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods. Our findings underscore the potential of ModICT as a valuable tool for enhancing the automatic generation of product descriptions in a wide range of applications. Data and code are at https://github.com/HITsz-TMG/Multimodal-In-Context-Tuning

pdf abs
A Multi-Task Transformer Model for Fine-grained Labelling of Chest X-Ray Reports
Yuanyi Zhu | Maria Liakata | Giovanni Montana

Precise understanding of free-text radiology reports through localised extraction of clinical findings can enhance medical imaging applications like computer-aided diagnosis. We present a new task, that of segmenting radiology reports into topically meaningful passages (segments) and a transformer-based model that both segments reports into semantically coherent segments and classifies each segment using a set of 37 radiological abnormalities, thus enabling fine-grained analysis. This contrasts with prior work that performs classification on full reports without localisation. Trained on over 2.7 million unlabelled chest X-ray reports and over 28k segmented and labelled reports, our model achieves state-of-the-art performance on report segmentation (0.0442 WinDiff) and multi-label classification (0.84 report-level macro F1) over 37 radiological labels and 8 NLP-specific labels. This work establishes new benchmarks for fine-grained understanding of free-text radiology reports, with precise localisation of semantics unlocking new opportunities to improve computer vision model training and clinical decision support. We open-source our annotation tool, model code and pretrained weights to encourage future research.

pdf abs
Analysis of Sensation-transfer Dialogues in Motorsports
Takeru Isaka | Atsushi Otsuka | Iwaki Toshima

Clarifying the effects of subjective ideas on group performance is essential for future dialogue systems to improve mutual understanding among humans and group creativity. However, there has been little focus on dialogue research on quantitatively analyzing the effects of the quality and quantity of subjective information contained in dialogues on group performance. We hypothesize that the more subjective information interlocutors exchange, the better the group performance in collaborative work. We collected dialogues between drivers and engineers in motorsports when deciding how the car should be tuned as a suitable case to verify this hypothesis. Our analysis suggests that the greater the amount of subjective information (which we defined as “sensation”) in the driver’s utterances, the greater the race performance and driver satisfaction with the car’s tuning. The results indicate that it is essential for the development of dialogue research to create a corpus of situations that require high performance through collaboration among experts with different backgrounds but who have mastered their respective fields.

pdf abs
Analysis on Unsupervised Acquisition Process of Bilingual Vocabulary through Iterative Back-Translation
Takuma Tanigawa | Tomoyosi Akiba | Hajime Tsukada

In this paper, we investigate how new bilingual vocabulary is acquired through Iterative Back-Translation (IBT), which is known as a data augmentation method for machine translation from monolingual data of both source and target languages. To reveal the acquisition process, we first identify the word translation pairs in test data that do not exist in a bilingual data but do only in two monolingual data, then observe how many pairs are successfully translated by the translation model trained through IBT. We experimented on it with domain adaptation settings on two language pairs. Our experimental evaluation showed that more than 60% of the new bilingual vocabulary is successfully acquired through IBT along with the improvement in the translation quality in terms of BLEU. It also revealed that new bilingual vocabulary was gradually acquired by repeating IBT iterations. From the results, we present our hypothesis on the process of new bilingual vocabulary acquisition where the context of the words plays a critical role in the success of the acquisition.

Chain-of-Thought (CoT) prompting combined with large language models (LLM) has shown great potential in improving performance on challenging reasoning tasks. While understanding why CoT prompting is effective is crucial for the application and improvement of CoT prompting, few studies have addressed this issue. Besides, almost no prior work has conducted theoretical analysis on CoT prompting in the context of black-box models. In this paper, we approach the analysis of CoT prompting in black-box LLMs from an information-theoretic perspective. Specifically, we propose a new metric, EPVI (Estimated Pointwise V-Information), which extends the concept of pointwise V-information to black-box models, quantifying the label-relevant new information introduced by CoT prompting beyond the pre-existing information in the input. Based on this, we conduct a series of experiments at both the task and instance levels to analyze CoT prompting, demonstrating that the effectiveness of CoT prompting can be attributed to its capacity to influence the difficulty of model inference by augmenting or reducing the model-usable information. Furthermore, we show that selecting high-quality demonstrations of CoT reasoning based on EPVI can improve the downstream performance of reasoning tasks.

pdf abs
Analyzing Effects of Learning Downstream Tasks on Moral Bias in Large Language Models
Niklas Kiehne | Alexander Ljapunov | Marc Bätje | Wolf-Tilo Balke

Pre-training and fine-tuning large language models (LMs) is currently the state-of-the-art methodology for enabling data-scarce downstream tasks. However, the derived models still tend to replicate and perpetuate social biases. To understand this process in more detail, this paper investigates the actual effects of learning downstream tasks on moral bias in LMs. We develop methods to assess the agreement of LMs to explicitly codified norms in both pre-training and fine-tuning stages. Even if a pre-trained foundation model exhibits consistent norms, we find that introducing downstream tasks may indeed lead to unexpected inconsistencies in norm representation. Specifically, we observe two phenomena during fine-tuning across both masked and causal LMs: (1) pre-existing moral bias may be mitigated or amplified even when presented with opposing views and (2) prompt sensitivity may be negatively impacted. We provide empirical evidence of models deteriorating into conflicting states, where contradictory answers can easily be triggered by slight modifications in the input sequence. Our findings thus raise concerns about the general ability of LMs to mitigate moral biases effectively.

Word Sense Disambiguation (WSD) is a key task in Natural Language Processing (NLP), aiming to assign the correct meaning (sense) to a word in context. However, traditional WSD systems rely on WordNet as the underlying sense inventory, often differentiating meticulously between subtle nuances of word meanings, which may lead to excessive complexity and reduced practicality of WSD systems in today’s NLP. Indeed, current Pretrained Language Models (PLMs) do seem to be able to perform disambiguation, but it is not clear to what extent, or to what level of granularity, they actually operate. In this paper, we address these points and, firstly, introduce a new large-scale resource that leverages homonymy relations to systematically cluster WordNet senses, effectively reducing the granularity of word senses to a very coarse-grained level; secondly, we use this resource to train Homonymy Disambiguation systems and investigate whether PLMs are inherently able to differentiate coarse-grained word senses. Our findings demonstrate that, while state-of-the-art models still struggle to choose the correct fine-grained meaning of a word in context, Homonymy Disambiguation systems are able to differentiate homonyms with up to 95% accuracy scores even without fine-tuning the underlying PLM. We release our data and code at https://github.com/SapienzaNLP/homonymy-wsd.

pdf abs
Analyzing Interpretability of Summarization Model with Eye-gaze Information
Fariz Ikhwantri | Hiroaki Yamada | Takenobu Tokunaga

Interpretation methods provide saliency scores indicating the importance of input words for neural summarization models. Prior work has analyzed models by comparing them to human behavior, often using eye-gaze as a proxy for human attention in reading tasks such as classification. This paper presents a framework to analyze the model behavior in summarization by comparing it to human summarization behavior using eye-gaze data. We examine two research questions: RQ1) whether model saliency conforms to human gaze during summarization and RQ2) how model saliency and human gaze affect summarization performance. For RQ1, we measure conformity by calculating the correlation between model saliency and human fixation counts. For RQ2, we conduct ablation experiments removing words/sentences considered important by models or humans. Experiments on two datasets with human eye-gaze during summarization partially confirm that model saliency aligns with human gaze (RQ1). However, ablation experiments show that removing highly-attended words/sentences from the human gaze does not significantly degrade performance compared with the removal by the model saliency (RQ2).

pdf abs
Analyzing Large Language Models’ Capability in Location Prediction
Zhaomin Xiao | Yan Huang | Eduardo Blanco

In this paper, we investigate and evaluate large language models’ capability in location prediction. We present experimental results with four models—FLAN-T5, FLAN-UL2, FLAN-Alpaca, and ChatGPT—in various instruction finetuning and exemplar settings. We analyze whether taking into account the context—tweets published before and after the tweet mentioning a location—is beneficial. Additionally, we conduct an ablation study to explore whether instruction modification is beneficial. Lastly, our qualitative analysis sheds light on the errors made by the best-performing model.

pdf abs
Analyzing Occupational Distribution Representation in Japanese Language Models
Katsumi Ibaraki | Winston Wu | Lu Wang | Rada Mihalcea

Recent advances in large language models (LLMs) have enabled users to generate fluent and seemingly convincing text. However, these models have uneven performance in different languages, which is also associated with undesirable societal biases toward marginalized populations. Specifically, there is relatively little work on Japanese models, despite it being the thirteenth most widely spoken language. In this work, we first develop three Japanese language prompts to probe LLMs’ understanding of Japanese names and their association between gender and occupations. We then evaluate a variety of English, multilingual, and Japanese models, correlating the models’ outputs with occupation statistics from the Japanese Census Bureau from the last 100 years. Our findings indicate that models can associate Japanese names with the correct gendered occupations when using constrained decoding. However, with sampling or greedy decoding, Japanese language models have a preference for a small set of stereotypically gendered occupations, and multilingual models, though trained on Japanese, are not always able to understand Japanese prompts.

The ever-growing number of people suffering from mental distress has motivated significant research initiatives towards automated depression estimation. Despite the multidisciplinary nature of the task, very few of these approaches include medical professionals in their research process, thus ignoring a vital source of domain knowledge. In this paper, we propose to bring the domain experts back into the loop and incorporate their knowledge within the gold-standard DAIC-WOZ dataset. In particular, we define a novel transformer-based architecture and analyse its performance in light of our expert annotations. Overall findings demonstrate a strong correlation between the psychological tendencies of medical professionals and the behavior of the proposed model, which additionally provides new state-of-the-art results.

The discourse surrounding climate change on social media platforms has emerged as a significant avenue for understanding public sentiments, perspectives, and engagement with this critical global issue. The unavailability of publicly available datasets, coupled with ignoring the multi-aspect analysis of climate discourse on social media platforms, has underscored the necessity for further advancement in this area. To address this gap, in this paper, we present an extensive exploration of the intricate realm of climate change discourse on Twitter, leveraging a meticulously annotated ClimaConvo dataset comprising 15,309 tweets. Our annotations encompass a rich spectrum, including aspects like relevance, stance, hate speech, the direction of hate, and humor, offering a nuanced understanding of the discourse dynamics. We address the challenges inherent in dissecting online climate discussions and detail our comprehensive annotation methodology. In addition to annotations, we conduct benchmarking assessments across various algorithms for six tasks: relevance detection, stance detection, hate speech identification, direction and target, and humor analysis. This assessment enhances our grasp of sentiment fluctuations and linguistic subtleties within the discourse. Our analysis extends to exploratory data examination, unveiling tweet distribution patterns, stance prevalence, and hate speech trends. Employing sophisticated topic modeling techniques uncovers underlying thematic clusters, providing insights into the diverse narrative threads woven within the discourse. The findings present a valuable resource for researchers, policymakers, and communicators seeking to navigate the intricacies of climate change discussions. The dataset and resources for this paper are available at https://github.com/shucoll/ClimaConvo.

pdf abs
Analyzing the Performance of Large Language Models on Code Summarization
Rajarshi Haldar | Julia Hockenmaier

Large language models (LLMs) such as Llama 2 perform very well on tasks that involve both natural language and source code, particularly code summarization and code generation. We show that for the task of code summarization, the performance of these models on individual examples often depends on the amount of (subword) token overlap between the code and the corresponding reference natural language descriptions in the dataset. This token overlap arises because the reference descriptions in standard datasets (corresponding to docstrings in large code bases) are often highly similar to the names of the functions they describe. We also show that this token overlap occurs largely in the function names of the code and compare the relative performance of these models after removing function names versus removing code structure. We also show that using multiple evaluation metrics like BLEU and BERTScore gives us very little additional insight since these metrics are highly correlated with each other.

pdf abs
Analyzing the Understanding of Morphologically Complex Words in Large Language Models
Marion Weller-Di Marco | Alexander Fraser

We empirically study the ability of a Large Language Model (gpt-3.5-turbo-instruct) to understand morphologically complex words. In our experiments, we looked at a variety of tasks to analyse German compounds with regard to compositional word formation and derivation, such as identifying the head noun of existing and novel compounds, identifying the shared verb stem between two words, or recognizing words constructed with inappropriately used derivation morphemes as invalid. Our results show that the language model is generally capable of solving most tasks, except for the task of identifying ill-formed word forms. While the model demonstrated a good overall understanding of complex words and their word-internal structure, the results also suggest that there is no formal knowledge of derivational rules, but rather an interpretation of the observed word parts to derive the meaning of a word.

pdf abs
An Argument for Symmetric Coordination from Dependency Length Minimization: A Replication Study
Adam Przepiórkowski | Magdalena Borysiak | Adam Głowacki

It is well known that left conjuncts tend to be shorter in English coordinate structures. On the basis of Penn Treebank, Przepiórkowski and Woźniak 2023 (in ACL 2023 proceedings) show that this tendency depends on the difference between lengths of conjuncts: the larger the difference, the stronger the tendency for the shorter conjunct to occur on the left. However, this dynamics is observed only when the governor of the coordinate structure is on the left of the coordination (e.g., “Bring apples and oranges!”) or when it is absent (e.g., “Come and sing!”), and not when it is on the right (e.g., “Apples and oranges fell”). Given the principle of Dependency Length Minimization, this turns out to provide an argument for the symmetric structure of coordination. We replicate and sharpen this result on the basis of a much larger dataset: parts of the COCA corpus parsed with Stanza. We also investigate the dependence of this result on the assumed unit of length (word vs. character) and on genre.

pdf abs
A Natural Approach for Synthetic Short-Form Text Analysis
Ruiting Shao | Ryan Schwarz | Christopher Clifton | Edward Delp

Detecting synthetically generated text in the wild has become increasingly difficult with advances in Natural Language Generation techniques and the proliferation of freely available Large Language Models (LLMs). Social media and news sites can be flooded with synthetically generated misinformation via tweets and posts while authentic users can inadvertently spread this text via shares and retweets. Most modern natural language processing techniques designed to detect synthetically generated text focus primarily on long-form content, such as news articles, or incorporate stylometric characteristics and metadata during their analysis. Unfortunately, for short form text like tweets, this information is often unavailable, usually detached from its original source, displayed out of context, and is often too short or informal to yield significant information from stylometry. This paper proposes a method of detecting synthetically generated tweets via a Transformer architecture and incorporating unique style-based features. Additionally, we have created a new dataset consisting of human-generated and Large Language Model generated tweets for 4 topics and another dataset consisting of tweets paraphrased by 3 different paraphrase models.

Data availability is crucial for advancing artificial intelligence applications, including voice-based technologies. As content creation, particularly in social media, experiences increasing demand, translation and text-to-speech (TTS) technologies have become essential tools. Notably, the performance of these TTS technologies is highly dependent on the quality of the training data, emphasizing the mutual dependence of data availability and technological progress. This paper introduces an end-to-end tool to generate high-quality datasets for text-to-speech (TTS) models to address this critical need for high-quality data. The contributions of this work are manifold and include: the integration of language-specific phoneme distribution into sample selection, automation of the recording process, automated and human-in-the-loop quality assurance of recordings, and processing of recordings to meet specified formats. The proposed application aims to streamline the dataset creation process for TTS models through these features, thereby facilitating advancements in voice-based technologies.

pdf abs
Anchor and Broadcast: An Efficient Concept Alignment Approach for Evaluation of Semantic Graphs
Haibo Sun | Nianwen Xue

In this paper, we present AnCast, an intuitive and efficient tool for evaluating graph-based meaning representations (MR). AnCast implements evaluation metrics that are well understood in the NLP community, and they include concept F1, unlabeled relation F1, labeled relation F1, and weighted relation F1. The efficiency of the tool comes from a novel anchor broadcast alignment algorithm that is not subject to the trappings of local maxima. We show through experimental results that the AnCast score is highly correlated with the widely used Smatch score, but its computation takes only about 40% the time.

With the increasing availability of multimodal content on social media, consisting primarily of text and images, multimodal named entity recognition (MNER) has gained a wide-spread attention. A fundamental challenge of MNER lies in effectively aligning different modalities. However, the majority of current approaches rely on word-based sequence labeling framework and align the image and text at inconsistent semantic levels (whole image-words or regions-words). This misalignment may lead to inferior entity recognition performance. To address this issue, we propose an effective span-based method, named SMNER, which achieves a more consistent multimodal alignment from the perspectives of information-theoretic and cross-modal interaction, respectively. Specifically, we first introduce a cross-modal information bottleneck module for the global-level multimodal alignment (whole image-whole text). This module aims to encourage the semantic distribution of the image to be closer to the semantic distribution of the text, which can enable the filtering out of visual noise. Next, we introduce a cross-modal attention module for the local-level multimodal alignment (regions-spans), which captures the correlations between regions in the image and spans in the text, enabling a more precise alignment of the two modalities. Extensive ex- periments conducted on two benchmark datasets demonstrate that SMNER outperforms the state-of-the-art baselines.

pdf abs
An Empirical Study of Synthetic Data Generation for Implicit Discourse Relation Recognition
Kazumasa Omura | Fei Cheng | Sadao Kurohashi

Implicit Discourse Relation Recognition (IDRR), which is the task of recognizing the semantic relation between given text spans that do not contain overt clues, is a long-standing and challenging problem. In particular, the paucity of training data for some error-prone discourse relations makes the problem even more challenging. To address this issue, we propose a method of generating synthetic data for IDRR using a large language model. The proposed method is summarized as two folds: extraction of confusing discourse relation pairs based on false negative rate and synthesis of data focused on the confusion. The key points of our proposed method are utilizing a confusion matrix and adopting two-stage prompting to obtain effective synthetic data. According to the proposed method, we generated synthetic data several times larger than training examples for some error-prone discourse relations and incorporated it into training. As a result of experiments, we achieved state-of-the-art macro-F1 performance thanks to the synthetic data without sacrificing micro-F1 performance and demonstrated its positive effects especially on recognizing some infrequent discourse relations.

pdf abs
An Empirical Study on the Robustness of Massively Multilingual Neural Machine Translation
Supryadi Supryadi | Leiyu Pan | Deyi Xiong

Massively multilingual neural machine translation (MMNMT) has been proven to enhance the translation quality of low-resource languages. In this paper, we empirically investigate the translation robustness of Indonesian-Chinese translation in the face of various naturally occurring noise. To assess this, we create a robustness evaluation benchmark dataset for Indonesian-Chinese translation. This dataset is automatically translated into Chinese using four NLLB-200 models of different sizes. We conduct both automatic and human evaluations. Our in-depth analysis reveal the correlations between translation error types and the types of noise present, how these correlations change across different model sizes, and the relationships between automatic evaluation indicators and human evaluation indicators. The dataset is publicly available at https://github.com/tjunlp-lab/ID-ZH-MTRobustEval.

pdf abs
An Evaluation of Croatian ASR Models for Čakavian Transcription
Shulin Zhang | John Hale | Margaret Renwick | Zvjezdana Vrzić | Keith Langston

To assist in the documentation of Čakavian, an endangered language variety closely related to Croatian, we test four currently available ASR models that are trained with Croatian data and assess their performance in the transcription of Čakavian audio data. We compare the models’ word error rates, analyze the word-level error types, and showcase the most frequent Deletion and Substitution errors. The evaluation results indicate that the best-performing system for transcribing Čakavian was a CTC-based variant of the Conformer model.

pdf abs
An Event-based Abductive Learning for Hard Time-sensitive Question Answering
Shaojuan Wu | Jitong Li | Xiaowang Zhang | Zhiyong Feng

Time-Sensitive Question Answering (TSQA) is to answer questions qualified for a certain timestamp based on the given document. It is split into easy and hard modes depending on whether the document contain time qualifiers mentioned in the question. While existing models have performed well on easy mode, their performance is significant reduced for answering hard time-sensitive questions, whose time qualifiers are implicit in the document. An intuitive idea is to match temporal events in the given document by treating time-sensitive question as a temporal event of missing objects. However, not all temporal events extracted from the document have explicit time qualifiers. In this paper, we propose an Event-AL framework, in which a graph pruning model is designed to locate the timespan of implicit temporal events by capturing temporal relation between events. Moreover, we present an abductive reasoning module to determine proper objects while providing explanations. Besides, as the same relation may be scattered throughout the document in diverse expressions, a relation-based prompt is introduced to instructs LLMs in extracting candidate temporal events. We conduct extensive experiment and results show that Event-AL outperforms strong baselines for hard time-sensitive questions, with a 12.7% improvement in EM scores. In addition, it also exhibits great superiority for multi-answer and beyond hard time-sensitive questions.

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

pdf abs
An LCF-IDF Document Representation Model Applied to Long Document Classification
Renzo Arturo Alva Principe | Nicola Chiarini | Marco Viviani

A document representation model that has been used for years in NLP and Text Mining tasks is TF-IDF (Term Frequency-Inverse Document Frequency). This model is indeed effective for various tasks like Information Retrieval and Document Classification. However, it may fall short when it comes to capturing the deeper semantic and contextual meaning of a text, which is where Transformer-based Pre-trained Language Models (PLMs) such as BERT have been gaining significant traction in recent years. Despite this, these models also face specific challenges related to Transformers and their attention mechanism limits, especially when dealing with long documents. Therefore, this paper proposes a novel approach to exploit the advantages of the TF-IDF representation while incorporating semantic context, by introducing a Latent Concept Frequency-Inverse Document Frequency (LCF-IDF) document representation model. Its effectiveness is tested with respect to the Long Document Classification task. The results obtained show promising performance of the proposed solution compared to TF-IDF and BERT-like representation models, including those specifically for long documents such as Longformer as well as those designed for particular domains, especially when it comes to Single Label Multi-Class (SLMC) classification.

pdf abs
An LLM-Enhanced Adversarial Editing System for Lexical Simplification
Keren Tan | Kangyang Luo | Yunshi Lan | Zheng Yuan | Jinlong Shu

Lexical Simplification (LS) aims to simplify text at the lexical level. Existing methods rely heavily on annotated data, making it challenging to apply in low-resource scenarios. In this paper, we propose a novel LS method without parallel corpora. This method employs an Adversarial Editing System with guidance from a confusion loss and an invariance loss to predict lexical edits in the original sentences. Meanwhile, we introduce an innovative LLM-enhanced loss to enable the distillation of knowledge from Large Language Models (LLMs) into a small-size LS system. From that, complex words within sentences are masked and a Difficulty-aware Filling module is crafted to replace masked positions with simpler words. At last, extensive experimental results and analyses on three benchmark LS datasets demonstrate the effectiveness of our proposed method.

Monitoring the threat landscape to be aware of actual or potential attacks is of utmost importance to cybersecurity professionals. Information about cyber threats is typically distributed using natural language reports. Natural language processing can help with managing this large amount of unstructured information, yet to date, the topic has received little attention. With this paper, we present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports. The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts including implicitly mentioned techniques and tactics. Entities and concepts are linked to Wikipedia and the MITRE ATT&CK knowledge base, the most widely-used taxonomy for classifying types of attacks. Prior datasets linking to MITRE ATT&CK either provide a single label per document or annotate sentences out-of-context; our dataset annotates entire documents in a much finer-grained way. In an experimental study, we model the annotations of our dataset using state-of-the-art neural models. In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.

pdf abs
Annotate Chinese Aspect with UMR——a Case Study on the Liitle Prince
Sijia Ge | Zilong Li | Alvin Po-Chun Chen | Guanchao Wang

Aspect is a valuable tool for determining the perspective from which an event is observed, allowing for viewing both at the situation and viewpoint level. Uniform Meaning Representation (UMR) seeks to provide a standard, typologically-informed representation of aspects across languages. It employs an aspectual lattice to adapt to different languages and design values that encompass both viewpoint aspect and situation aspects. In the context of annotating the Chinese version of The Little Prince, we paid particular attention to the interactions between aspect values and aspect markers and we also want to know the annotation effectiveness and challenges under the UMR aspectual lattice. During our annotation process, we identified the relationships between aspectual markers and labels. We further categorized and analyzed complex examples that led to low inter-annotator agreement. The factors contributing to disagreement among annotators included the interpretations of lexical semantics, implications, and the influence of aspectual markers, which is related to the inclination of the situation aspect and the exclusivity between the two aspects’ perspectives. Overall, our work sheds light on the challenges of aspect annotation in Chinese and highlights the need for more comprehensive guidelines.

pdf abs
Annotate the Way You Think: An Incremental Note Generation Framework for the Summarization of Medical Conversations
Longxiang Zhang | Caleb D. Hart | Susanne Burger | Thomas Schaaf

The scarcity of public datasets for the summarization of medical conversations has been a limiting factor for advancing NLP research in the healthcare domain, and the structure of the existing data is largely limited to the simple format of conversation-summary pairs. We therefore propose a novel Incremental Note Generation (ING) annotation framework capable of greatly enriching summarization datasets in the healthcare domain and beyond. Our framework is designed to capture the human summarization process via an annotation task by instructing the annotators to first incrementally create a draft note as they accumulate information through a conversation transcript (Generation) and then polish the draft note into a reference note (Rewriting). The annotation results include both the reference note and a comprehensive editing history of the draft note in tabular format. Our pilot study on the task of SOAP note generation showed reasonable consistency between four expert annotators, established a solid baseline for quantitative targets of inter-rater agreement, and demonstrated the ING framework as an improvement over the traditional annotation process for future modeling of summarization.

pdf abs
Annotating Chinese Word Senses with English WordNet: A Practice on OntoNotes Chinese Sense Inventories
Hongzhi Xu | Jingxia Lin | Sameer Pradhan | Mitchell Marcus | Ming Liu

In this paper, we present our exploration of annotating Chinese word senses using English WordNet synsets, with examples extracted from OntoNotes Chinese sense inventories. Given a target word along with the example that contains it, the annotators select a WordNet synset that best describes the meaning of the target word in the context. The result demonstrates an inter-annotator agreement of 38% between two annotators. We delve into the instances of disagreement by comparing the two annotated synsets, including their positions within the WordNet hierarchy. The examination reveals intriguing patterns among closely related synsets, shedding light on similar concepts represented within the WordNet structure. The data offers as an indirect linking of Chinese word senses defined in OntoNotes Chinese sense inventories to WordNet sysnets, and thus promotes the value of the OntoNotes corpus. Compared to a direct linking of Chinese word senses to WordNet synsets, the example-based annotation has the merit of not being affected by inaccurate sense definitions and thus offers a new way of mapping WordNets of different languages. At the same time, the annotated data also serves as a valuable linguistic resource for exploring potential lexical differences between English and Chinese, with potential contributions to the broader understanding of cross-linguistic semantic mapping

pdf abs
Annotating Customer-Oriented Behaviour in Call Centre Sales Dialogues
Jutta Stock | Volha Petukhova | Dietrich Klakow

Customer-oriented behaviour (COB) plays an important role in call centre interactions, particularly in the context of successful sales negotiation. However, the evaluation of COB in customer-agent conversations often lacks clarity in its definition and robust computational assessment methods. This paper addresses these challenges by presenting a comprehensive conceptual and empirical framework. We conducted multidimensional dialogue act annotations on authentic call centre interactions using the ISO 24617-2 taxonomy, capturing the multifaceted nature of these interactions. This process led to the identification of relevant dialogue act categories, proposed extensions concerning relationship-building aspects, and derived corpus statistics. The findings highlight specific facets of COB that positively impact on Customer Satisfaction (CS), as determined through correlation analysis. Additionally, we delved into the dependencies between COB and feedback acts, leveraging the hierarchical structure of the DIT++ model. This framework improves our understanding of the dynamics shaping sales strategies in call centres and holds promise for practical applications in optimising customer-agent interactions.

pdf abs
Annotation and Classification of Relevant Clauses in Terms-and-Conditions Contracts
Pietro Giovanni Bizzaro | Elena Della Valentina | Maurizio Napolitano | Nadia Mana | Massimo Zancanaro

In this paper, we propose a new annotation scheme to classify different types of clauses in Terms-and-Conditions contracts with the ultimate goal of supporting legal experts to quickly identify and assess problematic issues in this type of legal documents. To this end, we built a small corpus of Terms-and-Conditions contracts and finalized an annotation scheme of 14 categories, eventually reaching an inter-annotator agreement of 0.92. Then, for 11 of them, we experimented with binary classification tasks using few-shot prompting with a multilingual T5 and two fine-tuned versions of two BERT-based LLMs for Italian. Our experiments showed the feasibility of automatic classification of our categories by reaching accuracies ranging from .79 to .95 on validation tasks.

pdf abs
Annotation of Japanese Discourse Relations Focusing on Concessive Inferences
Ai Kubota | Takuma Sato | Takayuki Amamoto | Ryota Akiyoshi | Koji Mineshima

In this study, we focus on the inference presupposed in the concessive discourse relation and present the discourse relation annotation for the Japanese connectives ‘nagara’ and ‘tsutsu’, both of which have two usages: Synchronous and Concession, just like English while. We also present the annotation for ‘tokorode’, which is ambiguous in three ways: Temporal, Location, and Concession. While corpora containing concessive discourse relations already exist, the distinctive feature of our study is that it aims to identify the concessive inferential relations by writing out the implicit presupposed inferences. In this paper, we report on the annotation methodology and its results, as well as the characteristics of concession that became apparent during annotation.

Few speech resources describe interruption phenomena, especially for TV and media content. The description of these phenomena may vary across authors: it thus leaves room for improved annotation protocols. We present an annotation of Transition-Relevance Places (TRP) and Floor-Taking event types on an existing French TV and Radio broadcast corpus to facilitate studies of interruptions and turn-taking. Each speaker change is annotated with the presence or absence of a TRP, and a classification of the next-speaker floor-taking as Smooth, Backchannel or different types of turn violations (cooperative or competitive, successful or attempted interruption). An inter-rater agreement analysis shows such annotations’ moderate to substantial reliability. The inter-annotator agreement for TRP annotation reaches κ=0.75, κ=0.56 for Backchannel and κ=0.5 for the Interruption/non-interruption distinction. More precise differences linked to cooperative or competitive behaviors lead to lower agreements. These results underline the importance of low-level features like TRP to derive a classification of turn changes that would be less subject to interpretation. The analysis of the presence of overlapping speech highlights the existence of interruptions without overlaps and smooth transitions with overlaps. These annotations are available at https://lium.univ-lemans.fr/corpus-allies/.

pdf abs
Annotations for Exploring Food Tweets from Multiple Aspects
Matiss Rikters | Rinalds Vīksna | Edison Marrese-Taylor

This research builds upon the Latvian Twitter Eater Corpus (LTEC), which is focused on the narrow domain of tweets related to food, drinks, eating and drinking. LTEC has been collected for more than 12 years and reaching almost 3 million tweets with the basic information as well as extended automatically and manually annotated metadata. In this paper we supplement the LTEC with manually annotated subsets of evaluation data for machine translation, named entity recognition, timeline-balanced sentiment analysis, and text-image relation classification. We experiment with each of the data sets using baseline models and highlight future challenges for various modelling approaches.

pdf abs
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
Oana Ignat | Longju Bai | Joan C. Nwatu | Rada Mihalcea

Current foundation models have shown impressive performance across various tasks. However, several studies have revealed that these models are not effective for everyone due to the imbalanced geographical and economic representation of the data used in the training process. Most of this data comes from Western countries, leading to poor results for underrepresented countries. To address this issue, more data needs to be collected from these countries, but the cost of annotation can be a significant bottleneck. In this paper, we propose methods to identify the data to be annotated to balance model performance and annotation costs. Our approach first involves finding the countries with images of topics (objects and actions) most visually distinct from those already in the training datasets used by current large vision-language foundation models. Next, we identify countries with higher visual similarity for these topics and show that using data from these countries to supplement the training data improves model performance and reduces annotation costs. The resulting lists of countries and corresponding topics are made available at https://github.com/MichiganNLP/visual_diversity_budget.

pdf abs
AnnoTheia: A Semi-Automatic Annotation Toolkit for Audio-Visual Speech Technologies
José-M. Acosta-Triana | David Gimeno-Gómez | Carlos-D. Martínez-Hinarejos

More than 7,000 known languages are spoken around the world. However, due to the lack of annotated resources, only a small fraction of them are currently covered by speech technologies. Albeit self-supervised speech representations, recent massive speech corpora collections, as well as the organization of challenges, have alleviated this inequality, most studies are mainly benchmarked on English. This situation is aggravated when tasks involving both acoustic and visual speech modalities are addressed. In order to promote research on low-resource languages for audio-visual speech technologies, we present AnnoTheia, a semi-automatic annotation toolkit that detects when a person speaks on the scene and the corresponding transcription. In addition, to show the complete process of preparing AnnoTheia for a language of interest, we also describe the adaptation of a pre-trained model for active speaker detection to Spanish, using a database not initially conceived for this type of task. Prior evaluations show that the toolkit is able to speed up to four times the annotation process. The AnnoTheia toolkit, tutorials, and pre-trained models are available at https://github.com/joactr/AnnoTheia/.

pdf abs
Announcing the Prague Discourse Treebank 3.0
Pavlína Synková | Jiří Mírovský | Lucie Poláková | Magdaléna Rysová

We present the Prague Discourse Treebank 3.0 – a new version of the annotation of discourse relations marked by primary and secondary discourse connectives in the data of the Prague Dependency Treebank. Compared to the previous version (PDiT 2.0), the version 3.0 comes with three types of major updates: (i) it brings a largely revised annotation of discourse relations: pragmatic relations have been thoroughly reworked, many inconsistencies across all discourse types have been fixed and previously unclear cases marked in annotators’ comments have been resolved, (ii) it achieves consistency with a Lexicon of Czech Discourse Connectives (CzeDLex), and (iii) it provides the data not only in its native format (Prague Markup Language, discourse relations annotated at the top of the dependency trees), but also in the Penn Discourse Treebank 3.0 format (plain text plus a stand-off discourse annotation) and sense taxonomy. PDiT 3.0 contains 21,662 discourse relations (plus 445 list relations) in 49 thousand sentences.

Medical imaging is critical to the diagnosis, surveillance, and treatment of many health conditions, including oncological, neurological, cardiovascular, and musculoskeletal disorders, among others. Radiologists interpret these complex, unstructured images and articulate their assessments through narrative reports that remain largely unstructured. This unstructured narrative must be converted into a structured semantic representation to facilitate secondary applications such as retrospective analyses or clinical decision support. Here, we introduce the Corpus of Annotated Medical Imaging Reports (CAMIR), which includes 609 annotated radiology reports from three imaging modality types: Computed Tomography, Magnetic Resonance Imaging, and Positron Emission Tomography-Computed Tomography. Reports were annotated using an event-based schema that captures clinical indications, lesions, and medical problems. Each event consists of a trigger and multiple arguments, and a majority of the argument types, including anatomy, normalize the spans to pre-defined concepts to facilitate secondary use. CAMIR uniquely combines a granular event structure and concept normalization. To extract CAMIR events, we explored two BERT (Bi-directional Encoder Representation from Transformers)-based architectures, including an existing architecture (mSpERT) that jointly extracts all event information and a multi-step approach (PL-Marker++) that we augmented for the CAMIR schema.

pdf abs
A Novel Three-stage Framework for Few-shot Named Entity Recognition
Shengjie Ji | Fang Kong

Different from most existing tasks relying on abundant labeled data, Few-shot Named Entity Recognition (NER) aims to develop NER systems that are capable of learning from a small set of labeled samples and then generalizing well to new, unseen data.In this paper, with the intention of obtaining a model that can better adapt to new domains, we design a novel three-stage framework for Few-shot NER, including teacher span recognizer, student span recognizer and entity classifier.We first train a teacher span recognizer which is based on a global boundary matrix to obtain soft boundary labels.Then we leverage the soft boundary labels learned by the teacher model to assist in training the student span recognizer,which can smooth the training process of span recognizer.Finally, we adopt the traditional prototypical network as entity classifier and incorporate the idea of prompt learning to construct a more generalizable semantic space.Extensive experiments on various benchmarks demonstrate that our approach surpasses prior methods.

Argument mining aims to detect all possible argumentative components and identify their relationships automatically. As a thriving task in natural language processing, there has been a large amount of corpus for academic study and application development in this field. However, the research in this area is still constrained by the inherent limitations of existing datasets. Specifically, all the publicly available datasets are relatively small in scale, and few of them provide information from other modalities to facilitate the learning process. Moreover, the statements and expressions in these corpora are usually in a compact form, which restricts the generalization ability of models. To this end, we collect a novel dataset AntCritic to serve as a helpful complement to this area, which consists of about 10k free-form and visually-rich financial comments and supports both argument component detection and argument relation prediction tasks. Besides, to cope with the challenges brought by scenario expansion, we thoroughly explore the fine-grained relation prediction and structure reconstruction scheme and discuss the encoding mechanism for visual styles and layouts. On this basis, we design two simple but effective model architectures and conduct various experiments on this dataset to provide benchmark performances as a reference and verify the practicability of our proposed architecture. We release our data and code in this link, and this dataset follows CC BY-NC-ND 4.0 license.

pdf abs
An Unsupervised Framework for Adaptive Context-aware Simplified-Traditional Chinese Character Conversion
Wei Li | Shutan Huang | Yanqiu Shao

Traditional Chinese character is an important carrier of Chinese culture, and is still actively used in many areas. Automatic conversion between traditional and simplified Chinese characters can help modern people understand traditional culture and facilitate communication among different regions. Previous conversion methods rely on rule-based mapping or shallow feature-based machine learning models, which struggle to convert simplified characters with different origins and constructing training data is costly. In this study, we propose an unsupervised adaptive context-aware conversion model that learns to convert between simplified and traditional Chinese characters under a denoising auto-encoder framework requiring no labeled data. Our model includes a Latent Generative Adversarial Encoder that transforms vectors to a latent space with generative adversarial network, which adds noise as an inevitable side effect, Based on which a Context-aware Semantic Reconstruction Decoder restores the original input while considering a broader range of context with a pretrained language model. Additionally, we propose to apply early exit mechanism during inference to reduce the computation complexity and improve the generalization ability. To test the effectiveness of our model, we construct a high quality test dataset with simplified-traditional Chinese character text pairs. Experiment results and extensive analysis demonstrate that our model outperforms strong unsupervised baselines and yields better conversion result for one-to-many cases.

A preprocessing task such as tokenization and sentence boundary detection (SBD) has commonly been considered as NLP challenges that have already been solved. This perception is due to their generally good performance and the presence of pre-tokenized data. However, it’s important to note that the low error rates of current methods are mainly specific to certain tasks, and rule-based tokenization can be difficult to use across different systems. Despite being subtle, these limitations are significant in the context of the NLP pipeline. In this paper, we introduce a novel evaluation algorithm for the preprocessing task, including both tokenization and SBD results. This algorithm aims to enhance the reliability of evaluations by reevaluating the counts of true positive cases for F1 measures in both preprocessing tasks jointly. It achieves this through an alignment-based approach inspired by sentence and word alignments used in machine translation. Our evaluation algorithm not only allows for precise counting of true positive tokens and sentence boundaries but also combines these two evaluation tasks into a single organized pipeline. To illustrate and clarify the intricacies of this calculation and integration, we provide detailed pseudo-code configurations for implementation. Additionally, we offer empirical evidence demonstrating how sentence and word alignment can improve evaluation reliability and present case studies to further support our approach.

Machine Translation (MT) has greatly advanced over the years due to the developments in deep neural networks. However, the emergence of Large Language Models (LLMs) like GPT-4 and ChatGPT is introducing a new phase in the MT domain. In this context, we believe that the future of MT is intricately tied to the capabilities of LLMs. These models not only offer vast linguistic understandings but also bring innovative methodologies, such as prompt-based techniques, that have the potential to further elevate MT. In this paper, we provide an overview of the significant enhancements in MT that are influenced by LLMs and advocate for their pivotal role in upcoming MT research and implementations. We highlight several new MT directions, emphasizing the benefits of LLMs in scenarios such as Long-Document Translation, Stylized Translation, and Interactive Translation. Additionally, we address the important concern of privacy in LLM-driven MT and suggest essential privacy-preserving strategies. By showcasing practical instances, we aim to demonstrate the advantages that LLMs offer, particularly in tasks like translating extended documents. We conclude by emphasizing the critical role of LLMs in guiding the future evolution of MT and offer a roadmap for future exploration in the sector.

pdf abs
A Persona-Based Corpus in the Diabetes Self-Care Domain - Applying a Human-Centered Approach to a Low-Resource Context
Rossana Cunha | Thiago Castro Ferreira | Adriana Pagano | Fabio Alves

While Natural Language Processing (NLP) models have gained substantial attention, only in recent years has research opened new paths for tackling Human-Computer Design (HCD) from the perspective of natural language. We focus on developing a human-centered corpus, more specifically, a persona-based corpus in a particular healthcare domain (diabetes mellitus self-care). In order to follow an HCD approach, we created personas to model interpersonal interaction (expert and non-expert users) in that specific domain. We show that an HCD approach benefits language generation from different perspectives, from machines to humans - contributing with new directions for low-resource contexts (languages other than English and sensitive domains) where the need to promote effective communication is essential.

Long-form numerical reasoning aims to generate a reasoning program to calculate the answer for a given question. Previous work followed a retriever-generator framework, where the retriever selects key facts from a long-form document, and the generator generates a reasoning program based on the retrieved facts. However, they treated all facts equally without considering the different contributions of facts with and without numerical information. Furthermore, they ignored program consistency, leading to the wrong punishment of programs that differed from the ground truth. In order to address these issues, we proposed APOLLO (An optimized training aPproach fOr Long-form numericaL reasOning), to improve long-form numerical reasoning. APOLLO includes a number-aware negative sampling strategy for the retriever to discriminate key numerical facts, and a consistency-based reinforcement learning with target program augmentation for the generator to ultimately increase the execution accuracy. Experimental results on the FinQA and ConvFinQA leaderboards verify the effectiveness of our proposed methods, achieving the new state-of-the-art.

pdf abs
Applying Transfer Learning to German Metaphor Prediction
Maria Berger | Nieke Kiwitt | Sebastian Reimann

This paper presents results in transfer-learning metaphor recognition in German. Starting from an English language corpus annotated for metaphor at the sentence level, and its machine-translation to German, we annotate 1000 sentences of the German part to use it as a Gold standard for two different metaphor prediction setups: i) a sequence labeling set-up (on the token-level), and ii) a classification (based on sentences) setup. We test two transfer leaning approaches: i) a group of transformer models, and ii) a technique that utilizes bilingual embeddings together with an RNN classifier. We find out that the transformer models do moderately in a zero-shot scenario (up to 61% F1 for classification) and the embeddings approaches do not even beat the guessing baseline (36% F1 for classification). We use our Gold data to fine-tune the classification tasks on target-language data achieving up to 90% F1 with both, the multilingual BERT and the bilingual embeddings. We also publish the annotated bilingual corpus.

Empathy is essential in healthcare communication. We introduce an annotation approach that draws on well-established frameworks for clinical empathy and breaking bad news (BBN) conversations for considering the interactive dynamics of discourse relations. We construct Empathy in BBNs, a span-relation task dataset of simulated BBN conversations in German, using our annotation scheme, in collaboration with a large medical school to support research on educational tools for medical didactics. The annotation is based on 1) Pounds (2011)’s appraisal framework for clinical empathy, which is grounded in systemic functional linguistics, and 2) the SPIKES protocol for breaking bad news (Baile et al., 2000), commonly taught in medical didactics training. This approach presents novel opportunities to study clinical empathic behavior and enables the training of models to detect causal relations involving empathy, a highly desirable feature of systems that can provide feedback to medical professionals in training. We present illustrative examples, discuss applications of the annotation scheme, and insights we can draw from the framework.

pdf abs
Approaches and Challenges for Resolving Different Representations of Fictional Characters for Chinese Novels
Li Song | Ying Liu

Due to the huge scale of literary works, automatic text analysis technologies are urgently needed for literary studies such as Digital Humanities. However, the domain-generality of existing NLP technologies limits their effectiveness on in-depth literary studies. It is valuable to explore how to adapt NLP technologies to the literary-specific tasks. Fictional characters are the most essential elements of a novel, and thus crucial to understanding the content of novels. The prerequisite of collecting a character’s information is to resolve its different representations. It is a specific problem of anaphora resolution which is a classical and open-domain NLP task. We adapt a state-of-the-art anaphora resolution model to resolve character representations in Chinese novels by making some modifications, and train a widely used BERT fine-tuned model for speaker extraction as assistance. We also analyze the challenges and potential solutions for character-resolution in Chinese novels according to the resolution results on a specific Chinese novel.

pdf abs
A Preliminary Study of ChatGPT for Spanish E2R Text Adaptation
Margot Madina | Itziar Gonzalez-Dios | Melanie Siegel

The process of adapting and creating Easy-to-Read (E2R) texts is very expensive and time-consuming. Due to the success of Large Language Models (LLMs) such as ChatGPT and their ability to generate written language, it is likely to think that such models can help in the adaptation or creation of text in E2R. In this paper, we explore the concept of E2R, its underlying principles and applications, and provides a preliminary study on the usefulness of ChatGPT-4 for E2R text adaptation. We focus on the Spanish language and its E2R variant, Lectura Fácil (LF). We consider a range of prompts that can be used and the differences in output that this produces. We then carry out a three-folded evaluation on 10 texts adapted by ChatGPT-4: (1) an automated evaluation to check values related to the readability of texts, (2) a checklist-based manual evaluation (for which we also propose three new capabilities) and (3) a users’ evaluation with people with cognitive disabilities. We show that it is difficult to choose the best prompt to make ChatGPT-4 adapt texts to LF. Furthermore, the generated output does not follow the E2R text rules, so it is often not suitable for the target audience.

pdf abs
A Quantum-Inspired Matching Network with Linguistic Theories for Metaphor Detection
Wenbo Qiao | Peng Zhang | ZengLai Ma

Enabling machines with the capability to recognize and comprehend metaphors is a crucial step toward achieving artificial intelligence. In linguistic theories, metaphor can be identified through Metaphor Identification Procedure (MIP) or Selectional Preference Violation (SPV), both of which are typically considered as matching tasks in the field of natural language processing. However, the implementation of MIP poses a challenge due to the semantic uncertainty and ambiguity of literal meanings of words. Simultaneously, SPV often struggles to recognize conventional metaphors. Inspired by Quantum Language Model (QLM) for modeling semantic uncertainty and fine-grained feature matching, we propose a quantum-inspired matching network for metaphor detection. Specifically, we use the density matrix to explicitly characterize the literal meanings of the target word for MIP, in order to model the uncertainty and ambiguity of the literal meanings of words. This can make SPV effective even in the face of conventional metaphors. MIP and SPV are then achieved by fine-grained feature matching. The results of the experiment finally demonstrated our approach has strong competitiveness.

Arabic diacritic recovery i.e. diacritization is necessary for proper vocalization and an enabler for downstream applications such as language learning and text to speech. Diacritics come in two varieties, namely: core-word diacritics and case endings. In this paper we introduce a highly effective morphologically informed character-level model that can recover both types of diacritics simultaneously. The model uses a Recurrent Neural Network (RNN) based architecture that takes in text as a sequence of characters, with markers for morphological segmentation, and outputs a sequence of diacritics. We also introduce a character-based morphological segmentation model that we train for Modern Standard Arabic (MSA) and dialectal Arabic. We demonstrate the efficacy of our diacritization model on Classical Arabic, MSA, and two dialectal (Moroccan and Tunisian) texts. We achieve the lowest reported word-level diacritization error rate for MSA (3.4%), match the best results for Classical Arabic (5.4%), and report competitive results for dialectal Arabic.

pdf abs
Arbitrary Time Information Modeling via Polynomial Approximation for Temporal Knowledge Graph Embedding
Zhiyu Fang | Jingyan Qin | Xiaobin Zhu | Chun Yang | Xu-Cheng Yin

Distinguished from traditional knowledge graphs (KGs), temporal knowledge graphs (TKGs) must explore and reason over temporally evolving facts adequately. However, existing TKG approaches still face two main challenges, i.e., the limited capability to model arbitrary timestamps continuously and the lack of rich inference patterns under temporal constraints. In this paper, we propose an innovative TKGE method (PTBox) via polynomial decomposition-based temporal representation and box embedding-based entity representation to tackle the above-mentioned problems. Specifically, we decompose time information by polynomials and then enhance the model’s capability to represent arbitrary timestamps flexibly by incorporating the learnable temporal basis tensor. In addition, we model every entity as a hyperrectangle box and define each relation as a transformation on the head and tail entity boxes. The entity boxes can capture complex geometric structures and learn robust representations, improving the model’s inductive capability for rich inference patterns. Theoretically, our PTBox can encode arbitrary time information or even unseen timestamps while capturing rich inference patterns and higher-arity relations of the knowledge base. Extensive experiments on real-world datasets demonstrate the effectiveness of our method.

pdf abs
ARBRES Kenstur: A Breton-French Parallel Corpus Rooted in Field Linguistics
Loïc Grobol | Mélanie Jouitteau

ARBRES is an ongoing project of open science implemented as a platform (“wikigrammar”) documenting both the Breton language itself and the state of research and engineering work in linguistics and NLP. Along its nearly 15 years of operation, it has aggregated a wealth of linguistic data in the form of interlinear glosses with translations illustrating lexical items, grammatical features, dialectal variations... While these glosses were primarily meant for human consumption, their volume and the regular format imposed by the wiki engine used for the website also make them suitable for machine processing. ARBRES Kenstur is a new parallel corpus derived from the glosses in ARBRES, including about 5k phrases and sentences in Breton along with translations in standard French. The nature of the original data — sourced from field linguistic inquiries meant to document the structure of Breton — leads to a resource that is mechanically more concerned with the internal variations of the language and rare phenomena than typical parallel corpora. Preliminaries experiments in using this corpus show that it can help improve machine translation for Breton, demonstrating that sourcing data from field linguistic documentation can be a way to help provide NLP tools for minority and low-resource languages.

pdf abs
A Regularization-based Transfer Learning Method for Information Extraction via Instructed Graph Decoder
Kedi Chen | Jie Zhou | Qin Chen | Shunyu Liu | Liang He

Information extraction (IE) aims to extract complex structured information from the text. Numerous datasets have been constructed for various IE tasks, leading to time-consuming and labor-intensive data annotations. Nevertheless, most prevailing methods focus on training task-specific models, while the common knowledge among different IE tasks is not explicitly modeled. Moreover, the same phrase may have inconsistent labels in different tasks, which poses a big challenge for knowledge transfer using a unified model. In this study, we propose a regularization-based transfer learning method for IE (TIE) via an instructed graph decoder. Specifically, we first construct an instruction pool for datasets from all well-known IE tasks, and then present an instructed graph decoder, which decodes various complex structures into a graph uniformly based on corresponding instructions. In this way, the common knowledge shared with existing datasets can be learned and transferred to a new dataset with new labels. Furthermore, to alleviate the label inconsistency problem among various IE tasks, we introduce a task-specific regularization strategy, which does not update the gradients of two tasks with ‘opposite direction’. We conduct extensive experiments on 12 datasets spanning four IE tasks, and the results demonstrate the great advantages of our proposed method.

Due to the lack of parallel data, the mainstream fine-tuning-based domain adaptation methods have the overfitting problem in the translation of low-resource domains, and it is difficult for the model to learn the in-domain generalization knowledge. To address the above issue, in this work, we propose a novel Reinforcement Learning Domain Adaptation method for Neural Machine Translation (RLDA-NMT) in the low-resource domain. RLDA-NMT utilizes in-domain source monolingual data to make up for the lack of parallel data, and reinforces domain features learning to make the translation model learn the domain-specific knowledge more fully. Specifically, we first train a ranking-based model with a small-scale in-domain parallel corpus, and then adopt it as the reward model to select higher-quality generated translations for reinforcement when fine-tuning pre-trained NMT model using in-domain source monolingual data. We conduct experiments on Education, Laws, Thesis, and Patent domains of Chinese⇔English translation tasks. Experimental results demonstrate that RLDA-NMT can alleviate overfitting and reinforce the NMT model to learn domain-specific knowledge. Additionally, the results also show that RLDA-NMT and back-translation (BT) are nicely complementary to each other, where combining RLDA-NMT with BT can further improve translation quality.

pdf abs
Are Large Language Models Good at Lexical Semantics? A Case of Taxonomy Learning
Viktor Moskvoretskii | Alexander Panchenko | Irina Nikishina

Recent studies on LLMs do not pay enough attention to linguistic and lexical semantic tasks, such as taxonomy learning. In this paper, we explore the capacities of Large Language Models featuring LLaMA-2 and Mistral for several Taxonomy-related tasks. We introduce a new methodology and algorithm for data collection via stochastic graph traversal leading to controllable data collection. Collected cases provide the ability to form nearly any type of graph operation. We test the collected dataset for learning taxonomy structure based on English WordNet and compare different input templates for fine-tuning LLMs. Moreover, we apply the fine-tuned models on such datasets on the downstream tasks achieving state-of-the-art results on the TexEval-2 dataset.

pdf abs
Are Text Classifiers Xenophobic? A Country-Oriented Bias Detection Method with Least Confounding Variables
Valentin Barriere | Sebastian Cifuentes

Classical bias detection methods used in Machine Learning are themselves biased because of the different confounding variables implied in the assessment of the initial biases. First they are using templates that are syntactically simple and distant from the target data on which the model will deployed. Second, current methods are assessing biases in pre-trained language models or in dataset, but not directly on the fine-tuned classifier that can actually produce harms. We propose a simple method to detect the biases of a specific fine-tuned classifier on any type of unlabeled data. The idea is to study the classifier behavior by creating counterfactual examples directly on the target data distribution and quantify the amount of changes. In this work, we focus on named entity perturbations by applying a Named Entity Recognition on target-domain data and modifying them accordingly to most common names or location of a target group (gender and country), and this for several morphosynctactically different languages spoken in relation with the countries of the target groups. We used our method on two models available open-source that are likely to be deployed by industry, and on two tasks and domains. We first assess the bias of a multilingual sentiment analysis model trained over multiple-languages tweets and available open-source, and then a multilingual stance recognition model trained over several languages and assessed over English language. Finally we propose to link the perplexity of each example with the bias of the model, by looking at the change in label distribution with respect to the language of the target group. Our work offers a fine-grained analysis of the interactions between names and languages, revealing significant biases in multilingual models.

The computational treatment of arguments on controversial issues has been subject to extensive NLP research, due to its envisioned impact on opinion formation, decision making, writing education, and the like. A critical task in any such application is the assessment of an argument’s quality - but it is also particularly challenging. In this position paper, we start from a brief survey of argument quality research, where we identify the diversity of quality notions and the subjectiveness of their perception as the main hurdles towards substantial progress on argument quality assessment. We argue that the capabilities of instruction-following large language models (LLMs) to leverage knowledge across contexts enable a much more reliable assessment. Rather than just fine-tuning LLMs towards leaderboard chasing on assessment tasks, they need to be instructed systematically with argumentation theories and scenarios as well as with ways to solve argument-related problems. We discuss the real-world opportunities and ethical issues emerging thereby.

pdf abs
Article Classification with Graph Neural Networks and Multigraphs
Khang Ly | Yury Kashnitsky | Savvas Chamezopoulos | Valeria Krzhizhanovskaya

Classifying research output into context-specific label taxonomies is a challenging and relevant downstream task, given the volume of existing and newly published articles. We propose a method to enhance the performance of article classification by enriching simple Graph Neural Network (GNN) pipelines with multi-graph representations that simultaneously encode multiple signals of article relatedness, e.g. references, co-authorship, shared publication source, shared subject headings, as distinct edge types. Fully supervised transductive node classification experiments are conducted on the Open Graph Benchmark OGBN-arXiv dataset and the PubMed diabetes dataset, augmented with additional metadata from Microsoft Academic Graph and PubMed Central, respectively. The results demonstrate that multi-graphs consistently improve the performance of a variety of GNN models compared to the default graphs. When deployed with SOTA textual node embedding methods, the transformed multi-graphs enable simple and shallow 2-layer GNN pipelines to achieve results on par with more complex architectures.

We introduce the Alternating Reading Task (ART) Corpus, a collection of dyadic sentence reading for studying the entrainment and imitation behaviour in speech communication. The ART corpus features three experimental conditions - solo reading, alternating reading, and deliberate imitation - as well as three subcorpora encompassing French-, Italian-, and Slovak-accented English. This design allows systematic investigation of speech entrainment in a controlled and less spontaneous setting. Alongside detailed transcriptions, it includes English proficiency scores, demographics, and in-experiment questionnaires for probing linguistic, personal and interpersonal influences on entrainment. Our presentation covers its design, collection, annotation processes, initial analysis, and future research prospects.

Simile tasks are challenging in natural language processing (NLP) because models require adequate world knowledge to produce predictions. In recent years, pre-trained language models (PLMs) have succeeded in NLP since they learn generic knowledge from a large corpus. The knowledge embedded in PLMs can be used for different kinds of Simile tasks. However, previous work usually explored one type of simile knowledge for a specific simile task, how to fully utilize different types of knowledge embedded in the PLMs requires further exploration. This paper proposes a self-verified method for exploring simile knowledge from PLMs, which allows the PLMs to leverage one type of simile knowledge to self-validate another. To this end, we first enhance PLMs with a novel multi-level simile recognition (MLSR) task that trains PLMs to evaluate the quality of similes. Then the PLMs leverage this evaluation score to assist the simile interpretation and generation tasks. In this way, we connect different types of simile knowledge in PLMs and make better use of them. Experiments on different pre-trained models and multiple publicly available datasets show that our method works for different kinds of PLMs and can explore more accurate simile knowledge for PLMs. Our code/data will be released on GitHub.

Document-level Event Argument Extraction (DEAE) aims to identify arguments and their specific roles from an unstructured document. The advanced approaches on DEAE utilize prompt-based methods to guide pre-trained language models (PLMs) in extracting arguments from input documents. They mainly concentrate on establishing relations between triggers and entity mentions within documents, leaving two unresolved problems: a) independent modeling of entity mentions; b) document-prompt isolation. To this end, we propose a semantic mention Graph Augmented Model (GAM) to address these two problems in this paper. Firstly, GAM constructs a semantic mention graph that captures relations within and between documents and prompts, encompassing co-existence, co-reference and co-type relations. Furthermore, we introduce an ensemble graph transformer module to address mentions and their three semantic relations effectively. Later, the graph-augmented encoder-decoder module incorporates the relation-specific graph into the input embedding of PLMs and optimizes the encoder section with topology information, enhancing the relations comprehensively. Extensive experiments on the RAMS and WikiEvents datasets demonstrate the effectiveness of our approach, surpassing baseline methods and achieving a new state-of-the-art performance.

pdf abs
ASEM: Enhancing Empathy in Chatbot through Attention-based Sentiment and Emotion Modeling
Omama Hamad | Khaled Shaban | Ali Hamdi

Effective feature representations play a critical role in enhancing the performance of text generation models that rely on deep neural networks. However, current approaches suffer from several drawbacks, such as the inability to capture the deep semantics of language and sensitivity to minor input variations, resulting in significant changes in the generated text. In this paper, we present a novel solution to these challenges by employing a mixture of experts, multiple encoders, to offer distinct perspectives on the emotional state of the user’s utterance while simultaneously enhancing performance. We propose an end-to-end model architecture called ASEM that performs emotion analysis on top of sentiment analysis for open-domain chatbots, enabling the generation of empathetic responses that are fluent and relevant. In contrast to traditional attention mechanisms, the proposed model employs a specialized attention strategy that uniquely zeroes in on sentiment and emotion nuances within the user’s utterance. This ensures the generation of context-rich representations tailored to the underlying emotional tone and sentiment intricacies of the text. Our approach outperforms existing methods for generating empathetic embeddings, providing empathetic and diverse responses. The performance of our proposed model significantly exceeds that of existing models, enhancing emotion detection accuracy by 6.2% and lexical diversity by 1.4%. ASEM code is released at https://github.com/MIRAH-Official/Empathetic-Chatbot-ASEM.git

pdf abs
A Single Linear Layer Yields Task-Adapted Low-Rank Matrices
Hwichan Kim | Shota Sasaki | Sho Hoshino | Ukyo Honda

Low-Rank Adaptation (LoRA) is a widely used Parameter-Efficient Fine-Tuning (PEFT) method that updates an initial weight matrix W₀ with a delta matrix 𝛥 W consisted by two low-rank matrices A and B. A previous study suggested that there is correlation between W₀ and 𝛥 W. In this study, we aim to delve deeper into relationships between W₀ and low-rank matrices A and B to further comprehend the behavior of LoRA. In particular, we analyze a conversion matrix that transform W₀ into low-rank matrices, which encapsulates information about the relationships. Our analysis reveals that the conversion matrices are similar across each layer. Inspired by these findings, we hypothesize that a single linear layer, which takes each layer’s W₀ as input, can yield task-adapted low-rank matrices. To confirm this hypothesis, we devise a method named Conditionally Parameterized LoRA (CondLoRA) that updates initial weight matrices with low-rank matrices derived from a single linear layer. Our empirical results show that CondLoRA maintains a performance on par with LoRA, despite the fact that the trainable parameters of CondLoRA are fewer than those of LoRA. Therefore, we conclude that “a single linear layer yields task-adapted low-rank matrices.” The code used in our experiments is available at https://github.com/CyberAgentAILab/CondLoRA.

pdf abs
Asking and Answering Questions to Extract Event-Argument Structures
Md Nayem Uddin | Enfa Rose George | Eduardo Blanco | Steven R. Corman

This paper presents a question-answering approach to extract document-level event-argument structures. We automatically ask and answer questions for each argument type an event may have. Questions are generated using manually defined templates and generative transformers. Template-based questions are generated using predefined role-specific wh-words and event triggers from the context document. Transformer-based questions are generated using large language models trained to formulate questions based on a passage and the expected answer. Additionally, we develop novel data augmentation strategies specialized in inter-sentential event-argument relations. We use a simple span-swapping technique, coreference resolution, and large language models to augment the training instances. Our approach enables transfer learning without any corpora-specific modifications and yields competitive results with the RAMS dataset. It outperforms previous work, and it is especially beneficial to extract arguments that appear in different sentences than the event trigger. We also present detailed quantitative and qualitative analyses shedding light on the most common errors made by our best model.

pdf abs
AssameseBackTranslit: Back Transliteration of Romanized Assamese Social Media Text
Hemanta Baruah | Sanasam Ranbir Singh | Priyankoo Sarmah

This paper presents a novel back transliteration dataset capturing native language text originally composed in the Roman/Latin script, harvested from popular social media platforms, along with its corresponding representation in the native Assamese script. Assamese, categorized as a low-resource language within the Indo-Aryan language family, predominantly spoken in the north-east Indian state of Assam, faces a scarcity of linguistic resources. The dataset comprises a total of 60,312 Roman-native parallel transliterated sentences. This paper diverges from conventional forward transliteration datasets consisting mainly of named entities and technical terms, instead presenting a novel transliteration dataset cultivated from three prominent social media platforms, Facebook, Twitter(currently X), and YouTube, in the backward transliteration direction. The paper offers a comprehensive examination of ten state-of-the-art word-level transliteration models within the context of this dataset, encompassing transliteration evaluation benchmarks, extensive performance assessments, and a discussion of the unique chal- lenges encountered during the processing of transliterated social media content. Our approach involves the initial use of two statistical transliteration models, followed by the training of two state-of-the-art neural network-based transliteration models, evaluation of three publicly available pre-trained models, and ultimately fine-tuning one existing state-of-the-art multilingual transliteration model along with two pre-trained large language models using the collected datasets. Notably, the Neural Transformer model outperforms all other baseline transliteration models, achieving the lowest Word Error Rate (WER) and Character Error Rate (CER), and the highest BLEU (up to 4 gram) score of 55.05, 19.44, and 69.15, respectively.

pdf abs
Assessing Online Writing Feedback Resources: Generative AI vs. Good Samaritans
Shabnam Behzad | Omid Kashefi | Swapna Somasundaran

Providing constructive feedback on student essays is a critical factor in improving educational results; however, it presents notable difficulties and may demand substantial time investments, especially when aiming to deliver individualized and informative guidance. This study undertakes a comparative analysis of two readily available online resources for students seeking to hone their skills in essay writing for English proficiency tests: 1) essayforum.com, a widely used platform where students can submit their essays and receive feedback from volunteer educators at no cost, and 2) Large Language Models (LLMs) such as ChatGPT. By contrasting the feedback obtained from these two resources, we posit that they can mutually reinforce each other and are more helpful if employed in conjunction when seeking no-cost online assistance. The findings of this research shed light on the challenges of providing personalized feedback and highlight the potential of AI in advancing the field of automated essay evaluation.

pdf abs
Assessing the Capabilities of Large Language Models in Coreference: An Evaluation
Yujian Gan | Massimo Poesio | Juntao Yu

This paper offers a nuanced examination of the role Large Language Models (LLMs) play in coreference resolution, aimed at guiding the future direction in the era of LLMs. We carried out both manual and automatic analyses of different LLMs’ abilities, employing different prompts to examine the performance of different LLMs, obtaining a comprehensive view of their strengths and weaknesses. We found that LLMs show exceptional ability in understanding coreference. However, harnessing this ability to achieve state of the art results on traditional datasets and benchmarks isn’t straightforward. Given these findings, we propose that future efforts should: (1) Improve the scope, data, and evaluation methods of traditional coreference research to adapt to the development of LLMs. (2) Enhance the fine-grained language understanding capabilities of LLMs.

pdf abs
Assessing the Efficacy of Grammar Error Correction: A Human Evaluation Approach in the Japanese Context
Qiao Wang | Zheng Yuan

In this study, we evaluated the performance of the state-of-the-art sequence tagging grammar error detection and correction model (SeqTagger) using Japanese university students’ writing samples. With an automatic annotation toolkit, ERRANT, we first evaluated SeqTagger’s performance on error correction with human expert correction as the benchmark. Then a human-annotated approach was adopted to evaluate Seqtagger’s performance in error detection using a subset of the writing dataset. Results indicated a precision of 63.66% and a recall of 20.19% for error correction in the full dataset. For the subset, after manual exclusion of irrelevant errors such as semantic and mechanical ones, the model shows an adjusted precision of 97.98% and an adjusted recall of 42.98% for error detection, indicating the model’s high accuracy but also its conservativeness. Thematic analysis on errors undetected by the model revealed that determiners and articles, especially the latter, were predominant. Specifically, in terms of context-independent errors, the model occasionally overlooked basic ones and faced challenges with overly erroneous or complex structures. Meanwhile, context-dependent errors, notably those related to tense and noun number, as well as those possibly influenced by the students’ first language (L1), remained particularly challenging.

pdf abs
A Streamlined Span-based Factorization Method for Few Shot Named Entity Recognition
Wenjie Xu | Yidan Chen | Jianquan Ouyang

Few-shot named entity recognition (NER) is a challenging task that aims to recognize new named entities with only a limited amount of labeled examples. In this paper, we introduce SSF, which is a streamlined span-based factorization method that addresses the problem of few-shot NER. Our approach formulates few-shot NER as a span-level alignment problem between query and support instances. To achieve this goal, SSF decomposes the span-level alignment problem into several refined span-level procedures. The proposed approach encompasses several key modules such as the Span Boosting Module, Span Prototypical Module, Span Alignment Module, and Span Optimization Module. Our experimental results demonstrate a significant improvement over the previous state-of-the-art performance. Specifically, compared to previous methods, our proposed approach achieves an average F1 score improvement of 12 points on the FewNERD dataset and 10 points on the SNIPS dataset. Moreover, our approach has surpassed the latest state-of-the-art performance on both datasets.

pdf abs
A Study on How Attention Scores in the BERT Model Are Aware of Lexical Categories in Syntactic and Semantic Tasks on the GLUE Benchmark
Dongjun Jang | Sungjoo Byun | Hyopil Shin

This study examines whether the attention scores between tokens in the BERT model significantly vary based on lexical categories during the fine-tuning process for downstream tasks. Drawing inspiration from the notion that in human language processing, syntactic and semantic information is parsed differently, we categorize tokens in sentences according to their lexical categories and focus on changes in attention scores among these categories. Our hypothesis posits that in downstream tasks that prioritize semantic information, attention scores centered on content words are enhanced, while in cases emphasizing syntactic information, attention scores centered on function words are intensified. Through experimentation conducted on six tasks from the GLUE benchmark dataset, we substantiate our hypothesis regarding the fine-tuning process. Furthermore, our additional investigations reveal the presence of BERT layers that consistently assign more bias to specific lexical categories, irrespective of the task, highlighting the existence of task-agnostic lexical category preferences.

pdf abs
A Survey on Natural Language Processing for Programming
Qingfu Zhu | Xianzhen Luo | Fang Liu | Cuiyun Gao | Wanxiang Che

Natural language processing for programming aims to use NLP techniques to assist programming. It is increasingly prevalent for its effectiveness in improving productivity. Distinct from natural language, a programming language is highly structured and functional. Constructing a structure-based representation and a functionality-oriented algorithm is at the heart of program understanding and generation. In this paper, we conduct a systematic review covering tasks, datasets, evaluation methods, techniques, and models from the perspective of the structure-based and functionality-oriented property, aiming to understand the role of the two properties in each component. Based on the analysis, we illustrate unexplored areas and suggest potential directions for future work.

pdf abs
A Tool for Determining Distances and Overlaps between Multimodal Annotations
Camila Antonio Barros | Jorge Francisco Ciprián-Sánchez | Saulo Mendes Santos

Comparing annotations is a constant and necessary step in corpus analysis. Although the nature of these annotations is normally research-specific, the tools used for this purpose do not have to be. Here, we present a tool for extracting and comparing annotations from ELAN, despite their idiosyncrasies. The intention behind this tool is to provide a handy way to analyze ELAN annotated files, by comparing tiers to a reference unit. Using the presented tool, it is possible to see how tiers overlap (even if they are of symbolic type), to which ratio, and the displacement regarding a reference unit. We present an example of multimodal corpus analysis, regarding the coordination between speech and gesture units based on a pragmatic reference. We argue that looking into overlap ratios can be more informative of the association between speech and gestures, and that considering a time buffer between speech and gestural events can be misleading.

pdf abs
A Treebank of Asia Minor Greek
Eleni Vligouridou | Inessa Iliadou | Çağrı Çöltekin

Asia Minor Greek (AMG) dialects are endangered dialects rich in history and cultAsia Minor Greek (AMG) dialects are endangered dialects rich in history and cultAsia Minor Greek (AMG) dialects are endangered dialects rich in history and cultAsia Minor Greek (AMG) dialects are endangered dialects rich in history and cultAsia Minor Greek (AMG) dialects are endangered dialects rich in history and culture that face a dire struggle for preservation due to declining speaker base and scarce linguistic resources. To address this need, we introduce a Universal Dependencies treebank of Pharasiot Greek, one of the severly endangerd AMG dialects. The present treebank is fully manually annotated and currently consists of 350 sentences from six fairy tales in Pharasiot dialect. Besides describing the treebank and the annotation process, we provide and discuss interesting phenomena we observed in the treebank. Most phenomena we discuss are related to contact-induced linguistic changes that these dialects are well known for. Beyond linguistic inquiry, like other treebanks for truly low-resource languages, the AMG treebank we present offers potentials for diverse applications, such as language preservation and revitalization, as well as NLP tools that have to be developed with scarce resources.

pdf abs
A Trusted Multi-View Evidential Fusion Framework for Commonsense Reasoning
Shuo Yang

While deep learning models are powerful, they have limitations in tasks that require commonsense reasoning, as these tasks often involve interpreting information that may not be directly available in the input. Providing evidence has been proven to significantly enhance performance in commonsense reasoning tasks. However, there are various perspectives on evidence, including natural language explanations generated by pre-trained language models, facts derived from world knowledge like text corpora and knowledge bases, and rationales extracted from the input context. Hence, it is crucial to determine how to estimate the confidence degree of different evidence and how to combine them reliably. To address these challenges, this study proposes a trusted multi-view evidential fusion framework for reliable commonsense reasoning tasks that dynamically assesses the confidence of evidence and combines different views of evidence in a trustworthy manner. The proposed method is applied to three commonsense question-answering benchmarks, demonstrating that this approach can effectively reason with multi-view evidence and can compete with state-of-the-art performance.

pdf abs
Attack Named Entity Recognition by Entity Boundary Interference
Yifei Yang | Hongqiu Wu | Hai Zhao

Named Entity Recognition (NER) is a cornerstone natural language processing task while its robustness has been given little attention. This paper rethinks the principles of the conventional text attack, as they can easily violate the label consistency between the original and adversarial NER samples. This is due to the fine-grained nature of NER, as even minor word changes in the sentence can result in the emergence or mutation of any entity, producing invalid adversarial samples. To this end, we propose a novel one-word modification NER attack based on a key insight, NER models are always vulnerable to the boundary position of an entity to make their decision. We thus strategically insert a new boundary into the sentence and trigger the victim model to make a wrong recognition either on this boundary word or on other words in the sentence. We call this attack Virtual Boundary Attack (ViBA), which is shown to be remarkably effective when attacking both English and Chinese models with a 70%-90% attack success rate on state-of-the-art language models, and also significantly faster than previous methods.

pdf abs
At the Crossroad of Cuneiform and NLP: Challenges for Fine-grained Part-of-speech Tagging
Gustav Ryberg Smidt | Els Lefever | Katrien de Graef

The study of ancient Middle Eastern cultures is dominated by the vast number of cuneiform texts. Multiple languages and language families were expressed in cuneiform. The most dominant language written in cuneiform is the Semitic Akkadian, which is the focus of this paper. We are specifically focusing on letters written in the dialect used in modern-day Baghdad and south towards the Persian Gulf during the Old Babylonian period (c. 2000-1600 B.C.E.). The Akkadian language was rediscovered in the 19th century and is now being scrutinised by Natural Language Processing (NLP) methods. However, existing Akkadian text publications are not always suitable for digital editions. We therefore risk applying NLP methods onto renderings of Akkadian unfit for the purpose. In this paper we want to investigate the input material and try to initiate a discussion about best-practices in the crossroad where NLP meets cuneiform studies. Specifically, we want to question the use of pre-trained embeddings, sentence segmentation and the type of cuneiform input used to fine-tune language models for the task of fine-grained part-of-speech tagging. We examine the issues by theoretical and practical approaches in a way that we hope spurs discussions that are relevant for automatic processing of other ancient languages.

pdf abs
A Tulu Resource for Machine Translation
Manu Narayanan | Noëmi Aepli

We present the first parallel dataset for English–Tulu translation. Tulu, classified within the South Dravidian linguistic family branch, is predominantly spoken by approximately 2.5 million individuals in southwestern India. Our dataset is constructed by integrating human translations into the multilingual machine translation resource FLORES-200. Furthermore, we use this dataset for evaluation purposes in developing our English–Tulu machine translation model. For the model’s training, we leverage resources available for related South Dravidian languages. We adopt a transfer learning approach that exploits similarities between high-resource and low-resource languages. This method enables the training of a machine translation system even in the absence of parallel data between the source and target language, thereby overcoming a significant obstacle in machine translation development for low-resource languages. Our English–Tulu system, trained without using parallel English–Tulu data, outperforms Google Translate by 19 BLEU points (in September 2023). The dataset and code are available here: https://github.com/manunarayanan/Tulu-NMT.

pdf abs
A Two-Stage Framework with Self-Supervised Distillation for Cross-Domain Text Classification
Yunlong Feng | Bohan Li | Libo Qin | Xiao Xu | Wanxiang Che

Cross-domain text classification is a crucial task as it enables models to adapt to a target domain that lacks labeled data. It leverages or reuses rich labeled data from the different but related source domain(s) and unlabeled data from the target domain. To this end, previous work focuses on either extracting domain-invariant features or task-agnostic features, ignoring domain-aware features that may be present in the target domain and could be useful for the downstream task. In this paper, we propose a two-stage framework for cross-domain text classification. In the first stage, we finetune the model with mask language modeling (MLM) and labeled data from the source domain. In the second stage, we further fine-tune the model with self-supervised distillation (SSD) and unlabeled data from the target domain. We evaluate its performance on a public cross-domain text classification benchmark and the experiment results show that our method achieves new state-of-the-art results for both single-source domain adaptations (94.17% +1.03%) and multi-source domain adaptations (95.09% +1.34%).

pdf abs
A Two-Stage Prediction-Aware Contrastive Learning Framework for Multi-Intent NLU
Guanhua Chen | Yutong Yao | Derek F. Wong | Lidia S. Chao

Multi-intent natural language understanding (NLU) presents a formidable challenge due to the model confusion arising from multiple intents within a single utterance. While previous works train the model contrastively to increase the margin between different multi-intent labels, they are less suited to the nuances of multi-intent NLU. They ignore the rich information between the shared intents, which is beneficial to constructing a better embedding space, especially in low-data scenarios. We introduce a two-stage Prediction-Aware Contrastive Learning (PACL) framework for multi-intent NLU to harness this valuable knowledge. Our approach capitalizes on shared intent information by integrating word-level pre-training and prediction-aware contrastive fine-tuning. We construct a pre-training dataset using a word-level data augmentation strategy. Subsequently, our framework dynamically assigns roles to instances during contrastive fine-tuning while introducing a prediction-aware contrastive loss to maximize the impact of contrastive learning. We present experimental results and empirical analysis conducted on three widely used datasets, demonstrating that our method surpasses the performance of three prominent baselines on both low-data and full-data scenarios.

pdf abs
A Typology of Errors for User Utterances in Chatbots
Anu Singh | Esme Manandise

This paper discusses the challenges non-prescriptive language uses in chatbot communication create for Semantic Parsing (SP). To help SP developers improve their systems, we propose a flexible error typology based on an analysis of a sample of non-prescriptive language uses mined from a domain-specific chatbot logs. This typology is not tied to any specific language model. We also present a framework for automatically mapping these errors to the typology. Finally, we show how our framework can help evaluate SP systems from a linguistic robustness perspective. Our framework can be expanded to include new classes of errors across different domains and user demographics.

pdf abs
Audiocite.net : A Large Spoken Read Dataset in French
Soline Felice | Solene Virginie Evain | Solange Rossato | François Portet

The advent of self-supervised learning (SSL) in speech processing has allowed the use of large unlabeled datasets to learn pre-trained models, serving as powerful encoders for various downstream tasks. However, the application of these SSL methods to languages such as French has proved difficult due to the scarcity of large French speech datasets. To advance the emergence of pre-trained models for French speech, we present the Audiocite.net corpus composed of 6,682 hours of recordings from 130 readers. This corpus is composed of audiobooks from the audiocite.net website, shared by 130 readers. In addition to describing the creation process and final statistics, we also show how this dataset impacted the models of LeBenchmark project in its 14k version for speech processing downstream tasks.

pdf abs
AuRoRA: A One-for-all Platform for Augmented Reasoning and Refining with Task-Adaptive Chain-of-Thought Prompting
Anni Zou | Zhuosheng Zhang | Hai Zhao

Large language models (LLMs) empowered by chain-of-thought (CoT) prompting have yielded remarkable prowess in reasoning tasks. Nevertheless, current methods predominantly lean on handcrafted or task-specific demonstrations, lack reliable knowledge basis and thus struggle for trustworthy responses in an automated pattern. While recent works endeavor to improve upon one certain aspect, they ignore the importance and necessity of establishing an integrated and interpretable reasoning system. To address these drawbacks and provide a universal solution, we propose AuRoRA: a one-for-all platform for augmented reasoning and refining based on CoT prompting that excels in adaptability, reliability, integrity, and interpretability. The system exhibits superior performances across six reasoning tasks and offers real-time visual analysis, which has pivotal academic and application value in the era of LLMs. The AuRoRA platform is available at https://huggingface.co/spaces/Anni123/AuRoRA.

pdf abs
Automated Extraction of Prosodic Structure from Unannotated Sign Language Video
Antonio F. G. Sevilla | José María Lahoz-Bengoechea | Alberto Diaz

As in oral phonology, prosody is an important carrier of linguistic information in sign languages. One of the most prominent ways this reveals itself is in the time structure of signs: their rhythm and intensity of articulation. To be able to empirically see these effects, the velocity of the hands can be computed throughout the execution of a sign. In this article, we propose a method for extracting this information from unlabeled videos of sign language, exploiting CoTracker, a recent advancement in computer vision which can track every point in a video without the need of any calibration or fine-tuning. The dominant hand is identified via clustering of the computed point velocities, and its dynamic profile plotted to make apparent the prosodic structure of signing. We apply our method to different datasets and sign languages, and perform a preliminary visual exploration of results. This exploration supports the usefulness of our methodology for linguistic analysis, though issues to be tackled remain, such as bi-manual signs and a formal and numerical evaluation of accuracy. Nonetheless, the absence of any preprocessing requirements may make it useful for other researchers and datasets.

pdf abs
Automatically Estimating Textual and Phonemic Complexity for Cued Speech: How to See the Sounds from French Texts
Núria Gala | Brigitte Bigi | Marie Bauer

In this position paper we present a methodology to automatically annotate French text for Cued Speech (CS), a communication system developed for people with hearing loss to complement speech reading at the phonetic level. This visual communication mode uses handshapes in different placements near the face in combination with the mouth movements (called ‘cues’ or ‘keys’) to make the phonemes of spoken language look different from each other. CS is used to acquire skills in lip reading, in oral communication and for reading. Despite many studies demonstrating its benefits, there are few resources available for learning and practicing it, especially in French. We thus propose a methodology to phonemize written corpora so that each word is aligned with the corresponding CS key(s). This methodology is proposed as part of a wider project aimed at creating an augmented reality system displaying a virtual coding hand where the user will be able to choose a text upon its complexity for cueing.

pdf abs
Automatic Animacy Classification for Romanian Nouns
Maria Tepei | Jelke Bloem

We introduce the first Romanian animacy classifier, specifically a type-based binary classifier of Romanian nouns into the classes human/non-human, using pre-trained word embeddings and animacy information derived from Romanian WordNet. By obtaining a seed set of labeled nouns and their embeddings, we are able to train classifiers that generalize to unseen nouns. We compare three different architectures and observe good performance on classifying word types. In addition, we manually annotate a small corpus for animacy to perform a token-based evaluation of Romanian animacy classification in a naturalistic setting, which reveals limitations of the type-based classification approach.

pdf abs
Automatic Annotation of Grammaticality in Child-Caregiver Conversations
Mitja Nikolaus | Abhishek Agrawal | Petros Kaklamanis | Alex Warstadt | Abdellah Fourtassi

The acquisition of grammar has been a central question to adjudicate between theories of language acquisition. In order to conduct faster, more reproducible, and larger-scale corpus studies on grammaticality in child-caregiver conversations, tools for automatic annotation can offer an effective alternative to tedious manual annotation. We propose a coding scheme for context-dependent grammaticality in child-caregiver conversations and annotate more than 4,000 utterances from a large corpus of transcribed conversations. Based on these annotations, we train and evaluate a range of NLP models. Our results show that fine-tuned Transformer-based models perform best, achieving human inter-annotation agreement levels. As a first application and sanity check of this tool, we use the trained models to annotate a corpus almost two orders of magnitude larger than the manually annotated data and verify that children’s grammaticality shows a steady increase with age. This work contributes to the growing literature on applying state-of-the-art NLP methods to help study child language acquisition at scale.

pdf abs
Automatic Authorship Analysis in Human-AI Collaborative Writing
Aquia Richburg | Calvin Bao | Marine Carpuat

As the quality of AI-generated text increases with the development of new Large Language Models, people use them to write in a variety of contexts. Human-AI collaborative writing poses a potential challenge for existing AI analysis techniques, which have been primarily tested either on human-written text only, or on samples independently generated by humans and AI. In this work, we investigate the extent to which existing AI detection and authorship analysis models can perform classification on data generated in human-AI collaborative writing sessions. Results show that, for AI text detection in the cowriting setting, classifiers based on authorship embeddings (Rivera-Soto et al., 2021) outperform classifiers used in prior work distinguishing AI vs. human text generated independently. However, these embeddings are not optimal for finer-grained authorship identification tasks: for authorship verification, n-gram based models are more robust to human-AI co-written text, and authorship attribution performance degrades compared to baselines that use human-written text only. Taken together, this suggests that the rise of human-AI co-written text will require adapting AI detection tools and authorship analysis techniques in the near future. We release our code at https://github.com/AARichburg/Human-AI_Authorship_Analysis.

pdf abs
Automatic Coding of Contingency in Child-Caregiver Conversations
Abhishek Agrawal | Mitja Nikolaus | Benoit Favre | Abdellah Fourtassi

One of the most important communicative skills children have to learn is to engage in meaningful conversations with people around them. At the heart of this learning lies the mastery of contingency, i.e., the ability to contribute to an ongoing exchange in a relevant fashion (e.g., by staying on topic). Current research on this question relies on the manual annotation of a small sample of children, which limits our ability to draw general conclusions about development. Here, we propose to mitigate the limitations of manual labor by relying on automatic tools for contingency judgment in children’s early natural interactions with caregivers. Drawing inspiration from the field of dialogue systems evaluation, we built and compared several automatic classifiers. We found that a Transformer-based pre-trained language model – when fine-tuned on a relatively small set of data we annotated manually (around 3,500 turns) – provided the best predictions. We used this model to automatically annotate, new and large-scale data, almost two orders of magnitude larger than our fine-tuning set. It was able to replicate existing results and generate new data-driven hypotheses. The broad impact of the work is to provide resources that can help the language development community study communicative development at scale, leading to more robust theories.

pdf abs
Automatic Construction of a Chinese Review Dataset for Aspect Sentiment Triplet Extraction via Iterative Weak Supervision
Chia-Wen Lu | Ching-Wen Yang | Wei-Yun Ma

Aspect Sentiment Triplet Extraction (ASTE), introduced in 2020, is a task that involves the extraction of three key elements: target aspects, descriptive opinion spans, and their corresponding sentiment polarity. This process, however, faces a significant hurdle, particularly when applied to Chinese languages, due to the lack of sufficient datasets for model training, largely attributable to the arduous manual labeling process. To address this issue, we present an innovative framework that facilitates the automatic construction of ASTE via Iterative Weak Supervision, negating the need for manual labeling, aided by a discriminator to weed out subpar samples. The objective is to successively improve the quality of this raw data and generate supplementary data. The effectiveness of our approach is underscored by our results, which include the creation of a substantial Chinese review dataset. This dataset encompasses over 60,000 Google restaurant reviews in Chinese and features more than 200,000 extracted triplets. Moreover, we have also established a robust baseline model by leveraging a novel method of weak supervision. Both our dataset and model are openly accessible to the public.

pdf abs
Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks
Keyaki Ohno | Hirotaka Kameko | Keisuke Shirai | Taichi Nishimura | Shinsuke Mori

Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expressions on general domains. In this paper, we propose Wikipedia Hyperlink-based Location Linking (WHLL), a novel method to construct a large-scale corpus for geoparsing from Wikipedia articles. WHLL leverages hyperlinks in Wikipedia to annotate multiple location expressions with coordinates. With this method, we constructed the WHLL corpus, a new large-scale corpus for geoparsing. The WHLL corpus consists of 1.3M articles, each containing about 7.8 unique location expressions. 45.6% of location expressions are ambiguous and refer to more than one location with the same notation. In each article, location expressions of the article title and those hyperlinks to other articles are assigned with coordinates. By utilizing hyperlinks, we can accurately assign location expressions with coordinates even with ambiguous location expressions in the texts. Experimental results show that there remains room for improvement by disambiguating location expressions.

pdf abs
Automatic Data Visualization Generation from Chinese Natural Language Questions
Yan Ge | Victor Junqiu Wei | Yuanfeng Song | Jason Chen Zhang | Raymond Chi-Wing Wong

Data visualization has emerged as an effective tool for getting insights from massive datasets. Due to the hardness of manipulating the programming languages of data visualization, automatic data visualization generation from natural languages (Text-to-Vis) is becoming increasingly popular. Despite the plethora of research effort on the English Text-to-Vis, studies have yet to be conducted on data visualization generation from questions in Chinese. Motivated by this, we propose a Chinese Text-to-Vis dataset in the paper and demonstrate our first attempt to tackle this problem. Our model integrates multilingual BERT as the encoder, boosts the cross-lingual ability, and infuses the n-gram information into our word representation learning. Our experimental results show that our dataset is challenging and deserves further research.

pdf abs
Automatic Decomposition of Text Editing Examples into Primitive Edit Operations: Toward Analytic Evaluation of Editing Systems
Daichi Yamaguchi | Rei Miyata | Atsushi Fujita | Tomoyuki Kajiwara | Satoshi Sato

This paper presents our work on a task of automatic decomposition of text editing examples into primitive edit operations. Toward a detailed analysis of the behavior of text editing systems, identification of fine-grained edit operations performed by the systems is essential. Given a pair of source and edited sentences, the goal of our task is to generate a non-redundant sequence of primitive edit operations, i.e., the semantically minimal edit operations preserving grammaticality, that iteratively converts the source sentence to the edited sentence. First, we formalize this task, explaining its significant features and specifying the constraints that primitive edit operations should satisfy. Then, we propose a method to automate this task, which consists of two steps: generation of an edit operation lattice and selection of an optimal path. To obtain a wide range of edit operation candidates in the first step, we combine a phrase aligner and a large language model. Experimental results show that our method perfectly decomposes 44% and 64% of editing examples in the text simplification and machine translation post-editing datasets, respectively. Detailed analyses also provide insights into the difficulties of this task, suggesting directions for improvement.

pdf abs
Automatic Extraction of Language-Specific Biomarkers of Healthy Aging in Icelandic
Elena Callegari | Iris Edda Nowenstein | Ingunn Jóhanna Kristjánsdóttir | Anton Karl Ingason

This study examines the influence of task type and healthy aging on various automatically extracted part-of-speech features in Icelandic. We administered three language tasks to participants aged 60–80: picture description, trip planning, and description of one’s childhood home. Our findings reveal significant task effects on 11 out of 14 linguistic variables studied, highlighting the substantial influence of sampling methods on language production. Among the variables showing statistically significant task effects, we find the rate of the genitive and subjunctive, variables which can only be studied in morphologically richer languages like Icelandic. On the other hand, rates of pronouns, adverbs, and prepositions remained stable across task types. Aging effects were more subtle, being evident in 3 of the 14 variables, including an interaction with task type for dative case marking. These findings underscore the significance of task selection in studies targeting linguistic features but also emphasize the need to examine languages other than English to fully understand the effects of aging on language production. Additionally, the results have clinical implications: understanding healthy aging’s impact on language can help us better identify and study changes caused by Alzheimer’s Disease in older adults’ speech.

pdf abs
Automatic Extraction of Nominal Phrases from German Learner Texts of Different Proficiency Levels
Ronja Laarmann-Quante | Marco Müller | Eva Belke

Correctly inflecting determiners and adjectives so that they agree with the noun in nominal phrases (NPs) is a big challenge for learners of German. Given the increasing number of available learner corpora, a large-scale corpus-based study on the acquisition of this aspect of German morphosyntax would be desirable. In this paper, we present a pilot study in which we investigate how well nouns, their grammatical heads and the dependents that have to agree with the noun can be extracted automatically via dependency parsing. For six samples of the German learner corpus MERLIN (one per proficiency level), we found that in spite of many ungrammatical sentences in texts of low proficiency levels, human annotators find only few true ambiguities that would make the extraction of NPs and their heads infeasible. The automatic parsers, however, perform rather poorly on extracting the relevant elements for texts on CEFR levels A1-B1 (< 70%) but quite well from level B2 onwards ( 90%). We discuss the sources of errors and how performance could potentially be increased in the future.

We are concerned with mapping the discursive landscape of conspiracy narratives surrounding the COVID-19 pandemic. In the present study, we analyse a corpus of more than 1,000 German Telegram posts tagged with 14 fine-grained conspiracy narrative labels by three independent annotators. Since emerging narratives on social media are short-lived and notoriously hard to track, we experiment with different state-of-the-art approaches to few-shot and zero-shot text classification. We report performance in terms of ROC-AUC and in terms of optimal F1, and compare fine-tuned methods with off-the-shelf approaches and human performance.

pdf abs
Automatic Partitioning of a Code-Switched Speech Corpus Using Mixed-Integer Programming
Joshua Miles Jansen van Vüren | Febe de Wet | Thomas Niesler

Defining training, development and test set partitions for speech corpora is usually accomplished by hand. However, for the dataset under investigation, which contains a large number of speakers, eight different languages and code-switching between all the languages, this style of partitioning is not feasible. Therefore, we view the partitioning task as a resource allocation problem and propose to solve it automatically and optimally by the application of mixed-integer linear programming. Using this approach, we are able to partition a new 41.6-hour multilingual corpus of code-switched speech into training, development and testing partitions while maintaining a fixed number of speakers and a specific amount of code-switched speech in the development and test partitions. For this newly partitioned corpus, we present baseline speech recognition results using a state-of-the-art multilingual transformer model (Wav2Vec2-XLS-R) and show that the exclusion of very short utterances (<1s) results in substantially improved speech recognition performance.

With the widespread adoption of automatic transcription tools, acquiring speech transcriptions within seconds has become a reality. Nonetheless, many of these tools yield unpunctuated outputs, potentially incurring additional costs. This paper presents a novel approach to integrating punctuation into the transcriptions generated by such automatic tools, specifically focusing on Spanish-speaking contexts. Leveraging the RoBERTa-bne model pre-trained with data from the Spanish National Library, our training proposal is augmented with additional corpora to enhance performance on less common punctuation marks, such as question marks. Also, the proposed model has been trained through fine-tuning pre-trained models, involving adjustments for token classification and using SoftMax to identify the highest probability token. The proposed model obtains promising results when compared with other Spanish reference paper models. Ultimately, this model aims to facilitate punctuation on live transcriptions seamlessly and accurately. The proposed model will be applied to a real-case education project to improve the readability of the transcriptions.

pdf abs
Automatic Speech Interruption Detection: Analysis, Corpus, and System
Martin Lebourdais | Marie Tahon | Antoine Laurent | Sylvain Meignier

Interruption detection is a new yet challenging task in the field of speech processing. This article presents a comprehensive study on automatic speech interruption detection, from the definition of this task, the assembly of a specialized corpus, and the development of an initial baseline system. We provide three main contributions: Firstly, we define the task, taking into account the nuanced nature of interruptions within spontaneous conversations. Secondly, we introduce a new corpus of conversational data, annotated for interruptions, to facilitate research in this domain. This corpus serves as a valuable resource for evaluating and advancing interruption detection techniques. Lastly, we present a first baseline system, which use speech processing methods to automatically identify interruptions in speech with promising results. In this article, we derivate from theoretical notions of interruption to build a simplification of this notion based on overlapped speech detection. Our findings can not only serve as a foundation for further research in the field but also provide a benchmark for assessing future advancements in automatic speech interruption detection.

This paper describes different approaches for developing, for the first time, an automatic speech recognition system for two of the main dialects of Occitan, namely Gascon and Languedocian, and the results obtained in them. The difficulty of the task lies in the fact that Occitan is a less-resourced language. Although a great effort has been made to collect or create corpora of each variant (transcribed speech recordings for the acoustic models and two text corpora for the language models), the sizes of the corpora obtained are far from those of successful systems reported in the literature, and thus we have tested different techniques to compensate for the lack of resources. We have developed classical systems using Kaldi, creating an acoustic model for each variant and also creating language models from the collected corpora and from machine translated texts. We have also tried fine-tuning a Whisper model with our speech corpora. We report word error rates of 20.86 for Gascon and 13.52 for Languedocian with the Kaldi systems and 16.37 for Gascon and 11.74 for Languedocian with Whisper.

pdf abs
Automatic Speech Recognition System-Independent Word Error Rate Estimation
Chanho Park | Mingjie Chen | Thomas Hain

Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). These are also domain-dependent and inflexible in real-world applications. In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. In contrast to prior work, the WER estimators are trained using data that simulates ASR system output. Hypotheses are generated using phonetically similar or linguistically more likely alternative words. In WER estimation experiments, the proposed method reaches a similar performance to ASR system-dependent WER estimators on in-domain data and achieves state-of-the-art performance on out-of-domain data. On the out-of-domain data, the SIWE model outperformed the baseline estimators in root mean square error and Pearson correlation coefficient by relative 17.58% and 18.21%, respectively, on Switchboard and CALLHOME. The performance was further improved when the WER of the training set was close to the WER of the evaluation dataset.

pdf abs
Automating Dataset Production Using Generative Text and Image Models
Christopher Thierauf | Mitchell Abrams | Matthias Scheutz

Practical and ethical dataset collection remains a challenge blocking many empirical methods in natural language processing, resulting in a lack of benchmarks or data on which to test hypotheses. We propose a solution to some of these areas by presenting a pipeline to reduce the research burden of producing image and text datasets when datasets may not exist. Our approach, with accompanying software tools, involves (1) generating text with LLMs; (2) creating accompanying image vignettes with text–to–image transformers; and (3) low-cost human validation. Based on existing literature that has struggled with quantitative evaluation (due to difficulty of data collection), we present the creation of 3 relevant datasets, and conduct a user study that demonstrates this approach is able to aid researchers in obtaining previously-challenging datasets. We provide sample data generated with this technique, the source code used to produce it, and discuss applicability and limitations.

pdf abs
Autonomous Aspect-Image Instruction a2II: Q-Former Guided Multimodal Sentiment Classification
Junjia Feng | Mingqian Lin | Lin Shang | Xiaoying Gao

Multimodal aspect-oriented sentiment classification (MABSC) task has garnered significant attention, which aims to identify the sentiment polarities of aspects by combining both language and vision information. However, the limited multimodal data in this task has become a big gap for the vision-language multimodal fusion. While large-scale vision-language pretrained models have been adapted to multiple tasks, their use for MABSC task is still in a nascent stage. In this work, we present an attempt to use the instruction tuning paradigm to MABSC task and leverage the ability of large vision-language models to alleviate the limitation in the fusion of textual and image modalities. To tackle the problem of potential irrelevance between aspects and images, we propose a plug-and-play selector to autonomously choose the most appropriate instruction from the instruction pool, thereby reducing the impact of irrelevant image noise on the final sentiment classification results. We conduct extensive experiments in various scenarios and our model achieves state-of-the-art performance on benchmark datasets, as well as in few-shot settings.

pdf abs
Auxiliary Knowledge-Induced Learning for Automatic Multi-Label Medical Document Classification
Xindi Wang | Robert E. Mercer | Frank Rudzicz

The International Classification of Diseases (ICD) is an authoritative medical classification system of different diseases and conditions for clinical and management purposes. ICD indexing aims to assign a subset of ICD codes to a medical record. Since human coding is labour-intensive and error-prone, many studies employ machine learning techniques to automate the coding process. ICD coding is a challenging task, as it needs to assign multiple codes to each medical document from an extremely large hierarchically organized collection. In this paper, we propose a novel approach for ICD indexing that adopts three ideas: (1) we use a multi-level deep dilated residual convolution encoder to aggregate the information from the clinical notes and learn document representations across different lengths of the texts; (2) we formalize the task of ICD classification with auxiliary knowledge of the medical records, which incorporates not only the clinical texts but also different clinical code terminologies and drug prescriptions for better inferring the ICD codes; and (3) we introduce a graph convolutional network to leverage the co-occurrence patterns among ICD codes, aiming to enhance the quality of label representations. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures.

In this work we present two datasets for the development of virtual patients and the first evaluation results. We firstly introduce a Spanish corpus of medical dialogue questions annotated with intents, built upon prior research in French. We also propose a second dataset of dialogues using a novel annotation approach that involves doctor questions, patient answers, and corresponding clinical records, organized as triples of the form (clinical report, question, patient answer). This way, the doctor-patient conversation is modeled as a question-answering system that tries to find responses to questions taking a clinical record as input. This approach can help to eliminate the need for manually structured patient records, as commonly used in previous studies, thereby expanding the pool of diverse virtual patients available. Leveraging these annotated corpora, we develop and assess an automatic system designed to answer medical dialogue questions posed by medical students to simulated patients in medical exams. Our approach demonstrates robust generalization, relying solely on medical records to generate new patient cases. The two datasets and the code will be freely available for the research community.

This paper presents a new web portal with information about the state of the art of natural language processing tasks in Spanish. It provides information about forums, competitions, tasks and datasets in Spanish, that would otherwise be spread in multiple articles and web sites. The portal consists of overview pages where information can be searched for and filtered by several criteria and individual pages with detailed information and hyperlinks to facilitate navigation. Information has been manually curated from publications that describe competitions and NLP tasks from 2013 until 2023 and will be updated as new tasks appear. A total of 185 tasks and 128 datasets from 94 competitions have been introduced.

We describe ongoing work for developing a workflow for the applied use case of classifying diachronic and regional language variation in Pre-Modern Slavic texts. The data were obtained via handwritten text recognition (HTR) on medieval manuscripts and printings and partly by manual transcription. Our goal is to develop a workflow for such historical language data, covering HTR-postprocessing, annotating and classifying the digitized texts. We test and adapt existing language resources to fit the pipeline with low-barrier tooling, accessible for Humanists with limited experience in research data infrastructures, computational analysis or advanced methods of natural language processing (NLP). The workflow starts by addressing ground truth (GT) data creation for diagnosing and correcting HTR errors via string metrics and data-driven methods. On GT and on HTR data, we subsequently show classification results using transfer learning on sentence-level text snippets. Next, we report on our token-level data labeling efforts. Each step of the workflow is complemented with describing current limitations and our corresponding work in progress.

pdf abs
A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks
Yanis Labrak | Mickael Rouvier | Richard Dufour

The recent emergence of Large Language Models (LLMs) has enabled significant advances in the field of Natural Language Processing (NLP). While these new models have demonstrated superior performance on various tasks, their application and potential are still underexplored, both in terms of the diversity of tasks they can handle and their domain of application. In this context, we evaluate four state-of-the-art instruction-tuned LLMs (ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca) on a set of 13 real-world clinical and biomedical NLP tasks in English, including named-entity recognition (NER), question-answering (QA), relation extraction (RE), and more. Our overall results show that these evaluated LLMs approach the performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, particularly excelling in the QA task, even though they have never encountered examples from these tasks before. However, we also observe that the classification and RE tasks fall short of the performance achievable with specifically trained models designed for the medical field, such as PubMedBERT. Finally, we note that no single LLM outperforms all others across all studied tasks, with some models proving more suitable for certain tasks than others.

pdf abs
Backdoor NLP Models via AI-Generated Text
Wei Du | Tianjie Ju | Ge Ren | GaoLei Li | Gongshen Liu

Backdoor attacks pose a critical security threat to natural language processing (NLP) models by establishing covert associations between trigger patterns and target labels without affecting normal accuracy. Existing attacks usually disregard fluency and semantic fidelity of poisoned text, rendering the malicious data easily detectable. However, text generation models can produce coherent and content-relevant text given prompts. Moreover, potential differences between human-written and AI-generated text may be captured by NLP models while being imperceptible to humans. More insidious threats could arise if attackers leverage latent features of AI-generated text as trigger patterns. We comprehensively investigate backdoor attacks on NLP models using AI-generated poisoned text obtained via continued writing or paraphrasing, exploring three attack scenarios: data, model and pre-training. For data poisoning, we fine-tune generators with attribute control to enhance the attack performance. For model poisoning, we leverage downstream tasks to derive specialized generators. For pre-training poisoning, we train multiple attribute-based generators and align their generated text with pre-defined vectors, enabling task-agnostic migration attacks. Experiments demonstrate that our method achieves effective attacks while maintaining fluency and semantic similarity across all scenarios. We hope this work can raise awareness of the security risks hidden in AI-generated text.

Open speech corpora of substantial size are seldom available for less-spoken languages, and this was recently the case also for Latvian with its 1.5M native speakers. While there exist several closed Latvian speech corpora of 100+ hours, used to train competitive models for automatic speech recognition (ASR), there were only a few tiny open datasets available at the beginning of 2023, the 18-hour Latvian Common Voice 13.0 dataset being the largest one. In the result of a successful national crowdsourcing initiative, organised jointly by several institutions, the size and speaker diversity of the Latvian Common Voice 17.0 release have increased more than tenfold in less than a year. A successful follow-up initiative was also launched for Latgalian, which has been recognized as an endangered historic variant of Latvian with 150k speakers. The goal of these initiatives is not only to enlarge the datasets but also to make them more diverse in terms of speakers and accents, text genres and styles, intonations, grammar and lexicon. They have already become considerable language resources for both improving ASR and conducting linguistic research. Since we use the Mozilla Common Voice platform to record and validate speech samples, this paper focuses on (i) the selection of text snippets to enrich the language data and to stimulate various intonations, (ii) an indicative evaluation of the acquired corpus and the first ASR models fine-tuned on this data, (iii) our social campaigns to boost and maintain this initiative.

pdf abs
BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models
Zican Dong | Tianyi Tang | Junyi Li | Wayne Xin Zhao | Ji-Rong Wen

Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length. Recently, multiple studies have committed to extending the context length and enhancing the long text modeling capabilities of LLMs. To comprehensively evaluate the long context ability of LLMs, we propose BAMBOO, a multi-task long context benchmark. BAMBOO has been designed with four principles: comprehensive capacity evaluation, avoidance of data contamination, accurate automatic evaluation, and different length levels. It consists of 10 datasets from 5 different long text understanding tasks, i.e., question answering, hallucination detection, text sorting, language modeling, and code completion, to cover various domains and core capacities of LLMs. We conduct experiments with five widely-used long-context models and further discuss five key questions for long text research. In the end, we discuss problems of current long-context models and point out future directions for enhancing long text modeling capacities. We release our data, prompts, and code at https://anonymous.4open.science/r/BAMBOO/.

pdf abs
BanglaAutoKG: Automatic Bangla Knowledge Graph Construction with Semantic Neural Graph Filtering
Azmine Toushik Wasi | Taki Hasan Rafi | Raima Islam | Dong-Kyu Chae

Knowledge Graphs (KGs) have proven essential in information processing and reasoning applications because they link related entities and give context-rich information, supporting efficient information retrieval and knowledge discovery; presenting information flow in a very effective manner. Despite being widely used globally, Bangla is relatively underrepresented in KGs due to a lack of comprehensive datasets, encoders, NER (named entity recognition) models, POS (part-of-speech) taggers, and lemmatizers, hindering efficient information processing and reasoning applications in the language. Addressing the KG scarcity in Bengali, we propose BanglaAutoKG, a pioneering framework that is able to automatically construct Bengali KGs from any Bangla text. We utilize multilingual LLMs to understand various languages and correlate entities and relations universally. By employing a translation dictionary to identify English equivalents and extracting word features from pre-trained BERT models, we construct the foundational KG. To reduce noise and align word embeddings with our goal, we employ graph-based polynomial filters. Lastly, we implement a GNN-based semantic filter, which elevates contextual understanding and trims unnecessary edges, culminating in the formation of the definitive KG. Empirical findings and case studies demonstrate the universal effectiveness of our model, capable of autonomously constructing semantically enriched KGs from any text. Data and code are available here: https://github.com/azminewasi/BanglaAutoKG

Since the Internet is flooded with hate, it is one of the main tasks for NLP experts to master automated online content moderation. However, advancements in this field require improved access to publicly available accurate and non-synthetic datasets of social media content. For the Polish language, such resources are very limited. In this paper, we address this gap by presenting a new open dataset of offensive social media content for the Polish language. The dataset comprises content from Wykop.pl, a popular online service often referred to as the Polish Reddit, reported by users and banned in the internal moderation process. It contains a total of 691,662 posts and comments, evenly divided into two categories: harmful and neutral (non-harmful). The anonymized subset of the BAN-PL dataset consisting on 24,000 pieces (12,000 for each class), along with preprocessing scripts have been made publicly available. Furthermore the paper offers valuable insights into real-life content moderation processes and delves into an analysis of linguistic features and content characteristics of the dataset. Moreover, a comprehensive anonymization procedure has been meticulously described and applied. The prevalent biases encountered in similar datasets, including post-moderation and pre-selection biases, are also discussed.

In the realm of artificial intelligence and linguistics, the automatic generation of humor, particularly puns, remains a complex task. This paper introduces an innovative approach that employs a Generative Adversarial Network (GAN) and semantic pruning techniques to generate humorous puns. We initiate our process by identifying potential pun candidates via semantic pruning. This is followed by the use of contrastive learning to decode the unique characteristics of puns, emphasizing both correct and incorrect interpretations. The learned features from contrastive learning are utilized within our GAN model to better capture the semantic nuances of puns. Specifically, the generator exploits the pruned semantic tree to generate pun texts, while the discriminator evaluates the generated puns, ensuring both linguistic correctness and humor. Evaluation results highlight our model’s capacity to produce semantically coherent and humorous puns, demonstrating an enhancement over prior methods and approach human-level performance. This work contributes significantly to the field of computational humor, advancing the capabilities of automatic pun generation.

pdf abs
Basque and Spanish Counter Narrative Generation: Data Creation and Evaluation
Jaione Bengoetxea | Yi-Ling Chung | Marco Guerini | Rodrigo Agerri

Counter Narratives (CNs) are non-negative textual responses to Hate Speech (HS) aiming at defusing online hatred and mitigating its spreading across media. Despite the recent increase in HS content posted online, research on automatic CN generation has been relatively scarce and predominantly focused on English. In this paper, we present CONAN-EUS, a new Basque and Spanish dataset for CN generation developed by means of Machine Translation (MT) and professional post-edition. Being a parallel corpus, also with respect to the original English CONAN, it allows to perform novel research on multilingual and crosslingual automatic generation of CNs. Our experiments on CN generation with mT5, a multilingual encoder-decoder model, shows that generation greatly benefits from training on post-edited data, as opposed to relying on silver MT data only. These results are confirmed by their correlation with a qualitative manual evaluation, demonstrating that manually revised training data remains crucial for the quality of the generated CNs. Furthermore, multilingual data augmentation improves results over monolingual settings for structurally similar languages such as English and Spanish, while being detrimental for Basque, a language isolate. Similar findings occur in zero-shot crosslingual evaluations, where model transfer (fine-tuning in English and generating in a different target language) outperforms fine-tuning mT5 on machine translated data for Spanish but not for Basque. This provides an interesting insight into the asymmetry in the multilinguality of generative models, a challenging topic which is still open to research. Data and code will be made publicly available upon publication.

pdf abs
Becoming a High-Resource Language in Speech: The Catalan Case in the Common Voice Corpus
Carme Armentano-Oller | Montserrat Marimon | Marta Villegas

Collecting voice resources for speech recognition systems is a multifaceted challenge, involving legal, technical, and diversity considerations. However, it is crucial to ensure fair access to voice-driven technology across diverse linguistic backgrounds. We describe an ongoing effort to create an extensive, high-quality, publicly available voice dataset for future development of speech technologies in Catalan through the Mozilla Common Voice crowd-sourcing platform. We detail the specific approaches used to address the challenges faced in recruiting contributors and managing the collection, validation, and recording of sentences. This detailed overview can serve as a source of guidance for similar initiatives across other projects and linguistic contexts. The success of this project is evident in the latest corpus release, version 16.1, where Catalan ranks as the most prominent language in the corpus, both in terms of recorded hours and when considering validated hours. This establishes Catalan as a language with significant speech resources for language technology development and significantly raises its international visibility.

pdf abs
BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language
Konrad Wojtasik | Kacper Wołowiec | Vadim Shishkin | Arkadiusz Janz | Maciej Piasecki

The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR), garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr. TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark – a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language, marking a pioneering development in this field. The BEIR-PL is included in MTEB Benchmark and also available with trained models at URL https://huggingface.co/clarin-knext.

pdf abs
Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies
Flavio Petruzzellis | Alberto Testolin | Alessandro Sperduti

Large Language Models (LLMs) have revolutionized the field of Natural Language Processing thanks to their ability to reuse knowledge acquired on massive text corpora on a wide variety of downstream tasks, with minimal (if any) tuning steps. At the same time, it has been repeatedly shown that LLMs lack systematic generalization, which allows to extrapolate the learned statistical regularities outside the training distribution. In this work, we offer a systematic benchmarking of GPT-4, one of the most advanced LLMs available, on three algorithmic tasks characterized by the possibility to control the problem difficulty with two parameters. We compare the performance of GPT-4 with that of its predecessor (GPT-3.5) and with a variant of the Transformer-Encoder architecture recently introduced to solve similar tasks, the Neural Data Router. We find that the deployment of advanced prompting techniques allows GPT-4 to reach superior accuracy on all tasks, demonstrating that state-of-the-art LLMs constitute a very strong baseline also in challenging tasks that require systematic generalization.

Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. However, they are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP). To support this approach, we innovatively develop a dataset called Unanswerable Math Word Problem (UMWP) which comprises 5200 questions across five categories. We developed an evaluation methodology combining text similarity and mathematical expression detection to determine whether LLM considers the question unanswerable. The results of extensive experiments conducted on 31 LLMs, including GPT-3, InstructGPT, LLaMA, and Claude, demonstrate that in-context learning and reinforcement learning with human feedback (RLHF) training significantly enhance the model’s ability to avoid hallucination. We show that utilizing MWP is a reliable and effective approach to assess hallucination. Our code and data are available at https://github.com/Yuki-Asuuna/UMWP.

This paper explores the efficacy of large language models (LLMs) for Persian. While ChatGPT and consequent LLMs have shown remarkable performance in English, their efficiency for more low-resource languages remains an open question. We present the first comprehensive benchmarking study of LLMs across diverse Persian language tasks. Our primary focus is on GPT-3.5-turbo, but we also include GPT-4 and OpenChat-3.5 to provide a more holistic evaluation. Our assessment encompasses a diverse set of tasks categorized into classic, reasoning, and knowledge-based domains. To enable a thorough comparison, we evaluate LLMs against existing task-specific fine-tuned models. Given the limited availability of Persian datasets for reasoning tasks, we introduce two new benchmarks: one based on elementary school math questions and another derived from the entrance exams for 7th and 10th grades. Our findings reveal that while LLMs, especially GPT-4, excel in tasks requiring reasoning abilities and a broad understanding of general knowledge, they often lag behind smaller pretrained models fine-tuned specifically for particular tasks. Additionally, we observe improved performance when test sets are translated to English before inputting them into GPT-3.5. These results highlight the significant potential for enhancing LLM performance in the Persian language. This is particularly noteworthy due to the unique attributes of Persian, including its distinct alphabet and writing styles. We have made our codes, prompts, and data available here: https://github.com/Ipouyall/Benchmarking_ChatGPT_for_Persian.

pdf abs
Benchmarking the Performance of Machine Translation Evaluation Metrics with Chinese Multiword Expressions
Huacheng Song | Hongzhi Xu

To investigate the impact of Multiword Expressions (MWEs) on the fine-grained performance of the state-of-the-art metrics for Machine Translation Evaluation (MTE), we conduct experiments on the WMT22 Metrics Shared Task dataset with a preliminary focus on the Chinese-to-English language pair. We further annotate 28 types of Chinese MWEs on the source texts and then examine the performance of 31 MTE metrics on groups of sentences containing different MWEs. We have 3 interesting findings: 1) Machine Translation (MT) systems tend to perform worse on most Chinese MWE categories, confirming the previous claim that MWEs are a bottleneck of MT; 2) automatic metrics tend to overrate the translation of sentences containing MWEs; 3) most neural-network-based metrics perform better than string-overlap-based metrics. It concludes that both MT systems and MTE metrics still suffer from MWEs, suggesting richer annotation of data to facilitate MWE-aware automatic MTE and MT.

pdf abs
Benchmarking the Simplification of Dutch Municipal Text
Daniel Vlantis | Iva Gornishka | Shuai Wang

Text simplification (TS) makes written information more accessible to all people, especially those with cognitive or language impairments. Despite much progress in TS due to advances in NLP technology, the bottleneck issue of lack of data for low-resource languages persists. Dutch is one of these languages that lack a monolingual simplification corpus. In this paper, we use English as a pivot language for the simplification of Dutch medical and municipal text. We experiment with augmenting training data and corpus choice for this pivot-based approach. We compare the results to a baseline and an end-to-end LLM approach using the GPT 3.5 Turbo model. Our evaluation shows that, while we can substantially improve the results of the pivot pipeline, the zero-shot end-to-end GPT-based simplification performs better on all metrics. Our work shows how an existing pivot-based pipeline can be improved for simplifying Dutch medical text. Moreover, we provide baselines for the comparison in the domain of Dutch municipal text and make our corresponding evaluation dataset publicly available.

pdf abs
BengaliLCP: A Dataset for Lexical Complexity Prediction in the Bengali Texts
Nabila Ayman | Md. Akram Hossain | Abdul Aziz | Rokan Uddin Faruqui | Abu Nowshed Chy

Encountering intricate or ambiguous terms within a sentence produces distress for the reader during comprehension. Lexical Complexity Prediction (LCP) deals with predicting the complexity score of a word or a phrase considering its context. This task poses several challenges including ambiguity, context sensitivity, and subjectivity in perceiving complexity. Despite having 300 million native speakers and ranking as the seventh most spoken language in the world, Bengali falls behind in the research on lexical complexity when compared to other languages. To bridge this gap, we introduce the first annotated Bengali dataset, that assists in performing the task of LCP in this language. Besides, we propose a transformer-based deep neural approach with a pairwise multi-head attention mechanism and LSTM model to predict the lexical complexity of Bengali tokens. The outcomes demonstrate that the proposed neural approach outperformed the existing state-of-the-art models for the Bengali language.

Large Language Models (LLMs) have emerged as one of the most important breakthroughs in natural language processing (NLP) for their impressive skills in language generation and other language-specific tasks. Though LLMs have been evaluated in various tasks, mostly in English, they have not yet undergone thorough evaluation in under-resourced languages such as Bengali (Bangla). To this end, this paper introduces BenLLM-Eval, which consists of a comprehensive evaluation of LLMs to benchmark their performance in the low-resourced Bangla language. In this regard, we select various important and diverse Bangla NLP tasks, such as text summarization, question answering, paraphrasing, natural language inference, text classification, and sentiment analysis for zero-shot evaluation of popular LLMs, namely, ChatGPT, LLaMA-2, and Claude-2. Our experimental results demonstrate that while in some Bangla NLP tasks, zero-shot LLMs could achieve performance on par, or even better than current SOTA fine-tuned models; in most tasks, their performance is quite poor (with the performance of open-source LLMs like LLaMA-2 being significantly bad) in comparison to the current SOTA results. Therefore, it calls for further efforts to develop a better understanding of LLMs in low-resource languages like Bangla.

Recently, we have witnessed a significant performance boosting for dialogue response selection task achieved by Cross-Encoder based models. However, such models directly feed the concatenation of context and response into the pre-trained model for interactive inference, ignoring the comprehensively independent representation modeling of context and response. Moreover, randomly sampling negative responses from other dialogue contexts is simplistic, and the learned models have poor generalization capability in realistic scenarios. In this paper, we propose a response selection model called BERT-BC that combines the representation-based Bi-Encoder and interaction-based Cross-Encoder. Three contrastive learning methods are devised for the Bi-Encoder to align context and response to obtain the better semantic representation. Meanwhile, according to the alignment difficulty of context and response semantics, the harder samples are dynamically selected from the same batch with negligible cost and sent to Cross-Encoder to enhance the model’s interactive reasoning ability. Experimental results show that BERT-BC can achieve state-of-the-art performance on three benchmark datasets for multi-turn response selection.

In the digital age, cyberbullying (CB) poses a significant concern, impacting individuals as early as primary school and leading to severe or lasting consequences, including an increased risk of self-harm. CB incidents, are not limited to bullies and victims, but include bystanders with various roles, and usually have numerous sub-categories and variations of online harms. This position paper emphasises the complexity of CB incidents by drawing on insights from psychology, social sciences, and computational linguistics. While awareness of CB complexities is growing, existing computational techniques tend to oversimplify CB as a binary classification task, often relying on training datasets that capture peripheries of CB behaviours. Inconsistent definitions and categories of CB-related online harms across various platforms further complicates the issue. Ethical concerns arise when CB research involves children to role-play CB incidents to curate datasets. Through multi-disciplinary collaboration, we propose strategies for consideration when developing CB detection systems. We present our position on leveraging large language models (LLMs) such as Claude-2 and Llama2-Chat as an alternative approach to generate CB-related role-playing datasets. Our goal is to assist researchers, policymakers, and online platforms in making informed decisions regarding the automation of CB incident detection and intervention. By addressing these complexities, our research contributes to a more nuanced and effective approach to combating CB especially in young people.

pdf abs
Beyond Canonical Fine-tuning: Leveraging Hybrid Multi-Layer Pooled Representations of BERT for Automated Essay Scoring
Eujene Nikka V. Boquio | Prospero C. Naval, Jr.

The challenging yet relevant task of automated essay scoring (AES) continuously gains attention from multiple disciplines over the years. With the advent of pre-trained large language models such as BERT, fine-tuning those models has become the dominant technique in various natural language processing (NLP) tasks. Several studies fine-tune BERT for the AES task but only utilize the final pooled output from its last layer. With BERT’s multi-layer architecture that encodes hierarchical linguistic information, we believe we can improve overall essay scoring performance by leveraging information from its intermediate layers. In this study, we diverge from the canonical fine-tuning paradigm by exploring different combinations of model outputs and single- and multi-layer pooling strategies, as well as architecture modifications to the task-specific component of the model. Using a hybrid pooling strategy, experimental results show that our best essay representa- tion combined with a simple architectural modification outperforms the average QWK score of the basic fine-tuned BERT with default output on the ASAP AES dataset, suggesting its effectiveness for the AES task and potentially other long-text tasks.

pdf abs
Beyond Code: Evaluate Thought Steps for Complex Code Generation
Liuwen Cao | Yi Cai | Jiexin Wang | Hongkui He | Hailin Huang

Code generation aims to generate code in a general-purpose programming language, such as C++, based on natural language intents. Existing efforts primarily focus on relatively simple programming problems and fail to evaluate the thought process involved in complex programming scenarios. In this paper, we introduce “steps-guided code generation,” a task that assesses the quality of both thought steps and code implementation to evaluate the overall management of handling a complex programming problem. To support this task, we construct CodeStepsEval, a real-world scenario dataset of complex programming problems in the C++ programming language with varying levels of difficulty. Comprehensive experiments on this dataset demonstrate the importance of high-quality steps in enhancing code generation performance and the challenges faced by the code LLMs in this task.

Low-Rank Adaptation (LoRA) is a widespread parameter-efficient fine-tuning algorithm for large-scale language models. It has been commonly accepted that LoRA mostly achieves promising results in single-task, low-resource settings, and struggles to handle multi-task instruction tuning scenarios. In this paper, we conduct a systematic study of LoRA on diverse tasks and rich resources with different learning capacities, examining its performance on seen tasks during training and its cross-task generalization on unseen tasks. Our findings challenge the prevalent assumption that the limited learning capacity will inevitably result in performance decline. In fact, our study reveals that when configured with an appropriate rank, LoRA can achieve remarkable performance in high-resource and multi-task scenarios, even comparable to that achieved through full fine-tuning. It turns out that the constrained learning capacity encourages LoRA to prioritize conforming to instruction requirements rather than memorizing specialized features of particular tasks or instances. This study reveals the underlying connection between learning capacity and generalization capabilities for robust parameter-efficient fine-tuning, highlighting a promising direction for the broader application of LoRA across various tasks and settings.

Emotion recognition in conversation (ERC) is essential for dialogue systems to identify the emotions expressed by speakers. Although previous studies have made significant progress, accurate recognition and interpretation of similar fine-grained emotion properly accounting for individual variability remains a challenge. One particular under-explored area is the role of individual beliefs and desires in modelling emotion. Inspired by the Belief-Desire Theory of Emotion, we propose a novel method for conversational emotion recognition that incorporates both belief and desire to accurately identify emotions. We extract emotion-eliciting events from utterances and construct graphs that represent beliefs and desires in conversations. By applying message passing between nodes, our graph effectively models the utterance context, speaker’s global state, and the interaction between emotional beliefs, desires, and utterances. We evaluate our model’s performance by conducting extensive experiments on four popular ERC datasets and comparing it with multiple state-of-the-art models. The experimental results demonstrate the superiority of our proposed model and validate the effectiveness of each module in the model.

pdf abs
Beyond Model Performance: Can Link Prediction Enrich French Lexical Graphs?
Hee-Soo Choi | Priyansh Trivedi | Mathieu Constant | Karen Fort | Bruno Guillaume

This paper presents a resource-centric study of link prediction approaches over French lexical-semantic graphs. Our study incorporates two graphs, RezoJDM16k and RL-fr, and we evaluated seven link prediction models, with CompGCN-ConvE emerging as the best performer. We also conducted a qualitative analysis of the predictions using manual annotations. Based on this, we found that predictions with higher confidence scores were more valid for inclusion. Our findings highlight different benefits for the dense graph compared to the sparser graph RL-fr. While the addition of new triples to RezoJDM16k offers limited advantages, RL-fr can benefit substantially from our approach.

With the rise of Large Language Models (LLMs), AI assistants’ ability to utilize tools, especially through API calls, has advanced notably. This progress has necessitated more accurate evaluation methods. Many existing studies adopt static evaluation, where they assess AI assistants’ API call based on pre-defined dialogue histories. However, such evaluation method can be misleading, as an AI assistant might fail in generating API calls from preceding human interaction in real cases. Instead of the resource-intensive method of direct human-machine interactions, we propose Automated Dynamic Evaluation (AutoDE) to assess an assistant’s API call capability without human involvement. In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions, using a LLM-based user agent, equipped with a user script to ensure human alignment. Experimental results highlight that AutoDE uncovers errors overlooked by static evaluations, aligning more closely with human assessment. Testing four AI assistants using our crafted benchmark, our method further mirrored human evaluation compared to conventional static evaluations.

Out-of-domain (OOD) intent detection aims to examine whether the user’s query falls outside the predefined domain of the system, which is crucial for the proper functioning of task-oriented dialogue (TOD) systems. Previous methods address it by fine-tuning discriminative models. Recently, some studies have been exploring the application of large language models (LLMs) represented by ChatGPT to various downstream tasks, but it is still unclear for their ability on OOD detection task.This paper conducts a comprehensive evaluation of LLMs under various experimental settings, and then outline the strengths and weaknesses of LLMs. We find that LLMs exhibit strong zero-shot and few-shot capabilities, but is still at a disadvantage compared to models fine-tuned with full resource. More deeply, through a series of additional analysis experiments, we discuss and summarize the challenges faced by LLMs and provide guidance for future work including injecting domain knowledge, strengthening knowledge transfer from IND(In-domain) to OOD, and understanding long instructions.

pdf abs
Beyond Words: Decoding Facial Expression Dynamics in Motivational Interviewing
Nezih Younsi | Catherine Pelachaud | Laurence Chaby

Authors : Nezih Younsi, Catherine Pelachaud, Laurence Chaby Title : Beyond Words: Decoding Facial Expression Dynamics in Motivational Interviewing Abstract : This paper focuses on studying the facial expressions of both client and therapist in the context of Motivational Interviewing (MI). The annotation system Motivational Interview Skill Code MISC defines three types of talk, namely sustain, change, and neutral for the client and information, question, or reflection for the therapist. Most studies on MI look at the verbal modality. Our research aims to understand the variation and dynamics of facial expressions of both interlocutors over a counseling session. We apply a sequence mining algorithm to identify categories of facial expressions for each type. Using co-occurrence analysis, we derive the correlation between the facial expressions and the different types of talk, as well as the interplay between interlocutors’ expressions.

pdf abs
BigNLI: Native Language Identification with Big Bird Embeddings
Sergey Kramp | Giovanni Cassani | Chris Emmery

Native Language Identification (NLI) intends to classify an author’s native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and NLI transformer models have thus far failed to offer effective, practical alternatives. The current work shows input size is a limiting factor, and that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models (for which we reproduce previous work) by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample (Europe subreddit) and out-of-domain (TOEFL-11) performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work.

pdf abs
Biomedical Concept Normalization over Nested Entities with Partial UMLS Terminology in Russian
Natalia Loukachevitch | Andrey Sakhovskiy | Elena Tutubalina

We present a new manually annotated dataset of PubMed abstracts for concept normalization in Russian. It contains over 23,641 entity mentions in 756 documents linked to 4,544 unique concepts from the UMLS ontology. Compared to existing corpora, we explore two novel annotation characteristics: the nestedness of named entities and the incompleteness of the Russian medical terminology in UMLS. 4,424 entity mentions are linked to 1,535 unique English concepts absent in the Russian part of the UMLS ontology. We present several baselines for normalization over nested named entities obtained with state-of-the-art models such as SapBERT. Our experimental results show that models pre-trained on graph structural data from UMLS achieve superior performance in a zero-shot setting on bilingual terminology.

pdf abs
Biomedical Entity Linking as Multiple Choice Question Answering
Zhenxi Lin | Ziheng Zhang | Xian Wu | Yefeng Zheng

Although biomedical entity linking (BioEL) has made significant progress with pre-trained language models, challenges still exist for fine-grained and long-tailed entities. To address these challenges, we present BioELQA, a novel model that treats Biomedical Entity Linking as Multiple Choice Question Answering. BioELQA first obtains candidate entities with a fast retriever, jointly presents the mention and candidate entities to a generator, and then outputs the predicted symbol associated with its chosen entity. This formulation enables explicit comparison of different candidate entities, thus capturing fine-grained interactions between mentions and entities, as well as among entities themselves. To improve generalization for long-tailed entities, we retrieve similar labeled training instances as clues and concatenate the input with retrieved instances for the generator. Extensive experimental results show that BioELQA outperforms state-of-the-art baselines on several datasets.

pdf abs
Bits and Pieces: Investigating the Effects of Subwords in Multi-task Parsing across Languages and Domains
Daniel Dakota | Sandra Kübler

Neural parsing is very dependent on the underlying language model. However, very little is known about how choices in the language model affect parsing performance, especially in multi-task learning. We investigate questions on how the choice of subwords affects parsing, how subword sharing is responsible for gains or negative transfer in a multi-task setting where each task is parsing of a specific domain of the same language. More specifically, we investigate these issues across four languages: English, German, Italian, and Turkish. We find a general preference for averaged or last subwords across languages and domains. However, specific POS tags may require different subwords, and the distributional overlap between subwords across domains is perhaps a more influential factor in determining positive or negative transfer than discrepancies in the data sizes.

pdf abs
BiVert: Bidirectional Vocabulary Evaluation Using Relations for Machine Translation
Carinne Cherf | Yuval Pinter

Neural machine translation (NMT) has progressed rapidly in the past few years, promising improvements and quality translations for different languages. Evaluation of this task is crucial to determine the quality of the translation. Overall, insufficient emphasis is placed on the actual sense of the translation in traditional methods. We propose a bidirectional semantic-based evaluation method designed to assess the sense distance of the translation from the source text. This approach employs the comprehensive multilingual encyclopedic dictionary BabelNet. Through the calculation of the semantic distance between the source and its back translation of the output, our method introduces a quantifiable approach that empowers sentence comparison on the same linguistic level. Factual analysis shows a strong correlation between the average evaluation scores generated by our method and the human assessments across various machine translation systems for English-German language pair. Finally, our method proposes a new multilingual approach to rank MT systems without the need for parallel corpora.

pdf abs
BKEE: Pioneering Event Extraction in the Vietnamese Language
Thi-Nhung Nguyen | Bang Tien Tran | Trong-Nghia Luu | Thien Huu Nguyen | Kiem-Hieu Nguyen

Event Extraction (EE) is a fundamental task in information extraction, aimed at identifying events and their associated arguments within textual data. It holds significant importance in various applications and serves as a catalyst for the development of related tasks. Despite the availability of numerous datasets and methods for event extraction in various languages, there has been a notable absence of a dedicated dataset for the Vietnamese language. To address this limitation, we propose BKEE, a novel event extraction dataset for Vietnamese. BKEE encompasses over 33 distinct event types and 28 different event argument roles, providing a labeled dataset for entity mentions, event mentions, and event arguments on 1066 documents. Additionally, we establish robust baselines for potential downstream tasks on this dataset, facilitating the analysis of challenges and future development prospects in the field of Vietnamese event extraction.

pdf abs
BlendX: Complex Multi-Intent Detection with Blended Patterns
Yejin Yoon | Jungyeon Lee | Kangsan Kim | Chanhee Park | Taeuk Kim

Task-oriented dialogue (TOD) systems are commonly designed with the presumption that each utterance represents a single intent. However, this assumption may not accurately reflect real-world situations, where users frequently express multiple intents within a single utterance. While there is an emerging interest in multi-intent detection (MID), existing in-domain datasets such as MixATIS and MixSNIPS have limitations in their formulation. To address these issues, we present BlendX, a suite of refined datasets featuring more diverse patterns than their predecessors, elevating both its complexity and diversity. For dataset construction, we utilize both rule-based heuristics as well as a generative tool—OpenAI’s ChatGPT—which is augmented with a similarity-driven strategy for utterance selection. To ensure the quality of the proposed datasets, we also introduce three novel metrics that assess the statistical properties of an utterance related to word count, conjunction use, and pronoun usage. Extensive experiments on BlendX reveal that state-of-the-art MID models struggle with the challenges posed by the new datasets, highlighting the need to reexamine the current state of the MID field. The dataset is available at https://github.com/HYU-NLP/BlendX.

pdf abs
BLN600: A Parallel Corpus of Machine/Human Transcribed Nineteenth Century Newspaper Texts
Callum William Booth | Alan Thomas | Robert Gaizauskas

We present a publicly available corpus of nineteenth-century newspaper text focused on crime in London, derived from the Gale British Library Newspapers corpus parts 1 and 2. The corpus comprises 600 newspaper excerpts and for each excerpt contains the original source image, the machine transcription of that image as found in the BLN and a gold standard manual transcription that we have created. We envisage the corpus will be helpful for the training and development of OCR and post-OCR correction methodologies for historical newspaper machine transcription—for which there is currently a dearth of publicly available resources. In this paper, we discuss the rationale behind gathering such a corpus, the methodology used to select, process, and align the data, and the corpus’ potential utility for historians and digital humanities researchers—particularly within the realms of neural machine translation-based post-OCR correction approaches, and other natural language processing tasks that are critically affected by erroneous OCR.

pdf abs
Bootstrapping UMR Annotations for Arapaho from Language Documentation Resources
Matthew J. Buchholz | Julia Bonn | Claire Benet Post | Andrew Cowell | Alexis Palmer

Uniform Meaning Representation (UMR) is a semantic labeling system in the AMR family designed to be uniformly applicable to typologically diverse languages. The UMR labeling system is quite thorough and can be time-consuming to execute, especially if annotators are starting from scratch. In this paper, we focus on methods for bootstrapping UMR annotations for a given language from existing resources, and specifically from typical products of language documentation work, such as lexical databases and interlinear glossed text (IGT). Using Arapaho as our test case, we present and evaluate a bootstrapping process that automatically generates UMR subgraphs from IGT. Additionally, we describe and evaluate a method for bootstrapping valency lexicon entries from lexical databases for both the target language and English. We are able to generate enough basic structure in UMR graphs from the existing Arapaho interlinearized texts to automate UMR labeling to a significant extent. Our method thus has the potential to streamline the process of building meaning representations for new languages without existing large-scale computational resources.

pdf abs
BootTOD: Bootstrap Task-oriented Dialogue Representations by Aligning Diverse Responses
Weihao Zeng | Keqing He | Yejie Wang | Dayuan Fu | Weiran Xu

Pre-trained language models have been successful in many scenarios. However, their usefulness in task-oriented dialogues is limited due to the intrinsic linguistic differences between general text and task-oriented dialogues. Current task-oriented dialogue pre-training methods rely on a contrastive framework, which faces challenges such as selecting true positives and hard negatives, as well as lacking diversity. In this paper, we propose a novel dialogue pre-training model called BootTOD. It learns task-oriented dialogue representations via a self-bootstrapping framework. Unlike contrastive counterparts, BootTOD aligns context and context+response representations and dismisses the requirements of contrastive pairs. BootTOD also uses multiple appropriate response targets to model the intrinsic one-to-many diversity of human conversations. Experimental results show that BootTOD outperforms strong TOD baselines on diverse downstream dialogue tasks.

Text image machine translation (TIMT) aims at translating source language texts in images into another target language, which has been proven successful by bridging text image recognition encoder and text translation decoder. However, it is still an open question of how to incorporate fine-grained knowledge supervision to make it consistent between recognition and translation modules. In this paper, we propose a novel TIMT method named as BabyNet, which is optimized with hierarchical parental supervision to improve translation performance. Inspired by genetic recombination and variation in the field of genetics, the proposed BabyNet is inherited from the recognition and translation parent models with a variation module of which parameters can be updated when training on the TIMT task. Meanwhile, hierarchical and multi-granularity supervision from parent models is introduced to bridge the gap between inherited modules in BabyNet. Extensive experiments on both synthetic and real-world TIMT tests show that our proposed method significantly outperforms existing methods. Further analyses of various parent model combinations show the good generalization of our method.

pdf abs
BP4ER: Bootstrap Prompting for Explicit Reasoning in Medical Dialogue Generation
Yuhong He | Yongqi Zhang | Shizhu He | Jun Wan

Medical dialogue generation (MDG) has gained increasing attention due to its substantial practical value. Previous works typically employ a sequence-to-sequence framework to generate medical responses by modeling dialogue context as sequential text with annotated medical entities. While these methods have been successful in generating fluent responses, they fail to provide process explanations of reasoning and require extensive entity annotation. To address these limitations, we propose the method Bootstrap Prompting for Explicit Reasoning in MDG (BP4ER), which explicitly model MDG’s multi-step reasoning process and iteratively enhance this reasoning process. We employ a least-to-most prompting strategy to guide a large language model (LLM) in explicit reasoning, breaking down MDG into simpler sub-questions. These sub-questions build on answers from previous ones. Additionally, we also introduce two distinct bootstrapping techniques for prompting, which autonomously correct errors and facilitate the LLM’s explicit reasoning. This approach eliminates the need for entity annotation and increases the transparency of the MDG process by explicitly generating the intermediate reasoning chain. Experimental results on the two publicly datasets show that BP4ER outperforms state-of-the-art methods across both objective and subjective evaluation.

Multimodal sarcasm detection has received considerable attention due to its unique role in social networks. Existing methods often rely on feature concatenation to fuse different modalities or model the inconsistencies among modalities. However, sarcasm is often embodied in local and momentary nuances in a subtle way, which causes difficulty for sarcasm detection. To effectively incorporate these nuances, this paper presents Context-Aware Self-Attention Fusion (CAAF) to integrate local and momentary multimodal information into specific words. Furthermore, due to the instantaneous nature of sarcasm, the connotative meanings of words post-multimodal integration generally deviate from their denotative meanings. Therefore, Word Weight Calculation (WWC) is presented to compute the weight of specific words based on CAAF’s fusion nuances, illustrating the inconsistency between connotation and denotation. We evaluate our method on the MUStARD dataset, achieving an accuracy of 76.9 and an F1 score of 76.1, which surpasses the current state-of-the-art IWAN model by 1.7 and 1.6 respectively.

pdf abs
Bridging Computational Lexicography and Corpus Linguistics: A Query Extension for OntoLex-FrAC
Christian Chiarcos | Ranka Stanković | Maxim Ionov | Gilles Sérasset

OntoLex, the dominant community standard for machine-readable lexical resources in the context of RDF, Linked Data and Semantic Web technologies, is currently extended with a designated module for Frequency, Attestations and Corpus-based Information (OntoLex-FrAC). We propose a novel component for OntoLex-FrAC, addressing the incorporation of corpus queries for (a) linking dictionaries with corpus engines, (b) enabling RDF-based web services to exchange corpus queries and responses data dynamically, and (c) using conventional query languages to formalize the internal structure of collocations, word sketches, and colligations. The primary field of application of the query extension is in digital lexicography and corpus linguistics, and we present a proof-of-principle implementation in backend components of a novel platform designed to support digital lexicography for the Serbian language.

pdf abs
Bridging Textual and Tabular Worlds for Fact Verification: A Lightweight, Attention-Based Model
Shirin Dabbaghi Varnosfaderani | Canasai Kruengkrai | Ramin Yahyapour | Junichi Yamagishi

FEVEROUS is a benchmark and research initiative focused on fact extraction and verification tasks involving unstructured text and structured tabular data. In FEVEROUS, existing works often rely on extensive preprocessing and utilize rule-based transformations of data, leading to potential context loss or misleading encodings. This paper introduces a simple yet powerful model that nullifies the need for modality conversion, thereby preserving the original evidence’s context. By leveraging pre-trained models on diverse text and tabular datasets and by incorporating a lightweight attention-based mechanism, our approach efficiently exploits latent connections between different data types, thereby yielding comprehensive and reliable verdict predictions. The model’s modular structure adeptly manages multi-modal information, ensuring the integrity and authenticity of the original evidence are uncompromised. Comparative analyses reveal that our approach exhibits competitive performance, aligning itself closely with top-tier models on the FEVEROUS benchmark.

pdf abs
Bridging the Code Gap: A Joint Learning Framework across Medical Coding Systems
Geunyeong Jeong | Seokwon Jeong | Juoh Sun | Harksoo Kim

Automated Medical Coding (AMC) is the task of automatically converting free-text medical documents into predefined codes according to a specific medical coding system. Although deep learning has significantly advanced AMC, the class imbalance problem remains a significant challenge. To address this issue, most existing methods consider only a single coding system and disregard the potential benefits of reflecting the relevance between different coding systems. To bridge this gap, we introduce a Joint learning framework for Across Medical coding Systems (JAMS), which jointly learns different coding systems through multi-task learning. It learns various representations using a shared encoder and explicitly captures the relationships across these coding systems using the medical code attention network, a modification of the graph attention network. In the experiments on the MIMIC-IV ICD-9 and MIMIC-IV ICD-10 datasets, connected through General Equivalence Mappings, JAMS improved the performance consistently regardless of the backbone models. This result demonstrates its model-agnostic characteristic, which is not constrained by specific model structures. Notably, JAMS significantly improved the performance of low-frequency codes. Our analysis shows that these performance gains are due to the connections between the codes of the different coding systems.

Temporal knowledge graph forecasting aims to reason over known facts to complete the missing links in the future. Existing methods are highly dependent on the structures of temporal knowledge graphs and commonly utilize recurrent or graph neural networks for forecasting. However, entities that are infrequently observed or have not been seen recently face challenges in learning effective knowledge representations due to insufficient structural contexts. To address the above disadvantages, in this paper, we propose a Contrastive Prompt-based framework with Entity background information for TKG forecasting, which we named CoPET. Specifically, to bring the time-invariant entity background information to time-variant structural information, we employ a dual encoder architecture consisting of a candidate encoder and a query encoder. A contrastive learning framework is used to encourage the query representation to be closer to the candidate representation. We further propose three kinds of trainable time-variant prompts aimed at capturing temporal structural information. Experiments on two datasets demonstrate that our method is effective and stays competitive in inference with limited structural information. Our code is available at https://github.com/qianxinying/CoPET.

This paper reports the first release of the UMR (Uniform Meaning Representation) data set. UMR is a graph-based meaning representation formalism consisting of a sentence-level graph and a document-level graph. The sentence-level graph represents predicate-argument structures, named entities, word senses, aspectuality of events, as well as person and number information for entities. The document-level graph represents coreferential, temporal, and modal relations that go beyond sentence boundaries. UMR is designed to capture the commonalities and variations across languages and this is done through the use of a common set of abstract concepts, relations, and attributes as well as concrete concepts derived from words from invidual languages. This UMR release includes annotations for six languages (Arapaho, Chinese, English, Kukama, Navajo, Sanapana) that vary greatly in terms of their linguistic properties and resource availability. We also describe on-going efforts to enlarge this data set and extend it to other genres and modalities. We also briefly describe the available infrastructure (UMR annotation guidelines and tools) that others can use to create similar data sets.

pdf abs
Building a Database of Conversational Routines
Polina Bychkova | Alyaxey Yaskevich | Serafima Gyulasaryan | Ekaterina Rakhilina

This paper discusses the Routinicon, a new constructicographic resource for the description of conversational routines. Conversational routines are defined as conventional formulaic expressions that language speakers use in standard extralinguistic situations (cf. Bless you! as a reaction to sneezing or Who’s there? as a typical answer to a knock on the door). The Routinicon’s goal is to accumulate the routines that constitute the inventory of conventional expressions in Russian language and systematically describe them in a way that would enable future cross-linguistic comparison and typological research. Conceptually, the Routinicon is a natural extension of such projects as the Russian Constructicon and Pragmaticon. It inherits their approach to the systematization of phraseological units as well as to the data collection. At the same time, the new project focuses on a fundamentally different domain of units and hence offers a radically new structure of linguistic annotation. Its principles and challenges are addressed in the paper.

Current LLM-based applications are becoming steadily available for everyone with a reliable access to technology and the internet. These applications offer benefits to their users that leave those without access to them at a serious disadvantage. Given the vastly large amount of data needed to train LLMs, the gap between languages with access to such quantity of data and those without it is currently larger than ever. Aimed at saving this gap, the Aina Project was created to provide Catalan with the necessary resources to keep being relevant in the context of AI/NLP applications based on LLMs. We thus present a set of strategies to consider when improving technology support for a mid- or low-resource language, specially addressing sustainability of high-quality data acquisition and the challenges involved in the process. We also introduce a large amount of new annotated data for Catalan. Our hope is that those interested in replicating this work for another language can learn from what worked for us, the challenges that we faced, and the sometimes disheartening truth of working with mid- and low-resource languages.

pdf abs
Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer
Youmi Ma | An Wang | Naoaki Okazaki

Document-level Relation Extraction (DocRE) is the task of extracting all semantic relationships from a document. While studies have been conducted on English DocRE, limited attention has been given to DocRE in non-English languages. This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages, with Japanese as the representative case. As an initial attempt, we construct a dataset by transferring an English dataset to Japanese. However, models trained on such a dataset are observed to suffer from low recalls. We investigate the error cases and attribute the failure to different surface structures and semantics of documents translated from English and those written by native speakers. We thus switch to explore if the transferred dataset can assist human annotation on Japanese documents. In our proposal, annotators edit relation predictions from a model trained on the transferred dataset. Quantitative analysis shows that relation recommendations suggested by the model help reduce approximately 50% of the human edit steps compared with the previous approach. Experiments quantify the performance of existing DocRE models on our collected dataset, portraying the challenges of Japanese and cross-lingual DocRE.

pdf abs
Building MUSCLE, a Dataset for MUltilingual Semantic Classification of Links between Entities
Lucia Pitarch | Carlos Bobed Lisbona | David Abián | Jorge Gracia | Jordi Bernad

In this paper we introduce MUSCLE, a dataset for MUltilingual lexico-Semantic Classification of Links between Entities. The MUSCLE dataset was designed to train and evaluate Lexical Relation Classification (LRC) systems with 27K pairs of universal concepts selected from Wikidata, a large and highly multilingual factual Knowledge Graph (KG). Each pair of concepts includes its lexical forms in 25 languages and is labeled with up to five possible lexico-semantic relations between the concepts: hypernymy, hyponymy, meronymy, holonymy, and antonymy. Inspired by Semantic Map theory, the dataset bridges lexical and conceptual semantics, is more challenging and robust than previous datasets for LRC, avoids lexical memorization, is domain-balanced across entities, and enables enrichment and hierarchical information retrieval.

pdf abs
Building Question-Answer Data Using Web Register Identification
Anni Eskelinen | Amanda Myntti | Erik Henriksson | Sampo Pyysalo | Veronika Laippala

This article introduces a resource-efficient method for developing question-answer (QA) datasets by extracting QA pairs from web-scale data using machine learning (ML). Our method benefits from recent advances in web register (genre) identification and consists of two ML steps with an additional post-processing step. First, using XLM-R and the multilingual CORE web register corpus series with categories such as QA Forum, we train a multilingual classifier to retrieve documents that are likely to contain QA pairs from web-scale data. Second, we develop a NER-style token classifier to identify the QA text spans within these documents. To this end, we experiment with training on a semi-synthetic dataset built on top of the English LFQA, a small set of manually cleaned web QA pairs in English and Finnish, and a Finnish web QA pair dataset cleaned using ChatGPT. The evaluation of our pipeline demonstrates its capability to efficiently retrieve a substantial volume of QA pairs. While the approach is adaptable to any language given the availability of language models and extensive web data, we showcase its efficiency in English and Finnish, developing the first open, non-synthetic and non-machine translated QA dataset for Finnish – Turku WebQA – comprising over 200,000 QA pairs.

pdf abs
CAGK: Collaborative Aspect Graph Enhanced Knowledge-based Recommendation
Xiaotong Song | Huiping Lin | Jiatao Zhu | Xinyi Gong

Auxiliary information, such as knowledge graph (KG), has become increasingly crucial in recommender systems. However, the current KG-based recommendation still has some limitations: (1) low link rates between items and KG entities, (2) redundant knowledge in KG. In this paper, we introduce the aspect, which refers to keywords describing item attributes in reviews, to KG-based recommendation, and propose a new model, Collaborative Aspect Graph enhanced Knowledge-based Network (CAGK). Firstly, CAGK builds a Collaborative Aspect Graph (CAG) with user-item interactions, aspects and KG, where aspects can fill most of the sparsity. Secondly, we leverage interactive information and aspect features to generate aspect-aware guidance signals to customize knowledge extraction and eliminate redundant knowledge. Lastly, we utilize low ratings and negative aspect sentiment to capture features of that users dislike to prevent repetitive recommendations of disliked items. Experimental results on two widely used benchmark datasets, Amazon-book and Yelp2018, confirm the superiority of CAGK.

pdf abs
CALAMR: Component ALignment for Abstract Meaning Representation
Paul Landes | Barbara Di Eugenio

We present Component ALignment for Abstract Meaning Representation (Calamr), a novel method for graph alignment that can support summarization and its evaluation. First, our method produces graphs that explain what is summarized through their alignments, which can be used to train graph based summarization learners. Second, although numerous scoring methods have been proposed for abstract meaning representation (AMR) that evaluate semantic similarity, no AMR based summarization metrics exist despite years of work using AMR for this task. Calamr provides alignments on which new scores can be based. The contributions of this work include a) a novel approach to aligning AMR graphs, b) a new summarization based scoring methods for similarity of AMR subgraphs composed of one or more sentences, and c) the entire reusable source code to reproduce our results.

Recent advancements in large language models (LLMs) and their emergent capabilities make LLM a promising reference-free evaluator on the quality of natural language generation, and a competent alternative to human evaluation. However, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf LLM-based evaluator towards better human alignment. In this work, we propose AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Then, an initial set of scoring criteria is drafted by the language model itself, leveraging in-context learning on different few-shot examples. To further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. Our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.

Comparative Question Answering (CompQA) is a Natural Language Processing task that combines Question Answering and Argument Mining approaches to answer subjective comparative questions in an efficient argumentative manner. In this paper, we present an end-to-end (full pipeline) system for answering comparative questions called CAM 2.0 as well as a public leaderboard called CompUGE that unifies the existing datasets under a single easy-to-use evaluation suite. As compared to previous web-form-based CompQA systems, it features question identification, object and aspect labeling, stance classification, and summarization using up-to-date models. We also select the most time- and memory-effective pipeline by comparing separately fine-tuned Transformer Encoder models which show state-of-the-art performance on the subtasks with Generative LLMs in few-shot and LoRA setups. We also conduct a user study for a whole-system evaluation.

pdf abs
CAMAL: A Novel Dataset for Multi-label Conversational Argument Move Analysis
Viet Dac Lai | Duy Ngoc Pham | Jonathan Steinberg | Jamie Mikeska | Thien Huu Nguyen

Understanding the discussion moves that teachers and students use to engage in classroom discussions is important to support pre-service teacher learning and teacher educators. This work introduces a novel conversational multi-label corpus of teaching transcripts collected from a simulated classroom environment for Conversational Argument Move AnaLysis (CAMAL). The dataset offers various argumentation moves used by pre-service teachers and students in mathematics and science classroom discussions. The dataset includes 165 transcripts from these discussions that pre-service elementary teachers facilitated in a simulated classroom environment of five student avatars. The discussion transcripts were annotated by education assessment experts for nine argumentation moves (aka. intents) used by the pre-service teachers and students during the discussions. In this paper, we describe the dataset, our annotation framework, and the models we employed to detect argumentation moves. Our experiments with state-of-the-art models demonstrate the complexity of the CAMAL task presented in the dataset. The result reveals that models that combined CNN and LSTM structures with speaker ID graphs improved the F1-score of our baseline models to detect speakers’ intents by a large margin. Given the complexity of the CAMAL task, it creates research opportunities for future studies. We share the dataset, the source code, and the annotation framework publicly at http://github.com/uonlp/camal-dataset.

pdf abs
Camel Morph MSA: A Large-Scale Open-Source Morphological Analyzer for Modern Standard Arabic
Christian Khairallah | Salam Khalifa | Reham Marzouk | Mayar Nassar | Nizar Habash

We present Camel Morph MSA, the largest open-source Modern Standard Arabic morphological analyzer and generator. Camel Morph MSA has over 100K lemmas, and includes rarely modeled morphological features of Modern Standard Arabic with Classical Arabic origins. Camel Morph MSA can produce ∼1.45B analyses and ∼535M unique diacritizations, almost an order of magnitude larger than SAMA (Maamouri et al., 2010c), in addition to having ∼36% less OOV rate than SAMA on a 10B word corpus. Furthermore, Camel Morph MSA fills the gaps of many lemma paradigms by modeling linguistic phenomena consistently. Camel Morph MSA seamlessly integrates with the Camel Tools Python toolkit (Obeid et al., 2020), ensuring ease of use and accessibility.

pdf abs
CamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical Data
Rian Touchent | Éric de la Clergerie

Clinical data in hospitals are increasingly accessible for research through clinical data warehouses. However these documents are unstructured and it is therefore necessary to extract information from medical reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT has allowed major advances for French, especially for named entity recognition. However, these models are trained for plain language and are less efficient on biomedical data. Addressing this gap, we introduce CamemBERT-bio, a dedicated French biomedical model derived from a new public French biomedical dataset. Through continual pre-training of the original CamemBERT, CamemBERT-bio achieves an improvement of 2.54 points of F1-score on average across various biomedical named entity recognition tasks, reinforcing the potential of continual pre-training as an equally proficient yet less computationally intensive alternative to training from scratch. Additionally, we highlight the importance of using a standard evaluation protocol that provides a clear view of the current state-of-the-art for French biomedical models.

pdf abs
CAMERA³: An Evaluation Dataset for Controllable Ad Text Generation in Japanese
Go Inoue | Akihiko Kato | Masato Mita | Ukyo Honda | Peinan Zhang

Ad text generation is the task of creating compelling text from an advertising asset that describes products or services, such as a landing page. In advertising, diversity plays an important role in enhancing the effectiveness of an ad text, mitigating a phenomenon called “ad fatigue,” where users become disengaged due to repetitive exposure to the same advertisement. Despite numerous efforts in ad text generation, the aspect of diversifying ad texts has received limited attention, particularly in non-English languages like Japanese. To address this, we present CAMERA³, an evaluation dataset for controllable text generation in the advertising domain in Japanese. Our dataset includes 3,980 ad texts written by expert annotators, taking into account various aspects of ad appeals. We make CAMERA³ publicly available, allowing researchers to examine the capabilities of recent NLG models in controllable text generation in a real-world scenario.

pdf abs
Can Factual Statements Be Deceptive? The DeFaBel Corpus of Belief-based Deception
Aswathy Velutharambath | Amelie Wührl | Roman Klinger

If a person firmly believes in a non-factual statement, such as “The Earth is flat”, and argues in its favor, there is no inherent intention to deceive. As the argumentation stems from genuine belief, it may be unlikely to exhibit the linguistic properties associated with deception or lying. This interplay of factuality, personal belief, and intent to deceive remains an understudied area. Disentangling the influence of these variables in argumentation is crucial to gain a better understanding of the linguistic properties attributed to each of them. To study the relation between deception and factuality, based on belief, we present the DeFaBel corpus, a crowd-sourced resource of belief-based deception. To create this corpus, we devise a study in which participants are instructed to write arguments supporting statements like “eating watermelon seeds can cause indigestion”, regardless of its factual accuracy or their personal beliefs about the statement. In addition to the generation task, we ask them to disclose their belief about the statement. The collected instances are labelled as deceptive if the arguments are in contradiction to the participants’ personal beliefs. Each instance in the corpus is thus annotated (or implicitly labelled) with personal beliefs of the author, factuality of the statement, and the intended deceptiveness. The DeFaBel corpus contains 1031 texts in German, out of which 643 are deceptive and 388 are non-deceptive. It is the first publicly available corpus for studying deception in German. In our analysis, we find that people are more confident in the persuasiveness of their arguments when the statement is aligned with their belief, but surprisingly less confident when they are generating arguments in favor of facts. The DeFaBel corpus can be obtained from https://www.ims.uni-stuttgart.de/data/defabel .

pdf abs
Can GPT-4 Identify Propaganda? Annotation and Detection of Propaganda Spans in News Articles
Maram Hasanain | Fatema Ahmad | Firoj Alam

The use of propaganda has spiked on mainstream and social media, aiming to manipulate or mislead users. While efforts to automatically detect propaganda techniques in textual, visual, or multimodal content have increased, most of them primarily focus on English content. The majority of the recent initiatives targeting medium to low-resource languages produced relatively small annotated datasets, with a skewed distribution, posing challenges for the development of sophisticated propaganda detection models. To address this challenge, we carefully develop the largest propaganda dataset to date, ArPro, comprised of 8K paragraphs from newspaper articles, labeled at the text span level following a taxonomy of 23 propagandistic techniques. Furthermore, our work offers the first attempt to understand the performance of large language models (LLMs), using GPT-4, for fine-grained propaganda detection from text. Results showed that GPT-4’s performance degrades as the task moves from simply classifying a paragraph as propagandistic or not, to the fine-grained task of detecting propaganda techniques and their manifestation in text. Compared to models fine-tuned on the dataset for propaganda detection at different classification granularities, GPT-4 is still far behind. Finally, we evaluate GPT-4 on a dataset consisting of six other languages for span detection, and results suggest that the model struggles with the task across languages. We made the dataset publicly available for the community.

Textual domain is a crucial property within the Natural Language Processing (NLP) community due to its effects on downstream model performance. The concept itself is, however, loosely defined and, in practice, refers to any non-typological property, such as genre, topic, medium or style of a document. We investigate the core notion of domains via human proficiency in identifying related intrinsic textual properties, specifically the concepts of genre (communicative purpose) and topic (subject matter). We publish our annotations in TGeGUM: A collection of 9.1k sentences from the GUM dataset (Zeldes, 2017) with single sentence and larger context (i.e., prose) annotations for one of 11 genres (source type), and its topic/subtopic as per the Dewey Decimal library classification system (Dewey, 1979), consisting of 10/100 hierarchical topics of increased granularity. Each instance is annotated by three annotators, for a total of 32.7k annotations, allowing us to examine the level of human disagreement and the relative difficulty of each annotation task. With a Fleiss’ kappa of at most 0.53 on the sentence level and 0.66 at the prose level, it is evident that despite the ubiquity of domains in NLP, there is little human consensus on how to define them. By training classifiers to perform the same task, we find that this uncertainty also extends to NLP models.

pdf abs
Can Language Models Learn Embeddings of Propositional Logic Assertions?
Nurul Fajrin Ariyani | Zied Bouraoui | Richard Booth | Steven Schockaert

Natural language offers an appealing alternative to formal logics as a vehicle for representing knowledge. However, using natural language means that standard methods for automated reasoning can no longer be used. A popular solution is to use transformer-based language models (LMs) to directly reason about knowledge expressed in natural language, but this has two important limitations. First, the set of premises is often too large to be directly processed by the LM. This means that we need a retrieval strategy which can select the most relevant premises when trying to infer some conclusion. Second, LMs have been found to learn shortcuts and thus lack robustness, putting in doubt to what extent they actually understand the knowledge that is expressed. Given these limitations, we explore the following alternative: rather than using LMs to perform reasoning directly, we use them to learn embeddings of individual assertions. Reasoning is then carried out by manipulating the learned embeddings. We show that this strategy is feasible to some extent, while at the same time also highlighting the limitations of directly fine-tuning LMs to learn the required embeddings.

pdf abs
Can Large Language Models Automatically Score Proficiency of Written Essays?
Watheq Ahmad Mansour | Salam Albatarni | Sohaila Eltanbouly | Tamer Elsayed

Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential on this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students.

pdf abs
Can Large Language Models Discern Evidence for Scientific Hypotheses? Case Studies in the Social Sciences
Sai Koneru | Jian Wu | Sarah Rajtmajer

Hypothesis formulation and testing are central to empirical research. A strong hypothesis is a best guess based on existing evidence and informed by a comprehensive view of relevant literature. However, with exponential increase in the number of scientific articles published annually, manual aggregation and synthesis of evidence related to a given hypothesis is a challenge. Our work explores the ability of current large language models (LLMs) to discern evidence in support or refute of specific hypotheses based on the text of scientific abstracts. We share a novel dataset for the task of scientific hypothesis evidencing using community-driven annotations of studies in the social sciences. We compare the performance of LLMs to several state of the art methods and highlight opportunities for future research in this area. Our dataset is shared with the research community: https://github.com/Sai90000/ScientificHypothesisEvidencing.git

pdf abs
Can Large Language Models Learn Translation Robustness from Noisy-Source In-context Demonstrations?
Leiyu Pan | Yongqi Leng | Deyi Xiong

Large language models (LLMs) have been used for machine translation. When provided with prompts and source sentences, LLMs can achieve impressive translation results. However, the robustness of these LLMs remains a significant challenge, as they often struggle to accurately translate sentences in the presence of noise, even when using similarity-based in-context learning methods. This work proposes a research scheme for studying machine translation robustness on LLMs, investigating whether LLMs can learn translation robustness from noisy-source demonstration examples. Through experiments on different models, languages, and noise types, we empirically demonstrate that LLMs can learn how to handle noise and translation methods from noisy-source demonstration examples, thereby improving their translation performance on noisy sentences. Furthermore, we find that increasing the noise ratio appropriately for the noisy-source demonstration examples can enhance the translation robustness of LLMs. Additionally, we also attempt to investigate scenarios where LLMs are more likely to learn translation robustness for mixed and specific types of noise. We find that the model’s performance varies across different noise settings.

pdf abs
Can Machine Translation Bridge Multilingual Pretraining and Cross-lingual Transfer Learning?
Shaoxiong Ji | Timothee Mickus | Vincent Segonne | Jörg Tiedemann

Multilingual pretraining and fine-tuning have remarkably succeeded in various natural language processing tasks. Transferring representations from one language to another is especially crucial for cross-lingual learning. One can expect machine translation objectives to be well suited to fostering such capabilities, as they involve the explicit alignment of semantically equivalent sentences from different languages. This paper investigates the potential benefits of employing machine translation as a continued training objective to enhance language representation learning, bridging multilingual pretraining and cross-lingual applications. We study this question through two lenses: a quantitative evaluation of the performance of existing models and an analysis of their latent representations. Our results show that, contrary to expectations, machine translation as the continued training fails to enhance cross-lingual representation learning in multiple cross-lingual natural language understanding tasks. We conclude that explicit sentence-level alignment in the cross-lingual scenario is detrimental to cross-lingual transfer pretraining, which has important implications for future cross-lingual transfer studies. We furthermore provide evidence through similarity measures and investigation of parameters that this lack of positive influence is due to output separability—which we argue is of use for machine translation but detrimental elsewhere.

Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) due to their simplicity and efficiency. However, there are concerns about whether MCQs can truly measure LLM’s capabilities, particularly in knowledge-intensive scenarios where long-form generation (LFG) answers are required. The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ’s efficacy, which we undertake in this paper by evaluating nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English. We identify a significant issue: LLMs exhibit an order sensitivity in bilingual MCQs, favoring answers located at specific positions, i.e., the first position. We further quantify the gap between MCQs and long-form generation questions (LFGQs) by comparing their direct outputs, token logits, and embeddings. Our results reveal a relatively low correlation between answers from MCQs and LFGQs for identical questions. Additionally, we propose two methods to quantify the consistency and confidence of LLMs’ output, which can be generalized to other QA evaluation benchmarks. Notably, our analysis challenges the idea that the higher the consistency, the greater the accuracy. We also find MCQs to be less reliable than LFGQs in terms of expected calibration error. Finally, the misalignment between MCQs and LFGQs is not only reflected in the evaluation performance but also in the embedding space. Our code and models can be accessed at https://github.com/Meetyou-AI-Lab/Can-MC-Evaluate-LLMs.

We introduce a novel framework, LM-Guided CoT, that leverages a lightweight (i.e., <1B) language model (LM) for guiding a black-box large (i.e., >10B) LM in reasoning tasks. Specifically, the lightweight LM first generates a rationale for each input instance. The Frozen large LM is then prompted to predict a task output based on the rationale generated by the lightweight LM. Our approach is resource-efficient in the sense that it only requires training the lightweight LM. We optimize the model through 1) knowledge distillation and 2) reinforcement learning from rationale-oriented and task-oriented reward signals. We assess our method with multi-hop extractive question answering (QA) benchmarks, HotpotQA, and 2WikiMultiHopQA. Experimental results show that our approach outperforms all baselines regarding answer prediction accuracy. We also find that reinforcement learning helps the model to produce higher-quality rationales with improved QA performance.

pdf abs
Can We Identify Stance without Target Arguments? A Study for Rumour Stance Classification
Yue Li | Carolina Scarton

Considering a conversation thread, rumour stance classification aims to identify the opinion (e.g. agree or disagree) of replies towards a target (rumour story). Although the target is expected to be an essential component in traditional stance classification, we show that rumour stance classification datasets contain a considerable amount of real-world data whose stance could be naturally inferred directly from the replies, contributing to the strong performance of the supervised models without awareness of the target. We find that current target-aware models underperform in cases where the context of the target is crucial. Finally, we propose a simple yet effective framework to enhance reasoning with the targets, achieving state-of-the-art performance on two benchmark datasets.

pdf abs
Can We Learn Question, Answer, and Distractors All from an Image? A New Task for Multiple-choice Visual Question Answering
Wenjian Ding | Yao Zhang | Jun Wang | Adam Jatowt | Zhenglu Yang

Multiple-choice visual question answering (MC VQA) requires an answer picked from a list of distractors, based on a question and an image. This research has attracted wide interest from the fields of visual question answering, visual question generation, and visual distractor generation. However, these fields still stay in their own territories, and how to jointly generate meaningful questions, correct answers, and challenging distractors remains unexplored. In this paper, we introduce a novel task, Visual Question-Answer-Distractors Generation (VQADG), which can bridge this research gap as well as take as a cornerstone to promote existing VQA models. Specific to the VQADG task, we present a novel framework consisting of a vision-and-language model to encode the given image and generate QADs jointly, and contrastive learning to ensure the consistency of the generated question, answer, and distractors. Empirical evaluations on the benchmark dataset validate the performance of our model in the VQADG task.

pdf abs
CARE: Co-Attention Network for Joint Entity and Relation Extraction
Wenjun Kong | Yamei Xia

Joint entity and relation extraction is the fundamental task of information extraction, consisting of two subtasks: named entity recognition and relation extraction. However, most existing joint extraction methods suffer from issues of feature confusion or inadequate interaction between the two subtasks. Addressing these challenges, in this work, we propose a Co-Attention network for joint entity and Relation Extraction (CARE). Our approach includes adopting a parallel encoding strategy to learn separate representations for each subtask, aiming to avoid feature overlap or confusion. At the core of our approach is the co-attention module that captures two-way interaction between the two subtasks, allowing the model to leverage entity information for relation prediction and vice versa, thus promoting mutual enhancement. Through extensive experiments on three benchmark datasets for joint entity and relation extraction (NYT, WebNLG, and SciERC), we demonstrate that our proposed model outperforms existing baseline models. Our code will be available at https://github.com/kwj0x7f/CARE.

pdf abs
CareCorpus: A Corpus of Real-World Solution-Focused Caregiver Strategies for Personalized Pediatric Rehabilitation Service Design
Mina Valizadeh | Vera C. Kaelin | Mary A. Khetani | Natalie Parde

In pediatric rehabilitation services, one intervention approach involves using solution-focused caregiver strategies to support children in their daily life activities. The manual sharing of these strategies is not scalable, warranting need for an automated approach to recognize and select relevant strategies. We introduce CareCorpus, a dataset of 780 real-world strategies written by caregivers. Strategies underwent dual-annotation by three trained annotators according to four established rehabilitation classes (i.e., environment/context, n=325 strategies; a child’s sense of self, n=151 strategies; a child’s preferences, n=104 strategies; and a child’s activity competences, n=62 strategies) and a no-strategy class (n=138 instances) for irrelevant or indeterminate instances. The average percent agreement was 80.18%, with a Cohen’s Kappa of 0.75 across all classes. To validate this dataset, we propose multi-grained classification tasks for detecting and categorizing strategies, and establish new performance benchmarks ranging from F1=0.53-0.79. Our results provide a first step towards a smart option to sort caregiver strategies for use in designing pediatric rehabilitation care plans. This novel, interdisciplinary resource and application is also anticipated to generalize to other pediatric rehabilitation service contexts that target children with developmental need.

pdf abs
CASIMIR: A Corpus of Scientific Articles Enhanced with Multiple Author-Integrated Revisions
Léane Jourdan | Florian Boudin | Nicolas Hernandez | Richard Dufour

Writing a scientific article is a challenging task as it is a highly codified and specific genre, consequently proficiency in written communication is essential for effectively conveying research findings and ideas. In this article, we propose an original textual resource on the revision step of the writing process of scientific articles. This new dataset, called CASIMIR, contains the multiple revised versions of 15,646 scientific articles from OpenReview, along with their peer reviews. Pairs of consecutive versions of an article are aligned at sentence-level while keeping paragraph location information as metadata for supporting future revision studies at the discourse level. Each pair of revised sentences is enriched with automatically extracted edits and associated revision intention. To assess the initial quality on the dataset, we conducted a qualitative study of several state-of-the-art text revision approaches and compared various evaluation metrics. Our experiments led us to question the relevance of the current evaluation methods for the text revision task.

pdf abs
Categorial Grammar Induction with Stochastic Category Selection
Christian Clark | William Schuler

Grammar induction, the task of learning a set of syntactic rules from minimally annotated training data, provides a means of exploring the longstanding question of whether humans rely on innate knowledge to acquire language. Of the various formalisms available for grammar induction, categorial grammars provide an appealing option due to their transparent interface between syntax and semantics. However, to obtain competitive results, previous categorial grammar inducers have relied on shortcuts such as part-of-speech annotations or an ad hoc bias term in the objective function to ensure desirable branching behavior. We present a categorial grammar inducer that eliminates both shortcuts: it learns from raw data, and does not rely on a biased objective function. This improvement is achieved through a novel stochastic process used to select the set of available syntactic categories. On a corpus of English child-directed speech, the model attains a recall-homogeneity of 0.48, a large improvement over previous categorial grammar inducers.

pdf abs
Causal Intersectionality and Dual Form of Gradient Descent for Multimodal Analysis: A Case Study on Hateful Memes
Yosuke Miyanishi | Minh Le Nguyen

Amidst the rapid expansion of Machine Learning (ML) and Large Language Models (LLMs), understanding the semantics within their mechanisms is vital. Causal analyses define semantics, while gradient-based methods are essential to eXplainable AI (XAI), interpreting the model’s ‘black box’. Integrating these, we investigate how a model’s mechanisms reveal its causal effect on evidence-based decision-making. Research indicates intersectionality - the combined impact of an individual’s demographics - can be framed as an Average Treatment Effect (ATE). This paper demonstrates that hateful meme detection can be viewed as an ATE estimation using intersectionality principles, and summarized gradient-based attention scores highlight distinct behaviors of three Transformer models. We further reveal that LLM Llama-2 can discern the intersectional aspects of the detection through in-context learning and that the learning process could be explained via meta-gradient, a secondary form of gradient. In conclusion, this work furthers the dialogue on Causality and XAI. Our code is available online (see External Resources section).

pdf abs
CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models
Yufei Huang | Deyi Xiong

Holistically measuring societal biases of large language models is crucial for detecting and reducing ethical risks in highly capable AI models. In this work, we present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models, covering stereotypes and societal biases in 14 social dimensions related to Chinese culture and values. The curation process contains 4 essential steps: bias identification, ambiguous context generation, AI-assisted disambiguous context generation, and manual review and recomposition. The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control. The dataset exhibits wide coverage and high diversity. Extensive experiments demonstrate the effectiveness of the dataset in evaluating model bias, with all 12 publicly available Chinese large language models exhibiting strong bias in certain categories. Additionally, we observe from our experiments that fine-tuned models could, to a certain extent, heed instructions and avoid generating harmful outputs, in the way of “moral self-correction”. Our dataset is available at https://anonymous.4open.science/r/CBBQ-B860/.

pdf abs
CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering
Hongbin Na

The recent advancements in artificial intelligence highlight the potential of language models in psychological health support. While models trained on data from mental health service platform have achieved preliminary success, challenges persist in areas such as data scarcity, quality, and ensuring a solid foundation in psychological techniques. To address these challenges, this study introduces a novel approach to enhance the precision and efficacy of psychological support through large language models. Specifically, we design a specific prompt derived from principles of Cognitive Behavioral Therapy (CBT) and have generated the CBT QA dataset, specifically for Chinese psychological health Q&A based on CBT structured intervention strategies. Unlike previous methods, our dataset emphasizes professional and structured response. Utilizing this dataset, we fine-tuned the large language model, giving birth to CBT-LLM, the large-scale language model specifically designed for Cognitive Behavioral Therapy techniques. Empirical evaluations demonstrate that CBT-LLM excels in generating structured, professional, and highly relevant responses in psychological health support tasks, showcasing its practicality and quality. The model is available on Hugging Face: https://huggingface.co/Hongbin37/CBT-LLM.

End-to-end automatic speech recognition (ASR) systems often struggle to recognize rare name entities, such as personal names, organizations and terminologies that are not frequently encountered in the training data. This paper presents Contextual Biasing Whisper (CB-Whisper), a novel ASR system based on OpenAI’s Whisper model that can recognize user-defined name entities by performing open-vocabulary keyword-spotting (KWS) before the decoder. The KWS module leverages text-to-speech (TTS) techniques and a convolutional neural network (CNN) classifier to match the features between the entities and the utterances. To integrate the recognized entities into the Whipser decoder and avoid hallucinations, we carefully crafted multiple prompts with spoken form hints. Experiments show that the KWS module based on Whisper encoder’s features can recognize unseen user-defined keywords effectively. More importantly, the proposed CB-Whisper substantially improves the mixed-error-rate (MER) and entity recall compared to the original Whisper model on three internal datasets and two publicly available datasets including Aishell and ACL datasets that cover English-only, Chinese-only, and code-switching scenarios.

pdf abs
CEPT: A Contrast-Enhanced Prompt-Tuning Framework for Emotion Recognition in Conversation
Qingqing Gao | Jiuxin Cao | Biwei Cao | Xin Guan | Bo Liu

Emotion Recognition in Conversation (ERC) has attracted increasing attention due to its wide applications in public opinion analysis, empathetic conversation generation, and so on. However, ERC research suffers from the problems of data imbalance and the presence of similar linguistic expressions for different emotions. These issues can result in limited learning for minority emotions, biased predictions for common emotions, and the misclassification of different emotions with similar linguistic expressions. To alleviate these problems, we propose a Contrast-Enhanced Prompt-Tuning (CEPT) framework for ERC. We transform the ERC task into a Masked Language Modeling (MLM) generation task and generate the emotion for each utterance in the conversation based on the prompt-tuning of the Pre-trained Language Model (PLM), where a novel mixed prompt template and a label mapping strategy are introduced for better context and emotion feature modeling. Moreover, Supervised Contrastive Learning (SCL) is employed to help the PLM mine more information from the labels and learn a more discriminative representation space for utterances with different emotions. We conduct extensive experiments and the results demonstrate that CEPT outperforms the state-of-the-art methods on all three benchmark datasets and excels in recognizing minority emotions.

pdf abs
CE-VDG: Counterfactual Entropy-based Bias Reduction for Video-grounded Dialogue Generation
Hongcheng Liu | Pingjie Wang | Zhiyuan Zhu | Yanfeng Wang | Yu Wang

The Video-Grounded Dialogue generation (VDG) is a challenging task requiring a comprehensive understanding of the multi-modal information to produce a pertinent response. However, VDG models may rely on dataset bias as a shortcut and fail to learn the multi-modal knowledge from both video and audio. Counterfactual reasoning is an effective method that can estimate and eliminate bias on some special aspects of classification tasks. However, conventional counterfactual reasoning cannot be applied to VDG tasks directly due to the BPE algorithm. In this paper, we reformulate the counterfactual reasoning from the information entropy perspective and extend it from the classification task to the generative task, which can effectively reduce the question-related bias in the auto-regressive generation task. We design CE-VDG to demonstrate the effectiveness in bias elimination of the reformulated counterfactual reasoning by using the proposed counterfactual entropy as an external loss. Extensive experiment results on two popular VDG datasets show the superiority of CE-VDG over the existing baseline method, demonstrating the effective debiasing capability in our model considering counterfactual entropy.

pdf abs
ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting
Xiaoxue Cheng | Junyi Li | Wayne Xin Zhao | Ji-Rong Wen

Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs), establishing itself as a primary approach to solving complex reasoning tasks. Existing CoT synthesis approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts. In response to this challenge, we present an empirical investigation of CoT prompting and introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts. CoTGenius is developed based on three major evolution strategies, i.e., complicate, diversify, and specify—alongside two filtering mechanisms: evolutionary success judgement and correctness verification. We further employ CoTGenius to create an extensive CoT dataset, and subsequently fine-tune the Llama 2-Chat 7B and 13B models on this dataset. We call the resulting model ChainLM. To deal with the cumulative error issue in reasoning steps, we propose a step-level debating method, wherein multiple debaters discuss each reasoning step to arrive at the correct answer. Extensive experiments demonstrate that our ChainLM models exhibit enhanced proficiency in addressing a spectrum of complex reasoning problems compared to existing models. In addition, we conduct an in-depth analysis of the impact of data categories within CoTGenius on the model performance. We release our dataset and code at https://github.com/RUCAIBox/ChainLM.

pdf abs
ChainNet: Structured Metaphor and Metonymy in WordNet
Rowan Hall Maudslay | Simone Teufel | Francis Bond | James Pustejovsky

The senses of a word exhibit rich internal structure. In a typical lexicon, this structure is overlooked: A word’s senses are encoded as a list, without inter-sense relations. We present ChainNet, a lexical resource which for the first time explicitly identifies these structures, by expressing how senses in the Open English Wordnet are derived from one another. In ChainNet, every nominal sense of a word is either connected to another sense by metaphor or metonymy, or is disconnected (in the case of homonymy). Because WordNet senses are linked to resources which capture information about their meaning, ChainNet represents the first dataset of grounded metaphor and metonymy.

pdf abs
Challenges in Pre-Training Graph Neural Networks for Context-Based Fake News Detection: An Evaluation of Current Strategies and Resource Limitations
Gregor Donabauer | Udo Kruschwitz

Pre-training of neural networks has recently revolutionized the field of Natural Language Processing (NLP) and has before demonstrated its effectiveness in computer vision. At the same time, advances around the detection of fake news were mainly driven by the context-based paradigm, where different types of signals (e.g. from social media) form graph-like structures that hold contextual information apart from the news article to classify. We propose to merge these two developments by applying pre-training of Graph Neural Networks (GNNs) in the domain of context-based fake news detection. Our experiments provide an evaluation of different pre-training strategies for graph-based misinformation detection and demonstrate that transfer learning does currently not lead to significant improvements over training a model from scratch in the domain. We argue that a major current issue is the lack of suitable large-scale resources that can be used for pre-training.

pdf abs
Challenging Negative Gender Stereotypes: A Study on the Effectiveness of Automated Counter-Stereotypes
Isar Nejadgholi | Kathleen C. Fraser | Anna Kerkhof | Svetlana Kiritchenko

Gender stereotypes are pervasive beliefs about individuals based on their gender that play a significant role in shaping societal attitudes, behaviours, and even opportunities. Recognizing the negative implications of gender stereotypes, particularly in online communications, this study investigates eleven strategies to automatically counteract and challenge these views. We present AI-generated gender-based counter-stereotypes to (self-identified) male and female study participants and ask them to assess their offensiveness, plausibility, and potential effectiveness. The strategies of counter-facts and broadening universals (i.e., stating that anyone can have a trait regardless of group membership) emerged as the most robust approaches, while humour, perspective-taking, counter-examples, and empathy for the speaker were perceived as less effective. Also, the differences in ratings were more pronounced for stereotypes about the different targets than between the genders of the raters. Alarmingly, many AI-generated counter-stereotypes were perceived as offensive and/or implausible. Our analysis and the collected dataset offer foundational insight into counter-stereotype generation, guiding future efforts to develop strategies that effectively challenge gender stereotypes in online interactions.

pdf abs
Characteristic AI Agents via Large Language Models
Xi Wang | Hongliang Dai | Shen Gao | Piji Li

The advancement of Large Language Models (LLMs) has led to significant enhancements in the performance of chatbot systems. Many researchers have dedicated their efforts to the development of bringing characteristics to chatbots. While there have been commercial products for developing role-driven chatbots using LLMs, it is worth noting that academic research in this area remains relatively scarce. Our research focuses on investigating the performance of LLMs in constructing Characteristic AI Agents by simulating real-life individuals across different settings. Current investigations have primarily focused on act on roles with simple profiles. In response to this research gap, we create a benchmark for the characteristic AI agents task, including dataset, techniques, and evaluation metrics. A dataset called “Character100” is built for this benchmark, comprising the most-visited people on Wikipedia for language models to role-play. With the constructed dataset, we conduct comprehensive assessment of LLMs across various settings. In addition, we devise a set of automatic metrics for quantitative performance evaluation. The experimental results underscore the potential directions for further improvement in the capabilities of LLMs in constructing characteristic AI agents. The benchmark is available at https://github.com/nuaa-nlp/Character100.

pdf abs
Character-level Language Models for Abbreviation and Long-form Detection
Leonardo Zilio | Shenbin Qian | Diptesh Kanojia | Constantin Orasan

Abbreviations and their associated long forms are important textual elements that are present in almost every scientific communication, and having information about these forms can help improve several NLP tasks. In this paper, our aim is to fine-tune language models for automatically identifying abbreviations and long forms. We used existing datasets which are annotated with abbreviations and long forms to train and test several language models, including transformer models, character-level language models, stacking of different embeddings, and ensemble methods. Our experiments showed that it was possible to achieve state-of-the-art results by stacking RoBERTa embeddings with domain-specific embeddings. However, the analysis of our first run showed that one of the datasets had issues in the BIO annotation, which led us to propose a revised dataset. After re-training selected models on the revised dataset, results show that character-level models achieve comparable results, especially when detecting abbreviations, but both RoBERTa large and the stacking of embeddings presented better results on biomedical data. When tested on a different subdomain (segments extracted from computer science texts), an ensemble method proved to yield the best results for the detection of long forms, and a character-level model had the best performance in detecting abbreviations.

We present Charles Translator, a machine translation system between Ukrainian and Czech, developed as part of a society-wide effort to mitigate the impact of the Russian-Ukrainian war on individuals and society. The system was developed in the spring of 2022 with the help of many language data providers in order to quickly meet the demand for such a service, which was not available at the time in the required quality. The translator was later implemented as an online web interface and as an Android app with speech input, both featuring Cyrillic-Latin script transliteration. The system translates directly, in comparison to other available systems that use English as a pivot, and thus makes advantage of the typological similarity of the two languages. It uses the block back-translation method which allows for efficient use of monolingual training data. The paper describes the development process including data collection and implementation, evaluation, mentions several use cases and outlines possibilities for further development of the system for educational purposes.

pdf abs
Charting the Linguistic Landscape of Developing Writers: An Annotation Scheme for Enhancing Native Language Proficiency
Miguel Da Corte | Jorge Baptista

This study describes a pilot annotation task designed to capture orthographic, grammatical, lexical, semantic, and discursive patterns exhibited by college native English speakers participating in developmental education (DevEd) courses. The paper introduces an annotation scheme developed by two linguists aiming at pinpointing linguistic challenges that hinder effective written communication. The scheme builds upon patterns supported by the literature, which are known as predictors of student placement in DevEd courses and English proficiency levels. Other novel, multilayered, linguistic aspects that the literature has not yet explored are also presented. The scheme and its primary categories are succinctly presented and justified. Two trained annotators used this scheme to annotate a sample of 103 text units (3 during the training phase and 100 during the annotation task proper). Texts were randomly selected from a population of 290 community college intending students. An in-depth quality assurance inspection was conducted to assess tagging consistency between annotators and to discern (and address) annotation inaccuracies. Krippendorff’s Alpha (K-alpha) interrater reliability coefficients were calculated, revealing a K-alpha score of k=0.40, which corresponds to a moderate level of agreement, deemed adequate for the complexity and length of the annotation task.

pdf abs
ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization
Mengsha Liu | Daoyuan Chen | Yaliang Li | Guian Fang | Ying Shen

Data visualization serves as a critical means for presenting data and mining its valuable insights. The task of chart summarization, through natural language processing techniques, facilitates in-depth data analysis of charts. However, there still are notable deficiencies in terms of visual-language matching and reasoning ability for existing approaches. To address these limitations, this study constructs a large-scale dataset of comprehensive chart-caption pairs and fine-tuning instructions on each chart. Thanks to the broad coverage of various topics and visual styles within this dataset, better matching degree can be achieved from the view of training data. Moreover, we propose an innovative chart summarization method, ChartThinker, which synthesizes deep analysis based on chains of thought and strategies of context retrieval, aiming to improve the logical coherence and accuracy of the generated summaries. Built upon the curated datasets, our trained model consistently exhibits superior performance in chart summarization tasks, surpassing 8 state-of-the-art models over 7 evaluation metrics. Our dataset and codes are publicly accessible.

pdf abs
ChatASU: Evoking LLM’s Reflexion to Truly Understand Aspect Sentiment in Dialogues
Yiding Liu | Jingjing Wang | Jiamin Luo | Tao Zeng | Guodong Zhou

Aspect Sentiment Understanding (ASU) in interactive scenarios (e.g., Question-Answering and Dialogue) has attracted ever-more interest in recent years and achieved important progresses. However, existing studies on interactive ASU largely ignore the coreference issue for opinion targets (i.e., aspects), while this phenomenon is ubiquitous in interactive scenarios especially dialogues, limiting the ASU performance. Recently, large language models (LLMs) shows the powerful ability to integrate various NLP tasks with the chat paradigm. In this way, this paper proposes a new Chat-based Aspect Sentiment Understanding (ChatASU) task, aiming to explore LLMs’ ability in understanding aspect sentiments in dialogue scenarios. Particularly, this ChatASU task introduces a sub-task, i.e., Aspect Chain Reasoning (ACR) task, to address the aspect coreference issue. On this basis, we propose a Trusted Self-reflexion Approach (TSA) with ChatGLM as backbone to ChatASU. Specifically, this TSA treats the ACR task as an auxiliary task to boost the performance of the primary ASU task, and further integrates trusted learning into reflexion mechanisms to alleviate the LLMs-intrinsic factual hallucination problem in TSA. Furthermore, a high-quality ChatASU dataset is annotated to evaluate TSA, and extensive experiments show that our proposed TSA can significantly outperform several state-of-the-art baselines, justifying the effectiveness of TSA to ChatASU and the importance of considering the coreference and hallucination issues in ChatASU.

pdf abs
ChatEL: Entity Linking with Chatbots
Yifan Ding | Qingkai Zeng | Tim Weninger

Entity Linking (EL) is an essential and challenging task in natural language processing that seeks to link some text representing an entity within a document or sentence with its corresponding entry in a dictionary or knowledge base. Most existing approaches focus on creating elaborate contextual models that look for clues the words surrounding the entity-text to help solve the linking problem. Although these fine-tuned language models tend to work, they can be unwieldy, difficult to train, and do not transfer well to other domains. Fortunately, Large Language Models (LLMs) like GPT provide a highly-advanced solution to the problems inherent in EL models, but simply naive prompts to LLMs do not work well. In the present work, we define ChatEL, which is a three-step framework to prompt LLMs to return accurate results. Overall the ChatEL framework improves the average F1 performance across 10 datasets by more than 2%. Finally, a thorough error analysis shows many instances with the ground truth labels were actually incorrect, and the labels predicted by ChatEL were actually correct. This indicates that the quantitative results presented in this paper may be a conservative estimate of the actual performance. All data and code are available as an open-source package on GitHub at https://github.com/yifding/In_Context_EL.

Large language models (LLMs) have made significant progress in NLP. However, their ability to memorize, represent, and leverage commonsense knowledge has been a well-known pain point. In this paper, we specifically focus on ChatGPT, a widely used and easily accessible LLM, and ask the following questions: (1) Can ChatGPT effectively answer commonsense questions? (2) Is ChatGPT aware of the underlying commonsense knowledge for answering a specific question? (3) Is ChatGPT knowledgeable in commonsense? (4) Can ChatGPT effectively leverage commonsense for answering questions? We conduct a series of experiments on 11 datasets to evaluate ChatGPT’s commonsense abilities, including answering commonsense questions, identifying necessary knowledge, generating knowledge descriptions, and using knowledge descriptions to answer questions again. Experimental results show that: (1) ChatGPT can achieve good QA accuracies in commonsense tasks, while still struggling with certain domains of datasets. (2) ChatGPT is knowledgeable, and can accurately generate most of the commonsense knowledge using knowledge prompts. (3) Despite its knowledge, ChatGPT is an inexperienced commonsense problem solver, which cannot precisely identify the needed commonsense for answering a specific question. These findings raise the need to explore improved mechanisms for effectively incorporating commonsense into LLMs like ChatGPT, such as better instruction following and commonsense guidance.

pdf abs
ChatGPT Rates Natural Language Explanation Quality like Humans: But on Which Scales?
Fan Huang | Haewoon Kwak | Kunwoo Park | Jisun An

As AI becomes more integral in our lives, the need for transparency and responsibility grows. While natural language explanations (NLEs) are vital for clarifying the reasoning behind AI decisions, evaluating them through human judgments is complex and resource-intensive due to subjectivity and the need for fine-grained ratings. This study explores the alignment between ChatGPT and human assessments across multiple scales (i.e., binary, ternary, and 7-Likert scale). We sample 300 data instances from three NLE datasets and collect 900 human annotations for both informativeness and clarity scores as the text quality measurement. We further conduct paired comparison experiments under different ranges of subjectivity scores, where the baseline comes from 8,346 human annotations. Our results show that ChatGPT aligns better with humans in more coarse-grained scales. Also, paired comparisons and dynamic prompting (i.e., providing semantically similar examples in the prompt) improve the alignment. This research advances our understanding of large language models’ capabilities to assess the text explanation quality in different configurations for responsible AI development.

pdf abs
ChatGPT Role-play Dataset: Analysis of User Motives and Model Naturalness
Yufei Tao | Ameeta Agrawal | Judit Dombi | Tetyana Sydorenko | Jung In Lee

Recent advances in interactive large language models like ChatGPT have revolutionized various domains; however, their behavior in natural and role-play conversation settings remains underexplored. In our study, we address this gap by deeply investigating how ChatGPT behaves during conversations in different settings by analyzing its interactions in both a normal way and a role-play setting. We introduce a novel dataset of broad range of human-AI conversations annotated with user motives and model naturalness to examine (i) how humans engage with the conversational AI model, and (ii) how natural are AI model responses. Our study highlights the diversity of user motives when interacting with ChatGPT and variable AI naturalness, showing not only the nuanced dynamics of natural conversations between humans and AI, but also providing new avenues for improving the effectiveness of human-AI communication.

pdf abs
ChatUIE: Exploring Chat-based Unified Information Extraction Using Large Language Models
Jun Xu | Mengshu Sun | Zhiqiang Zhang | Jun Zhou

Recent advancements in large language models have shown impressive performance in general chat. However, their domain-specific capabilities, particularly in information extraction, have certain limitations. Extracting structured information from natural language that deviates from known schemas or instructions has proven challenging for previous prompt-based methods. This motivated us to explore domain-specific modeling in chat-based language models as a solution for extracting structured information from natural language. In this paper, we present ChatUIE, an innovative unified information extraction framework built upon ChatGLM. Simultaneously, reinforcement learning is employed to improve and align various tasks that involve confusing and limited samples. Furthermore, we integrate generation constraints to address the issue of generating elements that are not present in the input. Our experimental results demonstrate that ChatUIE can significantly improve the performance of information extraction with a slight decrease in chatting ability.

Existing studies of naturally occurring language-in-interaction have largely focused on the two ends of the developmental spectrum, i.e., early childhood and adulthood, leaving a gap in our knowledge about how development unfolds, especially across middle childhood. The current work contributes to filling this gap by introducing CHICA (for Child Interpersonal Communication Analysis), a developmental corpus of child-caregiver conversations at home, involving groups of French-speaking children aged 7, 9, and 11 years old. Each dyad was recorded twice: once in a face-to-face setting and once using computer-mediated video calls. For the face-to-face settings, we capitalized on recent advances in mobile, lightweight eye-tracking and head motion detection technology to optimize the naturalness of the recordings, allowing us to obtain both precise and ecologically valid data. Further, we mitigated the challenges of manual annotation by relying – to the extent possible – on automatic tools in speech processing and computer vision. Finally, to demonstrate the richness of this corpus for the study of child communicative development, we provide preliminary analyses comparing several measures of child-caregiver conversational dynamics across developmental age, modality, and communicative medium. We hope the current corpus will allow new discoveries into the properties and mechanisms of multimodal communicative development across middle childhood.

pdf abs
Chinese Morpheme-informed Evaluation of Large Language Models
Yaqi Yin | Yue Wang | Yang Liu

Previous evaluations of large language models (LLMs) focused on the perspective of various tasks or abilities. In this paper, we propose to evaluate from a linguistic viewpoint and argue that morpheme, a potential linguistic feature that captures both word-formation and lexical semantics, is another suitable component for evaluation that remains largely unexplored. In light of this, we construct MorphEval, a morpheme-informed benchmark, including three datasets following the bottom-up levels of characters, words, and sentences in Chinese, and then evaluate representative LLMs with both zero- and few-shot settings under two metrics. From this perspective, we reveal three aspects of issues LLMs nowadays encounter: dysfunctions in morphology and syntax, challenges with the long-tailed distribution of semantics, and difficulties from cultural implications. In these scenarios, even a smaller Chinese-targeted model may outperform ChatGPT, highlighting the actual challenges LLMs face and the necessity of language-specific improvements when applied to non-English languages. This new approach could also help guide model enhancements as well as get extended to other languages.

Chinese sequence labeling tasks are sensitive to word boundaries. Although pretrained language models (PLM) have achieved considerable success in these tasks, current PLMs rarely consider boundary information explicitly. An exception to this is BABERT, which incorporates unsupervised statistical boundary information into Chinese BERT’s pre-training objectives. Building upon this approach, we input supervised high-quality boundary information to enhance BABERT’s learning, developing a semi-supervised boundary-aware PLM. To assess PLMs’ ability to encode boundaries, we introduce a novel “Boundary Information Metric” that is both simple and effective. This metric allows comparison of different PLMs without task-specific fine-tuning. Experimental results on Chinese sequence labeling datasets demonstrate that the improved BABERT version outperforms the vanilla version, not only in these tasks but also in broader Chinese natural language understanding tasks. Additionally, our proposed metric offers a convenient and accurate means of evaluating PLMs’ boundary awareness.

pdf abs
CHisIEC: An Information Extraction Corpus for Ancient Chinese History
Xuemei Tang | Qi Su | Jun Wang | Zekun Deng

Natural Language Processing (NLP) plays a pivotal role in the realm of Digital Humanities (DH) and serves as the cornerstone for advancing the structural analysis of historical and cultural heritage texts. This is particularly true for the domains of named entity recognition (NER) and relation extraction (RE). In our commitment to expediting ancient history and culture, we present the “Chinese Historical Information Extraction Corpus”(CHisIEC). CHisIEC is a meticulously curated dataset designed to develop and evaluate NER and RE tasks, offering a resource to facilitate research in the field. Spanning a remarkable historical timeline encompassing data from 13 dynasties spanning over 1830 years, CHisIEC epitomizes the extensive temporal range and text heterogeneity inherent in Chinese historical documents. The dataset encompasses four distinct entity types and twelve relation types, resulting in a meticulously labeled dataset comprising 14,194 entities and 8,609 relations. To establish the robustness and versatility of our dataset, we have undertaken comprehensive experimentation involving models of various sizes and paradigms. Additionally, we have evaluated the capabilities of Large Language Models (LLMs) in the context of tasks related to ancient Chinese history. The dataset and code are available at https://github.com/tangxuemei1995/CHisIEC.

pdf abs
Chitchat as Interference: Adding User Backstories to Task-Oriented Dialogues
Armand Stricker | Patrick Paroubek

During task-oriented dialogues (TODs), human users naturally introduce chitchat that is beyond the immediate scope of the task, interfering with the flow of the conversation. To address this issue without the need for expensive manual data creation, we use few-shot prompting with Llama-2-70B to enhance the MultiWOZ dataset with user backstories, a typical example of chitchat interference in TODs. We assess the impact of this addition by testing two models: one trained solely on TODs and another trained on TODs with a preliminary chitchat interaction. Our analysis demonstrates that our enhanced dataset poses a challenge for these systems. Moreover, we demonstrate that our dataset can be effectively used for training purposes, enabling a system to consistently acknowledge the user’s backstory while also successfully moving the task forward in the same turn, as confirmed by human evaluation. These findings highlight the benefits of generating novel chitchat-TOD scenarios to test TOD systems more thoroughly and improve their resilience to natural user interferences.

pdf abs
Choice-75: A Dataset on Decision Branching in Script Learning
Zhaoyi Hou | Li Zhang | Chris Callison-Burch

Script learning studies how daily events unfold. It enables machines to reason about narratives with implicit information. Previous works mainly consider a script as a linear sequence of events while ignoring the potential branches that arise due to people’s circumstantial choices. We hence propose Choice-75, the first benchmark that challenges intelligent systems to make decisions given descriptive scenarios, containing 75 scripts and more than 600 scenarios. We also present preliminary results with current large language models (LLM). Although they demonstrate overall decent performances, there is still notable headroom in hard scenarios.

pdf abs
C-Journal: A Journaling Application for Detecting and Classifying Cognitive Distortions Using Deep-Learning Based on a Crowd-sourced Dataset
Nada Elsharawi | Alia El Bolock

Cognitive distortions are negatively biased thinking patterns and erroneous self-statements resulting from and leading to logical errors in one’s own internal reasoning. Cognitive distortions have an adverse effect on mental health and can lead to mental health disorders in extreme cases. This paper belongs to a bigger project which aims to provide an application for detecting and classifying cognitive distortions in texts. As no public data sets were available for the task, the first contribution of the proposed work lies in providing an open-source labeled dataset of 14 cognitive distortions consisting of 34370 entries collected via crowd-sourcing, user questionnaires, and re-purposing emotions dataset from social media. The dataset is collected in cooperation with a licensed psychologist. We implemented a baseline model using Naïve Bayes and Count Vectorizer and different CNN, LSTM, and DNN classifiers to classify cognitive distortions based on the dataset. We investigated the usage of different word embeddings with the best-performing models. The best-performing model relied on a CNN with pre-trained Sentence-BERT embedding with an F1-score of 84 % for classifying cognitive distortions. The best-performing model was built into C-Journal, a free journaling and mood-tracking mobile application that pinpoints potential thinking distortions to the users.

pdf abs
Claim-Centric and Sentiment Guided Graph Attention Network for Rumour Detection
Sajad Ramezani | Mauzama Firdaus | Lili Mou

Automatic rumour detection has gained attention due to the influence of social media on individuals and its pervasiveness. In this work, we construct a representation that takes into account the claim in the source tweet, considering both the propagation graph and the accompanying text alongside tweet sentiment. This is achieved through the implementation of a hierarchical attention mechanism, which not only captures the embedding of documents from individual word vectors but also combines these document representations as nodes within the propagation graph. Furthermore, to address potential overfitting concerns, we employ generative models to augment the existing datasets. This involves rephrasing the claims initially made in the source tweet, thereby creating a more diverse and robust dataset. In addition, we augment the dataset with sentiment labels to improve the performance of the rumour detection task. This holistic and refined approach yields a significant enhancement in the performance of our model across three distinct datasets designed for rumour detection. Quantitative and qualitative analysis proves the effectiveness of our methodology, surpassing the achievements of prior methodologies.

pdf abs
Clarifying Completions: Evaluating How LLMs Respond to Incomplete Questions
Angus Addlesee | Oliver Lemon | Arash Eshghi

People understand and produce language incrementally on a word by word basis. This gives rise to many characteristic conversational phenomena including long mid-sentence pauses that are followed by incremental clarification requests (iCRs) intended to recover the rest of the truncated turn (see Fig. 1; (A), (B), (C)). The ability to generate iCRs is important in natural conversational AI systems, and crucial to their accessibility to users with memory impairment. In this paper, we collect, release and analyse SLUICE-CR: a large corpus of 3000 human produced iCRs. We then use this corpus to probe the incremental processing capability of a number of state of the art LLMs by evaluating the quality of the model’s generated iCRs in response to incomplete questions. Our evaluations show that the ability to generate contextually appropriate iCRs only emerges at larger LLM sizes, and only when prompted with example iCRs from our corpus. They also indicate that autoregressive LMs are, in principle, able to both understand and generate language incrementally.

pdf abs
Classifying Social Media Users before and after Depression Diagnosis via Their Language Usage: A Dataset and Study
Falwah Alhamed | Julia Ive | Lucia Specia

Mental illness can significantly impact individuals’ quality of life. Analysing social media data to uncover potential mental health issues in individuals via their posts is a popular research direction. However, most studies focus on the classification of users suffering from depression versus healthy users, or on the detection of suicidal thoughts. In this paper, we instead aim to understand and model linguistic changes that occur when users transition from a healthy to an unhealthy state. Addressing this gap could lead to better approaches for earlier depression detection when signs are not as obvious as in cases of severe depression or suicidal ideation. In order to achieve this goal, we have collected the first dataset of textual posts by the same users before and after reportedly being diagnosed with depression. We then use this data to build multiple predictive models (based on SVM, Random Forests, BERT, RoBERTa, MentalBERT, GPT-3, GPT-3.5, Bard, and Alpaca) for the task of classifying user posts. Transformer-based models achieved the best performance, while large language models used off-the-shelf proved less effective as they produced random guesses (GPT and Bard) or hallucinations (Alpaca).

pdf abs
Class-Incremental Few-Shot Event Detection
Kailin Zhao | Xiaolong Jin | Long Bai | Jiafeng Guo | Xueqi Cheng

Event detection is one of the fundamental tasks in information extraction and knowledge graph. However, a realistic event detection system often needs to deal with new event classes constantly. These new classes usually have only a few labeled instances as it is time-consuming and labor-intensive to annotate a large number of unlabeled instances. Therefore, this paper proposes a new task, called class-incremental few-shot event detection. Nevertheless, there are two problems (i.e., old knowledge forgetting and new class overfitting) in this task. To solve these problems, this paper further presents a novel knowledge distillation and prompt learning based method, called Prompt-KD. Specifically, to reduce the forgetting issue about old knowledge, Prompt-KD develops an attention based multi-teacher knowledge distillation framework, where the ancestor teacher model pre-trained on base classes is reused in all learning sessions, and the father teacher model derives the current student model via adaptation. On the other hand, in order to cope with the few-shot learning scenario and alleviate the corresponding new class overfitting problem, Prompt-KD is also equipped with a prompt learning mechanism. Extensive experiments on two benchmark datasets, i.e., FewEvent and MAVEN, demonstrate the state-of-the-art performance of Prompt-KD.

pdf abs
CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation
Nikola Ljubešić | Taja Kuzman

This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space. The collection of these corpora comprises a total of 13 billion tokens of texts from 26 million documents. The comparability of the corpora is ensured by a comparable crawling setup and the usage of identical crawling and post-processing technology. All the corpora were linguistically annotated with the state-of-the-art CLASSLA-Stanza linguistic processing pipeline, and enriched with document-level genre information via the Transformer-based multilingual X-GENRE classifier, which further enhances comparability at the level of linguistic annotation and metadata enrichment. The genre-focused analysis of the resulting corpora shows a rather consistent distribution of genres throughout the seven corpora, with variations in the most prominent genre categories being well-explained by the economic strength of each language community. A comparison of the distribution of genre categories across the corpora indicates that web corpora from less developed countries primarily consist of news articles. Conversely, web corpora from economically more developed countries exhibit a smaller proportion of news content, with a greater presence of promotional and opinionated texts.

pdf abs
CLAUSE-ATLAS: A Corpus of Narrative Information to Scale up Computational Literary Analysis
Enrica Troiano | Piek T.J.M. Vossen

We introduce CLAUSE-ATLAS, a resource of XIX and XX century English novels annotated automatically. This corpus, which contains 41,715 labeled clauses, allows to study stories as sequences of eventive, subjective and contextual information. We use it to investigate if recent large language models, in particular gpt-3.5-turbo with 16k tokens of context, constitute promising tools to annotate large amounts of data for literary studies (we show that this is the case). Moreover, by analyzing the annotations so collected, we find that our clause-based approach to literature captures structural patterns within books, as well as qualitative differences between them.

pdf abs
CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments
Savitha Sam Abraham | Marjan Alirezaie | Luc de Raedt

The integration of learning and reasoning is high on the research agenda in AI. Nevertheless, there is only a little attention to using existing background knowledge for reasoning about partially observed scenes to answer questions about the scene. Yet, we as humans use such knowledge frequently to infer plausible answers to visual questions (by eliminating all inconsistent ones). Such knowledge often comes in the form of constraints about objects and it tends to be highly domain or environment specific. We contribute a novel benchmark called CLEVR-POC for reasoning-intensive visual question answering (VQA) in partially observable environments under constraints. In CLEVR-POC, knowledge in the form of logical constraints needs to be leveraged in order to generate plausible answers to questions about a hidden object in a given partial scene. For instance, if one has the knowledge that all cups are colored either red, green or blue and that there is only one green cup, it becomes possible to deduce the color of an occluded cup as either red or blue, provided that all other cups, including the green one, are observed. Through experiments we observe that the performance of pre-trained vision language models like CLIP (approx. 22%) and a large language model (LLM) like GPT-4 (approx. 46%) on CLEVR-POC are not satisfactory, ascertaining the necessity for frameworks that can handle reasoning-intensive tasks where environment-specific background knowledge is available and crucial. Furthermore, our demonstration illustrates that a neuro-symbolic model, which integrates an LLM like GPT-4 with a visual perception network and a formal logical reasoner, exhibits exceptional performance on CLEVR-POC.

pdf abs
CLFFRD: Curriculum Learning and Fine-grained Fusion for Multimodal Rumor Detection
Fan Xu | Lei Zeng | Bowei Zou | Ai Ti Aw | Huan Rong

In an era where rumors can propagate rapidly across social media platforms such as Twitter and Weibo, automatic rumor detection has garnered considerable attention from both academia and industry. Existing multimodal rumor detection models often overlook the intricacies of sample difficulty, e.g., text-level difficulty, image-level difficulty, and multimodal-level difficulty, as well as their order when training. Inspired by the concept of curriculum learning, we propose the Curriculum Learning and Fine-grained Fusion-driven multimodal Rumor Detection (CLFFRD) framework, which employs curriculum learning to automatically select and train samples according to their difficulty at different training stages. Furthermore, we introduce a fine-grained fusion strategy that unifies entities from text and objects from images, enhancing their semantic cohesion. We also propose a novel data augmentation method that utilizes linear interpolation between textual and visual modalities to generate diverse data. Additionally, our approach incorporates deep fusion for both intra-modality (e.g., text entities and image objects) and inter-modality (e.g., CLIP and social graph) features. Extensive experimental results demonstrate that CLFFRD outperforms state-of-the-art models on both English and Chinese benchmark datasets for rumor detection in social media.

Reinforcement learning from human feedback (RLHF) is a crucial technique in aligning large language models (LLMs) with human preferences, ensuring these LLMs behave in beneficial and comprehensible ways to users. However, a longstanding challenge in human alignment techniques based on reinforcement learning lies in their inherent complexity and difficulty in training. To address this challenge, we present a simple yet effective Contrastive Learning Framework for Human Alignment (CLHA) to align LLMs with human preferences directly. CLHA employs a novel rescoring strategy to evaluate the noise within the data by considering its inherent quality and dynamically adjusting the training process. Simultaneously, CLHA utilizes pairwise contrastive loss and adaptive supervised fine-tuning loss to adaptively modify the likelihood of generating responses, ensuring enhanced alignment with human preferences. Using advanced methods, CLHA surpasses other algorithms, showcasing superior performance in terms of reward model scores, automatic evaluations, and human assessments on the widely used “Helpful and Harmless” dataset.

Despite the rapid development of large language models (LLMs) for the Korean language, there remains an obvious lack of benchmark datasets that test the requisite Korean cultural and linguistic knowledge. Because many existing Korean benchmark datasets are derived from the English counterparts through translation, they often overlook the different cultural contexts. For the few benchmark datasets that are sourced from Korean data capturing cultural knowledge, only narrow tasks such as hate speech detection are offered. To address this gap, we introduce a benchmark of Cultural and Linguistic Intelligence in Korean (CLIcK), a dataset comprising 1,995 QA pairs. CLIcK sources its data from official Korean exams and textbooks, partitioning the questions into eleven categories under the two main categories of language and culture. For each instance in click, we provide fine-grained annotation of which cultural and linguistic knowledge is required to correctly answer the question. Using CLIcK, we test 13 language models to assess their performance. Our evaluation uncovers insights into their performances across the categories, as well as the diverse factors affecting their comprehension. CLIcK offers the first large-scale comprehensive Korean-centric analysis of LLMs’ proficiency in Korean language and culture.

Crossword puzzles are popular linguistic games often used as tools to engage students in learning. Educational crosswords are characterized by less cryptic and more factual clues that distinguish them from traditional crossword puzzles. Despite there exist several publicly available clue-answer pair databases for traditional crosswords, educational clue-answer pairs datasets are missing. In this article, we propose a methodology to build educational clue generation datasets that can be used to instruct Large Language Models (LLMs). By gathering from Wikipedia pages informative content associated with relevant keywords, we use Large Language Models to automatically generate pedagogical clues related to the given input keyword and its context. With such an approach, we created clue-instruct, a dataset containing 44,075 unique examples with text-keyword pairs associated with three distinct crossword clues. We used clue-instruct to instruct different LLMs to generate educational clues from a given input content and keyword. Both human and automatic evaluations confirmed the quality of the generated clues, thus validating the effectiveness of our approach.

Metaphor is a prominent linguistic device in human language and literature, as they add color, imagery, and emphasis to enhance effective communication. This paper introduces a large-scale high quality annotated Chinese Metaphor Corpus, which comprises around 28K sentences drawn from a diverse range of Chinese literary sources, such as poems, prose, song lyrics, etc. To ensure the accuracy and consistency of our annotations, we introduce a comprehensive set of guidelines. These guidelines address the facets of metaphor annotation, including identifying tenors, vehicles, and grounds to handling the complexities of similes, personifications, juxtapositions, and hyperboles. Breaking tradition, our approach to metaphor generation emphasizes tenors and their distinct features rather than the conventional combination of tenors and vehicles. By integrating “ground” as a CoT (Chain of Thoughts) input, we are able to generate metaphors that resonate more with real-world intuition. We test generative models such as Belle, Baichuan, and Chinese-alpaca-33B using our annotated corpus. These models are able to generate creative and fluent metaphor sentences more frequently induced by selected samples from our dataset, demonstrating the value of our corpus for Chinese metaphor research.

Extracting structured event knowledge, including event triggers and corresponding arguments, from military texts is fundamental to many applications, such as intelligence analysis and decision assistance. However, event extraction in the military field faces the data scarcity problem, which impedes the research of event extraction models in this domain. To alleviate this problem, we propose CMNEE, a large-scale, document-level open-source Chinese Military News Event Extraction dataset. It contains 17,000 documents and 29,223 events, which are all manually annotated based on a pre-defined schema for the military domain including 8 event types and 11 argument role types. We designed a two-stage, multi-turns annotation strategy to ensure the quality of CMNEE and reproduced several state-of-the-art event extraction models with a systematic evaluation. The experimental results on CMNEE fall shorter than those on other domain datasets obviously, which demonstrates that event extraction for military domain poses unique challenges and requires further research efforts. Our code and data can be obtained from https://github.com/Mzzzhu/CMNEE. Keywords: Corpus,Information Extraction, Information Retrieval, Knowledge Discovery/Representation

pdf abs
CM-Off-Meme: Code-Mixed Hindi-English Offensive Meme Detection with Multi-Task Learning by Leveraging Contextual Knowledge
Gitanjali Kumari | Dibyanayan Bandyopadhyay | Asif Ekbal | Vinutha B. NarayanaMurthy

Detecting offensive content in internet memes is challenging as it needs additional contextual knowledge. While previous works have only focused on detecting offensive memes, classifying them further into implicit and explicit categories depending on their severity is still a challenging and underexplored area. In this work, we present an end-to-end multitask model for addressing this challenge by empirically investigating two correlated tasks simultaneously: (i) offensive meme detection and (ii) explicit-implicit offensive meme detection by leveraging the two self-supervised pre-trained models. The first pre-trained model, referred to as the “knowledge encoder,” incorporates contextual knowledge of the meme. On the other hand, the second model, referred to as the “fine-grained information encoder”, is trained to understand the obscure psycho-linguistic information of the meme. Our proposed model utilizes contrastive learning to integrate these two pre-trained models, resulting in a more comprehensive understanding of the meme and its potential for offensiveness. To support our approach, we create a large-scale dataset, CM-Off-Meme, as there is no publicly available such dataset for the code-mixed Hindi-English (Hinglish) domain. Empirical evaluation, including both qualitative and quantitative analysis, on the CM-Off-Meme dataset demonstrates the effectiveness of the proposed model in terms of cross-domain generalization.

Generative query rewrite generates reconstructed query rewrites using the conversation history while rely heavily on gold rewrite pairs that are expensive to obtain. Recently, few-shot learning is gaining increasing popularity for this task, whereas these methods are sensitive to the inherent noise due to limited data size. Besides, both attempts face performance degradation when there exists language style shift between training and testing cases. To this end, we study low-resource generative conversational query rewrite that is robust to both noise and language style shift. The core idea is to utilize massive unlabeled data to make further improvements via a contrastive co-training paradigm. Specifically, we co-train two dual models (namely Rewriter and Simplifier) such that each of them provides extra guidance through pseudo-labeling for enhancing the other in an iterative manner. We also leverage contrastive learning with data augmentation, which enables our model pay more attention on the truly valuable information than the noise. Extensive experiments demonstrate the superiority of our model under both few-shot and zero-shot scenarios. We also verify the better generalization ability of our model when encountering language style shift.

pdf abs
CoANZSE Audio: Creation of an Online Corpus for Linguistic and Phonetic Analysis of Australian and New Zealand Englishes
Steven Coats

CoANZSE Audio is a searchable online version of the Corpus of Australian and New Zealand Spoken English, a 195-million-word collection of geo-located YouTube transcripts of local government channels. In addition to the part-of-speech-tagged and lemmatized transcript data, CoANZSE Audio provides access to almost all of the underlying audio, as well as to forced alignments of the audio with transcript content, in Praat’s TextGrid format. This paper describes the methods used to create the corpus from open-source tools and the architecture of the CoANZSE Audio website. Two possible linguistic analyses based on CoANZSE Audio data are described: use of double modals, a rare syntactic feature, and raising of the mid front vowel /ɛ/ in New Zealand English. CoANZSE Audio can be considered to be among the first large, free, fully searchable online corpora containing data suitable for acoustic phonetic analyses in addition to lexical, grammatical, and discourse properties of Australian and New Zealand Englishes.

pdf abs
Coarse-Tuning for Ad-hoc Document Retrieval Using Pre-trained Language Models
Atsushi Keyaki | Ribeka Keyaki

Fine-tuning in information retrieval systems using pre-trained language models (PLM-based IR) requires learning query representations and query-document relations, in addition to downstream task-specific learning. This study introduces coarse-tuning as an intermediate learning stage that bridges pre-training and fine-tuning. By learning query representations and query-document relations in coarse-tuning, we aim to reduce the load of fine-tuning and improve the learning effect of downstream IR tasks. We propose Query-Document Pair Prediction (QDPP) for coarse-tuning, which predicts the appropriateness of query-document pairs. Evaluation experiments show that the proposed method significantly improves MRR and/or nDCG@5 in four ad-hoc document retrieval datasets. Furthermore, the results of the query prediction task suggested that coarse-tuning facilitated learning of query representation and query-document relations.

pdf abs
CoBaLD Annotation: The Enrichment of the Enhanced Universal Dependencies with the Semantical Pattern
Maria Andreevna Petrova | Alexandra M. Ivoylova | Anastasia Tishchenkova

The paper is devoted to the annotation format aimed at morphological, syntactic and especially semantic markup. The format combines the Enhanced UD morphosyntax and the Compreno semantic pattern, enriching the UD annotation with word meanings and labels for semantic relations between words. To adapt the Compreno semantics for the current purpose, we reduced the number of the semantic fields denoting lexical meanings by using hyperonym fields. Moreover, we used a generalized variant of the semantic relations as the original roles possess rather narrow meanings which makes them too numerous. Creating such a format demands the Compreno-to-UD morphosyntax conversion as well, which, in turn, demands solving the asymmetry problem between the models. The asymmetry concerns tokenization, lemmatization, POS-tagging, sets of grammatical features and dependency heads. To overcome this problem, the Compreno-to-UD converter was created. As an application, the work presents a 150,000 token corpus of English news annotated according to the standard.

While pre-trained language models (LM) for code have achieved great success in code completion, they generate code conditioned only on the contents within the file, i.e., in-file context, but ignore the rich semantics in other files within the same project, i.e., project-level cross-file context, a critical source of information that is especially useful in modern modular software development. Such overlooking constrains code LMs’ capacity in code completion, leading to unexpected behaviors such as generating hallucinated class member functions or function calls with unexpected arguments. In this work, we propose CoCoMIC, a novel framework that jointly learns the in-file and cross-file context on top of code LMs. To empower CoCoMIC, we develop CCFinder, a static-analysis-based tool that locates and retrieves the most relevant project-level cross-file context for code completion. CoCoMIC successfully improves the existing code LM with a 33.94% relative increase in exact match and 28.69% in identifier matching for code completion when the cross-file context is provided. Finally, we perform a series of ablation studies and share valuable insights for future research on integrating cross-file context into code LMs.

pdf abs
Code Defect Detection Using Pre-trained Language Models with Encoder-Decoder via Line-Level Defect Localization
Jimin An | YunSeok Choi | Jee-Hyong Lee

Recently, code Pre-trained Language Models (PLMs) trained on large amounts of code and comment, have shown great success in code defect detection tasks. However, most PLMs simply treated the code as a single sequence and only used the encoder of PLMs to determine if there exist defects in the entire code. For a more analyzable and explainable approach, it is crucial to identify which lines contain defects. In this paper, we propose a novel method for code defect detection that integrates line-level defect localization into a unified training process. To identify code defects at the line-level, we convert the code into a sequence separated by lines using a special token. Then, to utilize the characteristic that both the encoder and decoder of PLMs process information differently, we leverage both the encoder and decoder for line-level defect localization. By learning code defect detection and line-level defect localization tasks in a unified manner, our proposed method promotes knowledge sharing between the two tasks. We demonstrate that our proposed method significantly improves performance on four benchmark datasets for code defect detection. Additionally, we show that our method can be easily integrated with ChatGPT.

pdf abs
Code-Mixed Probes Show How Pre-Trained Models Generalise on Code-Switched Text
Frances Adriana Laureano De Leon | Harish Tayyar Madabushi | Mark Lee

Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on abilities of these models to generalise representations to CS corpora. We release all our code and data, including the novel corpus, at https://github.com/francesita/code-mixed-probes.

pdf abs
Code-Mixed Text Augmentation for Latvian ASR
Martins Kronis | Askars Salimbajevs | Mārcis Pinnis

Code-mixing has become mainstream in the modern, globalised world and affects low-resource languages, such as Latvian, in particular. Solutions to developing an automatic speech recognition system (ASR) for code-mixed speech often rely on specially created audio-text corpora, which are expensive and time-consuming to create. In this work, we attempt to tackle code-mixed Latvian-English speech recognition by improving the language model (LM) of a hybrid ASR system. We make a distinction between inflected transliterations and phonetic transcriptions as two different foreign word types. We propose an inflected transliteration model and a phonetic transcription model for the automatic generation of said word types. We then leverage a large human-translated English-Latvian parallel text corpus to generate synthetic code-mixed Latvian sentences by substituting in generated foreign words. Using the newly created augmented corpora, we train a new LM and combine it with our existing Latvian acoustic model (AM). For evaluation, we create a specialised foreign word test set on which our methods yield up to 15% relative CER improvement. We then further validate these results in a human evaluation campaign.

pdf abs
Cognitive Information Bottleneck: Extracting Minimal Sufficient Cognitive Language Processing Signals
Yuto Harada | Yohei Oseki

In Reinforcement Learning from Human Feedback (RLHF), explicit human feedback, such as rankings, is employed to align Natural Language Processing (NLP) models with human preferences. In contrast, the potential of implicit human feedback, encompassing cognitive processing signals like eye-tracking and brain activity, remains underexplored. These signals capture unconscious human responses but are often marred by noise and redundancy, complicating their application to specific tasks. To address this issue, we introduce the Cognitive Information Bottleneck (CIB), a method that extracts only the task-relevant information from cognitive processing signals. Grounded in the principles of the information bottleneck, CIB aims to learn representations that maximize the mutual information between the representations and targets while minimizing the mutual information between inputs and representations. By employing CIB to filter out redundant information from cognitive processing signals, our goal is to provide representations that are both minimal and sufficient. This approach enables more efficient fitting of models to inputs. Our results show that the proposed method outperforms existing methods in efficiently compressing various cognitive processing signals and significantly enhances performance on downstream tasks. Evaluated on public datasets, our model surpasses contemporary state-of-the-art models. Furthermore, by analyzing these compressed representations, we offer insights into how cognitive processing signals can be leveraged to improve performance.

In order to construct or extend entity-centric and event-centric knowledge graphs (KG and EKG), the information extraction (IE) annotation toolkit is essential. However, existing IE toolkits have several non-trivial problems, such as not supporting multi-tasks, and not supporting automatic updates. In this work, we present CollabKG, a learnable human-machine-cooperative IE toolkit for KG and EKG construction. Specifically, for the multi-task issue, CollabKG unifies different IE subtasks, including named entity recognition (NER), entity-relation triple extraction (RE), and event extraction (EE), and supports both KG and EKG. Then, combining advanced prompting-based IE technology, the human-machine-cooperation mechanism with Large Language Models (LLMs) as the assistant machine is presented which can provide a lower cost as well as a higher performance. Lastly, owing to the two-way interaction between the human and machine, CollabKG with learning ability allows self-renewal. Besides, CollabKG has several appealing features (e.g., customization, training-free, and label propagation) that make the system powerful and high-productivity. We holistically compare our toolkit with other existing tools on these features. Human evaluation quantitatively illustrates that CollabKG significantly improves annotation quality, efficiency, and stability simultaneously.

pdf abs
Collecting and Analyzing Dialogues in a Tagline Co-Writing Task
Xulin Zhou | Takuma Ichikawa | Ryuichiro Higashinaka

The potential usage scenarios of dialogue systems will be greatly expanded if they are able to collaborate more creatively with humans. Many studies have examined ways of building such systems, but most of them focus on problem-solving dialogues, and relatively little research has been done on systems that can engage in creative collaboration with users. In this study, we designed a tagline co-writing task in which two people collaborate to create taglines via text chat, created an interface for data collection, and collected dialogue logs, editing logs, and questionnaire results. In total, we collected 782 Japanese dialogues. We describe the characteristic interactions comprising the tagline co-writing task and report the results of our analysis, in which we examined the kind of utterances that appear in the dialogues as well as the most frequent expressions found in highly rated dialogues in subjective evaluations. We also analyzed the relationship between subjective evaluations and workflow utilized in the dialogues and the interplay between taglines and utterances.

pdf abs
Collecting Human-Agent Dialogue Dataset with Frontal Brain Signal toward Capturing Unexpressed Sentiment
Shun Katada | Ryu Takeda | Kazunori Komatani

Multimodal information such as text and audiovisual data has been used for emotion/sentiment estimation during human-agent dialogue; however, user sentiments are not necessarily expressed explicitly during dialogues. Biosignals such as brain signals recorded using an electroencephalogram (EEG) sensor have been the subject of focus in affective computing regions to capture unexpressed emotional changes in a controlled experimental environment. In this study, we collect and analyze multimodal data with an EEG during a human-agent dialogue toward capturing unexpressed sentiment. Our contributions are as follows: (1) a new multimodal human-agent dialogue dataset is created, which includes not only text and audiovisual data but also frontal EEGs and physiological signals during the dialogue. In total, about 500-minute chat dialogues were collected from thirty participants aged 20 to 70. (2) We present a novel method for dealing with eye-blink noise for frontal EEGs denoising. This method applies facial landmark tracking to detect and delete eye-blink noise. (3) An experimental evaluation showed the effectiveness of the frontal EEGs. It improved sentiment estimation performance when used with other modalities by multimodal fusion, although it only has three channels.

This paper reports on the experience collecting a number of corpora of Nordic languages spoken by children. The aim of the data collection is providing annotated data to develop and evaluate computer assisted pronunciation assessment systems both for non-native children learning a Nordic language (L2) and for L1 children with speech sound disorder (SSD). The paper presents the challenges encountered recording and annotating data for Finnish, Swedish and Norwegian, as well as the ethical considerations related with making this data publicly available. We hope that sharing this experience will encourage others to collect similar data for other languages. Of the different data collections, we were able to make the Norwegian corpus publicly available in the hope that it will serve as a reference in pronunciation assessment research.

pdf abs
Combining Discourse Coherence with Large Language Models for More Inclusive, Equitable, and Robust Task-Oriented Dialogue
Katherine Atwell | Mert Inan | Anthony B. Sicilia | Malihe Alikhani

Large language models (LLMs) are capable of generating well-formed responses, but using LLMs to generate responses on the fly is not yet feasible for many task-oriented systems. Modular architectures are often still required for safety and privacy guarantees on the output. We hypothesize that an offline generation approach using discourse theories, formal grammar rules, and LLMs can allow us to generate human-like, coherent text in a more efficient, robust, and inclusive manner within a task-oriented setting. To this end, we present the first discourse-aware multimodal task-oriented dialogue system that combines discourse theories with offline LLM generation. We deploy our bot as an app to the general public and keep track of the user ratings for six months. Our user ratings show an improvement from 2.8 to 3.5 out of 5 with the introduction of discourse coherence theories. We also show that our model reduces misunderstandings in the dialect of African-American Vernacular English from 93% to 57%. While terms of use prevent us from releasing our entire codebase, we release our code in a format that can be integrated into most existing dialogue systems.

pdf abs
COMET for Low-Resource Machine Translation Evaluation: A Case Study of English-Maltese and Spanish-Basque
Júlia Falcão | Claudia Borg | Nora Aranberri | Kurt Abela

Trainable metrics for machine translation evaluation have been scoring the highest correlations with human judgements in the latest meta-evaluations, outperforming traditional lexical overlap metrics such as BLEU, which is still widely used despite its well-known shortcomings. In this work we look at COMET, a prominent neural evaluation system proposed in 2020, to analyze the extent of its language support restrictions, and to investigate strategies to extend this support to new, under-resourced languages. Our case study focuses on English-Maltese and Spanish-Basque. We run a crowd-based evaluation campaign to collect direct assessments and use the annotated dataset to evaluate COMET-22, further fine-tune it, and to train COMET models from scratch for the two language pairs. Our analysis suggests that COMET’s performance can be improved with fine-tuning, and that COMET can be highly susceptible to the distribution of scores in the training data, which especially impacts low-resource scenarios.

pdf abs
COMICORDA: Dialogue Act Recognition in Comic Books
Jiri Martinek | Pavel Kral | Ladislav Lenc | Josef Baloun

Dialogue act (DA) recognition is usually realized from a speech signal that is transcribed and segmented into text. However, only a little work in DA recognition from images exists. Therefore, this paper concentrates on this modality and presents a novel DA recognition approach for image documents, namely comic books. To the best of our knowledge, this is the first study investigating dialogue acts from comic books and represents the first steps to building a model for comic book understanding. The proposed method is composed of the following steps: speech balloon segmentation, optical character recognition (OCR), and DA recognition itself. We use YOLOv8 for balloon segmentation, Google Vision for OCR, and Transformer-based models for DA classification. The experiments are performed on a newly created dataset comprising 1,438 annotated comic panels. It contains bounding boxes, transcriptions, and dialogue act annotation. We have achieved nearly 98% average precision for speech balloon segmentation and exceeded the accuracy of 70% for the DA recognition task. We also present an analysis of dialogue structure in the comics domain and compare it with the standard DA datasets, representing another contribution of this paper.

The Common European Language Data Space (LDS) is an integral part of the EU data strategy, which aims at developing a single market for data. Its decentralised technical infrastructure and governance scheme are currently being developed by the LDS project, which also has dedicated tasks for proof-of-concept prototypes, handling legal aspects, raising awareness and promoting the LDS through events and social media channels. The LDS is part of a broader vision for establishing all necessary components to develop European large language models.

Within Dialogue Modeling research in AI and NLP, considerable attention has been spent on “dialogue state tracking” (DST), which is the ability to update the representations of the speaker’s needs at each turn in the dialogue by taking into account the past dialogue moves and history. Less studied but just as important to dialogue modeling, however, is “common ground tracking” (CGT), which identifies the shared belief space held by all of the participants in a task-oriented dialogue: the task-relevant propositions all participants accept as true. In this paper we present a method for automatically identifying the current set of shared beliefs and ”questions under discussion” (QUDs) of a group with a shared goal. We annotate a dataset of multimodal interactions in a shared physical space with speech transcriptions, prosodic features, gestures, actions, and facets of collaboration, and operationalize these features for use in a deep neural model to predict moves toward construction of common ground. Model outputs cascade into a set of formal closure rules derived from situated evidence and belief axioms and update operations. We empirically assess the contribution of each feature type toward successful construction of common ground relative to ground truth, establishing a benchmark in this novel, challenging task.

pdf abs
Comparative Analysis of Sign Language Interpreting Agents Perception: A Study of the Deaf
Alfarabi Imashev | Nurziya Oralbayeva | Gulmira Baizhanova | Anara Sandygulova

Prior research on sign language recognition has already demonstrated encouraging outcomes in achieving highly accurate and dependable automatic sign language recognition. The use of virtual characters as virtual assistants has significantly increased in the past decade. However, the progress in sign language generation and output that closely resembles physiologically believable human motions is still in its early stages. This assertion explains the lack of progress in virtual intelligent signing generative systems. Aside from the development of signing systems, scholarly research have revealed a significant deficiency in evaluating sign language generation systems by those who are deaf and use sign language. This paper presents the findings of a user study conducted with deaf signers. The study is aimed at comparing a state-of-the-art sign language generation system with a skilled sign language interpreter. The study focused on testing established metrics to gain insights into usability of such metrics for deaf signers and how deaf signers perceive signing agents.

pdf abs
Comparing Static and Contextual Distributional Semantic Models on Intrinsic Tasks: An Evaluation on Mandarin Chinese Datasets
A Pranav | Yan Cong | Emmanuele Chersoni | Yu-Yin Hsu | Alessandro Lenci

The field of Distributional Semantics has recently undergone important changes, with the contextual representations produced by Transformers taking the place of static word embeddings models. Noticeably, previous studies comparing the two types of vectors have only focused on the English language and a limited number of models. In our study, we present a comparative evaluation of static and contextualized distributional models for Mandarin Chinese, focusing on a range of intrinsic tasks. Our results reveal that static models remain stronger for some of the classical tasks that consider word meaning independent of context, while contextualized models excel in identifying semantic relations between word pairs and in the categorization of words into abstract semantic classes.

pdf abs
Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition
David Gimeno-Gómez | Carlos-D. Martínez-Hinarejos

Thanks to the rise of deep learning and the availability of large-scale audio-visual databases, recent advances have been achieved in Visual Speech Recognition (VSR). Similar to other speech processing tasks, these end-to-end VSR systems are usually based on encoder-decoder architectures. While encoders are somewhat general, multiple decoding approaches have been explored, such as the conventional hybrid model based on Deep Neural Networks combined with Hidden Markov Models (DNN-HMM) or the Connectionist Temporal Classification (CTC) paradigm. However, there are languages and tasks in which data is scarce, and in this situation, there is not a clear comparison between different types of decoders. Therefore, we focused our study on how the conventional DNN-HMM decoder and its state-of-the-art CTC/Attention counterpart behave depending on the amount of data used for their estimation. We also analyzed to what extent our visual speech features were able to adapt to scenarios for which they were not explicitly trained, either considering a similar dataset or another collected for a different language. Results showed that the conventional paradigm reached recognition rates that improve the CTC/Attention model in data-scarcity scenarios along with a reduced training time and fewer parameters.

pdf abs
Comparison of the Intimacy Process between Real and Acting-based Long-term Text Chats
Tsunehiro Arimoto | Hiroaki Sugiyama | Hiromi Narimatsu | Masahiro Mizukami

Long-term chatbots are expected to develop relationships with users. The major trend in this field’s recent long-term chatbot studies is to train systems with virtual long-term chat data called Multi-Session Chat (MSC), which collects text chat from multiple sessions of crowd workers playing the roles of speakers with defined personas. However, no investigation has attempted to determine whether such virtual long-term chat can successfully simulate relationship-building between speakers. To clarify the difference between an actual long-term intimacy process and an MSC intimacy process, this study collects real long-term chat and MSC in Japanese and compares them in terms of speech form and dialogue acts. The results of analyzing these factors suggest that MSC have an unnatural tendency to behave as if they have a close relationship with non-polite speech levels compared to actual long-term chats, but also as if they have a shallow relationship with more questions than real long-term chats.

pdf abs
Complex Word Identification: A Comparative Study between ChatGPT and a Dedicated Model for This Task
Abdelhak Kelious | Mathieu Constant | Christophe Coeur

There are several works in natural language processing for identifying lexical complexity. This can be for various reasons, either for simplification, the selection of more suitable content, or for other specific tasks. Words can have multiple definitions and degrees of complexity depending on the context in which they appear. One solution being investigated is lexical complexity prediction, where computational methods are used to evaluate the difficulty of vocabulary for language learners and offer personalized assistance. In this work, we explore deep learning methods to assess the complexity of a word based on its context. Specifically, we investigate how to use pre-trained language models to encode both the sentence and the target word, and then fine-tune them by combining them with additional frequency-based features. Our approach achieved superior results compared to the best systems in SemEval-2021 (Shardlow et al., 2021), as demonstrated by an R2 score of 0.65. Finally, we carry out a comparative study with ChatGPT to assess its potential for predicting lexical complexity, to see whether prompt engineering can be an alternative to this task, we will discuss the advantages and limitations of ChatGPT.

Recent advances in natural language processing (NLP) can be largely attributed to the advent of pre-trained language models such as BERT and RoBERTa. While these models demonstrate remarkable performance on general datasets, they can struggle in specialized domains such as medicine, where unique domain-specific terminologies, domain-specific abbreviations, and varying document structures are common. This paper explores strategies for adapting these models to domain-specific requirements, primarily through continuous pre-training on domain-specific data. We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data. The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering. Our results suggest that models augmented by clinical and translation-based pre-training typically outperform general domain models in medical contexts. We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch. Furthermore, pre-training on clinical data or leveraging translated texts have proven to be reliable methods for domain adaptation in medical NLP tasks.

pdf abs
Computational Modelling of Plurality and Definiteness in Chinese Noun Phrases
Yuqi Liu | Guanyi Chen | Kees van Deemter

Theoretical linguists have suggested that some languages (e.g., Chinese and Japanese) are “cooler” than other languages based on the observation that the intended meaning of phrases in these languages depends more on their contexts. As a result, many expressions in these languages are shortened, and their meaning is inferred from the context. In this paper, we focus on the omission of the plurality and definiteness markers in Chinese noun phrases (NPs) to investigate the predictability of their intended meaning given the contexts. To this end, we built a corpus of Chinese NPs, each of which is accompanied by its corresponding context, and by labels indicating its singularity/plurality and definiteness/indefiniteness. We carried out corpus assessments and analyses. The results suggest that Chinese speakers indeed drop plurality and definiteness markers very frequently. Building on the corpus, we train a bank of computational models using both classic machine learning models and state-of-the-art pre-trained language models to predict the plurality and definiteness of each NP. We report on the performance of these models and analyse their behaviours.

pdf abs
CONAN-MT-SP: A Spanish Corpus for Counternarrative Using GPT Models
María Estrella Vallecillo Rodríguez | Maria Victoria Cantero Romero | Isabel Cabrera De Castro | Arturo Montejo Ráez | María Teresa Martín Valdivia

This paper describes the automated generation of CounterNarratives (CNs) for Hate Speech (HS) in Spanish using GPT-based models. Our primary objective is to evaluate the performance of these models in comparison to human capabilities. For this purpose, the English CONAN Multitarget corpus is taken as a starting point and we use the DeepL API to automatically translate into Spanish. Two GPT-based models, GPT-3 and GPT-4, are applied to the HS segment through a few-shot prompting strategy to generate a new CN. As a consequence of our research, we have created a high quality corpus in Spanish that includes the original HS-CN pairs translated into Spanish, in addition to the CNs generated automatically with the GPT models and that have been evaluated manually. The resulting CONAN-MT-SP corpus and its evaluation will be made available to the research community, representing the most extensive linguistic resource of CNs in Spanish to date. The results demonstrate that, although the effectiveness of GPT-4 outperforms GPT-3, both models can be used as systems to automatically generate CNs to combat the HS. Moreover, these models consistently outperform human performance in most instances.

pdf abs
Conceptual Pacts for Reference Resolution Using Small, Dynamically Constructed Language Models: A Study in Puzzle Building Dialogues
Julian Hough | Sina Zarrieß | Casey Kennington | David Schlangen | Massimo Poesio

Using Brennan and Clark’s theory of a Conceptual Pact, that when interlocutors agree on a name for an object, they are forming a temporary agreement on how to conceptualize that object, we present an extension to a simple reference resolver which simulates this process over time with different conversation pairs. In a puzzle construction domain, we model pacts with small language models for each referent which update during the interaction. When features from these pact models are incorporated into a simple bag-of-words reference resolver, the accuracy increases compared to using a standard pre-trained model. The model performs equally to a competitor using the same data but with exhaustive re-training after each prediction, while also being more transparent, faster and less resource-intensive. We also experiment with reducing the number of training interactions, and can still achieve reference resolution accuracies of over 80% in testing from observing a single previous interaction, over 20% higher than a pre-trained baseline. While this is a limited domain, we argue the model could be applicable to larger real-world applications in human and human-robot interaction and is an interpretable and transparent model.

Knowing the particular context associated with a conversation can help improving the performance of an automatic speech recognition (ASR) system. For example, if we are provided with a list of in-context words or phrases — such as the speaker’s contacts or recent song playlists — during inference, we can bias the recognition process towards this list. There are many works addressing contextual ASR; however, there is few publicly available real benchmark for evaluation, making it difficult to compare different solutions. To this end, we provide a corpus (“ConEC”) and baselines to evaluate contextual ASR approaches, grounded on real-world applications. The ConEC corpus is based on public-domain earnings calls (ECs) and associated supplementary materials, such as presentation slides, earnings news release as well as a list of meeting participants’ names and affiliations. We demonstrate that such real contexts are noisier than artificially synthesized contexts that contain the ground truth, yet they still make great room for future improvement of contextual ASR technology

Prompt-based methods have been widely used in few-shot named entity recognition (NER). In this paper, we first conduct a preliminary experiment and observe that the key to affecting the performance of prompt-based NER models is the capability to detect entity boundaries. However, most existing models fail to boost such capability. To solve the issue, we propose a novel model, ParaBART, which consists of a BART encoder and a specially designed parabiotic decoder. Specifically, the parabiotic decoder includes two BART decoders and a conjoint module. The two decoders are responsible for entity boundary detection and entity type classification, respectively. They are connected by the conjoint module, which is used to replace unimportant tokens’ embeddings in one decoder with the average embedding of all the tokens in the other. We further present a novel boundary expansion strategy to enhance the model’s capability in entity type classification. Experimental results show that ParaBART can achieve significant performance gains over state-of-the-art competitors.

pdf abs
CoNLL#: Fine-grained Error Analysis and a Corrected Test Set for CoNLL-03 English
Andrew Rueda | Elena Alvarez-Mellado | Constantine Lignos

Modern named entity recognition systems have steadily improved performance in the age of larger and more powerful neural models. However, over the past several years, the state-of-the-art has seemingly hit another plateau on the benchmark CoNLL-03 English dataset. In this paper, we perform a deep dive into the test outputs of the highest-performing NER models, conducting a fine-grained evaluation of their performance by introducing new document-level annotations on the test set. We go beyond F1 scores by categorizing errors in order to interpret the true state of the art for NER and guide future work. We review previous attempts at correcting the various flaws of the test set and introduce CoNLL#, a new corrected version of the test set that addresses its systematic and most prevalent errors, allowing for low-noise, interpretable error analysis.

Contrary to common belief, there are rich and diverse data sources available for many thousands of languages, which can be used to develop technologies for these languages. In this paper, we provide an overview of some of the major online data sources, the types of data that they provide access to, potential applications of this data, and the number of languages that they cover. Even this covers only a small fraction of the data that exists; for example, printed books are published in many languages but few online aggregators exist.

pdf abs
Constructing a Dependency Treebank for Second Language Learners of Korean
Hakyung Sung | Gyu-Ho Shin

We introduce a manually annotated syntactic treebank based on Universal Dependencies, derived from the written data of second language (L2) Korean learners. In developing this new dataset, we critically evaluated previous works and revised the annotation guidelines to better reflect the linguistic properties of Korean and the characteristics of L2 learners. The L2 Korean treebank encompasses 7,530 sentences (66,982 words; 129,333 morphemes) and is publicly available at: https://github.com/NLPxL2Korean/L2KW-corpus.

pdf abs
Constructing Indonesian-English Travelogue Dataset
Eunike Andriani Kardinata | Hiroki Ouchi | Taro Watanabe

Research in low-resource language is often hampered due to the under-representation of how the language is being used in reality. This is particularly true for Indonesian language because there is a limited variety of textual datasets, and majority were acquired from official sources with formal writing style. All the more for the task of geoparsing, which could be implemented for navigation and travel planning applications, such datasets are rare, even in the high-resource languages, such as English. Being aware of the need for a new resource in both languages for this specific task, we constructed a new dataset comprising both Indonesian and English from personal travelogue articles. Our dataset consists of 88 articles, exactly half of them written in each language. We covered both named and nominal expressions of four entity types related to travel: location, facility, transportation, and line. We also conducted experiments by training classifiers to recognise named entities and their nominal expressions. The results of our experiments showed a promising future use of our dataset as we obtained F1-score above 0.9 for both languages.

pdf abs
Constructing Korean Learners’ L2 Speech Corpus of Seven Languages for Automatic Pronunciation Assessment
Seunghee Han | Sunhee Kim | Minhwa Chung

Multilingual L2 speech corpora for developing automatic speech assessment are currently available, but they lack comprehensive annotations of L2 speech from non-native speakers of various languages. This study introduces the methodology of designing a Korean learners’ L2 speech corpus of seven languages: English, Japanese, Chinese, French, German, Spanish, and Russian. We describe the development of reading scripts, reading tasks, scoring criteria, and expert evaluation methods in detail. Our corpus contains 1,200 hours of L2 speech data from Korean learners (400 hours for English, 200 hours each for Japanese and Chinese, 100 hours each for French, German, Spanish, and Russian). The corpus is annotated with spelling and pronunciation transcription, expert pronunciation assessment scores (accuracy of pronunciation and fluency of prosody), and metadata such as gender, age, self-reported language proficiency, and pronunciation error types. We also propose a practical verification method and a reliability threshold to ensure the reliability and objectivity of large-scale subjective evaluation data.

Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find that noisier datasets do indeed lead to more hallucination. We argue that the ability of forward and reverse models trained on a dataset to cyclically regenerate source KG or text is a proxy for the equivalence between the KG and the text in the dataset. Using cyclic evaluation we find that manually created WebNLG is much better than automatically created TeKGen and T-REx. Informed by these observations, we construct a new, improved dataset called LAGRANGE using heuristics meant to improve equivalence between KG and text and show the impact of each of the heuristics on cyclic evaluation. We also construct two synthetic datasets using large language models (LLMs), and observe that these are conducive to models that perform significantly well on cyclic generation of text, but less so on cyclic generation of KGs, probably because of a lack of a consistent underlying ontology.

In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort to explain this failure. From a Computational Linguistics perspective, we identify a group of constructions with three classes of adjectives which cannot be distinguished by surface features. This enables us to probe for LLM’s understanding of these constructions in various ways, and we find that they fail in a variety of ways to distinguish between them, suggesting that they don’t adequately represent their meaning or capture the lexical properties of phrasal heads.

pdf abs
Context-Aware Non-Autoregressive Document-Level Translation with Sentence-Aligned Connectionist Temporal Classification
Hao Yu | Kaiyu Huang | Anqi Zhao | Junpeng Liu | Degen Huang

Previous studies employ the autoregressive translation (AT) paradigm in the document-to-document neural machine translation. These methods extend the translation unit from a single sentence to a pseudo-document and encodes the full pseudo-document, avoiding the redundant computation problem in context. However, the AT methods cannot parallelize decoding and struggle with error accumulation, especially when the length of sentences increases. In this work, we propose a context-aware non-autoregressive framework with the sentence-aligned connectionist temporal classification (SA-CTC) loss for document-level neural machine translation. In particular, the SA-CTC loss reduces the search space of the decoding path by fixing the positions of the beginning and end tokens for each sentence in the document. Meanwhile, the context-aware architecture introduces preset nodes to represent sentence-level information and utilizes a hierarchical attention structure to regulate the attention hypothesis space. Experimental results show that our proposed method can achieve competitive performance compared with several strong baselines. Our method implements non-autoregressive modeling in Doc-to-Doc translation manner, achieving an average 46X decoding speedup compared to the document-level AT baselines on three benchmarks.

pdf abs
Context Matters: Enhancing Metaphor Recognition in Proverbs
Gamze Goren | Carlo Strapparava

Despite the remarkable achievements of Large Language Models (LLMs) in various Natural Language Processing tasks, their competence in abstract language understanding remains a relatively under-explored territory. Figurative language interpretation serves as ideal testbed for assessing this as it requires models to navigate beyond the literal meaning and delve into underlying semantics of the figurative expressions. In this paper, we seek to examine the performance of GPT-3.5 in zero-shot setting through word-level metaphor detection. Specifically, we frame the task as annotation of word-level metaphors in proverbs. To this end, we employ a dataset of English proverbs and evaluated its performance by applying different prompting strategies. Our results show that the model shows a satisfactory performance at identifying word-level metaphors, particularly when it is prompted with a hypothetical context preceding the proverb. This observation underscores the pivotal role of well-designed prompts for zero-shot settings through which these models can be leveraged as annotators for subjective NLP tasks.

pdf abs
Context Shapes Emergent Communication about Concepts at Different Levels of Abstraction
Kristina Kobrock | Xenia Isabel Ohmer | Elia Bruni | Nicole Gotzner

We study the communication of concepts at different levels of abstraction and in different contexts in an agent-based, interactive reference game. While playing the concept-level reference game, the neural network agents develop a communication system from scratch. We use a novel symbolic dataset that disentangles concept type (ranging from specific to generic) and context (ranging from fine to coarse) to study the influence of these factors on the emerging language. We compare two game scenarios: one in which speaker agents have access to context information (context-aware) and one in which the speaker agents do not have access to context information (context-unaware). First, we find that the agents learn higher-level concepts from the object inputs alone. Second, an analysis of the emergent communication system shows that only context-aware agents learn to communicate efficiently by adapting their messages to the context conditions and relying on context for unambiguous reference. Crucially, this behavior is not explicitly incentivized by the game, but efficient communication emerges and is driven by the availability of context alone. The emerging language we observe is reminiscent of evolutionary pressures on human languages and highlights the pivotal role of context in a communication system.

pdf abs
Contextualizing Generated Citation Texts
Biswadip Mandal | Xiangci Li | Jessica Ouyang

Abstractive citation text generation is usually framed as an infilling task, where a sequence-to-sequence model is trained to generate a citation given a reference paper and the context window around the target; the generated citation should be a brief discussion of the reference paper as it relates to the citing context. However, examining a recent LED-based citation generation system, we find that many of the generated citations are generic summaries of the reference paper’s main contribution, ignoring the citation context’s focus on a different topic. To address this problem, we propose a simple modification to the citation text generation task: the generation target is not only the citation itself, but the entire context window, including the target citation. This approach can be easily applied to any abstractive citation generation system, and our experimental results show that training in this way is preferred by human readers and allows the generation model to make use of contextual clues about what topic to discuss and what stance to take.

Contextual information, including the sentences in the same document and in other documents of the dataset, plays a crucial role in improving the accuracy of document-level ASR Error Correction (AEC), while most previous works ignore this. In this paper, we propose a context-aware method that utilizes a k-Nearest Neighbors (kNN) approach to enhance the AEC model by retrieving a datastore containing contextual information. We conduct experiments on two English and two Chinese datasets, and the results demonstrate that our proposed model can effectively utilize contextual information to improve document-level AEC. Furthermore, the context information from the whole dataset provides even better results.

Traditional continual event detection relies on abundant labeled data for training, which is often impractical to obtain in real-world applications. In this paper, we introduce continual few-shot event detection (CFED), a more commonly encountered scenario when a substantial number of labeled samples are not accessible. The CFED task is challenging as it involves memorizing previous event types and learning new event types with few-shot samples. To mitigate these challenges, we propose a memory-based framework: Hierarchical Augmentation Network (HANet). To memorize previous event types with limited memory, we incorporate prototypical augmentation into the memory set. For the issue of learning new event types in few-shot scenarios, we propose a contrastive augmentation module for token representations. Despite comparing with previous state-of-the-art methods, we also conduct comparisons with ChatGPT. Experiment results demonstrate that our method significantly outperforms all of these methods in multiple continual few-shot event detection tasks.

pdf abs
Continual Reinforcement Learning for Controlled Text Generation
Velizar Shulev | Khalil Sima’an

Controlled Text Generation (CTG) steers the generation of continuations of a given context (prompt) by a Large Language Model (LLM) towards texts possessing a given attribute (e.g., topic, sentiment). In this paper we view CTG as a Continual Learning problem: how to learn at every step to steer next-word generation, without having to wait for end-of-sentence. This continual view is useful for online applications such as CTG for speech, where end-of-sentence is often uncertain. We depart from an existing model, the Plug-and-Play language models (PPLM), which perturbs the context at each step to better predict next-words that posses the desired attribute. While PPLM is intricate and has many hyper-parameters, we provide a proof that the PPLM objective function can be reduced to a Continual Reinforcement Learning (CRL) reward function, thereby simplifying PPLM and endowing it with a better understood learning framework. Subsequently, we present, the first of its kind, CTG algorithm that is fully based on CRL and exhibit promising empirical results.

pdf abs
Continued Pre-training on Sentence Analogies for Translation with Small Data
Liyan Wang | Haotong Wang | Yves Lepage

This paper introduces Continued Pre-training on Analogies (CPoA) to incorporate pre-trained language models with analogical abilities, aiming at improving performance in low-resource translations without data augmentation. We continue training the models on sentence analogies retrieved from a translation corpus. Considering the sparsity of analogy in corpora, especially in low-resource scenarios, we propose exploring approximate analogies between sentences. We attempt to find sentence analogies that might not conform to formal criteria for entire sentences but partial pieces. When training the models, we introduce a weighting scalar pertaining to the quality of analogies to adjust the influence: emphasizing closer analogies while diminishing the impact of far ones. We evaluate our approach on a low-resource translation task: German-Upper Sorbian. The results show that CPoA using 10 times fewer instances can effectively attain gains of +1.4 and +1.3 BLEU points over the original model in two translation directions. This improvement is more pronounced when there are fewer parallel examples.

pdf abs
Continuous Relational Diffusion Driven Topic Model with Multi-grained Text for Microblog
Chenhao Wu | Ruifang He | Chang Liu | Bo Wang

Topic model is a statistical model that leverages unsupervised learning to mine hidden topics in document collections. The data sparsity and colloquialism of social texts make it difficult to accurately mine the topics. Traditional methods assume that there are only 0/1-state relationships between the two parties in the social networks, but the relationship status in real life is more complicated, such as continuously changing relationships with different degrees of intimacy. This paper proposes a continuous relational diffusion driven topic model (CRTM) with multi-grained text for microblog to realize the continuous representation of the relationship state and make up for the context and structural information lost by previous representation methods. Multi-grained text representation learning distinguishes the impact of formal and informal expression on the topics further and alleviates colloquialism problems. Specifically, based on the original social network, the reconstructed social network with continuous relationship status is obtained by using information diffusion technology. The graph convolution model is utilized to learn node embeddings through the new social network. Finally, the neural variational inference is applied to generate topics according to continuous relationships. We validate CRTM on three real datasets, and the experimental results show the effectiveness of the scheme.

pdf abs
ContrastWSD: Enhancing Metaphor Detection with Word Sense Disambiguation Following the Metaphor Identification Procedure
Mohamad Elzohbi | Richard Zhao

This paper presents ContrastWSD, a RoBERTa-based metaphor detection model that integrates the Metaphor Identification Procedure (MIP) and Word Sense Disambiguation (WSD) to extract and contrast the contextual meaning with the basic meaning of a word to determine whether it is used metaphorically in a sentence. By utilizing the word senses derived from a WSD model, our model enhances the metaphor detection process and outperforms other methods that rely solely on contextual embeddings or integrate only the basic definitions and other external knowledge. We evaluate our approach on various benchmark datasets and compare it with strong baselines, indicating the effectiveness in advancing metaphor detection.

pdf abs
Contribution of Move Structure to Automatic Genre Identification: An Annotated Corpus of French Tourism Websites
Rémi Cardon | Trang Tran Hanh Pham | Julien Zakhia Doueihi | Thomas François

The present work studies the contribution of move structure to automatic genre identification. This concept - well known in other branches of genre analysis - seems to have little application in natural language processing. We describe how we collect a corpus of websites in French related to tourism and annotate it with move structure. We conduct experiments on automatic genre identification with our corpus. Our results show that our approach for informing a model with move structure can increase its performance for automatic genre identification, and reduce the need for annotated data and computational power.

pdf abs
Controllable Paraphrase Generation for Semantic and Lexical Similarities
Yuya Ogasa | Tomoyuki Kajiwara | Yuki Arase

We developed a controllable paraphrase generation model for semantic and lexical similarities using a simple and intuitive mechanism: attaching tags to specify these values at the head of the input sentence. Lexically diverse paraphrases have been long coveted for data augmentation. However, their generation is not straightforward because diversifying surfaces easily degrades semantic similarity. Furthermore, our experiments revealed two critical features in data augmentation by paraphrasing: appropriate similarities of paraphrases are highly downstream task-dependent, and mixing paraphrases of various similarities negatively affects the downstream tasks. These features indicated that the controllability in paraphrase generation is crucial for successful data augmentation. We tackled these challenges by fine-tuning a pre-trained sequence-to-sequence model employing tags that indicate the semantic and lexical similarities of synthetic paraphrases selected carefully based on the similarities. The resultant model could paraphrase an input sentence according to the tags specified. Extensive experiments on data augmentation for contrastive learning and pre-fine-tuning of pretrained masked language models confirmed the effectiveness of the proposed model. We release our paraphrase generation model and a corpus of 87 million diverse paraphrases. (https://github.com/Ogamon958/ConPGS)

pdf abs
Controllable Sentence Simplification in Swedish Using Control Prefixes and Mined Paraphrases
Julius Monsen | Arne Jonsson

Making information accessible to diverse target audiences, including individuals with dyslexia and cognitive disabilities, is crucial. Automatic Text Simplification (ATS) systems aim to facilitate readability and comprehension by reducing linguistic complexity. However, they often lack customizability to specific user needs, and training data for smaller languages can be scarce. This paper addresses ATS in a Swedish context, using methods that provide more control over the simplification. A dataset of Swedish paraphrases is mined from large amounts of text and used to train ATS models utilizing prefix-tuning with control prefixes. We also introduce a novel data-driven method for selecting complexity attributes for controlling the simplification and compare it with previous approaches. Evaluation of the trained models using SARI and BLEU demonstrates significant improvements over the baseline — a fine-tuned Swedish BART model — and compared to previous Swedish ATS results. These findings highlight the effectiveness of employing paraphrase data in conjunction with controllable generation mechanisms for simplification. Additionally, the set of explored attributes yields similar results compared to previously used attributes, indicating their ability to capture important simplification aspects.

pdf abs
Controlled Generation with Prompt Insertion for Natural Language Explanations in Grammatical Error Correction
Masahiro Kaneko | Naoaki Okazaki

In Grammatical Error Correction (GEC), it is crucial to ensure the user’s comprehension of a reason for correction. Existing studies present tokens, examples, and hints for corrections, but do not directly explain the reasons in natural language. Although methods that use Large Language Models (LLMs) to provide direct explanations in natural language have been proposed for various tasks, no such method exists for GEC. Generating explanations for GEC corrections involves aligning input and output tokens, identifying correction points, and presenting corresponding explanations consistently. However, it is not straightforward to specify a complex format to generate explanations, because explicit control of generation is difficult with prompts. This study introduces a method called controlled generation with Prompt Insertion (PI) so that LLMs can explain the reasons for corrections in natural language. In PI, LLMs first correct the input text, and then we automatically extract the correction points based on the rules. The extracted correction points are sequentially inserted into the LLM’s explanation output as prompts, guiding the LLMs to generate explanations for the correction points. We also create an Explainable GEC (XGEC) dataset of correction reasons by annotating NUCLE, CoNLL2013, and CoNLL2014. Although generations from GPT-3.5 and ChatGPT using original prompts miss some correction points, the generation control using PI can explicitly guide to describe explanations for all correction points, contributing to improved performance in generating correction reasons.

pdf abs
ControversialQA: Exploring Controversy in Question Answering
Zhen Wang | Peide Zhu | Jie Yang

Controversy is widespread online. Previous studies mainly define controversy based on vague assumptions of its relation to sentiment such as hate speech and offensive words. This paper introduces the first question-answering dataset that defines content controversy by user perception, i.e., votes from plenty of users. It contains nearly 10K questions, and each question has a best answer and a most controversial answer. Experimental results reveal that controversy detection in question answering is essential and challenging, and there is no strong correlation between controversy and sentiment tasks. We also show that controversial answers and most acceptable answers cannot be distinguished by retrieval-based QA models, which may cause controversy issues. With these insights, we believe ControversialQA can inspire future research on controversy in QA systems.

pdf abs
Conversational Grounding: Annotation and Analysis of Grounding Acts and Grounding Units
Biswesh Mohapatra | Seemab Hassan | Laurent Romary | Justine Cassell

Successful conversations often rest on common understanding, where all parties are on the same page about the information being shared. This process, known as conversational grounding, is crucial for building trustworthy dialog systems that can accurately keep track of and recall the shared information. The proficiencies of an agent in grounding the conveyed information significantly contribute to building a reliable dialog system. Despite recent advancements in dialog systems, there exists a noticeable deficit in their grounding capabilities. Traum (Traum, 1995) provided a framework for conversational grounding introducing Grounding Acts and Grounding Units, but substantial progress, especially in the realm of Large Language Models, remains lacking. To bridge this gap, we present the annotation of two dialog corpora employing Grounding Acts, Grounding Units, and a measure of their degree of grounding. We discuss our key findings during the annotation and also provide a baseline model to test the performance of current Language Models in categorizing the grounding acts of the dialogs. Our work aims to provide a useful resource for further research in making conversations with machines better understood and more reliable in natural day-to-day collaborative dialogs.

pdf abs
Converting Legacy Data to CLDF: A FAIR Exit Strategy for Linguistic Web Apps
Robert Forkel | Daniel G. Swanson | Steven Moran

In the mid 2000s, there were several large-scale US National Science Foundation (NSF) grants awarded to projects aiming at developing digital infrastructure and standards for different forms of linguistics data. For example, MultiTree encoded language family trees as phylogenies in XML and LL-MAP converted detailed geographic maps of endangered languages into KML. As early stand-alone website applications, these projects allowed researchers interested in comparative linguistics to explore language genealogies and areality, respectively. However as time passed, the technologies that supported these web apps became deprecated, unsupported, and inaccessible. Here we take a future-oriented approach to digital obsolescence and illustrate how to convert legacy linguistic resources into FAIR data via the Cross-Linguistic Data Formats (CLDF). CLDF is built on the W3C recommendations Model for Tabular Data and Metadata on the Web and Metadata Vocabulary for Tabular Data developed by the CSVW (CSV on the Web) working group. Thus, each dataset is modeled as a set of tabular data files described by metadata in JSON. These standards and the tools built to validate and manipulate them provide an accessible and extensible format for converting legacy linguistic web apps into FAIR datasets.

This paper introduces CookingSense, a descriptive collection of knowledge assertions in the culinary domain extracted from various sources, including web data, scientific papers, and recipes, from which knowledge covering a broad range of aspects is acquired. CookingSense is constructed through a series of dictionary-based filtering and language model-based semantic filtering techniques, which results in a rich knowledgebase of multidisciplinary food-related assertions. Additionally, we present FoodBench, a novel benchmark to evaluate culinary decision support systems. From evaluations with FoodBench, we empirically prove that CookingSense improves the performance of retrieval augmented language models. We also validate the quality and variety of assertions in CookingSense through qualitative analysis.

Automatic International Classification of Diseases (ICD) coding plays a crucial role in the extraction of relevant information from clinical notes for proper recording and billing. One of the most important directions for boosting the performance of automatic ICD coding is modeling ICD code relations. However, current methods insufficiently model the intricate relationships among ICD codes and often overlook the importance of context in clinical notes. In this paper, we propose a novel approach, a contextualized and flexible framework, to enhance the learning of ICD code representations. Our approach, unlike existing methods, employs a dependent learning paradigm that considers the context of clinical notes in modeling all possible code relations. We evaluate our approach on six public ICD coding datasets and the experimental results demonstrate the effectiveness of our approach compared to state-of-the-art baselines.

Naively assuming English as a source language may hinder cross-lingual transfer for many languages by failing to consider the importance of language contact. Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages; for many languages, the set of closely related languages does not include English. In this work, we study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language. We also construct a novel benchmark dataset for close contact Chinese-Japanese-Korean-Vietnamese (CJKV) languages to further encourage in-depth studies of language contact. To comprehensively capture contact between these languages, we propose to integrate Romanized transcription beyond textual scripts via Contrastive Learning objectives, leading to enhanced cross-lingual representations and effective zero-shot cross-lingual transfer.

pdf abs
Corpus Creation and Automatic Alignment of Historical Dutch Dialect Speech
Martijn Bentum | Eric Sanders | Antal P.J. van den Bosch | Douwe Zeldenrust | Henk van den Heuvel

The Dutch Dialect Database (also known as the ‘Nederlandse Dialectenbank’) contains dialectal variations of Dutch that were recorded all over the Netherlands in the second half of the twentieth century. A subset of these recordings of about 300 hours were enriched with manual orthographic transcriptions, using non-standard approximations of dialectal speech. In this paper we describe the creation of a corpus containing both the audio recordings and their corresponding transcriptions and focus on our method for aligning the recordings with the transcriptions and the metadata.

pdf abs
Corpus Services: A Framework to Curate XML Corpus Data
Aleksandr Riaposov | Elena Lazarenko

This paper provides a comprehensive description of the Corpus Services framework—a collection of Java validation tools for language corpora compiled in XML-based data formats, in particular those using EXMARaLDA corpus software. Having successfully found application in several research projects, the core functionality of the framework is currently integrated in the automated curation and publication workflows for EXMARaLDA-driven corpora of Northern Eurasian languages, as developed by the long-term project INEL. Preliminary stages of development and examples of practical use cases are covered, a structured explanation of the framework’s current functionality and operational mechanisms is provided. Furthermore, the utilization of Corpus Services is extensively illustrated within the context of INEL workflows.

pdf abs
Correcting Language Model Bias for Text Classification in True Zero-Shot Learning
Feng Zhao | Wan Xianlin | Cheng Yan | Chu Kiong Loo

Combining pre-trained language models (PLMs) and manual templates is a common practice for text classification in zero-shot scenarios. However, the effect of this approach is highly volatile, ranging from random guesses to near state-of-the-art results, depending on the quality of the manual templates. In this paper, we show that this instability stems from the fact that language models tend toward predicting certain label words of text classification, and manual templates can influence this tendency. To address this, we develop a novel pipeline for annotating and filtering a few examples from unlabeled examples. Moreover, we propose a new method to measure model bias on label words that utilizes unlabeled examples as a validation set when tuning language models. Our approach does not require any pre-labeled examples. Experimental results on six text classification tasks demonstrate that the proposed approach significantly outperforms standard prompt learning in zero-shot settings, achieving up to 19.7% absolute improvement and 13.8% average improvement. More surprisingly, on IMDB and SST-2, our approach even exceeds all few-shot baselines.

pdf abs
Correcting Pronoun Homophones with Subtle Semantics in Chinese Speech Recognition
Zhaobo Zhang | Rui Gan | Pingpeng Yuan | Hai Jin

Speech recognition is becoming prevalent in daily life. However, due to the similar semantic context of the entities and the overlap of Chinese pronunciation, the pronoun homophone, especially “他/她/它 (he/she/it)”, (their pronunciation is “Tā”) is usually recognized incorrectly. It poses a challenge to automatically correct them during the post-processing of Chinese speech recognition. In this paper, we propose three models to address the common confusion issues in this domain, tailored to various application scenarios. We implement the language model, the LSTM model with semantic features, and the rule-based assisted Ngram model, enabling our models to adapt to a wide range of requirements, from high-precision to low-resource offline devices. The extensive experiments show that our models achieve the highest recognition rate for “Tā” correction with improvements from 70% in the popular voice input methods up to 90%. Further ablation analysis underscores the effectiveness of our models in enhancing recognition accuracy. Therefore, our models improve the overall experience of Chinese speech recognition of “Tā” and reduce the burden of manual transcription corrections.

pdf abs
Correlations between Multilingual Language Model Geometry and Crosslingual Transfer Performance
Cheril Shah | Yashashree Chandak | Atharv Mahesh Mane | Benjamin Bergen | Tyler A. Chang

A common approach to interpreting multilingual language models is to evaluate their internal representations. For example, studies have found that languages occupy distinct subspaces in the models’ representation spaces, and geometric distances between languages often reflect linguistic properties such as language families and typological features. In our work, we investigate whether geometric distances between language representations correlate with zero-shot crosslingual transfer performance for POS-tagging and NER in three multilingual language models. We consider four distance metrics, including new metrics that identify a basis for a multilingual representation space that sorts axes based on their language-separability. We find that each distance metric either only moderately correlates or does not correlate with crosslingual transfer performance, and metrics do not generalize well across models, layers, and tasks. Although pairwise language separability is a reasonable predictor of crosslingual transfer, representational geometry overall is an inconsistent predictor for the crosslingual performance of multilingual language models.

pdf abs
Cost-Effective Discourse Annotation in the Prague Czech–English Dependency Treebank
Jiří Mírovský | Pavlína Synková | Lucie Polakova | Marie Paclíková

We present a cost-effective method for obtaining a high-quality annotation of explicit discourse relations in the Czech part of the Prague Czech–English Dependency Treebank, a corpus of almost 50 thousand sentences coming from the Czech translation of the Wall Street Journal part of the Penn Treebank. We use three different sources of information and combine them to obtain the discourse annotation: (i) annotation projection from the Penn Discourse Treebank 3.0, (ii) manual tectogrammatical (deep syntax) representation of sentences of the corpus, and (iii) the Lexicon of Czech Discourse Connectives CzeDLex. After solving as many discrepancies as possible automatically, the final discourse annotation is achieved by manual inspection of the remaining problematic cases. The discourse annotation of the corpus will be available both in the Prague format (on top of tectogrammatical trees) with the Prague taxonomy of discourse types, and in the Penn format (on plain texts) with the Penn Discourse Treebank 3.0 sense taxonomy.

pdf abs
Counterfactual Dialog Mixing as Data Augmentation for Task-Oriented Dialog Systems
Sebastian Steindl | Ulrich Schäfer | Bernd Ludwig

High-quality training data for Task-Oriented Dialog (TOD) systems is costly to come by if no corpora are available. One method to extend available data is data augmentation. Yet, the research into and adaptation of data augmentation techniques for TOD systems is limited in comparison with other data modalities. We propose a novel, causally-flavored data augmentation technique called Counterfactual Dialog Mixing (CDM) that generates realistic synthetic dialogs via counterfactuals to increase the amount of training data. We demonstrate the method on a benchmark dataset and show that a model trained to classify the counterfactuals from the original data fails to do so, which strengthens the claim of creating realistic synthetic dialogs. To evaluate the effectiveness of CDM, we train a current architecture on a benchmark dataset and compare the performance with and without CDM. By doing so, we achieve state-of-the-art on some metrics. We further investigate the external generalizability and a lower resource setting. To evaluate the models, we adopted an interactive evaluation scheme.

pdf abs
Creating Terminological Resources in the Digital Age for Less-resourced Languages
Mercè Vàzquez

Multilingual terminological resources contain the most representative knowledge of specialized domains and allow professionals to create and translate specialized content in order to spread knowledge. Today, representative and useful multilingual terminological resources are available for the most resourced languages. This reduces or limits the development of knowledge in less-resourced languages across different specialized domains, mainly those that are constantly evolving and creating or adapting new concepts as needed. In this paper we present our methodology for carrying out terminological projects in Catalan, based entirely on open access linguistic resources and using natural language processing tools. The main objective of this research is to maximize the Catalan terminology currently available in open access, using a combination of natural language processing tools. The results are supervised by linguists and terminologist experts before being publicly available to the public. The findings of our research provide a new approach to terminology work, making it possible to design high-volume multilingual terminological projects that are manually revised by linguists and terminologists in the context of less-resourced languages.

The landscape of privacy laws and regulations around the world is complex and ever-changing. National and super-national laws, agreements, decrees, and other government-issued rules form a patchwork that companies must follow to operate internationally. To examine the status and evolution of this patchwork, we introduce the Privacy Law Corpus, of 1,043 privacy laws, regulations, and guidelines, covering 183 jurisdictions. This corpus enables a large-scale quantitative and qualitative examination of legal focus on privacy. We examine the temporal distribution of when privacy laws were created and illustrate the dramatic increase in privacy legislation over the past 50 years, although a finer-grained examination reveals that the rate of increase varies depending on the personal data types that privacy laws address. Our exploration also demonstrates that most privacy laws respectively address relatively few personal data types. Additionally, topic modeling results show the prevalence of common themes in privacy laws, such as finance, healthcare, and telecommunications. Finally, we release the corpus to the research community to promote further study.

pdf abs
Croatian Idioms Integration: Enhancing the LIdioms Multilingual Linked Idioms Dataset
Ivana Filipović Petrović | Miguel López Otal | Slobodan Beliga

Idioms, also referred to as phraseological units in some language terminologies, are a subset within the broader category of multi-word expressions. However, there is a lack of representation of idioms in Croatian, a low-resourced language, in the Linguistic Linked Open Data cloud (LLOD). To address this gap, we propose an extension of an existing RDF-based multilingual representation of idioms, referred to as the LIdioms dataset, which currently includes idioms from English, German, Italian, Portuguese, and Russian. This paper expands the existing resource by incorporating 1,042 Croatian idioms in an Ontolex Lemon format. In addition, to foster translation initiatives and facilitate intercultural exchange, these added Croatian idioms have also been linked to other idioms of the LIdioms dataset, with which they share similar meanings despite their differences in the expression aspect. This addition enriches the knowledge base of the LLOD community with a new language resource that includes Croatian idioms.

pdf abs
CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization
Ruochen Zhang | Carsten Eickhoff

Cross-lingual summarization (CLS) has attracted increasing interest in recent years due to the availability of large-scale web-mined datasets and the advancements of multilingual language models. However, given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation which can contain overly literal artifacts. This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching. This alteration between languages in mid-message is a common phenomenon in multilingual settings yet has been largely overlooked in cross-lingual contexts due to data scarcity. To address this gap, we introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news. It consists of over 24,000 English source articles and 18,000 human-written Chinese news summaries, with more than 92% of the summaries containing code-switched phrases. For reference, we evaluate the performance of existing approaches including pipeline, end-to-end, and zero-shot methods. We show that leveraging existing CLS resources as a pretraining step does not improve performance on CroCoSum, indicating the limited generalizability of current datasets. Finally, we discuss the challenges of evaluating cross-lingual summarizers on code-switched generation through qualitative error analyses.

pdf abs
Cross-Lingual Learning vs. Low-Resource Fine-Tuning: A Case Study with Fact-Checking in Turkish
Recep Firat Cekinel | Çağrı Çöltekin | Pinar Karagoz

The rapid spread of misinformation through social media platforms has raised concerns regarding its impact on public opinion. While misinformation is prevalent in other languages, the majority of research in this field has concentrated on the English language. Hence, there is a scarcity of datasets for other languages, including Turkish. To address this concern, we have introduced the FCTR dataset, consisting of 3238 real-world claims. This dataset spans multiple domains and incorporates evidence collected from three Turkish fact-checking organizations. Additionally, we aim to assess the effectiveness of cross-lingual transfer learning for low-resource languages, with a particular focus on Turkish. We demonstrate in-context learning (zero-shot and few-shot) performance of large language models in this context. The experimental results indicate that the dataset has the potential to advance research in the Turkish language.

pdf abs
Cross-lingual Named Entity Corpus for Slavic Languages
Jakub Piskorski | Michał Marcińczuk | Roman Yangarber

This paper presents a corpus manually annotated with named entities for six Slavic languages — Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017–2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5,017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits — single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models — XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking.

pdf abs
Cross-Lingual NLU: Mitigating Language-Specific Impact in Embeddings Leveraging Adversarial Learning
Saedeh Tahery | Sahar Kianian | Saeed Farzi

Low-resource languages and computational expenses pose significant challenges in the domain of large language models (LLMs). Currently, researchers are actively involved in various efforts to tackle these challenges. Cross-lingual natural language processing (NLP) remains one of the most promising strategies to address these issues. In this paper, we introduce a novel approach that utilizes adversarial techniques to mitigate the impact of language-specific information in contextual embeddings generated by large multilingual language models, with potential applications in cross-lingual tasks. The study encompasses five different languages, including both Latin and non-Latin ones, in the context of two fundamental tasks in natural language understanding: intent detection and slot filling. The results primarily show that our current approach excels in zero-shot scenarios for Latin languages like Spanish. However, it encounters limitations when applied to languages distant from English, such as Thai and Persian. This highlights that while our approach effectively reduces the effect of language-specific information on the core meaning, it performs better for Latin languages that share language-specific nuances with English, as certain characteristics persist in the overall meaning within embeddings.

pdf abs
Cross-lingual Transfer or Machine Translation? On Data Augmentation for Monolingual Semantic Textual Similarity
Sho Hoshino | Akihiko Kato | Soichiro Murakami | Peinan Zhang

Learning better sentence embeddings leads to improved performance for natural language understanding tasks including semantic textual similarity (STS) and natural language inference (NLI). As prior studies leverage large-scale labeled NLI datasets for fine-tuning masked language models to yield sentence embeddings, task performance for languages other than English is often left behind. In this study, we directly compared two data augmentation techniques as potential solutions for monolingual STS: - (a): _cross-lingual transfer_ that exploits English resources alone as training data to yield non-English sentence embeddings as zero-shot inference, and - (b) _machine translation_ that coverts English data into pseudo non-English training data in advance. In our experiments on monolingual STS in Japanese and Korean, we find that the two data techniques yield performance on par. In addition, we find a superiority of Wikipedia domain over NLI domain as unlabeled training data for these languages. Combining our findings, we further demonstrate that the cross-lingual transfer of Wikipedia data exhibits improved performance.

pdf abs
Cross-Lingual Transfer Robustness to Lower-Resource Languages on Adversarial Datasets
Shadi Manafi | Nikhil Krishnaswamy

Multilingual Language Models (MLLMs) exhibit robust cross-lingual transfer capabilities, or the ability to leverage information acquired in a source language and apply it to a target language. These capabilities find practical applications in well-established Natural Language Processing (NLP) tasks such as Named Entity Recognition (NER). This study aims to investigate the effectiveness of a source language when applied to a target language, particularly in the context of perturbing the input test set. We evaluate on 13 pairs of languages, each including one high-resource language (HRL) and one low-resource language (LRL) with a geographic, genetic, or borrowing relationship. We evaluate two well-known MLLMs—MBERT and XLM-R—on these pairs, in native LRL and cross-lingual transfer settings, in two tasks, under a set of different perturbations. Our findings indicate that NER cross-lingual transfer depends largely on the overlap of entity chunks. If a source and target language have more entities in common, the transfer ability is stronger. Models using cross-lingual transfer also appear to be somewhat more robust to certain perturbations of the input, perhaps indicating an ability to leverage stronger representations derived from the HRL. Our research provides valuable insights into cross-lingual transfer and its implications for NLP applications, and underscores the need to consider linguistic nuances and potential limitations when employing MLLMs across distinct languages.

pdf abs
CrossTune: Black-Box Few-Shot Classification with Label Enhancement
Danqing Luo | Chen Zhang | Yan Zhang | Haizhou Li

Training or finetuning large-scale language models (LLMs) requires substantial computation resources, motivating recent efforts to explore parameter-efficient adaptation to downstream tasks. One approach is to treat these models as black boxes and use forward passes (Inference APIs) to interact with them. Current research focuses on adapting these black-box models to downstream tasks using gradient-free prompt optimization, but this often involves an expensive process of searching task-specific prompts. Therefore, we are motivated to study black-box language model adaptation without prompt search. Specifically, we introduce a label-enhanced cross-attention network called CrossTune, which models the semantic relatedness between the input text sequence and task-specific label descriptions. Its effectiveness is examined in the context of few-shot text classification. To improve the generalization of CrossTune, we utilize ChatGPT to generate additional training data through in-context learning. A switch mechanism is implemented to exclude low-quality ChatGPT-generated data. Through extensive experiments on seven benchmark text classification datasets, we demonstrate that our proposed approach outperforms the previous state-of-the-art gradient-free black-box tuning method by 5.7% on average. Even without using ChatGPT-augmented data, CrossTune performs better or comparably than previous black-box tuning methods, suggesting the effectiveness of our approach.

pdf abs
Cross-type French Multiword Expression Identification with Pre-trained Masked Language Models
Van-Tuan Bui | Agata Savary

Multiword expressions (MWEs) pose difficulties for natural language processing (NLP) due to their linguistic features, such as syntactic and semantic properties, which distinguish them from regular word groupings. This paper describes a combination of two systems: one that learns verbal multiword expressions (VMWEs) and another that learns non-verbal MWEs (nVMWEs). Together, these systems leverage training data from both types of MWEs to enhance performance on a cross-type dataset containing both VMWEs and nVMWEs. Such scenarios emerge when datasets are developed using differing annotation schemes. We explore the fine-tuning of several state-of-the-art neural transformers for each MWE type. Our experiments demonstrate the advantages of the combined system over multi-task approaches or single-task models, addressing the challenges posed by diverse tagsets within the training data. Specifically, we evaluated the combined system on a French treebank named Sequoia, which features an annotation layer encompassing all syntactic types of French MWEs. With this combined approach, we improved the F1-score by approximately 3% on the Sequoia dataset.

pdf abs
CSSWiki: A Chinese Sentence Simplification Dataset with Linguistic and Content Operations
Fengkai Liu | John S. Y. Lee

Sentence Simplification aims to make sentences easier to read and understand. With most effort on corpus development focused on English, the amount of annotated data is limited in Chinese. To address this need, we introduce CSSWiki, an open-source dataset for Chinese sentence simplification based on Wikipedia. This dataset contains 1.6k source sentences paired with their simplified versions. Each sentence pair is annotated with operation tags that distinguish between linguistic and content modifications. We analyze differences in annotation scheme and data statistics between CSSWiki and existing datasets. We then report baseline sentence simplification performance on CSSWiki using zero-shot and few-shot approaches with Large Language Models.

pdf abs
CTSM: Combining Trait and State Emotions for Empathetic Response Model
Yufeng Wang | Chao Chen | Zhou Yang | Shuhui Wang | Xiangwen Liao

Empathetic response generation endeavors to empower dialogue systems to perceive speakers’ emotions and generate empathetic responses accordingly. Psychological research demonstrates that emotion, as an essential factor in empathy, encompasses trait emotions, which are static and context-independent, and state emotions, which are dynamic and context-dependent. However, previous studies treat them in isolation, leading to insufficient emotional perception of the context, and subsequently, less effective empathetic expression. To address this problem, we propose Combining Trait and State emotions for Empathetic Response Model (CTSM). Specifically, to sufficiently perceive emotions in dialogue, we first construct and encode trait and state emotion embeddings, and then we further enhance emotional perception capability through an emotion guidance module that guides emotion representation. In addition, we propose a cross-contrastive learning decoder to enhance the model’s empathetic expression capability by aligning trait and state emotions between generated responses and contexts. Both automatic and manual evaluation results demonstrate that CTSM outperforms state-of-the-art baselines and can generate more empathetic responses. Our code is available at https://github.com/wangyufeng-empty/CTSM

Extensive training datasets represent one of the important factors for the impressive learning capabilities of large language models (LLMs). However, these training datasets for current LLMs, especially the recent state-of-the-art models, are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is released in Hugging Face facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.

pdf abs
Curation of Benchmark Templates for Measuring Gender Bias in Named Entity Recognition Models
Ana Cimitan | Ana Alves Pinto | Michaela Geierhos

Named Entity Recognition (NER) constitutes a popular machine learning technique that empowers several natural language processing applications. As with other machine learning applications, NER models have been shown to be susceptible to gender bias. The latter is often assessed using benchmark datasets, which in turn are curated specifically for a given Natural Language Processing (NLP) task. In this work, we investigate the robustness of benchmark templates to detect gender bias and propose a novel method to improve the curation of such datasets. The method, based on masked token prediction, aims to filter out benchmark templates with a higher probability of detecting gender bias in NER models. We tested the method for English and German, using the corresponding fine-tuned BERT base model (cased) as the NER model. The gender gaps detected with templates classified as appropriate by the method were statistically larger than those detected with inappropriate templates. The results were similar for both languages and support the use of the proposed method in the curation of templates designed to detect gender bias.

pdf abs
CuRIAM: Corpus Re Interpretation and Metalanguage in U.S. Supreme Court Opinions
Michael Kranzlein | Nathan Schneider | Kevin Tobia

Most judicial decisions involve the interpretation of legal texts. As such, judicial opinions use language as the medium to comment on or draw attention to other language (for example, through definitions and hypotheticals about the meaning of a term from a statute). Language used this way is called metalanguage. Focusing on the U.S. Supreme Court, we view metalanguage as reflective of justices’ interpretive processes, bearing on current debates and theories about textualism in law and political science. As a step towards large-scale metalinguistic analysis with NLP, we identify 9 categories prominent in metalinguistic discussions, including key terms, definitions, and different kinds of sources. We annotate these concepts in a corpus of U.S. Supreme Court opinions. Our analysis of the corpus reveals high interannotator agreement, frequent use of quotes and sources, and several notable frequency differences between majority, concurring, and dissenting opinions. We observe fewer instances than expected of several legal interpretive categories. We discuss some of the challenges in developing the annotation schema and applying it and provide recommendations for how this corpus can be used for broader analyses.

pdf abs
Curriculum Learning Meets Directed Acyclic Graph for Multimodal Emotion Recognition
Cam-Van Thi Nguyen | Cao-Bach Nguyen | Duc-Trong Le | Quang-Thuy Ha

Emotion recognition in conversation (ERC) is a crucial task in natural language processing and affective computing. This paper proposes MultiDAG+CL, a novel approach for Multimodal Emotion Recognition in Conversation (ERC) that employs Directed Acyclic Graph (DAG) to integrate textual, acoustic, and visual features within a unified framework. The model is enhanced by Curriculum Learning (CL) to address challenges related to emotional shifts and data imbalance. Curriculum learning facilitates the learning process by gradually presenting training samples in a meaningful order, thereby improving the model’s performance in handling emotional variations and data imbalance. Experimental results on the IEMOCAP and MELD datasets demonstrate that the MultiDAG+CL models outperform baseline models. We release the code for and experiments: https://github.com/vanntc711/MultiDAG-CL.

pdf abs
CuSINeS: Curriculum-driven Structure Induced Negative Sampling for Statutory Article Retrieval
Santosh T.y.s.s. | Kristina Kaiser | Matthias Grabmair

In this paper, we introduce CuSINeS, a negative sampling approach to enhance the performance of Statutory Article Retrieval (SAR). CuSINeS offers three key contributions. Firstly, it employs a curriculum-based negative sampling strategy guiding the model to focus on easier negatives initially and progressively tackle more difficult ones. Secondly, it leverages the hierarchical and sequential information derived from the structural organization of statutes to evaluate the difficulty of samples. Lastly, it introduces a dynamic semantic difficulty assessment using the being-trained model itself, surpassing conventional static methods like BM25, adapting the negatives to the model’s evolving competence. Experimental results on a real-world expert-annotated SAR dataset validate the effectiveness of CuSINeS across four different baselines, demonstrating its versatility.

pdf abs
CWTM: Leveraging Contextualized Word Embeddings from BERT for Neural Topic Modeling
Zheng Fang | Yulan He | Rob Procter

Most existing topic models rely on bag-of-words (BOW) representation, which limits their ability to capture word order information and leads to challenges with out-of-vocabulary (OOV) words in new documents. Contextualized word embeddings, however, show superiority in word sense disambiguation and effectively address the OOV issue. In this work, we introduce a novel neural topic model called the Contextlized Word Topic Model (CWTM), which integrates contextualized word embeddings from BERT. The model is capable of learning the topic vector of a document without BOW information. In addition, it can also derive the topic vectors for individual words within a document based on their contextualized word embeddings. Experiments across various datasets show that CWTM generates more coherent and meaningful topics compared to existing topic models, while also accommodating unseen words in newly encountered documents.

pdf abs
CyberAgressionAdo-v2: Leveraging Pragmatic-Level Information to Decipher Online Hate in French Multiparty Chats
Anais Ollagnier

As a part of the release of the CyberAgressionAdo-V2 dataset, this paper introduces a new tagset that includes tags marking pragmatic-level information occurring in cyberbullying situations. The previous version of this dataset, CyberAgressionAdo-V1, consists of aggressive multiparty chats in French annotated using a hierarchical tagset developed to describe bullying narrative events including the participant roles, the presence of hate speech, the type of verbal abuse, among others. In contrast, CyberAgressionAdo-V2 uses a multi-label, fine-grained tagset marking the discursive role of exchanged messages as well as the context in which they occur — for instance, attack (ATK), defend (DFN), counterspeech (CNS), abet/instigate (AIN), gaslight (GSL), etc. This paper provides a comprehensive overview of the annotation tagset and presents statistical insights derived from its application. Additionally, we address the challenges encountered when annotating pragmatic-level information in this context, conducting a thorough analysis of annotator disagreements. The resulting dataset comprises 19 conversations that have been manually annotated and is now available to facilitate further research in the field.

pdf abs
Czech Dataset for Complex Aspect-Based Sentiment Analysis Tasks
Jakub Šmíd | Pavel Přibáň | Ondrej Prazak | Pavel Kral

In this paper, we introduce a novel Czech dataset for aspect-based sentiment analysis (ABSA), which consists of 3.1K manually annotated reviews from the restaurant domain. The dataset is built upon the older Czech dataset, which contained only separate labels for the basic ABSA tasks such as aspect term extraction or aspect polarity detection. Unlike its predecessor, our new dataset is specifically designed to allow its usage for more complex tasks, e.g. target-aspect-category detection. These advanced tasks require a unified annotation format, seamlessly linking sentiment elements (labels) together. Our dataset follows the format of the well-known SemEval-2016 datasets. This design choice allows effortless application and evaluation in cross-lingual scenarios, ultimately fostering cross-language comparisons with equivalent counterpart datasets in other languages. The annotation process engaged two trained annotators, yielding an impressive inter-annotator agreement rate of approximately 90%. Additionally, we provide 24M reviews without annotations suitable for unsupervised learning. We present robust monolingual baseline results achieved with various Transformer-based models and insightful error analysis to supplement our contributions. Our code and dataset are freely available for non-commercial research purposes.

pdf abs
DACL: Disfluency Augmented Curriculum Learning for Fluent Text Generation
Rohan Chaudhury | Maria Teleki | Xiangjue Dong | James Caverlee

Voice-driven software systems are in abundance. However, language models that power these systems are traditionally trained on fluent, written text corpora. Hence there can be a misalignment between the inherent disfluency of transcribed spoken content and the fluency of the written training data. Furthermore, gold-standard disfluency annotations of various complexities for incremental training can be expensive to collect. So, we propose in this paper a Disfluency Augmented Curriculum Learning (DACL) approach to tackle the complex structure of disfluent sentences and generate fluent texts from them, by using Curriculum Learning (CL) coupled with our synthetically augmented disfluent texts of various levels. DACL harnesses the tiered structure of our generated synthetic disfluent data using CL, by training the model on basic samples (i.e. more fluent) first before training it on more complex samples (i.e. more disfluent). In contrast to the random data exposure paradigm, DACL focuses on a simple-to-complex learning process. We comprehensively evaluate DACL on Switchboard Penn Treebank-3 and compare it to the state-of-the-art disfluency removal models. Our model surpasses existing techniques in word-based precision (by up to 1%) and has shown favorable recall and F1 scores.

pdf abs
DADIT: A Dataset for Demographic Classification of Italian Twitter Users and a Comparison of Prediction Methods
Lorenzo Lupo | Paul Bose | Mahyar Habibi | Dirk Hovy | Carlo Schwarz

Social scientists increasingly use demographically stratified social media data to study the attitudes, beliefs, and behavior of the general public. To facilitate such analyses, we construct, validate, and release publicly the representative DADIT dataset of 30M tweets of 20k Italian Twitter users, along with their bios and profile pictures. We enrich the user data with high-quality labels for gender, age, and location. DADIT enables us to train and compare the performance of various state-of-the-art models for the prediction of the gender and age of social media users. In particular, we investigate if tweets contain valuable information for the task, since popular classifiers like M3 don’t leverage them. Our best XLM-based classifier improves upon the commonly used competitor M3 by up to 53% F1. Especially for age prediction, classifiers profit from including tweets as features. We also confirm these findings on a German test set.

pdf abs
DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition
Yi-Cheng Wang | Hsin-Wei Wang | Bi-Cheng Yan | Chi-Han Lin | Berlin Chen

End-to-end automatic speech recognition (E2E ASR) systems often suffer from mistranscription of domain-specific phrases, such as named entities, sometimes leading to catastrophic failures in downstream tasks. A family of fast and lightweight named entity correction (NEC) models for ASR have recently been proposed, which normally build on pho-netic-level edit distance algorithms and have shown impressive NEC performance. However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic con-fusion for NEC on ASR transcription. To this end, an efficient entity description augmented masked language model (EDA-MLM) comprised of a dense retrieval model is introduced, enabling MLM to adapt swiftly to domain-specific entities for the NEC task. A series of experiments conducted on the AISHELL-1 and Homophone datasets confirm the effectiveness of our modeling approach. DANCER outperforms a strong baseline, the phonetic edit-distance-based NEC model (PED-NEC), by a character error rate (CER) reduction of about 7% relatively on AISHELL-1 for named entities. More notably, when tested on Homophone that contain named entities of high phonetic confusion, DANCER offers a more pronounced CER reduction of 46% relatively over PED-NEC for named entities. The code is available at https://github.com/Amiannn/Dancer.

pdf abs
DanteLLM: Let’s Push Italian LLM Research Forward!
Andrea Bacciu | Cesare Campagnano | Giovanni Trappolini | Fabrizio Silvestri

In recent years, the dominance of Large Language Models (LLMs) in the English language has become evident. However, there remains a pronounced gap in resources and evaluation tools tailored for non-English languages, underscoring a significant disparity in the global AI landscape. This paper seeks to bridge this gap, specifically focusing on the Italian linguistic context. We introduce a novel benchmark, and an open LLM Leaderboard, designed to evaluate LLMs’ performance in Italian, providing a rigorous framework for comparative analysis. In our assessment of currently available models, we highlight their respective strengths and limitations against this standard. Crucially, we propose “DanteLLM”, a state-of-the-art LLM dedicated to Italian. Our empirical evaluations underscore Dante’s superiority, as it emerges as the most performant model on our benchmark, with improvements by up to 6 points. This research not only marks a significant stride in Italian-centric natural language processing but also offers a blueprint for the development and evaluation of LLMs in other languages, championing a more inclusive AI paradigm. Our code at: https://github.com/RSTLess-research/DanteLLM

In this paper, we present the DARIUS (Digital Argumentation Instruction for Science) corpus for argumentation quality on 4589 essays written by 1839 German secondary school students. The corpus is annotated according to a fine-grained annotation scheme, ranging from a broader perspective like content zones, to more granular features like argumentation coverage/reach and argumentative discourse units like claims and warrants. The features have inter-annotator agreements up to 0.83 Krippendorff’s α. The corpus and dataset are publicly available for further research in argument mining.

pdf abs
Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus
Gabriel de Jesus | Sérgio Sobral Nunes

This paper proposes Labadain Crawler, a data collection pipeline tailored to automate and optimize the process of constructing textual corpora from the web, with a specific target to low-resource languages. The system is built on top of Nutch, an open-source web crawler and data extraction framework, and incorporates language processing components such as a tokenizer and a language identification model. The pipeline efficacy is demonstrated through successful testing with Tetun, one of Timor-Leste’s official languages, resulting in the construction of a high-quality Tetun text corpus comprising 321.7k sentences extracted from over 22k web pages. The contributions of this paper include the development of a Tetun tokenizer, a Tetun language identification model, and a Tetun text corpus, marking an important milestone in Tetun text information retrieval.

Clinical NLP research faces a scarcity of publicly available datasets due to privacy concerns. MIMIC-III marked a significant milestone, enabling substantial progress, and now, with MIMIC-IV, the dataset has expanded significantly, offering a broader scope. In this paper, we focus on the task of predicting clinical outcomes from clinical text. This is crucial in modern healthcare, aiding in preventive care, differential diagnosis, and capacity planning. We introduce a novel clinical outcome prediction dataset derived from MIMIC-IV. Furthermore, we provide initial insights into the performance of models trained on MIMIC-III when applied to our new dataset, with specific attention to potential data drift. We investigate challenges tied to evolving documentation standards and changing codes in the International Classification of Diseases (ICD) taxonomy, such as the transition from ICD-9 to ICD-10. We also explore variations in clinical text across different hospital wards. Our study aims to probe the robustness and generalization of clinical outcome prediction models, contributing to the ongoing advancement of clinical NLP in healthcare.

pdf abs
Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks
Ileana Rugina | Rumen Dangovski | Li Jing | Preslav Nakov | Marin Soljacic

Attention mechanisms play a crucial role in the neural revolution of Natural Language Processing (NLP). With the growth of attention-based models, several pruning techniques have been developed to identify and exploit sparseness, making these models more efficient. Most efforts focus on hard-coding attention patterns or pruning attention weights based on training data. We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask. AP saves 90% of attention computation for language modeling and about 50% for machine translation and GLUE tasks, maintaining result quality. Our method reveals important distinctions between self- and cross-attention patterns, guiding future NLP research. Our framework can reduce both latency and memory requirements for any attention-based model, aiding in the development of improved models for existing or new NLP applications. We have demonstrated this with encoder and autoregressive transformer models using Triton GPU kernels and make our code publicly available at https://github.com/irugina/AP

pdf abs
Dataset for Identification of Homophobia and Transphobia for Telugu, Kannada, and Gujarati
Prasanna Kumar Kumaresan | Rahul Ponnusamy | Dhruv Sharma | Paul Buitelaar | Bharathi Raja Chakravarthi

Users of social media platforms are negatively affected by the proliferation of hate or abusive content. There has been a rise in homophobic and transphobic content in recent years targeting LGBT+ individuals. The increasing levels of homophobia and transphobia online can make online platforms harmful and threatening for LGBT+ persons, potentially inhibiting equality, diversity, and inclusion. We are introducing a new dataset for three languages, namely Telugu, Kannada, and Gujarati. Additionally, we have created an expert-labeled dataset to automatically identify homophobic and transphobic content within comments collected from YouTube. We provided comprehensive annotation rules to educate annotators in this process. We collected approximately 10,000 comments from YouTube for all three languages. Marking the first dataset of these languages for this task, we also developed a baseline model with pre-trained transformers.

pdf abs
Dataset of Quotation Attribution in German News Articles
Fynn Petersen-Frey | Chris Biemann

Extracting who says what to whom is a crucial part in analyzing human communication in today’s abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.

pdf abs
DC-MBR: Distributional Cooling for Minimum Bayesian Risk Decoding
Jianhao Yan | Jin Xu | Fandong Meng | Jie Zhou | Yue Zhang

Minimum Bayesian Risk Decoding (MBR) emerges as a promising decoding algorithm in Neural Machine Translation. However, MBR performs poorly with label smoothing, which is surprising as label smoothing provides decent improvement with beam search and improves generality in various tasks. In this work, we show that the issue arises from the inconsistency of label smoothing on the token-level and sequence-level distributions. We demonstrate that even though label smoothing only causes a slight change in the token level, the sequence-level distribution is highly skewed. We coin the issue autoregressive over-smoothness. To address this issue, we propose a simple and effective method, Distributional Cooling MBR (DC-MBR), which manipulates the entropy of output distributions by tuning down the Softmax temperature. We theoretically prove the equivalence between the pre-tuning label smoothing factor and distributional cooling. Extensive experiments on NMT benchmarks validate that distributional cooling improves MBR in various settings.

Differential diagnosis (DDx) is vital for physicians and challenging due to the existence of numerous diseases and their complex symptoms. Model training for this task is generally hindered by limited data access due to privacy concerns. To address this, we present DDxGym, a specialized OpenAI Gym environment for clinical differential diagnosis. DDxGym formulates DDx as a natural-language-based reinforcement learning (RL) problem, where agents emulate medical professionals, selecting examinations and treatments for patients with randomly sampled diseases. This RL environment utilizes data labeled from online resources, evaluated by medical professionals for accuracy. Transformers, while effective for encoding text in DDxGym, are unstable in online RL. For that reason we propose a novel training method using an auxiliary masked language modeling objective for policy optimization, resulting in model stabilization and significant performance improvement over strong baselines. Following this approach, our agent effectively navigates large action spaces and identifies universally applicable actions. All data, environment details, and implementation, including experiment reproduction code, are made publicly available.

pdf abs
Dealing with Data Scarcity in Spoken Question Answering
Merve Ünlü Menevşe | Yusufcan Manav | Ebru Arisoy | Arzucan Özgür

This paper focuses on dealing with data scarcity in spoken question answering (QA) using automatic question-answer generation and a carefully selected fine-tuning strategy that leverages limited annotated data (paragraphs and question-answer pairs). Spoken QA is a challenging task due to using spoken documents, i.e., erroneous automatic speech recognition (ASR) transcriptions, and the scarcity of spoken QA data. We propose a framework for utilizing limited annotated data effectively to improve spoken QA performance. To deal with data scarcity, we train a question-answer generation model with annotated data and then produce large amounts of question-answer pairs from unannotated data (paragraphs). Our experiments demonstrate that incorporating limited annotated data and the automatically generated data through a carefully selected fine-tuning strategy leads to 5.5% relative F1 gain over the model trained only with annotated data. Moreover, the proposed framework is also effective in high ASR errors.

pdf abs
Debiasing Multi-Entity Aspect-Based Sentiment Analysis with Norm-Based Data Augmentation
Scott Friedman | Joan Zheng | Hillel Steinmetz

Bias in NLP models may arise from using pre-trained transformer models trained on biased corpora, or by training or fine-tuning directly on corpora with systemic biases. Recent research has explored strategies for reduce measurable biases in NLP predictions while maintaining prediction accuracy on held-out test sets, e.g., by modifying word embedding geometry after training, using purpose-built neural modules for training, or automatically augmenting training data with examples designed to reduce bias. This paper focuses on a debiasing strategy for aspect-based sentiment analysis (ABSA) by augmenting the training data using norm-based language templates derived from previous language resources. We show that the baseline model predicts lower sentiment toward some topics and individuals than others and has relatively high prediction bias (measured by standard deviation), even when the context is held constant. Our results show that our norm-based data augmentation reduces topical bias to less than half while maintaining prediction quality (measured by RMSE), by augmenting the training data by only 1.8%.

pdf abs
Deciphering Emotional Landscapes in the Iliad: A Novel French-Annotated Dataset for Emotion Recognition
Davide Picca | John Pavlopoulos

One of the most significant pieces of ancient Greek literature, the Iliad, is part of humanity’s collective cultural heritage. This work aims to provide the scientific community with an emotion-labeled dataset for classical literature and Western mythology in particular. To model the emotions of the poem, we use a multi-variate time series. We also evaluated the dataset by means of two methods. We compare the manual classification against a dictionary-based benchmark as well as employ a state-of-the-art deep learning masked language model that has been tuned using our data. Both evaluations return encouraging results (MSE and MAE Macro Avg 0.101 and 0.188 respectively) and highlight some interesting phenomena.

pdf abs
DECM: Evaluating Bilingual ASR Performance on a Code-switching/mixing Benchmark
Enes Yavuz Ugan | Ngoc-Quan Pham | Alexander Waibel

Automatic Speech Recognition has made significant progress, but challenges persist. Code-switched (CSW) Speech presents one such challenge, involving the mixing of multiple languages by a speaker. Even when multilingual ASR models are trained, each utterance on its own usually remains monolingual. We introduce an evaluation dataset for German-English CSW, with German as the matrix language and English as the embedded language. The dataset comprises spontaneous speech from diverse domains, enabling realistic CSW evaluation in German-English. It includes splits with varying degrees of CSW to facilitate specialized model analysis. As it is difficult to collect CSW data for all language pairs, the provision of such evaluation data, is crucial for developing and analyzing ASR models capable of generalizing across unseen pairs. Detailed data statistics are presented, and state-of-the-art (SOTA) multilingual models are evaluated showing challanges of CSW speech.

Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process, posing challenges for real-time applications. This paper introduces Lexical Unit Decoding (LUD), a novel decoding methodology implemented in a data-driven manner, accelerating the decoding process without sacrificing output quality. The core of our approach is the observation that a pre-trained language model can confidently predict multiple contiguous tokens, forming the basis for a lexical unit, in which these contiguous tokens could be decoded in parallel. Extensive experiments validate that our method substantially reduces decoding time while maintaining generation quality, i.e., 33% speed up on natural language generation with no quality loss, and 30% speed up on code generation with a negligible quality loss of 3%. Distinctively, LUD requires no auxiliary models and does not require changes to existing architectures. It can also be integrated with other decoding acceleration methods, thus achieving an even more pronounced inference efficiency boost. We posit that the foundational principles of LUD could define a new decoding paradigm for future language models, enhancing their applicability for a broader spectrum of applications. All codes are be publicly available at https://github.com/tjunlp-lab/Lexical-Unit-Decoding-LUD-.

pdf abs
Decoding Probing: Revealing Internal Linguistic Structures in Neural Language Models Using Minimal Pairs
Linyang He | Peili Chen | Ercong Nie | Yuanning Li | Jonathan R. Brennan

Inspired by cognitive neuroscience studies, we introduce a novel “decoding probing” method that uses minimal pairs benchmark (BLiMP) to probe internal linguistic characteristics in neural language models layer by layer. By treating the language model as the brain and its representations as “neural activations”, we decode grammaticality labels of minimal pairs from the intermediate layers’ representations. This approach reveals: 1) Self-supervised language models capture abstract linguistic structures in intermediate layers that GloVe and RNN language models cannot learn. 2) Information about syntactic grammaticality is robustly captured through the first third layers of GPT-2 and also distributed in later layers. As sentence complexity increases, more layers are required for learning grammatical capabilities. 3) Morphological and semantics/syntax interface-related features are harder to capture than syntax. 4) For Transformer-based models, both embeddings and attentions capture grammatical features but show distinct patterns. Different attention heads exhibit similar tendencies toward various linguistic phenomena, but with varied contributions.

Multi-modal Named Entity Recognition, a fundamental task for multi-modal knowledge graph construction, requires integrating multi-modal information to extract named entities from text. Previous research has explored the integration of multi-modal representations at different granularities. However, they struggle to integrate all these multi-modal representations to provide rich contextual information to improve multi-modal named entity recognition. In this paper, we propose DPE-MNER, which is an iterative reasoning framework that dynamically incorporates all the diverse multi-modal representations following the strategy of “decompose, prioritize, and eliminate”. Within the framework, the fusion of diverse multi-modal representations is decomposed into hierarchically connected fusion layers that are easier to handle. The incorporation of multi-modal information prioritizes transitioning from “easy-to-hard” and “coarse-to-fine”. The explicit modeling of cross-modal relevance eliminate the irrelevances that will mislead the MNER prediction. Extensive experiments on two public datasets have demonstrated the effectiveness of our approach.

pdf abs
Deconstructing In-Context Learning: Understanding Prompts via Corruption
Namrata Shivagunde | Vladislav Lialin | Sherin Muckatira | Anna Rumshisky

The ability of large language models (LLMs) to “learn in context” based on the provided prompt has led to an explosive growth in their use, culminating in the proliferation of AI assistants such as ChatGPT, Claude, and Bard. These AI assistants are known to be robust to minor prompt modifications, mostly due to alignment techniques that use human feedback. In contrast, the underlying pre-trained LLMs they use as a backbone are known to be brittle in this respect. Building high-quality backbone models remains a core challenge, and a common approach to assessing their quality is to conduct few-shot evaluation. Such evaluation is notorious for being highly sensitive to minor prompt modifications, as well as the choice of specific in-context examples. Prior work has examined how modifying different elements of the prompt can affect model performance. However, these earlier studies tended to concentrate on a limited number of specific prompt attributes and often produced contradictory results. Additionally, previous research either focused on models with fewer than 15 billion parameters or exclusively examined black-box models like GPT-3 or PaLM, making replication challenging. In the present study, we decompose the entire prompt into four components: task description, demonstration inputs, labels, and inline instructions provided for each demonstration. We investigate the effects of structural and semantic corruptions of these elements on model performance. We study models ranging from 1.5B to 70B in size, using ten datasets covering classification and generation tasks. We find that repeating text within the prompt boosts model performance, and bigger models (≥30B) are more sensitive to the semantics of the prompt. Finally, we observe that adding task and inline instructions to the demonstrations enhances model performance even when the instructions are semantically corrupted. The code is available at this URL.

pdf abs
DEEM: Dynamic Experienced Expert Modeling for Stance Detection
Xiaolong Wang | Yile Wang | Sijie Cheng | Peng Li | Yang Liu

Recent work has made a preliminary attempt to use large language models (LLMs) to solve the stance detection task, showing promising results. However, considering that stance detection usually requires detailed background knowledge, the vanilla reasoning method may neglect the domain knowledge to make a professional and accurate analysis. Thus, there is still room for improvement of LLMs reasoning, especially in leveraging the generation capability of LLMs to simulate specific experts (i.e., multi-agents) to detect the stance. In this paper, different from existing multi-agent works that require detailed descriptions and use fixed experts, we propose a Dynamic Experienced Expert Modeling (DEEM) method which can leverage the generated experienced experts and let LLMs reason in a semi-parametric way, making the experts more generalizable and reliable. Experimental results demonstrate that DEEM consistently achieves the best results on three standard benchmarks, outperforms methods with self-consistency reasoning, and reduces the bias of LLMs.

Food touches our lives through various endeavors, including flavor, nourishment, health, and sustainability. Recipes are cultural capsules transmitted across generations via unstructured text. Automated protocols for recognizing named entities, the building blocks of recipe text, are of immense value for various applications ranging from information extraction to novel recipe generation. Named entity recognition is a technique for extracting information from unstructured or semi-structured data with known labels. Starting with manually-annotated data of 6,611 ingredient phrases, we created an augmented dataset of 26,445 phrases cumulatively. Simultaneously, we systematically cleaned and analyzed ingredient phrases from RecipeDB, the gold-standard recipe data repository, and annotated them using the Stanford NER. Based on the analysis, we sampled a subset of 88,526 phrases using a clustering-based approach while preserving the diversity to create the machine-annotated dataset. A thorough investigation of NER approaches on these three datasets involving statistical, fine-tuning of deep learning-based language models and few-shot prompting on large language models (LLMs) provides deep insights. We conclude that few-shot prompting on LLMs has abysmal performance, whereas the fine-tuned spaCy-transformer emerges as the best model with macro-F1 scores of 95.9%, 96.04%, and 95.71% for the manually-annotated, augmented, and machine-annotated datasets, respectively.

pdf abs
Deep Reinforcement Learning-based Dialogue Policy with Graph Convolutional Q-network
Kai Xu | Zhengyu Wang | Yuxuan Long | Qiaona Zhao

Deep Reinforcement learning (DRL) has been successfully applied to the dialogue policy of task-oriented dialogue systems. However, one challenge in the existing DRL-based dialogue policy methods is their unstructured state-action representations without the ability to learn the relationship between dialogue states and actions. To alleviate this problem, we propose a graph-structured dialogue policy framework for task-oriented dialogue systems. More specifically, we use an unsupervised approach to construct two different bipartite graphs. Then, we generate the user-related and knowledge-related subgraphs based on the matching dialogue sub-states with bipartite graph nodes. A variant of graph convolutional network is employed to encode dialogue subgraphs. After that, we use a bidirectional gated cycle unit (BGRU) and self-attention mechanism to obtain the high-level historical state representations and employ a neural network for the high-level current state representations. The two state representations are joined to learn the action value of dialogue policy. Experiments implemented with different DRL algorithms demonstrate that the proposed framework significantly improves the effectiveness and stability of dialogue policies.

pdf abs
Deep Reinforcement Learning with Hierarchical Action Exploration for Dialogue Generation
Itsugun Cho | Ryota Takahashi | Yusaku Yanase | Hiroaki Saito

Traditionally, approximate dynamic programming is employed in dialogue generation with greedy policy improvement through action sampling, as the natural language action space is vast. However, this practice is inefficient for reinforcement learning (RL) due to the sparsity of eligible responses with high action values, which leads to weak improvement sustained by random sampling. This paper presents theoretical analysis and experiments that reveal the performance of the dialogue policy is positively correlated with the sampling size. To overcome this limitation, we introduce a novel dual-granularity Q-function that explores the most promising response category to intervene in the sampling process. Our approach extracts actions based on a grained hierarchy, thereby achieving the optimum with fewer policy iterations. Additionally, we use offline RL and learn from multiple reward functions designed to capture emotional nuances in human interactions. Empirical studies demonstrate that our algorithm outperforms baselines across automatic metrics and human evaluations. Further testing reveals that our algorithm exhibits both explainability and controllability, as well as generates responses with higher expected rewards.

In today’s rapidly evolving digital age, disinformation poses a significant threat to public sentiment and socio-political dynamics. To address this, we introduce a new dataset “DeFaktS”, designed to understand and counter disinformation within German media. Distinctively curated across various news topics, DeFaktS offers an unparalleled insight into the diverse facets of disinformation. Our dataset, containing 105,855 posts with 20,008 meticulously labeled tweets, serves as a rich platform for in-depth exploration of disinformation’s diverse characteristics. A key attribute that sets DeFaktS apart is, its fine-grain annotations based on polarized categories. Our annotation framework, grounded in the textual characteristics of news content, eliminates the need for external knowledge sources. Unlike most existing corpora that typically assign a singular global veracity value to news, our methodology seeks to annotate every structural component and semantic element of a news piece, ensuring a comprehensive and detailed understanding. In our experiments, we employed a mix of classical machine learning and advanced transformer-based models. The results underscored the potential of DeFaktS, with transformer models, especially the German variant of BERT, exhibiting pronounced effectiveness in both binary and fine-grained classifications.

A text corpus centered on events is foundational to research concerning the detection, representation, reasoning, and harnessing of online events. The majority of current event-based datasets mainly target sentence-level tasks, thus to advance event-related research spanning from sentence to document level, this paper introduces DEIE, a unified large-scale document-level event information extraction dataset with over 56,000+ events and 242,000+ arguments. Three key features stand out: large-scale manual annotation (20,000 documents), comprehensive unified annotation (encompassing event trigger/argument, summary, and relation at once), and emergency events annotation (covering 19 emergency types). Notably, our experiments reveal that current event-related models struggle with DEIE, signaling a pressing need for more advanced event-related research in the future.

Vision-and-Language navigation (VLN) requires an agent to navigate in unseen environment by following natural language instruction. For task completion, the agent needs to align and integrate various navigation modalities, including instruction, observation and navigation history. Existing works primarily concentrate on cross-modal attention at the fusion stage to achieve this objective. Nevertheless, modality features generated by disparate uni-encoders reside in their own spaces, leading to a decline in the quality of cross-modal fusion and decision. To address this problem, we propose a Dual-levEL AligNment (DELAN) framework by cross-modal contrastive learning. This framework is designed to align various navigation-related modalities before fusion, thereby enhancing cross-modal interaction and action decision-making. Specifically, we divide the pre-fusion alignment into dual levels: instruction-history level and landmark-observation level according to their semantic correlations. We also reconstruct a dual-level instruction for adaptation to the dual-level alignment. As the training signals for pre-fusion alignment are extremely limited, self-supervised contrastive learning strategies are employed to enforce the matching between different modalities. Our approach seamlessly integrates with the majority of existing models, resulting in improved navigation performance on various VLN benchmarks, including R2R, R4R, RxR and CVDN.

pdf abs
Demonstration Retrieval-Augmented Generative Event Argument Extraction
Shiming He | Yu Hong | Shuai Yang | Jianmin Yao | Guodong Zhou

We tackle Event Argument Extraction (EAE) in the manner of template-based generation. Based on our exploration of generative EAE, it suffers from several issues, such as multiple arguments of one role, generating words out of context and inconsistency with prescribed format. We attribute it to the weakness of following complex input prompts. To address these problems, we propose the demonstration retrieval-augmented generative EAE (DRAGEAE), containing two components: event knowledge-injected generator (EKG) and demonstration retriever (DR). EKG employs event knowledge prompts to capture role dependencies and semantics. DR aims to search informative demonstrations from training data, facilitating the conditional generation of EKG. To train DR, we use the probability-based rankings from large language models (LLMs) as supervised signals. Experimental results on ACE-2005, RAMS and WIKIEVENTS demonstrate that our method outperforms all strong baselines and it can be generalized to various datasets. Further analysis is conducted to discuss the impact of diverse LLMs and prove that our model alleviates the above issues.

pdf abs
Denoising Labeled Data for Comment Moderation Using Active Learning
Andraž Pelicon | Mladen Karan | Ravi Shekhar | Matthew Purver | Senja Pollak

Noisily labeled textual data is ample on internet platforms that allow user-created content. Training models, such as offensive language detection models for comment moderation, on such data may prove difficult as the noise in the labels prevents the model to converge. In this work, we propose to use active learning methods for the purposes of denoising training data for model training. The goal is to sample examples the most informative examples with noisy labels with active learning and send them to the oracle for reannotation thus reducing the overall cost of reannotation. In this setting we tested three existing active learning methods, namely DBAL, Variance of Gradients (VoG) and BADGE. The proposed approach to data denoising is tested on the problem of offensive language detection. We observe that active learning can be effectively used for the purposes of data denoising, however care should be taken when choosing the algorithm for this purpose.

pdf abs
Denoising Table-Text Retrieval for Open-Domain Question Answering
Deokhyung Kang | Baikjin Jung | Yunsu Kim | Gary Geunbae Lee

In table-text open-domain question answering, a retriever system retrieves relevant evidence from tables and text to answer questions. Previous studies in table-text open-domain question answering have two common challenges: firstly, their retrievers can be affected by false-positive labels in training datasets; secondly, they may struggle to provide appropriate evidence for questions that require reasoning across the table. To address these issues, we propose Denoised Table-Text Retriever (DoTTeR). Our approach involves utilizing a denoised training dataset with fewer false positive labels by discarding instances with lower question-relevance scores measured through a false positive detection model. Subsequently, we integrate table-level ranking information into the retriever to assist in finding evidence for questions that demand reasoning across the table. To encode this ranking information, we fine-tune a rank-aware column encoder to identify minimum and maximum values within a column. Experimental results demonstrate that DoTTeR significantly outperforms strong baselines on both retrieval recall and downstream QA tasks. Our code is available at https://github.com/deokhk/DoTTeR.

Purpose: Based on the examples of English and German, we investigate to what extent parsers trained on modern variants of these languages can be transferred to older language levels without loss. Methods: We developed a treebank called DoTT (https://github.com/texttechnologylab/DoTT) which covers, roughly, the time period from 1800 until today, in conjunction with the further development of the annotation tool DependencyAnnotator. DoTT consists of a collection of diachronic corpora enriched with dependency annotations using 3 parsers, 6 pre-trained language models, 5 newly trained models for German, and two tag sets (TIGER and Universal Dependencies). To assess how the different parsers perform on texts from different time periods, we created a gold standard sample as a benchmark. Results: We found that the parsers/models perform quite well on modern texts (document-level LAS ranging from 82.89 to 88.54) and slightly worse on older texts, as expected (average document-level LAS 84.60 vs. 86.14), but not significantly. For German texts, the (German) TIGER scheme achieved slightly better results than UD. Conclusion: Overall, this result speaks for the transferability of parsers to past language levels, at least dating back until around 1800. This very transferability, it is however argued, means that studies of language change in the field of dependency syntax can draw on dependency distance but miss out on some grammatical phenomena.

Continual learning is an emerging area of machine learning that deals with the issue where models adapt well to the latest data but lose the ability to remember past data due to changes in the data source. A widely adopted solution is by keeping a small memory of previous learned data that use replay. Most of the previous studies on continual learning focused on classification tasks, such as image classification and text classification, where the model needs only to categorize the input data. Inspired by the human ability to incrementally learn knowledge and solve different problems using learned knowledge, we considered a more pratical scenario, knowledge based quesiton answering about continual learning. In this scenario, each single question is different from others(means different fact trippes to answer them) while classification tasks only need to find feature boundaries of different categories, which are the curves or surfaces that separate different categories in the feature space. To address this issue, we proposed a depth aware hierarchical replay framework which include a tree structure classfier to have a sense of knowledge distribution and fill the gap between text classfication tasks and question-answering tasks for continual learning, a local sampler to grasp these critical samples and a depth aware learning network to reconstructe the feature space of a single learning round. In our experiments, we have demonstrated that our proposed model outperforms previous continual learning methods in mitigating the issue of catastrophic forgetting.

pdf abs
Depth-Wise Attention (DWAtt): A Layer Fusion Method for Data-Efficient Classification
Muhammad ElNokrashy | Badr AlKhamissi | Mona Diab

Language Models pretrained on large textual data have been shown to encode different types of knowledge simultaneously. Traditionally, only the features from the last layer are used when adapting to new tasks or data. We put forward that, when using or finetuning deep pretrained models, intermediate layer features that may be relevant to the downstream task are buried too deep to be used efficiently in terms of needed samples or steps. To test this, we propose a new layer fusion method: Depth-Wise Attention (DWAtt), to help re-surface signals from non-final layers. We compare DWAtt to a basic concatenation-based layer fusion method (Concat), and compare both to a deeper model baseline—all kept within a similar parameter budget. Our findings show that DWAtt and Concat are more step- and sample-efficient than the baseline, especially in the few-shot setting. DWAtt outperforms Concat on larger data sizes. On CoNLL-03 NER, layer fusion shows 3.68 − 9.73% F1 gain at different few-shot sizes. The layer fusion models presented significantly outperform the baseline in various training scenarios with different data sizes, architectures, and training constraints.

pdf abs
Deriving Entity-Specific Embeddings from Multi-Entity Sequences
Connor Heaton | Prasenjit Mitra

Underpinning much of the recent progress in deep learning is the transformer architecture, which takes as input a sequence of embeddings E and emits an updated sequence of embeddings E’. A special [CLS] embedding is often included in this sequence, serving as a description of the sequence once processed and used as the basis for subsequent sequence-level tasks. The processed [CLS] embedding loses utility, however, when the model is presented with a multi-entity sequence and asked to perform an entity-specific task. When processing a multi-speaker dialogue, for example, the [CLS] embedding describes the entire dialogue, not any individual utterance/speaker. Existing methods toward entity-specific prediction involve redundant computation or post-processing outside of the transformer. We present a novel methodology for deriving entity-specific embeddings from a multi-entity sequence completely within the transformer, with a loose definition of entity amenable to many problem spaces. To show the generic applicability of our method, we apply it to widely different tasks: emotion recognition in conversation and player performance projection in baseball and show that it can be used to achieve SOTA in both. Code can be found at https://github.com/c-heat16/EntitySpecificEmbeddings.

pdf abs
DET: A Dual-Encoding Transformer for Relational Graph Embedding
Lingbing Guo | Zhuo Chen | Jiaoyan Chen | Qiang Zhang | Huajun Chen

Despite recent successes in natural language processing and computer vision, Transformer faces scalability issues when processing graphs, e.g., computing the full node-to-node attention on knowledge graphs (KGs) with million of entities is still infeasible. The existing methods mitigate this problem by considering only the local neighbors, sacrificing the Transformer’s ability to attend to elements at any distance. This paper proposes a new Transformer architecture called Dual-Encoding Transformer (DET). DET comprises a structural encoder to aggregate information from nearby neighbors, and a semantic encoder to seek for semantically relevant nodes. We adopt a semantic neighbor search approach inspired by multiple sequence alignment (MSA) algorithms used in biological sciences. By stacking the two encoders alternately, similar to the MSA Transformer for protein representation, our method achieves superior performance compared to state-of-the-art attention-based methods on complex relational graphs like KGs and citation networks. Additionally, DET remains competitive for smaller graphs such as molecules.

pdf abs
Detecting Conceptual Abstraction in LLMs
Michaela Regneri | Alhassan Abdelhalim | Soeren Laue

We show a novel approach to detecting noun abstraction within a large language model (LLM). Starting from a psychologically motivated set of noun pairs in taxonomic relationships, we instantiate surface patterns indicating hypernymy and analyze the attention matrices produced by BERT. We compare the results to two sets of counterfactuals and show that we can detect hypernymy in the abstraction mechanism, which cannot solely be related to the distributional similarity of noun pairs. Our findings are a first step towards the explainability of conceptual abstraction in LLMs.

Recent machine translation (MT) systems have overcome language barriers for a wide range of users, yet they still carry the risk of critical meaning deviation. Critical error detection (CED) is a task that identifies an inherent risk of catastrophic meaning distortions in the machine translation output. With the importance of reflecting cultural elements in detecting critical errors, we introduce the culture-aware “Politeness” type in detecting English-Korean critical translation errors. Besides, we facilitate two tasks by providing multiclass labels: critical error detection and critical error type classification (CETC). Empirical evaluations reveal that our introduced data augmentation approach using a newly presented perturber significantly outperforms existing baselines in both tasks. Further analysis highlights the significance of multiclass labeling by demonstrating its superior effectiveness compared to binary labels.

Cybercrime is a serious and growing threat affecting millions of people worldwide. Detecting cybercrimes from text messages is challenging, as it requires understanding the linguistic and cultural nuances of different languages and regions. Roman Urdu is a widely used language in Pakistan and other South Asian countries, however, it lacks sufficient resources and tools for natural language processing and cybercrime detection. To address this problem, we make three main contributions in this paper. (1) We create and release CRU, a benchmark dataset for text-based cybercrime detection in Roman Urdu, which covers a number of cybercrimes as defined by the Prevention of Electronic Crimes Act (PECA) of Pakistan. This dataset is annotated by experts following a standardized procedure based on Pakistan’s legal framework. (2) We perform experiments on four pre-trained language models (PLMs) for cybercrime text classification in Roman Urdu. Our results show that xlm-roberta-base is the best model for this task, achieving the highest performance on all metrics. (3) We explore the utility of prompt engineering techniques, namely prefix and cloze prompts, for enhancing the performance of PLMs for low-resource languages such as Roman Urdu. We analyze the impact of different prompt shapes and k-shot settings on the performance of xlm-roberta-base and bert-base-multilingual-cased. We find that prefix prompts are more effective than cloze prompts for Roman Urdu classification tasks, as they provide more contextually relevant completions for the models. Our work provides useful insights and resources for future research on cybercrime detection and text classification in low-resource languages.

We explore a strategy to handle controversial topics in LLM-based chatbots based on Wikipedia’s Neutral Point of View (NPOV) principle: acknowledge the absence of a single true answer and surface multiple perspectives. We frame this as retrieval augmented generation, where perspectives are retrieved from a knowledge base and the LLM is tasked with generating a fluent and faithful response from the given perspectives. As a starting point, we use a deterministic retrieval system and then focus on common LLM failure modes that arise during this approach to text generation, namely hallucination and coverage errors. We propose and evaluate three methods to detect such errors based on (1) word-overlap, (2) salience, and (3) LLM-based classifiers. Our results demonstrate that LLM-based classifiers, even when trained only on synthetic errors, achieve high error detection performance, with ROC AUC scores of 95.3% for hallucination and 90.5% for coverage error detection on unambiguous error cases. We show that when no training data is available, our other methods still yield good results on hallucination (84.0%) and coverage error (85.2%) detection.

Impact assessment is an evolving area of research that aims at measuring and predicting the potential effects of projects or programs. Measuring the impact of scientific research is a vibrant subdomain, closely intertwined with impact assessment. A recurring obstacle pertains to the absence of an efficient framework which can facilitate the analysis of lengthy reports and text labeling. To address this issue, we propose a framework for automatically assessing the impact of scientific research projects by identifying pertinent sections in project reports that indicate the potential impacts. We leverage a mixed-method approach, combining manual annotations with supervised machine learning, to extract these passages from project reports. We experiment with different machine learning algorithms, including traditional statistical models as well as pre-trained transformer language models. Our experiments show that our proposed method achieves accuracy scores up to 0.81, and that our method is generalizable to scientific research from different domains and different languages.

pdf abs
Detecting Loanwords in Emakhuwa: An Extremely Low-Resource Bantu Language Exhibiting Significant Borrowing from Portuguese
Felermino Dario Mario Ali | Henrique Lopes Cardoso | Rui Sousa-Silva

The accurate identification of loanwords within a given text holds significant potential as a valuable tool for addressing data augmentation and mitigating data sparsity issues. Such identification can improve the performance of various natural language processing tasks, particularly in the context of low-resource languages that lack standardized spelling conventions.This research proposes a supervised method to identify loanwords in Emakhuwa, borrowed from Portuguese. Our methodology encompasses a two-fold approach. Firstly, we employ traditional machine learning algorithms incorporating handcrafted features, including language-specific and similarity-based features. We build upon prior studies to extract similarity features and propose utilizing two external resources: a Sequence-to-Sequence model and a dictionary. This innovative approach allows us to identify loanwords solely by analyzing the target word without prior knowledge about its donor counterpart. Furthermore, we fine-tune the pre-trained CANINE model for the downstream task of loanword detection, which culminates in the impressive achievement of the F1-score of 93%. To the best of our knowledge, this study is the first of its kind focusing on Emakhuwa, and the preliminary results are promising as they pave the way to further advancements.

While detecting offensive language in online spaces remains an important societal issue, there is still a significant gap in existing research and practial datasets specific to chatbots. Furthermore, many of the current efforts by service providers to automatically filter offensive language are vulnerable to users’ deliberate text manipulation tactics, such as misspelling words. In this study, we analyze offensive language patterns in real logs of 6,254,261 chat utterance pairs from the commercial chat service Simsimi, which cover a variety of conversation topics. Based on the observed patterns, we introduce a novel offensive language detection method—a contrastive learning model that embeds chat content with a random masking strategy. We show that this model outperforms existing models in detecting offensive language in open-domain chat conversations while also demonstrating robustness against users’ deliberate text manipulation tactics when using offensive language. We release our curated chatbot dataset to foster research on offensive language detection in open-domain conversations and share lessons learned from mitigating offensive language on a live platform.

pdf abs
Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts
Thibault Clerice

In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.). We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting. Our results demonstrate the effectiveness of this approach, achieving high precision and true positive rates (TPR) of respectively 70.60% and 86.33% using HAN. We evaluate the impact of the dataset size on the model performances (420 instead of 2013 training samples), and show that, while our models perform worse, they still offer a high enough precision and TPR, even without MLM, respectively 69% and 51%. Given the result, we provide an analysis of the attention mechanism as a supporting added value for humanists in order to produce more data.

Large Language Models (LLMs) have made significant progress recently. However, their practical use in healthcare is hindered by their tendency to generate hallucinations. One specific type, called snowballing hallucination, occurs when LLMs encounter misleading information, and poses a security threat to LLMs. To understand how well LLMs can resist these hallucination, we create the Chinese Medical Hallucination Evaluation benchmark (CMHE). This benchmark can be used to evaluate LLMs’ ability to detect medical hallucinations, make accurate diagnoses in noisy conditions, and provide plausible explanations. The creation of this benchmark involves a combination of manual and model-based approaches. In addition, we use ICD-10 as well as MeSH, two specialized glossaries, to aid in the evaluation. Our experiments show that the LLM struggles to identify fake medical terms and makes poor diagnoses in distracting environments. However, improving the model’s understanding of medical concepts can help it resist interference to some extent.

Pronunciation of the phonemic inventory of a new language often presents difficulties to second language (L2) learners. These challenges can be alleviated by the development of pronunciation feedback tools that take speech input from learners and return information about errors in the utterance. This paper presents the development of a corpus designed for use in pronunciation feedback research. The corpus is comprised of gold standard recordings from isiZulu teachers and recordings from isiZulu L2 learners that have been annotated for pronunciation errors. Exploring the potential benefits of word-level versus phoneme-level feedback necessitates a speech corpus that has been annotated for errors on the phoneme-level. To aid in this discussion, this corpus of isiZulu L2 speech has been annotated for phoneme-errors in utterances, as well as suprasegmental errors in tone.

pdf abs
Developing a Rhetorical Structure Theory Treebank for Czech
Lucie Polakova | Jiří Mírovský | Šárka Zikánová | Eva Hajicova

We introduce the first version of the Czech RST Discourse Treebank, a collection of Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST), a global coherence model proposed by Mann and Thompson (1988). Each document in the corpus is represented as a single tree-like structure, where discourse units are interconnected through hierarchical rhetorical relations and their relative importance for the main purpose of a text is modeled by the nuclearity principle. The treebank is freely available in the LINDAT/CLARIAH-CZ repository under the Creative Commons license; for some documents, it includes two gold annotations representing divergent yet relevant interpretations. The paper outlines the annotation process, provides corpus statistics and evaluation, and discusses the issue of consistency associated with the global level of textual interpretation. In general, good agreement on the structure and labeling could be achieved on the lowest, local tree level and on the identification of the most central (nuclear) elementary discourse units. Disagreements mostly concerned segmentation and, in the structure, differences in the stepwise process of linking the largest text blocks. The project contributes to the advancement of RST research and its application to real-world text analysis challenges.

pdf abs
Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts
Ali Al-Laith | Alexander Conroy | Jens Bjerring-Hansen | Daniel Hershcovich

We develop and evaluate the first pre-trained language models specifically tailored for historical Danish and Norwegian texts. Three models are trained on a corpus of 19th-century Danish and Norwegian literature: two directly on the corpus with no prior pre-training, and one with continued pre-training. To evaluate the models, we utilize an existing sentiment classification dataset, and additionally introduce a new annotated word sense disambiguation dataset focusing on the concept of fate. Our assessment reveals that the model employing continued pre-training outperforms the others in two downstream NLP tasks on historical texts. Specifically, we observe substantial improvement in sentiment classification and word sense disambiguation compared to models trained on contemporary texts. These results highlight the effectiveness of continued pre-training for enhancing performance across various NLP tasks in historical text analysis.

In this paper we describe the development of a text-to-speech system for Māori ‘Avaiki Nui (Cook Islands Māori). We provide details about the process of community-collaboration that was followed throughout the project, a continued engagement where we are trying to develop speech and language technology for the benefit of the community. During this process we gathered a group of recordings that we used to train a TTS system. When training we used two approaches, the HMM-system MaryTTS (Schröder et al., 2011) and the deep learning system FastSpeech2 (Ren et al., 2020). We performed two evaluation tasks on the models: First, we measured their quality by having the synthesized speech transcribed by ASR. The human produced ground truth had lower error rates (CER=4.3, WER=18), but the FastSpeech2 audio has lower error rates (CER=11.8 and WER=42.7) than the MaryTTS voice (CER=17.9 and WER=48.1). The second evaluation was a survey amongst speakers of the language so they could judge the voice’s quality. The ground truth was rated with the highest quality (MOS=4.6), but the FastSpeech2 voice had an overall quality of MOS=3.2, which was significantly higher than that of the MaryTTS synthesized recordings (MOS=2.0). We intend to use the FastSpeech2 model to create language learning tools for community members both on the Cook Islands and in the diaspora.

pdf abs
DGoT: Dynamic Graph of Thoughts for Scientific Abstract Generation
Xinyu Ning | Yutong Zhao | Yitong Liu | Hongwen Yang

The method of training language models based on domain datasets has obtained significant achievements in the task of generating scientific paper abstracts. However, such models face problems of generalization and expensive training costs. The use of large language models (LLMs) to solve the task of generating paper abstracts saves the cost of model training. However, due to the hallucination problem of LLM, it is often necessary to improve the reliability of the results through multi-round query prompt approach such as Graph of Thoughts (GoT), which also brings additional reasoning costs. In this paper, we propose a Dynamic Graph of Thought (DGoT). It not only inherits the advantages of the existing GoT prompt approach, but also dynamically adjust the graph structure according to data characteristics while reducing model reasoning cost. Experimental results show that our method’s cost-effectiveness in abstract generation tasks is only 43.7% to 56.4% of other multi-round query prompt approaches. Our code is available at https://github.com/JayceNing/DGoT.

We present the acquisition process and the data of DGS-Fabeln-1, a parallel corpus of German text and videos containing German fairy tales interpreted into the German Sign Language (DGS) by a native DGS signer. The corpus contains 573 segments of videos with a total duration of 1 hour and 32 minutes, corresponding with 1428 written sentences. It is the first corpus of semi-naturally expressed DGS that has been filmed from 7 angles, and one of the few sign language (SL) corpora globally which have been filmed from more than 3 angles and where the listener has been simultaneously filmed. The corpus aims at aiding research at SL linguistics, SL machine translation and affective computing, and is freely available for research purposes at the following address: https://doi.org/10.5281/zenodo.10822097.

pdf abs
Dialogue Systems Can Generate Appropriate Responses without the Use of Question Marks?– a Study of the Effects of “?” for Spoken Dialogue Systems –
Tomoya Mizumoto | Takato Yamazaki | Katsumasa Yoshikawa | Masaya Ohagi | Toshiki Kawamoto | Toshinori Sato

When individuals engage in spoken discourse, various phenomena can be observed that differ from those that are apparent in text-based conversation. While written communication commonly uses a question mark to denote a query, in spoken discourse, queries are frequently indicated by a rising intonation at the end of a sentence. However, numerous speech recognition engines do not append a question mark to recognized queries, presenting a challenge when creating a spoken dialogue system. Specifically, the absence of a question mark at the end of a sentence can impede the generation of appropriate responses to queries in spoken dialogue systems. Hence, we investigate the impact of question marks on dialogue systems, with the results showing that they have a significant impact. Moreover, we analyze specific examples in an effort to determine which types of utterances have the impact on dialogue systems.

We introduce DiaSet, a novel dataset of dialectical Arabic speech, manually transcribed and annotated for two specific downstream tasks: sentiment analysis and named entity recognition. The dataset encapsulates the Palestine dialect, predominantly spoken in Palestine, Israel, and Jordan. Our dataset incorporates authentic conversations between YouTube influencers and their respective guests. Furthermore, we have enriched the dataset with simulated conversations initiated by inviting participants from various locales within the said regions. The participants were encouraged to engage in dialogues with our interviewer. Overall, DiaSet consists of 644.8K tokens and 23.2K annotated instances. Uniform writing standards were upheld during the transcription process. Additionally, we established baseline models by leveraging some of the pre-existing Arabic BERT language models, showcasing the potential applications and efficiencies of our dataset. We make DiaSet publicly available for further research.

pdf abs
Did You Get It? A Zero-Shot Approach to Locate Information Transfers in Conversations
Eliot Maës | Hossam Boudraa | Philippe Blache | Leonor Becerra-Bonache

Interaction theories suggest that the emergence of mutual understanding between speakers in natural conversations depends on the construction of a shared knowledge base (common ground), but the details of which information and the circumstances under which it is memorized are not explained by any model. Previous works have looked at metrics derived from Information Theory to quantify the dynamics of information exchanged between participants, but do not provide an efficient way to locate information that will enter the common ground. We propose a new method based on the segmentation of a conversation into themes followed by their summarization. We then obtain the location of information transfers by computing the distance between the theme summary and the different utterances produced by a speaker. We evaluate two Large Language Models (LLMs) on this pipeline, on the French conversational corpus Paco-Cheese. More generally, we explore how the recent developments in the field of LLMs provide us with the means to implement these new methods and more generally support research into questions that usually heavily relies on human annotators.

This paper presents novel techniques for enhancing the performance of knowledge tracing (KT) models by focusing on the crucial factor of question and concept difficulty level. Despite the acknowledged significance of difficulty, previous KT research has yet to exploit its potential for model optimization and has struggled to predict difficulty from unseen data. To address these problems, we propose a difficulty-centered contrastive learning method for KT models and a Large Language Model (LLM)-based framework for difficulty prediction. These innovative methods seek to improve the performance of KT models and provide accurate difficulty estimates for unseen data. Our ablation study demonstrates the efficacy of these techniques by demonstrating enhanced KT model performance. Nonetheless, the complex relationship between language and difficulty merits further investigation.

pdf abs
Diffusion Based Counterfactual Augmentation for Dual Sentiment Classification
Dancheng Xin | Jiawei Yuan | Yang Li

State-of-the-art NLP models have demonstrated exceptional performance across various tasks, including sentiment analysis. However, concerns have been raised about their robustness and susceptibility to systematic biases in both training and test data, which may lead to performance challenges when these models encounter out-of-distribution data in real-world applications. Although various data augmentation and adversarial perturbation techniques have shown promise in tackling these issues, prior methods such as word embedding perturbation or synonymous sentence expansion have failed to mitigate the spurious association problem inherent in the original data. Recent counterfactual augmentation methods have attempted to tackle this issue, but they have been limited by rigid rules, resulting in inconsistent context and disrupted semantics. In response to these challenges, we introduce a diffusion-based counterfactual data augmentation (DCA) framework. It utilizes an antonymous paradigm to guide the continuous diffusion model and employs reinforcement learning in combination with contrastive learning to optimize algorithms for generating counterfactual samples with high diversity and quality. Furthermore, we use a dual sentiment classifier to validate the generated antonymous samples and subsequently perform sentiment classification. Our experiments on four benchmark datasets demonstrate that DCA achieves state-of-the-art performance in sentiment classification tasks.

In real-life conversations, the content is diverse, and there exist one-to-many problems that require diverse generation. Previous studies attempted to introduce discrete or Gaussian-based latent variables to address the one-to-many problem, but the diversity is limited. Recently, diffusion models have made breakthroughs in computer vision and some attempts have been made in natural language processing. In this paper, we propose DiffusionDialog, a novel approach to enhance the diversity of dialogue generation with the help of diffusion model. In our approach, we introduce the continuous latent variables in the diffusion model instead of the discrete ones or VAE, which are often used in the previous studies. The problem of using discrete variables in dialog task is how to build a effective prior of latent space and inferring process to infer the proper latent given the context. Combining the encoder and latent-based diffusion model, we encode the latent of response in a continuous space as the prior instead of fixed Gaussian distribution in VAE or simply discrete ones, and we infer the latent by denoising step by step with diffusion model. The experimental results show that our model greatly enhance the diversity of dialog response while keeping the coherence. In further analysis, we find that our diffusion model achieved high inference efficiency which is the main challenge of applying diffusion model in natural language processing.

pdf abs
DimA: A Parameter-efficient Fine-tuning Method with Knowledge Transfer Based on Transformer
Wenxuan Zhang | Min Huang | Zhuoyang Song | Qinghai Miao

Fine-tuning is a widely used technique for leveraging pre-trained language models (PLMs) in downstream tasks, but it can be computationally expensive and storage-intensive. To address this challenge, researchers have developed parameter-efficient methods that balance performance and resource cost. However, these methods often come with trade-offs like increased inference latency, token length usage, or limited adaptability for multitasking scenarios. This paper introduces a novel parameter-efficient method called DimA(Dimensionality Augmentation), which enhances the Transformer architecture by increasing the dimensionality. DimA achieves state-of-the-art results in GLUE and XSUM tasks while utilizing less than 1% of the original model’s parameters. Moreover, DimA introduces a novel approach to knowledge transfer that enables the simultaneous utilization of knowledge learned from multiple tasks to handle new tasks. This method significantly enhances the performance of the model on new tasks. Its versatility in model structure also enables its application to various Transformer-based models.

pdf abs
Disambiguating Homographs and Homophones Simultaneously: A Regrouping Method for Japanese
Yo Sato

We present a method that re-groups surface forms into clusters representing synonyms, and help disambiguate homographs as well as homophone. The method is applied post-hoc to trained contextual word embeddings. It is beneficial to languages where both homographs and homophones abound, which compromise the efficiency of language model and causes the underestimation problem in evaluation. Taking Japanese as an example, we evaluate how accurate such disambiguation can be, and how much the underestimation can be mitigated.

pdf abs
DiscoGeM 2.0: A Parallel Corpus of English, German, French and Czech Implicit Discourse Relations
Frances Yung | Merel Scholman | Sarka Zikanova | Vera Demberg

We present DiscoGeM 2.0, a crowdsourced, parallel corpus of 12,834 implicit discourse relations, with English, German, French and Czech data. We propose and validate a new single-step crowdsourcing annotation method and apply it to collect new annotations in German, French and Czech. The corpus was constructed by having crowdsourced annotators choose a suitable discourse connective for each relation from a set of unambiguous candidates. Every instance was annotated by 10 workers. Our corpus hence represents the first multi-lingual resource that contains distributions of discourse interpretations for implicit relations. The results show that the connective insertion method of discourse annotation can be reliably extended to other languages. The resulting multi-lingual annotations also reveal that implicit relations inferred in one language may differ from those inferred in the translation, meaning the annotations are not always directly transferable. DiscoGem 2.0 promotes the investigation of cross-linguistic differences in discourse marking and could improve automatic discourse parsing applications. It is openly downloadable here: https://github.com/merelscholman/DiscoGeM.

pdf abs
Discourse Structure for the Minecraft Corpus
Kate Thompson | Julie Hunter | Nicholas Asher

We provide a new linguistic resource: The Minecraft Structured Dialogue Corpus (MSDC), a discourse annotated version of the Minecraft Dialogue Corpus (MDC; Narayan-Chen et al., 2019), with complete, situated discourse structures in the style of SDRT (Asher and Lascarides, 2003). Our structures feature both linguistic discourse moves and nonlinguistic actions. To show computational tractability, we train a discourse parser with a novel “2 pass architecture” on MSDC that gives excellent results on attachment prediction and relation labeling tasks especially long distance attachments.

pdf abs
Discriminative Language Model as Semantic Consistency Scorer for Prompt-based Few-Shot Text Classification
Zhipeng Xie | Yahe Li

A successful prompt-based finetuning method should have three prerequisites: task compatibility, input compatibility, and evidence abundance. Bearing this belief in mind, this paper designs a novel prompt-based method (called DLM-SCS) for few-shot text classification, which utilizes the discriminative language model ELECTRA that is pretrained to distinguish whether a token is original or replaced. The method is built upon the intuitive idea that the prompt instantiated with the true label should have higher semantic consistency score than other prompts with false labels. Since a prompt usually consists of several components (or parts), its semantic consistency can be decomposed accordingly, which means each part can provide information for semantic consistency discrimination. The semantic consistency of each component is then computed by making use of the pretrained ELECTRA model, where no extra parameters get introduced. Extensive experiments have shown that our model outperforms several state-of-the-art prompt-based few-shot methods on 10 widely-used text classification tasks.

pdf abs
Disentangling Pretrained Representation to Leverage Low-Resource Languages in Multilingual Machine Translation
Frederikus Hudi | Zhi Qu | Hidetaka Kamigaito | Taro Watanabe

Multilingual neural machine translation aims to encapsulate multiple languages into a single model. However, it requires an enormous dataset, leaving the low-resource language (LRL) underdeveloped. As LRLs may benefit from shared knowledge of multilingual representation, we aspire to find effective ways to integrate unseen languages in a pre-trained model. Nevertheless, the intricacy of shared representation among languages hinders its full utilisation. To resolve this problem, we employed target language prediction and a central language-aware layer to improve representation in integrating LRLs. Focusing on improving LRLs in the linguistically diverse country of Indonesia, we evaluated five languages using a parallel corpus of 1,000 instances each, with experimental results measured by BLEU showing zero-shot improvement of 7.4 from the baseline score of 7.1 to a score of 15.5 at best. Further analysis showed that the gains in performance are attributed more to the disentanglement of multilingual representation in the encoder with the shift of the target language-specific representation in the decoder.

This paper presents DISRPT, a multilingual, multi-domain, and cross-framework benchmark dataset for discourse processing, covering the tasks of discourse unit segmentation, connective identification, and relation classification. DISRPT includes 13 languages, with data from 24 corpora covering about 4 millions tokens and around 250,000 discourse relation instances from 4 discourse frameworks: RST, SDRT, PDTB, and Discourse Dependencies. We present an overview of the data, its development across three NLP shared tasks on discourse processing carried out in the past five years, and the latest modifications and added extensions. We also carry out an evaluation of state-of-the-art multilingual systems trained on the data for each task, showing plateau performance on segmentation, but important room for improvement for connective identification and relation classification. The DISRPT benchmark employs a unified format that we make available on GitHub and HuggingFace in order to encourage future work on discourse processing across languages, domains, and frameworks.

Code summarization provides a natural language description for a given piece of code. In this work, we focus on scripting code—programming languages that interact with specific devices through commands. The low-resource nature of scripting languages makes traditional code summarization methods challenging to apply. To address this, we introduce a novel framework: distantly supervised contrastive learning for low-resource scripting language summarization. This framework leverages limited atomic commands and category constraints to enhance code representations. Extensive experiments demonstrate our method’s superiority over competitive baselines.

pdf abs
Distillation with Explanations from Large Language Models
Hanyu Zhang | Xiting Wang | Xiang Ao | Qing He

Free-text explanations are crucial for enhancing the interpretability of AI models. However, training models to generate high-quality free-text explanations is challenging, primarily due to the requirement of a substantial amount of human-written explanations, which can be expensive. Recently, Large language models (LLMs) like ChatGPT and GPT-4 have made remarkable progress in various NLP tasks while also providing explanations alongside their answers. Leveraging LLMs for data labeling offers a more cost-effective alternative. However, a key concern arises from the fact that the answers provided by LLMs are not entirely accurate, potentially introducing noise to both task outputs and explanation generation. To remedy this, we propose a new mechanism, Distillation with Explanations from LLMs. we observe that despite the incorrectness in LLMs-generated answers, their explanations are consistent with their answers. Leveraging this consistency, our method combines the ground truth labels and answers-explanations generated by LLMs, to simultaneously generate more accurate answers and the corresponding free-text explanations. Experimental results demonstrate that our approach achieves improved predictive performance and also generates explanations that exhibit greater alignment with the model’s task outputs.

pdf abs
Distill, Fuse, Pre-train: Towards Effective Event Causality Identification with Commonsense-Aware Pre-trained Model
Peixin Huang | Xiang Zhao | Minghao Hu | Zhen Tan | Weidong Xiao

Event Causality Identification (ECI) aims to detect causal relations between events in unstructured texts. This task is challenged by the lack of data and explicit causal clues. Some methods incorporate explicit knowledge from external knowledge graphs (KGs) into Pre-trained Language Models (PLMs) to tackle these issues, achieving certain accomplishments. However, they ignore that existing KGs usually contain trivial knowledge which may prejudice the performance. Moreover, they simply integrate the concept triplets, underutilizing the deep interaction between the text and external graph. In this paper, we propose an effective pipeline DFP, i.e., Distill, Fuse and Pre-train, to build a commonsense-aware pre-trained model which integrates reliable task-specific knowledge from commonsense graphs. This pipeline works as follows: (1) To leverage the reliable knowledge, commonsense graph distillation is proposed to distill commonsense graphs and obtain the meta-graph which contain credible task-oriented knowledge. (2) To model the deep interaction between the text and external graph, heterogeneous information fusion is proposed to fuse them through a commonsense-aware memory network. (3) Continual pre-training designs three continual pre-training tasks to further align and fuse the text and the commonsense meta-graph. Through extensive experiments on two benchmarks, we demonstrate the validity of our pipeline.

pdf abs
Distilling Causal Effect of Data in Continual Few-shot Relation Learning
Weihang Ye | Peng Zhang | Jing Zhang | Hui Gao | Moyao Wang

Continual Few-Shot Relation Learning (CFRL) aims to learn an increasing number of new relational patterns from a data stream. However, due to the limited number of samples and the continual training mode, this method frequently encounters the catastrophic forgetting issues. The research on causal inference suggests that this issue is caused by the loss of causal effects from old data during the new training process. Inspired by the causal graph, we propose a unified causal framework for CFRL to restore the causal effects. Specifically, we establish two additional causal paths from old data to predictions by having the new data and memory data collide with old data separately in the old feature space. This augmentation allows us to preserve causal effects effectively and enhance the utilization of valuable information within memory data, thereby alleviating the phenomenon of catastrophic forgetting. Furthermore, we introduce a self-adaptive weight to achieve a delicate balance of causal effects between the new and old relation types. Extensive experiments demonstrate the superiority of our method over existing state-of-the-art approaches in CFRL task settings. Our codes are publicly available at: https://github.com/ywh140/CECF.

pdf abs
Distractor Generation Using Generative and Discriminative Capabilities of Transformer-based Models
Shiva Taslimipoor | Luca Benedetto | Mariano Felice | Paula Buttery

Multiple Choice Questions (MCQs) are very common in both high-stakes and low-stakes examinations, and their effectiveness in assessing students relies on the quality and diversity of distractors, which are the incorrect answer options provided alongside the correct answer. Motivated by the progress in generative language models, we propose a two-step automatic distractor generation approach which is based on text to text transfer transformer models. Unlike most of previous methods for distractor generation, our approach does not rely on the correct answer options. Instead, it first generates both correct and incorrect answer options, and then discriminates potential correct options from distractors. Identified distractors are finally categorised based on semantic similarity scores into separate clusters, and the cluster heads are selected as our final distinct distractors. Experiments on two publicly available datasets show that our approach outperforms previous models both in the case of single-word answer options and longer-sequence reading comprehension questions.

Traditional automated metrics for evaluating conditional natural language generation rely on pairwise comparisons between a single generated text and the best-matching gold-standard reference. This method is effective when ground truth data diversity can be attributed to noise, however, it falls short when diversity in references holds valuable contextual information, as in visual description or summarization, as it does not evaluate the ability of a model to generate text matching the diversity of the ground truth samples. In this paper, we challenge the adequacy of existing metrics in such semantically diverse contexts and introduce a novel approach for evaluating conditional language generation models, leveraging a family of meta-metrics that build on existing pairwise distance functions. These meta-metrics assess not just single-samples, but distributions of reference and model-generated captions using small sample sets. We demonstrate our approach through a case study of visual description in the English language which reveals not only how current models prioritize single-description quality over diversity, but further sheds light on the impact of sampling methods and temperature settings on description quality and diversity.

pdf abs
Diversifying Question Generation over Knowledge Base via External Natural Questions
Shasha Guo | Jing Zhang | Xirui Ke | Cuiping Li | Hong Chen

Previous methods on knowledge base question generation (KBQG) primarily focus on refining the quality of a single generated question. However, considering the remarkable paraphrasing ability of humans, we believe that diverse texts can express identical semantics through varied expressions. The above insights make diversifying question generation an intriguing task, where the first challenge is evaluation metrics for diversity. Current metrics inadequately assess the aforementioned diversity. They calculate the ratio of unique n-grams in the generated question, which tends to measure duplication rather than true diversity. Accordingly, we devise a new diversity evaluation metric, which measures the diversity among top-k generated questions for each instance while ensuring their relevance to the ground truth. Clearly, the second challenge is how to enhance diversifying question generation. To address this challenge, we introduce a dual model framework interwoven by two selection strategies to generate diverse questions leveraging external natural questions. The main idea of our dual framework is to extract more diverse expressions and integrate them into the generation model to enhance diversifying question generation. Extensive experiments on widely used benchmarks for KBQG show that our approach can outperform pre-trained language model baselines and text-davinci-003 in diversity while achieving comparable performance with ChatGPT.

pdf abs
DMON: A Simple Yet Effective Approach for Argument Structure Learning
Sun Wei | Mingxiao Li | Jingyuan Sun | Jesse Davis | Marie-Francine Moens

Argument structure learning (ASL) entails predicting relations between arguments. Because it can structure a document to facilitate its understanding, it has been widely applied in many fields (medical, commercial, and scientific domains). Despite its broad utilization, ASL remains a challenging task because it involves examining the complex relationships between the sentences in a potentially unstructured discourse. To resolve this problem, we have developed a simple yet effective approach called Dual-tower Multi-scale cOnvolution neural Network (DMON) for the ASL task. Specifically, we organize arguments into a relationship matrix that together with the argument embeddings forms a relationship tensor and design a mechanism to capture relations with contextual arguments. Experimental results on three different-domain argument mining datasets demonstrate that our framework outperforms state-of-the-art models. We will release the code after paper acceptance.

Table-text document (e.g., financial reports) understanding has attracted increasing attention in recent two years. TAT-DQA is a realistic setting for the understanding of visually-rich table-text documents, which involves answering associated questions requiring discrete reasoning. Most existing work relies on token-level semantics, falling short in the reasoning across document elements such as quantities and dates. To address this limitation, we propose a novel Doc2SoarGraph model that exploits element-level semantics and employs Semantic-oriented hierarchical Graph structures to capture the differences and correlations among different elements within the given document and question. Extensive experiments on the TAT-DQA dataset reveal that our model surpasses the state-of-the-art conventional method (i.e., MHST) and large language model (i.e., ChatGPT) by 17.73 and 6.49 points respectively in terms of Exact Match (EM) metric, demonstrating exceptional effectiveness.

We propose DOC-RAG - Domain-distributed Co-occurrence Retrieval Augmentation for ASR language model personalization aiming to improve the automatic speech recognition of rare word patterns in unseen domains. Our approach involves contrastively training a document retrieval module to rank external knowledge domains based on their semantic similarity with respect to the input query. We further use n-gram co-occurrence distribution to recognize rare word patterns associated with specific domains. We aggregate the next word probability distribution based on the relative importance of different domains. Extensive experiments on three user-specific speech-to-text tasks for meetings, TED talks, and financial earnings calls show that DOC-RAG significantly outperforms strong baselines with an 8-15% improvement in terms of perplexity and a 4-7% reduction in terms of Word Error Rates in various settings.

We present a novel task of document-level script event prediction, which aims to predict the next event given a candidate list of narrative events in long-form documents. To enable this, we introduce DocSEP, a challenging dataset in two new domains - contractual documents and Wikipedia articles, where timeline events may be paragraphs apart and may require multi-hop temporal and causal reasoning. We benchmark existing baselines and present a novel architecture called DocScript to learn sequential ordering between events at the document scale. Our experimental results on the DocSEP dataset demonstrate that learning longer-range dependencies between events is a key challenge and show that contemporary LLMs such as ChatGPT and FlanT5 struggle to solve this task, indicating their lack of reasoning abilities for understanding causal relationships and temporal sequences within long texts.

Document-level Event Extraction (DEE) is a vital task in NLP as it seeks to automatically recognize and extract event information from a document. However, current approaches often overlook intricate relationships among events and subtle correlations among arguments within a document, which can significantly impact the effectiveness of event type recognition and the extraction of cross-sentence arguments in DEE task. This paper proposes a novel Correlation Association Interactive Network (CAINet), comprising two key components: event relationship graph and argument correlation graph. In particular, the event relationship graph models the relationship among various events through structural associations among event nodes and sentence nodes, to improve the accuracy of event recognition. On the other hand, the arguments correlation graph models the correlations among arguments by quantifying the strength of association among arguments, to effectively aggregate cross-sentence arguments, contributing to the overall success of DEE. Furthermore, we use the large language model to execute DEE task experiments. Experimental results show the proposed CAINet outperforms existing state-of-the-art models and large language models in terms of F1-score across two benchmark datasets.

The Document Set Expansion (DSE) task involves identifying relevant documents from large collections based on a limited set of example documents. Previous research has highlighted Positive and Unlabeled (PU) learning as a promising approach for this task. However, most PU methods rely on the unrealistic assumption of knowing the class prior for positive samples in the collection. To address this limitation, this paper introduces a novel PU learning framework that utilizes intractable density estimation models. Experiments conducted on PubMed and Covid datasets in a transductive setting showcase the effectiveness of the proposed method for DSE. Code is available from https://github.com/Beautifuldog01/Document-set-expansion-puDE.

Despite the superior performance, Large Language Models (LLMs) require significant computational resources for deployment and use. To overcome this issue, quantization methods have been widely applied to reduce the memory footprint of LLMs as well as increase the inference rate. However, a major challenge is that low-bit quantization methods often lead to performance degradation. It is important to understand how quantization impacts the capacity of LLMs. Different from previous studies focused on overall performance, this work aims to investigate the impact of quantization on emergent abilities, which are important characteristics that distinguish LLMs from small language models. Specifically, we examine the abilities of in-context learning, chain-of-thought reasoning, and instruction-following in quantized LLMs. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation on the test of these abilities. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning. Our work derives a series of important findings to understand the impact of quantization on emergent abilities and sheds light on the possibilities of extremely low-bit quantization for LLMs.

Recently, ChatGPT has demonstrated remarkable performance in various downstream tasks such as open-domain question answering, machine translation, and code generation. As a general-purpose task solver, an intriguing inquiry arises: Does ChatGPT itself know that it does not know, without any access to internal states? In response to this query, we present an initial evaluation of ChatGPT for black-box calibration. We designed three types of proxy confidence, from three perspectives to assess its performance. Experiments are conducted on five datasets, spanning four tasks, and the results show that ChatGPT has a degree of capability for black-box calibration. Specifically, proxy confidence displayed a significantly positive Pearson correlation (95.16%) with accuracy in the TruthfulQA dataset, while revealing a negative correlation in the ModAr dataset. We delved deeper into ChatGPT’s black-box calibration ability by examining failure cases in the ModAr dataset. Our analysis revealed that ChatGPT’s tendency to exhibit overconfidence may stem from its reliance on semantic priors. Furthermore, we investigated why ChatGPT performs relatively well in TruthfulQA. The findings suggest that ChatGPT might implicitly acquire calibration skills during the reinforcement learning process, rather than relying solely on simplistic heuristics.

he present study introduces the knowledge-augmented generator, which is specifically designed to produce information that remains grounded in contextual knowledge, regardless of alterations in the context. Previous research has predominantly focused on examining hallucinations stemming from static input, such as in the domains of summarization or machine translation. However, our investigation delves into the faithfulness of generative question answering in the presence of dynamic knowledge. Our objective is to explore the existence of hallucinations arising from parametric memory when contextual knowledge undergoes changes, while also analyzing the underlying causes for their occurrence. In order to efficiently address this issue, we propose a straightforward yet effective measure for detecting such hallucinations. Intriguingly, our investigation uncovers that all models exhibit a tendency to generate previous answers as hallucinations. To gain deeper insights into the underlying causes of this phenomenon, we conduct a series of experiments that verify the critical role played by context in hallucination, both during training and testing, from various perspectives.

pdf abs
Does the Language Matter? Curriculum Learning over Neo-Latin Languages
Leonardo Ranaldi | Giulia Pucci | André Freitas

Curriculum Learning (CL) has been emerged as an effective technique for improving the performances and reducing the cost of pre-training Large Language Models (LLMs). The efficacy of CL demonstrated in different scenarios is in the training LLMs by organizing examples from the simplest to the most complex. Although improvements have been shown extensively, this approach was used for pre-training, leaving novel fine-tuning approaches such as instruction-tuning unexplored. In this paper, we propose a novel complexity measure to empower the instruction-tuning method using the CL paradigm. To complement previous works, we propose cognitively motivated measures to determine the complexity of training demonstrations used in the instruction-tuning paradigm. Hence, we experiment with the proposed heuristics first in English and then in other languages. The downstream results show that delivering training examples by complexity ranking is also effective for instruction tuning, as it improves downstream performance while reducing costs. Furthermore, the technique can be easily transferred to languages other than English, e.g., Italian and French, without any adaptation, maintaining functionality and effectiveness.

Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion’s share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European languages. Our approach is two-fold: first, we perform an intrinsic evaluation by performing a human evaluation of the quality of samples taken from different corpora; then, we assess the practical impact of the qualitative differences by training specific LMs on each of the corpora and evaluating their performance on downstream tasks. We find that there are clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining the best results. However, during the extrinsic evaluation, we actually find that the CC100 corpus achieves the highest scores. We conclude that, in our experiments, the quality of the web-crawled corpora does not seem to play a significant role when training LMs.

pdf abs
Do Large Language Models Understand Mansplaining? Well, Actually...
Carla Perez Almendros | Jose Camacho-Collados

Gender bias has been widely studied by the NLP community. However, other more subtle variations of it, such as mansplaining, have yet received little attention. Mansplaining is a discriminatory behaviour that consists of a condescending treatment or discourse towards women. In this paper, we introduce and analyze Well, actually..., a corpus of 886 mansplaining stories experienced by women. We analyze the corpus in terms of features such as offensiveness, sentiment or misogyny, among others. We also explore to what extent Large Language Models (LLMs) can understand and identify mansplaining and other gender-related microaggressions. Specifically, we experiment with ChatGPT-3.5-Turbo and LLaMA-2 (13b and 70b), with both targeted and open questions. Our findings suggest that, although they can identify mansplaining to some extent, LLMs still struggle to point out this attitude and will even reproduce some of the social patterns behind mansplaining situations, for instance by praising men for giving unsolicited advice to women.

pdf abs
Domain Adaptation for Dense Retrieval and Conversational Dense Retrieval through Self-Supervision by Meticulous Pseudo-Relevance Labeling
Minghan Li | Eric Gaussier

Recent studies have demonstrated that the ability of dense retrieval models to generalize to target domains with different distributions is limited, which contrasts with the results obtained with interaction-based models. Prior attempts to mitigate this challenge involved leveraging adversarial learning and query generation approaches, but both approaches nevertheless resulted in limited improvements. In this paper, we propose to combine the query-generation approach with a self-supervision approach in which pseudo-relevance labels are automatically generated on the target domain. To accomplish this, a T5-3B model is utilized for pseudo-positive labeling, and meticulous hard negatives are chosen. We also apply this strategy on conversational dense retrieval model for conversational search. A similar pseudo-labeling approach is used, but with the addition of a query-rewriting module to rewrite conversational queries for subsequent labeling. This proposed approach enables a model’s domain adaptation with real queries and documents from the target dataset. Experiments on standard dense retrieval and conversational dense retrieval models both demonstrate improvements on baseline models when they are fine-tuned on the pseudo-relevance labeled data.

pdf abs
Domain-Agnostic Adapter Architecture for Deception Detection: Extensive Evaluations with the DIFrauD Benchmark
Dainis A. Boumber | Fatima Zahra Qachfar | Rakesh Verma

Despite significant strides in training expansive transformer models, their deployment for niche tasks remains intricate. This paper delves into deception detection, assessing domain adaptation methodologies from a cross-domain lens using transformer Large Language Models (LLMs). We roll out a new corpus with roughly 100,000 honest and misleading statements in seven domains, designed to serve as a benchmark for multidomain deception detection. As a primary contribution, we present a novel parameter-efficient finetuning adapter, PreXIA, which was proposed and implemented as part of this work. The design is model-, domain- and task-agnostic, with broad applications that are not limited by the confines of deception or classification tasks. We comprehensively analyze and rigorously evaluate LLM tuning methods and our original design using the new benchmark, highlighting their strengths, pointing out weaknesses, and suggesting potential areas for improvement. The proposed adapter consistently outperforms all competition on the DIFrauD benchmark used in this study. To the best of our knowledge, it improves on the state-of-the-art in its class for the deception task. In addition, the evaluation process leads to unexpected findings that, at the very least, cast doubt on the conclusions made in some of the recently published research regarding reasoning ability’s unequivocal dominance over representations quality with respect to the relative contribution of each one to a model’s performance and predictions.

Few-shot relation extraction (FSRE) can alleviate the data scarcity problem in relation extraction. However, FSRE models often suffer a significant decline in performance when adapting to new domains. To overcome this issue, many researchers have focused on domain adaption FSRE (DAFSRE). Nevertheless, existing approaches primarily concentrate on the source domain, which makes it difficult to accurately transfer useful knowledge to the target domain. Additionally, the lack of distinction between relations further restricts the model performance. In this paper, we propose the domain-aware and co-adaptive feature transformation approach to address these issues. Specifically, we introduce a domain-aware transformation module that leverages the target domain distribution features to guide the domain-aware feature transformations. This can enhance the model’s adaptability across domains, leading to improved target domain performance. Furthermore, we design co-adaptive prototypical networks to perform co-adaptive feature transformation through a transformer mechanism. This results in more robust and distinguishable relation prototypes. Experiments on DAFSRE benchmark datasets demonstrate the effectiveness of our method, which outperforms existing models and achieves state-of-the-art performance.

Domain adaption has been widely adapted for cross-domain sentiment analysis to transfer knowledge from the source domain to the target domain. Whereas, most methods are proposed under the assumption that the target (test) domain is known, making them fail to generalize well on unknown test data that is not always available in practice. In this paper, we focus on the problem of domain generalization for cross-domain sentiment analysis. Specifically, we propose a backdoor adjustment-based causal model to disentangle the domain-specific and domain-invariant representations that play essential roles in tackling domain shift. First, we rethink the cross-domain sentiment analysis task in a causal view to model the causal-and-effect relationships among different variables. Then, to learn an invariant feature representation, we remove the effect of domain confounders (e.g., domain knowledge) using the backdoor adjustment. A series of experiments over many homologous and diverse datasets show the great performance and robustness of our model by comparing it with the state-of-the-art domain generalization baselines.

Interviews are an effective method to elicit critical skills to perform particular processes in various domains. In order to understand the knowledge structure of these domain-specific processes, we consider semantic role and predicate annotation based on Frame Semantics. We introduce a dataset of interview dialogues with experts in the culinary and gardening domains, each annotated with semantic frames. This dataset consists of (1) 308 interview dialogues related to the culinary domain, originally assembled by Okahisa et al. (2022), and (2) 100 interview dialogues associated with the gardening domain, which we newly acquired. The labeling specifications take into account the domain-transferability by adopting domain-agnostic labels for frame elements. In addition, we conducted domain transfer experiments from the culinary domain to the gardening domain to examine the domain transferability with our dataset. The experimental results showed the effectiveness of our domain-agnostic labeling scheme.

pdf abs
Do Neural Language Models Inferentially Compose Concepts the Way Humans Can?
Amilleah Rodriguez | Shaonan Wang | Liina Pylkkänen

While compositional interpretation is the core of language understanding, humans also derive meaning via inference. For example, while the phrase “the blue hat” introduces a blue hat into the discourse via the direct composition of “blue” and “hat,” the same discourse entity is introduced by the phrase “the blue color of this hat” despite the absence of any local composition between “blue” and “hat.” Instead, we infer that if the color is blue and it belongs to the hat, the hat must be blue. We tested the performance of neural language models and humans on such inferentially driven conceptual compositions, eliciting probability estimates for a noun in a minimally composed phrase, “This blue hat”, following contexts that had introduced the conceptual combinations of those nouns and adjectives either syntactically or inferentially. Surprisingly, our findings reveal significant disparities between the performance of neural language models and human judgments. Among the eight models evaluated, RoBERTa, BERT-large, and GPT-2 exhibited the closest resemblance to human responses, while other models faced challenges in accurately identifying compositions in the provided contexts. Our study reveals that language models and humans may rely on different approaches to represent and compose lexical items across sentence structure. All data and code are accessible at https://github.com/wangshaonan/BlueHat.

pdf abs
DORE: A Dataset for Portuguese Definition Generation
Anna Beatriz Dimas Furtado | Tharindu Ranasinghe | Frederic Blain | Ruslan Mitkov

Definition modelling (DM) is the task of automatically generating a dictionary definition of a specific word. Computational systems that are capable of DM can have numerous applications benefiting a wide range of audiences. As DM is considered a supervised natural language generation problem, these systems require large annotated datasets to train the machine learning (ML) models. Several DM datasets have been released for English and other high-resource languages. While Portuguese is considered a mid/high-resource language in most natural language processing tasks and is spoken by more than 200 million native speakers, there is no DM dataset available for Portuguese. In this research, we fill this gap by introducing DORE; the first dataset for Definition MOdelling for PoRtuguEse containing more than 100,000 definitions. We also evaluate several deep learning based DM models on DORE and report the results. The dataset and the findings of this paper will facilitate research and study of Portuguese in wider contexts.

pdf abs
DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures
Agrima Seth | Sanchit Ahuja | Kalika Bali | Sunayana Sitaram

Generative models are increasingly being used in various applications, such as text generation, commonsense reasoning, and question-answering. To be effective globally, these models must be aware of and account for local socio-cultural contexts, making it necessary to have benchmarks to evaluate the models for their cultural familiarity. Since the training data for LLMs is web-based and the Web is limited in its representation of information, it does not capture knowledge present within communities that are not on the Web. Thus, these models exacerbate the inequities, semantic misalignment, and stereotypes from the Web. There has been a growing call for community-centered participatory research methods in NLP. In this work, we respond to this call by using participatory research methods to introduce DOSA, the first community-generated Dataset of 615 Social Artifacts, by engaging with 260 participants from 19 different Indian geographic subcultures. We use a gamified framework that relies on collective sensemaking to collect the names and descriptions of these artifacts such that the descriptions semantically align with the shared sensibilities of the individuals from those cultures. Next, we benchmark four popular LLMs and find that they show significant variation across regional sub-cultures in their ability to infer the artifacts.

pdf abs
DP-CRE: Continual Relation Extraction via Decoupled Contrastive Learning and Memory Structure Preservation
Mengyi Huang | Meng Xiao | Ludi Wang | Yi Du

Continuous Relation Extraction (CRE) aims to incrementally learn relation knowledge from a non-stationary stream of data. Since the introduction of new relational tasks can overshadow previously learned information, catastrophic forgetting becomes a significant challenge in this domain. Current replay-based training paradigms prioritize all data uniformly and train memory samples through multiple rounds, which would result in overfitting old tasks and pronounced bias towards new tasks because of the imbalances of the replay set. To handle the problem, we introduce the DecouPled CRE (DP-CRE) framework that decouples the process of prior information preservation and new knowledge acquisition. This framework examines alterations in the embedding space as new relation classes emerge, distinctly managing the preservation and acquisition of knowledge. Extensive experiments show that DP-CRE significantly outperforms other CRE baselines across two datasets.

Open Domain Multi-Hop Question Answering (ODMHQA) plays a crucial role in Natural Language Processing (NLP) by aiming to answer complex questions through multi-step reasoning over retrieved information from external knowledge sources. Recently, Large Language Models (LLMs) have demonstrated remarkable performance in solving ODMHQA owing to their capabilities including planning, reasoning, and utilizing tools. However, LLMs may generate off-topic answers when attempting to solve ODMHQA, namely the generated answers are irrelevant to the original questions. This issue of off-topic answers accounts for approximately one-third of incorrect answers, yet remains underexplored despite its significance. To alleviate this issue, we propose the Discriminate→Re-Compose→Re- Solve→Re-Decompose (Dr3) mechanism. Specifically, the Discriminator leverages the intrinsic capabilities of LLMs to judge whether the generated answers are off-topic. In cases where an off-topic answer is detected, the Corrector performs step-wise revisions along the reversed reasoning chain (Re-Compose→Re-Solve→Re-Decompose) until the final answer becomes on-topic. Experimental results on the HotpotQA and 2WikiMultiHopQA datasets demonstrate that our Dr3 mechanism considerably reduces the occurrence of off-topic answers in ODMHQA by nearly 13%, improving the performance in Exact Match (EM) by nearly 3% compared to the baseline method without the Dr3 mechanism.

pdf abs
DRAMA: Dynamic Multi-Granularity Graph Estimate Retrieval over Tabular and Textual Question Answering
Ruize Yuan | Xiang Ao | Li Zeng | Qing He

The TableTextQA task requires finding the answer to the question from a combination of tabular and textual data, which has been gaining increasing attention. The row-based approaches have demonstrated remarkable effectiveness. However, they suffer from the following limitations: (1) a lack of interaction between rows; (2) excessively long input lengths; and (3) question attention shifts in the multi-hop QA task. To this end, we propose a novel method: Dynamic Multi-Granularity Graph Estimate Retrieval - DRAMA. Our method incorporates an interaction mechanism among multiple rows. Specifically, we utilize a memory bank to store the features of each row, thereby facilitating the construction of a heterogeneous graph with multi-row information. Besides, a Dynamic Graph Attention Network (DGAT) module is engaged to gauge the attention shift in the multi-hop question and eliminate the noise information dynamically. Empirical results on the widely used HybridQA and TabFact datasets demonstrate that the proposed model is effective.

The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for the assessment of intrinsic PLMs qualities from various perspectives. Although still limited to few languages, this initiative has been undertaken in the biomedical field, notably English and Chinese. This limitation hampers the evaluation of the latest French biomedical models, as they are either assessed on a minimal number of tasks with non-standardized protocols or evaluated using general downstream tasks. To bridge this research gap and account for the unique sensitivities of French, we present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, or classification. We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specific MLMs to assess their cross-lingual capabilities. Our experiments reveal that no single model excels across all tasks, while generalist models are sometimes still competitive.

pdf abs
Dual Complex Number Knowledge Graph Embeddings
Yao Dong | Qingchao Kong | Lei Wang | Yin Luo

Knowledge graph embedding, which aims to learn representations of entities and relations in large scale knowledge graphs, plays a crucial part in various downstream applications. The performance of knowledge graph embedding models mainly depends on the ability of modeling relation patterns, such as symmetry/antisymmetry, inversion and composition (commutative composition and non-commutative composition). Most existing methods fail in modeling the non-commutative composition patterns. Several methods support this kind of pattern by modeling in quaternion space or dihedral group. However, extending to such sophisticated spaces leads to a substantial increase in the amount of parameters, which greatly reduces the parameter efficiency. In this paper, we propose a new knowledge graph embedding method called dual complex number knowledge graph embeddings (DCNE), which maps entities to the dual complex number space, and represents relations as rotations in 2D space via dual complex number multiplication. The non-commutativity of the dual complex number multiplication empowers DCNE to model the non-commutative composition patterns. In the meantime, modeling relations as rotations in 2D space can effectively improve the parameter efficiency. Extensive experiments on multiple benchmark knowledge graphs empirically show that DCNE achieves significant performance in link prediction and path query answering.

pdf abs
Dual Encoder: Exploiting the Potential of Syntactic and Semantic for Aspect Sentiment Triplet Extraction
Xiaowei Zhao | Yong Zhou | Xiujuan Xu

Aspect Sentiment Triple Extraction (ASTE) is an emerging task in fine-grained sentiment analysis. Recent studies have employed Graph Neural Networks (GNN) to model the syntax-semantic relationships inherent in triplet elements. However, they have yet to fully tap into the vast potential of syntactic and semantic information within the ASTE task. In this work, we propose a Dual Encoder: Exploiting the potential of Syntactic and Semantic model (D2E2S), which maximizes the syntactic and semantic relationships among words. Specifically, our model utilizes a dual-channel encoder with a BERT channel to capture semantic information, and an enhanced LSTM channel for comprehensive syntactic information capture. Subsequently, we introduce the heterogeneous feature interaction module to capture intricate interactions between dependency syntax and attention semantics, and to dynamically select vital nodes. We leverage the synergy of these modules to harness the significant potential of syntactic and semantic information in ASTE tasks. Testing on public benchmarks, our D2E2S model surpasses the current state-of-the-art(SOTA), demonstrating its effectiveness.

pdf abs
DuetSim: Building User Simulator with Dual Large Language Models for Task-Oriented Dialogues
Xiang Luo | Zhiwen Tang | Jin Wang | Xuejie Zhang

User Simulators play a pivotal role in training and evaluating task-oriented dialogue systems. Traditional user simulators typically rely on human-engineered agendas, resulting in generated responses that often lack diversity and spontaneity. Although large language models (LLMs) exhibit a remarkable capacity for generating coherent and contextually appropriate utterances, they may fall short when tasked with generating responses that effectively guide users towards their goals, particularly in dialogues with intricate constraints and requirements. This paper introduces DuetSim, a novel framework designed to address the intricate demands of task-oriented dialogues by leveraging LLMs. DuetSim stands apart from conventional approaches by employing two LLMs in tandem: one dedicated to response generation and the other focused on verification. This dual LLM approach empowers DuetSim to produce responses that not only exhibit diversity but also demonstrate accuracy and are preferred by human users. We validate the efficacy of our method through extensive experiments conducted on the MultiWOZ dataset, highlighting improvements in response quality and correctness, largely attributed to the incorporation of the second LLM.

pdf abs
Dynamic Knowledge Prompt for Chest X-ray Report Generation
Shenshen Bu | Yujie Song | Taiji Li | Zhiming Dai

Automatic generation of radiology reports can relieve the burden of radiologist. In the radiology library, the biased dataset and the sparse features of chest X-ray image make it difficult to generate reports. Many approaches strive to integrate prior information to enhance generation, but they fail to dynamically utilize pulmonary lesion knowledge at the instance-level. To alleviate above problem, we propose a novel Dynamic Knowledge Prompt (DKP) framework for chest X-ray report generation. The DKP can dynamically incorporate the pulmonary lesion information at the instance-level to facilitate report generation. Initially, we design a knowledge prompt for each pulmonary lesion using numerous radiology reports. After that, the DKP using an anomaly detector generates the dynamic knowledge prompt by extracting discriminative lesion features in the corresponding X-ray image. Finally, the knowledge prompt is encoded and fused with hidden states extracted from decoder, to form multi-modal features that guide visual features to generate reports. Extensive experiments on the public datasets MIMIC-CXR and IU X-Ray show that our approach achieves state-of-the-art performance.

pdf abs
Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation
Do June Min | Veronica Perez-Rosas | Ken Resnicow | Rada Mihalcea

In this paper, we study the problem of multi-reward reinforcement learning to jointly optimize for multiple text qualities for natural language generation. We focus on the task of counselor reflection generation, where we optimize the generators to simultaneously improve the fluency, coherence, and reflection quality of generated counselor responses. We introduce two novel bandit methods, DynaOpt and C-DynaOpt, which rely on the broad strategy of combining rewards into a single value and optimizing them simultaneously. Specifically, we employ non-contextual and contextual multi-arm bandits to dynamically adjust multiple reward weights during training. Through automatic and manual evaluations, we show that our proposed techniques, DynaOpt and C-DynaOpt, outperform existing naive and bandit baselines, showcasing their potential for enhancing language models.

pdf abs
Dynamic Spatial-Temporal Aggregation for Skeleton-Aware Sign Language Recognition
Lianyu Hu | Liqing Gao | Zekang Liu | Wei Feng

Skeleton-aware sign language recognition (SLR) has gained popularity due to its ability to remain unaffected by background information and its lower computational requirements. Current methods utilize spatial graph modules and temporal modules to capture spatial and temporal features, respectively. However, their spatial graph modules are typically built on fixed graph structures such as graph convolutional networks or a single learnable graph, which only partially explore joint relationships. Additionally, a simple temporal convolution kernel is used to capture temporal information, which may not fully capture the complex movement patterns of different signers. To overcome these limitations, we propose a new spatial architecture consisting of two concurrent branches, which build input-sensitive joint relationships and incorporates specific domain knowledge for recognition, respectively. These two branches are followed by an aggregation process to distinguishe important joint connections. We then propose a new temporal module to model multi-scale temporal information to capture complex human dynamics. Our method achieves state-of-the-art accuracy compared to previous skeleton-aware methods on four large-scale SLR benchmarks. Moreover, our method demonstrates superior accuracy compared to RGB-based methods in most cases while requiring much fewer computational resources, bringing better accuracy-computation trade-off. Code is available at https://github.com/hulianyuyy/DSTA-SLR.

pdf abs
EcoVerse: An Annotated Twitter Dataset for Eco-Relevance Classification, Environmental Impact Analysis, and Stance Detection
Francesca Grasso | Stefano Locci | Giovanni Siragusa | Luigi Di Caro

Anthropogenic ecological crisis constitutes a significant challenge that all within the academy must urgently face, including the Natural Language Processing (NLP) community. While recent years have seen increasing work revolving around climate-centric discourse, crucial environmental and ecological topics outside of climate change remain largely unaddressed, despite their prominent importance. Mainstream NLP tasks, such as sentiment analysis, dominate the scene, but there remains an untouched space in the literature involving the analysis of environmental impacts of certain events and practices. To address this gap, this paper presents EcoVerse, an annotated English Twitter dataset of 3,023 tweets spanning a wide spectrum of environmental topics. We propose a three-level annotation scheme designed for Eco-Relevance Classification, Stance Detection, and introducing an original approach for Environmental Impact Analysis. We detail the data collection, filtering, and labeling process that led to the creation of the dataset. Remarkable Inter-Annotator Agreement indicates that the annotation scheme produces consistent annotations of high quality. Subsequent classification experiments using BERT-based models, including ClimateBERT, are presented. These yield encouraging results, while also indicating room for a model specifically tailored for environmental texts. The dataset is made freely available to stimulate further research.

pdf abs
ECtHR-PCR: A Dataset for Precedent Understanding and Prior Case Retrieval in the European Court of Human Rights
Santosh T.y.s.s. | Rashid Haddad | Matthias Grabmair

In common law jurisdictions, legal practitioners rely on precedents to construct arguments, in line with the doctrine of stare decisis. As the number of cases grow over the years, prior case retrieval (PCR) has garnered significant attention. Besides lacking real-world scale, existing PCR datasets do not simulate a realistic setting, because their queries use complete case documents while only masking references to prior cases. The query is thereby exposed to legal reasoning not yet available when constructing an argument for an undecided case as well as spurious patterns left behind by citation masks, potentially short-circuiting a comprehensive understanding of case facts and legal principles. To address these limitations, we introduce a PCR dataset based on judgements from the European Court of Human Rights (ECtHR), which explicitly separate facts from arguments and exhibit precedential practices, aiding us to develop this PCR dataset to foster systems’ comprehensive understanding. We benchmark different lexical and dense retrieval approaches with various negative sampling strategies, adapting them to deal with long text sequences using hierarchical variants. We found that difficulty-based negative sampling strategies were not effective for the PCR task, highlighting the need for investigation into domain-specific difficulty criteria. Furthermore, we observe performance of the dense models degrade with time and calls for further research into temporal adaptation of retrieval models. Additionally, we assess the influence of different views , Halsbury’s and Goodhart’s, in practice in ECtHR jurisdiction using PCR task.

Stance detection aims to determine the attitude expressed in text towards a given target. Zero-shot stance detection (ZSSD) has emerged to classify stances towards unseen targets during inference. Recent data augmentation techniques for ZSSD increase transferable knowledge between targets through text or target augmentation. However, these methods exhibit limitations. Target augmentation lacks logical connections between generated targets and source text, while text augmentation relies solely on training data, resulting in insufficient generalization. To address these issues, we propose an encoder-decoder data augmentation (EDDA) framework. The encoder leverages large language models and chain-of-thought prompting to summarize texts into target-specific if-then rationales, establishing logical relationships. The decoder generates new samples based on these expressions using a semantic correlation word replacement strategy to increase syntactic diversity. We also analyze the generated expressions to develop a rationale-enhanced network that fully utilizes the augmented data. Experiments on benchmark datasets demonstrate our approach substantially improves over state-of-the-art ZSSD techniques. The proposed EDDA framework increases semantic relevance and syntactic variety in augmented texts while enabling interpretable rationale-based learning.

We present EDEN, the first Norwegian dataset annotated with event information at the sentence level, adapting the widely used ACE event schema to Norwegian. The paper describes the manual annotation of Norwegian text as well as transcribed speech in the news domain, together with inter-annotator agreement and discussions of relevant dataset statistics. We also present preliminary modeling results using a graph-based event parser. The resulting dataset will be freely available for download and use.

This paper describes a corpus consisting of real-world dialogues in English between users and a task-oriented conversational agent, with interactions revolving around the description of finite state automata. The creation of this corpus is part of a larger research project aimed at developing tools for an easier access to educational content, especially in STEM fields, for users with visual impairments. The development of this corpus was precisely motivated by the aim of providing a useful resource to support the design of such tools. The core feature of this corpus is that its creation involved both sighted and visually impaired participants, thus allowing for a greater diversity of perspectives and giving the opportunity to identify possible differences in the way the two groups of participants interacted with the agent. The paper introduces this corpus, giving an account of the process that led to its creation, i.e. the methodology followed to obtain the data, the annotation scheme adopted, and the analysis of the results. Finally, the paper reports the results of a classification experiment on the annotated corpus, and an additional experiment to assess the annotation capabilities of three large language models, in view of a further expansion of the corpus.

pdf abs
EEE-QA: Exploring Effective and Efficient Question-Answer Representations
Zhanghao Hu | Yijun Yang | Junjie Xu | Yifu Qiu | Pinzhen Chen

Current approaches to question answering rely on pre-trained language models (PLMs) like RoBERTa. This work challenges the existing question-answer encoding convention and explores finer representations. We begin with testing various pooling methods compared to using the begin-of-sentence token as a question representation for better quality. Next, we explore opportunities to simultaneously embed all answer candidates with the question. This enables cross-reference between answer choices and improves inference throughput via reduced memory usage. Despite their simplicity and effectiveness, these methods have yet to be widely studied in current frameworks. We experiment with different PLMs, and with and without the integration of knowledge graphs. Results prove that the memory efficacy of the proposed techniques with little sacrifice in performance. Practically, our work enhances 38-100% throughput with 26-65% speedups on consumer-grade GPUs by allowing for considerably larger batch sizes. Our work sends a message to the community with promising directions in both representation quality and efficiency for the question-answering task in natural language processing.

pdf abs
Eesthetic: A Paralex Lexicon of Estonian Paradigms
Sacha Beniamine | Mari Aigro | Matthew Baerman | Jules Bouton | Maria Copot

We introduce Eesthetic, a comprehensive Estonian noun and verb lexicon sourced from the Ekilex database. It documents 5475 nouns inflecting for 28 paradigm cells and 5076 verbs inflecting for 51 cells, and comprises a total of 452885 inflected forms. Our openly accessible machine-readable dataset adheres to the Paralex standard. It comprises CSV tables linked by formal relationships. Metadata in JSON format, following the Frictionless standard, provides detailed descriptions of the tables and dataset. The lexicon offers extensive linguistic annotations, including orthographic forms, automatically transcribed phonemic transcriptions, non-canonical morphological phenomena such as overabundance and defectiveness, rich mapping of the paradigm cells and feature-values to other notation schemes, a decomposition of phonemes in distinctive features, and annotation of inflection classes. It is suited for both monolingual and comparative research, enabling qualitative and quantitative analysis. This paper outlines the creation process, rationale, and resulting structure, along with our set of rules for automatic orthography-to-phonemic transcription conversion.

pdf abs
Effective Distillation of Table-based Reasoning Ability from LLMs
Bohao Yang | Chen Tang | Kun Zhao | Chenghao Xiao | Chenghua Lin

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, their enormous parameter size and extremely high requirements for compute power pose challenges for their practical deployment. Recent research has revealed that specific capabilities of LLMs, such as numerical reasoning, can be transferred to smaller models through distillation. Some studies explore the potential of leveraging LLMs to perform table-based reasoning. However, there has been no prior work focusing on table reasoning skills in smaller models specifically tailored for scientific table-to-text generation tasks. In this paper, we propose a novel table-based reasoning distillation approach, with the aim of distilling LLMs into tailored smaller models. Our experimental results have shown that a 220 million parameter model (Flan-T5-base) fine-tuned using distilled data, not only achieves a significant improvement compared to traditionally fine-tuned baselines, but also surpasses specific LLMs on a scientific table-to-text generation dataset. Our code is available at https://github.com/Bernard-Yang/DistillTableCoT.

pdf abs
Effective Integration of Text Diffusion and Pre-Trained Language Models with Linguistic Easy-First Schedule
Yimin Ou | Ping Jian

Diffusion models have become a powerful generative modeling paradigm, achieving great success in continuous data patterns. However, the discrete nature of text data results in compatibility issues between continuous diffusion models (CDMs) and pre-trained language models (PLMs). That is, the performance of diffusion models even degrades when combined with PLMs. To alleviate this issue, we propose to utilize a pre-trained decoder to convert the denoised embedding vectors into natural language instead of using the widely used rounding operation. In this way, CDMs can be more effectively combined with PLMs. Additionally, considering that existing noise schedules in text diffusion models do not take into account the linguistic differences among tokens, which violates the easy-first policy for text generation, we propose a linguistic easy-first schedule that incorporates the measure of word importance, conforming to easy-first-generation linguistic features and bringing about improved generation quality. Experiment results on the E2E dataset and five controllable tasks show that our approach can combine the merits of CDMs and PLMs, significantly outperforming other diffusion-based models.

pdf abs
Efficiency and Effectiveness in Task-Oriented Dialogue: On Construction Repetition, Information Rate, and Task Success
Jun Sen Yee | Mario Giulianelli | Arabella J. Sinclair

We investigate the roles that efficiency and effectiveness play in speakers’ repetition of shared word sequences, or constructions, in task-oriented dialogue. We find that repeating constructions has negative effects on information rate and positive effects on rate of delivery, that information rate managing strategies are predictive of task success, and that this varies by the communicative function of the constructions being repeated. More effective dialogue is characterised by greater levels of shared construction usage and more efficient task-related repetition; while task-agnostic repetition can seem redundant, it can serve important efficiency and effectiveness functions. Our results provide a nuanced picture of the importance of repetition and of developing a shared lexicon for both efficiency and effectiveness in task-oriented dialogue.

pdf abs
Efficient AMR Parsing with CLAP: Compact Linearization with an Adaptable Parser
Abelardo Carlos Martinez Lorenzo | Roberto Navigli

Sequence-to-sequence models have become the de facto standard for Abstract Meaning Representation (AMR) parsing due to their high-quality performance. However, these systems face efficiency challenges because of their large model size and computational time, which limit their accessibility within the research community. This paper aims to break down these barriers by introducing a novel linearization and system that significantly enhances the efficiency and accessibility of previous AMR parsers. First, we propose our novel Compact linearization that simplifies encoding, thereby reducing the number of tokens by between 40% and 50%. Second, we present CLAP, an innovative modular system that maintains the model’s high performance while achieving remarkable 80% reduction in training and inference times. Furthermore, CLAP is compatible with multiple autoregressive Language Models (LM) and tokenizers, such as BART, T5, and others. These advancements underscore the importance of optimizing sequence-to-sequence models in AMR parsing, thus democratizing access to high-quality semantic analysis. Our code is publicly available at https://github.com/SapienzaNLP/clap/.

The efficacy of neural “retrieve and generate” systems is well established for question answering (QA) over unstructured text. Recent efforts seek to extend this approach to knowledge graph (KG) QA by converting structured triples to unstructured text. However, the relevance of KG triples retrieved by these systems limits their accuracy. In this paper, we improve the relevance of retrieved triples using a carefully designed re-ranker. Specifically, our pipeline (i) retrieves over documents of triples grouped by entity, (ii) re-ranks triples from these documents with context: triples in the 1-hop neighborhood of the documents’ subject entity, and (iii) generates an answer from highly relevant re-ranked triples. To train our re-ranker, we propose a novel “triple-level” labeling strategy that infers fine-grained labels and shows that these significantly improve the relevance of retrieved information. We show that the resulting “retrieve, re-rank, and generate” pipeline significantly improves upon prior KGQA systems, achieving a new state-of-the-art on FreebaseQA by 5.56% Exact Match. We perform multiple ablations that reveal the distinct benefits of our contextual re-ranker and labeling strategy and conclude with a case study that highlights opportunities for future works.

pdf abs
EFTNAS: Searching for Efficient Language Models in First-Order Weight-Reordered Super-Networks
Juan Pablo Munoz | Yi Zheng | Nilesh Jain

Transformer-based models have demonstrated outstanding performance in natural language processing (NLP) tasks and many other domains, e.g., computer vision. Depending on the size of these models, which have grown exponentially in the past few years, machine learning practitioners might be restricted from deploying them in resource-constrained environments. This paper discusses the compression of transformer-based models for multiple resource budgets. Integrating neural architecture search (NAS) and network pruning techniques, we effectively generate and train weight-sharing super-networks that contain efficient, high-performing, and compressed transformer-based models. A common challenge in NAS is the design of the search space, for which we propose a method to automatically obtain the boundaries of the search space and then derive the rest of the intermediate possible architectures using a first-order weight importance technique. The proposed end-to-end NAS solution, EFTNAS, discovers efficient subnetworks that have been compressed and fine-tuned for downstream NLP tasks. We demonstrate EFTNAS on the General Language Understanding Evaluation (GLUE) benchmark and the Stanford Question Answering Dataset (SQuAD), obtaining high-performing smaller models with a reduction of more than 5x in size without or with little degradation in performance.

Behavioral coding (BC) in motivational interviewing (MI) holds great potential for enhancing the efficacy of MI counseling. However, manual coding is labor-intensive, and automation efforts are hindered by the lack of data due to the privacy of psychotherapy. To address these challenges, we introduce BiMISC, a bilingual dataset of MI conversations in English and Dutch, sourced from real counseling sessions. Expert annotations in BiMISC adhere strictly to the motivational interviewing skills code (MISC) scheme, offering a pivotal resource for MI research. Additionally, we present a novel approach to elicit the MISC expertise from Large language models (LLMs) for MI coding. Through the in-depth analysis of BiMISC and the evaluation of our proposed approach, we demonstrate that the LLM-based approach yields results closely aligned with expert annotations and maintains consistent performance across different languages. Our contributions not only furnish the MI community with a valuable bilingual dataset but also spotlight the potential of LLMs in MI coding, laying the foundation for future MI research.

pdf abs
ELLEN: Extremely Lightly Supervised Learning for Efficient Named Entity Recognition
Haris Riaz | Razvan Gabriel Dumitru | Mihai Surdeanu

In this work, we revisit the problem of semi-supervised named entity recognition (NER) focusing on extremely light supervision, consisting of a lexicon containing only 10 examples per class. We introduce ELLEN, a simple, fully modular, neuro-symbolic method that blends fine-tuned language models with linguistic rules. These rules include insights such as “One Sense Per Discourse”, using a Masked Language Model as an unsupervised NER, leveraging part-of-speech tags to identify and eliminate unlabeled entities as false negatives, and other intuitions about classifier confidence scores in local and global context. ELLEN achieves very strong performance on the CoNLL-2003 dataset when using the minimal supervision from the lexicon above. It also outperforms most existing (and considerably more complex) semi-supervised NER methods under the same supervision settings commonly used in the literature (i.e., 5% of the training data). Further, we evaluate our CoNLL-2003 model in a zero-shot scenario on WNUT-17 where we find that it outperforms GPT-3.5 and achieves comparable performance to GPT-4. In a zero-shot setting, ELLEN also achieves over 75% of the performance of a strong, fully supervised model trained on gold data. Our code is publicly available.

pdf abs
EMAD: A Bridge Tagset for Unifying Arabic POS Annotations
Omar Kallas | Go Inoue | Nizar Habash

There have been many attempts to model the morphological richness and complexity of Arabic, leading to numerous Part-of-Speech (POS) tagsets that differ in terms of (a) which morphological features they represent, (b) how they represent them, and (c) the degree of specification of said features. Tagset granularity plays an important role in determining how annotated data can be used and for what applications. Due to the diversity among existing tagsets, many annotated corpora for Arabic cannot be easily combined, which exacerbates the Arabic resource poverty situation. In this work, we propose an intermediate tagset designed to facilitate the conversion and unification of different tagsets used to annotate Arabic corpora. This new tagset acts as a bridge between different annotation schemes, simplifying the integration of annotated corpora and promoting collaboration across the projects using them.

pdf abs
Emancipating Event Extraction from the Constraints of Long-Tailed Distribution Data Utilizing Large Language Models
Zhigang Kan | Liwen Peng | Linbo Qiao | Dongsheng Li

Event Extraction (EE) is a challenging task that aims to extract structural event-related information from unstructured text. Traditional methods for EE depend on manual annotations, which are both expensive and scarce. Furthermore, the existing datasets mostly follow the long-tail distribution, severely hindering the previous methods of modeling tail types. Two techniques can address this issue: transfer learning and data generation. However, the existing methods based on transfer learning still rely on pre-training with a large amount of labeled data in the source domain. Additionally, the quality of data generated by previous data generation methods is difficult to control. In this paper, leveraging Large Language Models (LLMs), we propose novel methods for event extraction and generation based on dialogues, overcoming the problems of relying on source domain data and maintaining data quality. Specifically, this paper innovatively transforms the EE task into multi-turn dialogues, guiding LLMs to learn event schemas from historical dialogue information and output structural events. Furthermore, we introduce a novel LLM-based method for generating high-quality data, significantly improving traditional models’ performance with various paradigms and structures, especially on tail types. Adequate experiments on real-world datasets demonstrate the effectiveness of the proposed event extraction and data generation methods.

pdf abs
EMOLIS App and Dataset to Find Emotionally Close Cartoons
Soëlie Lerch | Patrice Bellot | Elisabeth Murisasco | Emmanuel Bruno

We propose EMOLIS Dataset that contains annotated emotional transcripts of scenes from Walt Disney cartoons at the same time as physiological signals from spectators (breathing, ECG, eye movements). The dataset is used in EMOLIS App, our second proposal. EMOLIS App allows to display the identified emotions while a video is playing and suggest emotionally comparable videos. We propose to estimate an emotional distance between videos using multimodal neural representations (text, audio, video) that also combine physiological signals. This enables personalized results that can be used for cognitive therapies focusing on awareness of felt emotions. The dataset is designed to be suitable for all audiences and autistic people who have difficulties to recognize and express emotions.

pdf abs
EmoProgress: Cumulated Emotion Progression Analysis in Dreams and Customer Service Dialogues
Eileen Wemmer | Sofie Labat | Roman Klinger

Emotion analysis often involves the categorization of isolated textual units, but these are parts of longer discourses, like dialogues or stories. This leads to two different established emotion classification setups: (1) Classification of a longer text into one or multiple emotion categories. (2) Classification of the parts of a longer text (sentences or utterances), either (2a) with or (2b) without consideration of the context. None of these settings, does, however, enable to answer the question which emotion is presumably experienced at a specific moment in time. For instance, a customer’s request of “My computer broke.” would be annotated with anger. This emotion persists in a potential follow-up reply “It is out of warranty.” which would also correspond to the global emotion label. An alternative reply “We will send you a new one.” might, in contrast, lead to relief. Modeling these label relations requires classification of textual parts under consideration of the past, but without access to the future. Consequently, we propose a novel annotation setup for emotion categorization corpora, in which the annotations reflect the emotion up to the annotated sentence. We ensure this by uncovering the textual parts step-by-step to the annotator, asking for a label in each step. This perspective is important to understand the final, global emotion, while having access to the individual sentence’s emotion contributions to this final emotion. In modeling experiments, we use these data to check if the context is indeed required to automatically predict such cumulative emotion progressions.

Emotion-cause pair extraction (ECPE) main focus is on extracting all potential emotion clauses and corresponding cause clauses from unannotated documents. Existing methods achieve promising results with the help of fine-tuning and prompt paradigms, but they present three downsides. First, most approaches cannot distinguish between the emotion-cause pairs that belong to different types of emotions, limiting the existing approaches’ applicability. Second, existing prompt methods utilize a one-to-one mapping relation to achieve label words to category mapping, which brings considerable bias to the results. Third, existing methods achieve the cause extraction task supported by explicit semantic understanding or basic prompt templates, ignoring the implicit information contained in the cause clauses themselves. To solve these issues, we propose an Emotion knowledge-aware Prompt-tuning for Emotion-Cause Pair Extraction (EmoPrompt-ECPE) method, which integrate the knowledge of emotion categories in the ECPE task and mine the implicit knowledge of cause clauses. Specifically, we inject the latent knowledge of the cause clauses and the emotion types into the prompt template. Besides, we extend the emotion labels for many-to-one mapping of label words to categories with an external emotion word base. Furthermore, we utilize the cosine similarity filtering of the label word base to reduce the noise caused by knowledge introduction. Experiments on both Chinese and English benchmark datasets show that our approach can achieve state-of-the-art results. Our code and data can be found at: https://github.com/xy-xiaotudou/EmoPrompt-ECPE.

We developped a web app for ascribing verbal descriptions to expressive audiovisual utterances. These descriptions are limited to lists of adjectives that are either suggested via a navigation in emotional latent spaces built using discriminant analysis of BERT embeddings or entered freely by subjects. We show that such verbal descriptions collected on-line via Prolific on massive data (310 participants, 12620 labelled utterances up-to-now) provide Expressive Multimodal Text-to-Speech Synthesis with precise verbal control over desired emotional content

pdf abs
Emotion Analysis in NLP: Trends, Gaps and Roadmap for Future Directions
Flor Miriam Plaza-del-Arco | Alba A. Cercas Curry | Amanda Cercas Curry | Dirk Hovy

Emotions are a central aspect of communication. Consequently, emotion analysis (EA) is a rapidly growing field in natural language processing (NLP). However, there is no consensus on scope, direction, or methods. In this paper, we conduct a thorough review of 154 relevant NLP publications from the last decade. Based on this review, we address four different questions: (1) How are EA tasks defined in NLP? (2) What are the most prominent emotion frameworks and which emotions are modeled? (3) Is the subjectivity of emotions considered in terms of demographics and cultural factors? and (4) What are the primary NLP applications for EA? We take stock of trends in EA and tasks, emotion frameworks used, existing datasets, methods, and applications. We then discuss four lacunae: (1) the absence of demographic and cultural aspects does not account for the variation in how emotions are perceived, but instead assumes they are universally experienced in the same manner; (2) the poor fit of emotion categories from the two main emotion theories to the task; (3) the lack of standardized EA terminology hinders gap identification, comparison, and future goals; and (4) the absence of interdisciplinary research isolates EA from insights in other fields. Our work will enable more focused research into EA and a more holistic approach to modeling emotions in NLP.

Emotion recognition in conversation (ERC) is a field that aims to classify the emotion of each utterance within conversational contexts. This presents significant challenges, particularly in handling emotional ambiguity across various speakers and contextual factors. Existing ERC approaches have primarily focused on modeling conversational contexts while incorporating only superficial speaker attributes such as names, memories, and interactions. Recent works introduce personality as an essential deep speaker factor for emotion recognition, but relies on static personality, overlooking dynamic variability during conversations. Advances in personality psychology conceptualize personality as dynamic, proposing that personality states can change across situations. In this paper, we introduce ERC-DP, a novel model considering the dynamic personality of speakers during conversations. ERC-DP accounts for past utterances from the same speaker as situation impacting dynamic personality. It combines personality modeling with prompt design and fine-grained classification modules. Through a series of comprehensive experiments, ERC-DP demonstrates superior performance on three benchmark conversational datasets.

In an emotional conversation, emotions are causally transmitted among communication participants, constituting a fundamental conversational feature that can facilitate the comprehension of intricate changes in emotional states during the conversation and contribute to neutralizing emotional semantic bias in utterance caused by the absence of modality information. Therefore, emotional transition (ET) plays a crucial role in the task of Emotion Recognition in Conversation (ERC) that has not received sufficient attention in current research. In light of this, an Emotional Transition-based Emotion Recognizer (EmoTrans) is proposed in this paper. Specifically, we concatenate the most recent utterances with their corresponding speakers to construct the model input, known as samples, each with several placeholders to implicitly express the emotions of contextual utterances. Based on these placeholders, two components are developed to make the model sensitive to emotions and effectively capture the ET features in the sample. Furthermore, an ET-based Contrastive Learning (CL) is developed to compact the representation space, making the model achieve more robust sample representations. We conducted exhaustive experiments on four widely used datasets and obtained competitive experimental results, especially, new state-of-the-art results obtained on MELD and IEMOCAP, demonstrating the superiority of EmoTrans.

pdf abs
EmpCRL: Controllable Empathetic Response Generation via In-Context Commonsense Reasoning and Reinforcement Learning
Mingxiu Cai | Daling Wang | Shi Feng | Yifei Zhang

Empathetic response generation aims to understand the user’s feelings emotionally and generate responses with appropriate emotion. According to psychological theories, empathy consists of two main aspects: affection and cognition. However, existing works lack the perception of fine-grained dialogue emotion propagation, as well as have limitations in reasoning about the intentions of users on cognition, which affect the quality of empathetic response. To this end, we propose to generate Empathetic response based on in-context Commonsense reasoning and Reinforcement Learning (EmpCRL). First, we use a current popular large language model combined with multi-view contextual reasoning to broaden the cognitive boundaries through in-context learning. Furthermore, we infer the response emotion by jointly modeling the dialogue history and emotion flow, and achieve the control of response emotion and diversity through reinforcement learning. Extensive experiments on EmpatheticDialogues dataset show that our model outperforms state-of-the-art models in both automatic and human evaluation.

pdf abs
Empowering Low-Resource Regional Languages with Lexicons : A Comparative Study of NLP Tools for Morphosyntactic Analysis
Cristina Garcia Holgado | Marianne Vergez-Couret

We investigate the effect of integrating lexicon information to an extremely low-resource language when annotated data is scarce for morpho-syntactic analysis. Obtaining such data and linguistic resources for these languages are usually constrained by a lack of human and financial resources making this task particularly challenging. In this paper, we describe the collection and leverage of a bilingual lexicon for Poitevin-Saintongeais, a regional language of France, to create augmented data through a neighbor-based distributional method. We assess this lexicon-driven approach in improving POS tagging while using different lexicon and augmented data sizes. To evaluate this strategy, we compare two distinct paradigms: neural networks, which typically require extensive data, and a conventional probabilistic approach, in which a lexicon is instrumental in its performance. Our findings reveal that the lexicon is a valuable asset for all models, but in particular for neural, demonstrating an enhanced generalization across diverse classes without requiring an extensive lexicon size.

pdf abs
Empowering Oneida Language Revitalization: Development of an Oneida Verb Conjugator
Yanfei Lu | Patrick Littell | Keren Rice

In this paper, we present the development of a digital Oneida verb conjugator through using the Gramble framework. This project is a collaborative effort with the Twatati Adult Oneida Language program. Oneida is a polysynthetic North American Indigenous language. Its verb roots can be conjugated with multiple affixes, and long verbal complexes can be used as utterances. Each Oneida affix encodes important grammatical information, and its form often varies based on various factors, such as its position in the utterance and its phonological environment. The distinct morphosyntactic structures complicate acquisition of the language by learners who are native speakers of English. With an alarmingly small number of native speakers of Oneida, supporting and accelerating adult second language leaners’ acquisition process has become a pressing necessity. The Oneida verb conjugator can demonstrate its users the correct conjugations of verbs and can also let learners generate practice materials tailored to their unique learning trajectories. This paper presents the preliminary stages and outcomes of the project and outlines the areas for improvement to be addressed in our subsequent endeavors.

pdf abs
Empowering Small-Scale Knowledge Graphs: A Strategy of Leveraging General-Purpose Knowledge Graphs for Enriched Embeddings
Albert Sawczyn | Jakub Binkowski | Piotr Bielak | Tomasz Kajdanowicz

Knowledge-intensive tasks pose a significant challenge for Machine Learning (ML) techniques. Commonly adopted methods, such as Large Language Models (LLMs), often exhibit limitations when applied to such tasks. Nevertheless, there have been notable endeavours to mitigate these challenges, with a significant emphasis on augmenting LLMs through Knowledge Graphs (KGs). While KGs provide many advantages for representing knowledge, their development costs can deter extensive research and applications. Addressing this limitation, we introduce a framework for enriching embeddings of small-scale domain-specific Knowledge Graphs with well-established general-purpose KGs. Adopting our method, a modest domain-specific KG can benefit from a performance boost in downstream tasks when linked to a substantial general-purpose KG. Experimental evaluations demonstrate a notable enhancement, with up to a 44% increase observed in the Hits@10 metric. This relatively unexplored research direction can catalyze more frequent incorporation of KGs in knowledge-intensive tasks, resulting in more robust, reliable ML implementations, which hallucinates less than prevalent LLM solutions.

pdf abs
Empowering Tree-structured Entailment Reasoning: Rhetorical Perception and LLM-driven Interpretability
Longyin Zhang | Bowei Zou | Ai Ti Aw

The study delves into the construction of entailment trees for science question answering (SQA), employing a novel framework termed Tree-structured Entailment Reasoning (TER). Current research on entailment tree construction presents significant challenges, primarily due to the ambiguities and similarities among candidate science facts, which considerably complicate the fact retrieval process. Moreover, the existing models exhibit limitations in effectively modeling the sequence of reasoning states, understanding the intricate relations between neighboring entailment tree nodes, and generating intermediate conclusions. To this end, we explore enhancing the TER performance from three aspects: First, improving retrieval capabilities by modeling and referring to the chained reasoning states; Second, enhancing TER by infusing knowledge that bridges the gap between reasoning types and rhetorical relations. Third, exploring a task-specific large language model tuning scheme to mitigate deficiencies in intermediate conclusion generation. Experiments on the English EntailmentBank demonstrate the effectiveness of the proposed methods in augmenting the quality of tree-structured entailment reasoning to a certain extent.

pdf abs
Emstremo: Adapting Emotional Support Response with Enhanced Emotion-Strategy Integrated Selection
Junlin Li | Bo Peng | Yu-Yin Hsu

To provide effective support, it is essential for a skilled supporter to emotionally resonate with the help-seeker’s current emotional state. In conversational interactions, this emotional alignment is further influenced by the comforting strategies employed by the supporter. Different strategies guide the interlocutors to align their emotions in nuanced patterns. However, the incorporation of strategy into emotional alignment in the context of emotional support agents remains underexplored. To address this limitation, we propose an improved emotional support agent called Emstremo. Emstremo aims to achieve strategic control of emotional alignment by perceiving and responding to the user’s emotions. Our system’s state-of-the-art performance emphasizes the importance of integrating emotions and strategies in modeling conversations that provide emotional support.

pdf abs
Encoding Gesture in Multimodal Dialogue: Creating a Corpus of Multimodal AMR
Kenneth Lai | Richard Brutti | Lucia Donatelli | James Pustejovsky

Abstract Meaning Representation (AMR) is a general-purpose meaning representation that has become popular for its clear structure, ease of annotation and available corpora, and overall expressiveness. While AMR was designed to represent sentence meaning in English text, recent research has explored its adaptation to broader domains, including documents, dialogues, spatial information, cross-lingual tasks, and gesture. In this paper, we present an annotated corpus of multimodal (speech and gesture) AMR in a task-based setting. Our corpus is multilayered, containing temporal alignments to both the speech signal and to descriptions of gesture morphology. We also capture coreference relationships across modalities, enabling fine-grained analysis of how the semantics of gesture and natural language interact. We discuss challenges that arise when identifying cross-modal coreference and anaphora, as well as in creating and evaluating multimodal corpora in general. Although we find AMR’s abstraction away from surface form (in both language and gesture) occasionally too coarse-grained to capture certain cross-modal interactions, we believe its flexibility allows for future work to fill in these gaps. Our corpus and annotation guidelines are available at https://github.com/klai12/encoding-gesture-multimodal-dialogue.

pdf abs
Endowing Neural Language Learners with Human-like Biases: A Case Study on Dependency Length Minimization
Yuqing Zhang | Tessa Verhoef | Gertjan van Noord | Arianna Bisazza

Natural languages show a tendency to minimize the linear distance between heads and their dependents in a sentence, known as dependency length minimization (DLM). Such a preference, however, has not been consistently replicated with neural agent simulations. Comparing the behavior of models with that of human learners can reveal which aspects affect the emergence of this phenomenon. In this work, we investigate the minimal conditions that may lead neural learners to develop a DLM preference. We add three factors to the standard neural-agent language learning and communication framework to make the simulation more realistic, namely: (i) the presence of noise during listening, (ii) context-sensitivity of word use through non-uniform conditional word distributions, and (iii) incremental sentence processing, or the extent to which an utterance’s meaning can be guessed before hearing it entirely. While no preference appears in production, we show that the proposed factors can contribute to a small but significant learning advantage of DLM for listeners of verb-initial languages.

pdf abs
End-to-end Parsing of Procedural Text into Flow Graphs
Dhaivat J. Bhatt | Seyed Ahmad Abdollahpouri Hosseini | Federico Fancellu | Afsaneh Fazly

We focus on the problem of parsing procedural text into fine-grained flow graphs that encode actions and entities, as well as their interactions. Specifically, we focus on parsing cooking recipes, and address a few limitations of existing parsers. Unlike SOTA approaches to flow graph parsing that work in two separate stages identifying actions and entities (tagging) and encoding their interactions via connecting edges (graph generation). we propose an end-to-end multi-task framework that simultaneously performs tagging and graph generation. In addition, due to the end-to-end nature of our proposed model, we can unify the input representation, and moreover can use compact encoders, resulting in small models with significantly fewer parameters than SOTA models. Another key challenge in training flow graph parsers is the lack of sufficient annotated data, due to the costly nature of the fine-grained annotations. We address this problem by taking advantage of the abundant unlabelled recipes, and show that pre-training on automatically-generated noisy silver annotations (from unlabelled recipes) results in a large improvement in flow graph parsing.

Aspect-category-based sentiment analysis (ACSA), which aims to identify aspect categories and predict their sentiments has been intensively studied due to its wide range of NLP applications. Most approaches mainly utilize intrasentential features. However, a review often includes multiple different aspect categories, and some of them do not explicitly appear in the review. Even in a sentence, there is more than one aspect category with its sentiments, and they are entangled intra-sentence, which makes the model fail to discriminately preserve all sentiment characteristics. In this paper, we propose an enhanced coherence-aware network with hierarchical disentanglement (ECAN) for ACSA tasks. Specifically, we explore coherence modeling to capture the contexts across the whole review and to help the implicit aspect and sentiment identification. To address the issue of multiple aspect categories and sentiment entanglement, we propose a hierarchical disentanglement module to extract distinct categories and sentiment features. Extensive experimental and visualization results show that our ECAN effectively decouples multiple categories and sentiments entangled in the coherence representations and achieves state-of-the-art (SOTA) performance. Our codes and data are available online: https://github.com/cuijin-23/ECAN.

pdf abs
Enhanced Facet Generation with LLM Editing
Joosung Lee | Jinhong Kim

In information retrieval, facet identification of a user query is an important task. If a search service can recognize the facets of a user’s query, it has the potential to offer users a much broader range of search results. Previous studies can enhance facet prediction by leveraging retrieved documents and related queries obtained through a search engine. However, there are challenges in extending it to other applications when a search engine operates as part of the model. First, search engines are constantly updated. Therefore, additional information may change during training and test, which may reduce performance. The second challenge is that public search engines cannot search for internal documents. Therefore, a separate search system needs to be built to incorporate documents from private domains within the company. We propose two strategies that focus on a framework that can predict facets by taking only queries as input without a search engine. The first strategy is multi-task learning to predict SERP. By leveraging SERP as a target instead of a source, the proposed model deeply understands queries without relying on external modules. The second strategy is to enhance the facets by combining Large Language Model (LLM) and the small model. Overall performance improves when small model and LLM are combined rather than facet generation individually.

The widespread use of pre-trained language models (PLMs) in natural language processing (NLP) has greatly improved performance outcomes. However, these models’ vulnerability to adversarial attacks (e.g., camouflaged hints from drug dealers), particularly in the Chinese language with its rich character diversity/variation and complex structures, hatches vital apprehension. In this study, we propose a novel method, CHinese vAriatioN Graph Enhancement (CHANGE), to increase the robustness of PLMs against character variation attacks in Chinese content. CHANGE presents a novel approach to incorporate a Chinese character variation graph into the PLMs. Through designing different supplementary tasks utilizing the graph structure, CHANGE essentially enhances PLMs’ interpretation of adversarially manipulated text. Experiments conducted in a multitude of NLP tasks show that CHANGE outperforms current language models in combating against adversarial attacks and serves as a valuable contribution to robust language model research. Moreover, these findings highlight the substantial potential of graph-guided pre-training strategies for real-world applications.

Large Language Models (LLMs) have recently made significant advances in code generation through the ‘Chain-of-Thought’ prompting technique. This technique empowers the model to autonomously devise “solution plans” to tackle intricate programming challenges, thereby improving its performance in code generation. Nevertheless, smaller models have been struggling to keep up with LLMs in deducing these plans, adversely affecting their code generation capabilities. Given the considerable size and associated deployment costs, along with concerns about data security, many teams opt for deploying smaller models for code generation. Consequently, there arises a compelling need for transferring LLMs’ code generation reasoning abilities to the smaller models. In this paper, we propose the CodePLAN framework, which aims to transfer LLMs’ reasoning capabilities to smaller models through distillation. We adopt a multi-task learning approach, jointly undertaking code generation and solution plan generation tasks, to enhance the code generation capabilities of smaller model. To ensure the superior quality of the solution plans, we advocate for the utilization of backward reasoning and plan sampling strategies. Our experiments show that in comparison to the conventional fine-tuning approach, our approach improves the smaller model’s code generation performance (measured in pass@1 metric) by over 130% on the challenging APPS benchmark.

Court View Generation (CVG) is a challenging task in the field of Legal Artificial Intelligence (LegalAI), which aims to generate court views based on the plaintiff claims and the fact descriptions. While Pretrained Language Models (PLMs) have showcased their prowess in natural language generation, their application to the complex, knowledge-intensive domain of CVG often reveals inherent limitations. In this paper, we present a novel approach, named Knowledge Injection and Guidance (KIG), designed to bolster CVG using PLMs. To efficiently incorporate domain knowledge during the training stage, we introduce a knowledge-injected prompt encoder for prompt tuning, thereby reducing computational overhead. Moreover, to further enhance the model’s ability to utilize domain knowledge, we employ a generating navigator, which dynamically guides the text generation process in the inference stage without altering the model’s architecture, making it readily transferable. Comprehensive experiments on real-world data demonstrate the effectiveness of our approach compared to several established baselines, especially in the responsivity of claims, where it outperforms the best baseline by 11.87%.

Existing cross-document event coreference resolution models, which either compute mention similarity directly or enhance mention representation by extracting event arguments (such as location, time, agent, and patient), lackingmthe ability to utilize document-level information. As a result, they struggle to capture long-distance dependencies. This shortcoming leads to their underwhelming performance in determining coreference for the events where their argument information relies on long-distance dependencies. In light of these limitations, we propose the construction of document-level Rhetorical Structure Theory (RST) trees and cross-document Lexical Chains to model the structural and semantic information of documents. Subsequently, cross-document heterogeneous graphs are constructed and GAT is utilized to learn the representations of events. Finally, a pair scorer calculates the similarity between each pair of events and co-referred events can be recognized using standard clustering algorithm. Additionally, as the existing cross-document event coreference datasets are limited to English, we have developed a large-scale Chinese cross-document event coreference dataset to fill this gap, which comprises 53,066 event mentions and 4,476 clusters. After applying our model on the English and Chinese datasets respectively, it outperforms all baselines by large margins.

pdf abs
Enhancing Distantly Supervised Named Entity Recognition with Strong Label Guided Lottery Training
Zhiyuan Ma | Jintao Du | Changhua Meng | Weiqiang Wang

In low-resource Named Entity Recognition (NER) scenarios, only a limited quantity of strongly labeled data is available, while a vast amount of weakly labeled data can be easily acquired through distant supervision. However, weakly labeled data may fail to improve the model performance or even harm it due to the inevitable noise. While training on noisy data, only certain parameters are essential for model learning, termed safe parameters, whereas the other parameters tend to fit noise. In this paper, we propose a noise-robust learning framework where safe parameters can be identified with guidance from the small set of strongly labeled data, and non-safe parameters are suppressed during training on weakly labeled data for better generalization. Our method can effectively mitigate the impact of noise in weakly labeled data, and it can be easily integrated with data level noise-robust learning methods for NER. We conduct extensive experiments on multiple datasets and the results show that our approach outperforms the state-of-the-art methods.

pdf abs
Enhancing Effectiveness and Robustness in a Low-Resource Regime via Decision-Boundary-aware Data Augmentation
Kyohoon Jin | Junho Lee | Juhwan Choi | Sangmin Song | Youngbin Kim

Efforts to leverage deep learning models in low-resource regimes have led to numerous augmentation studies. However, the direct application of methods, such as mixup and cutout, is limited due to the discrete characteristics of the textual data. While methods using pre trained language models have exhibited good efficiency, they require additional considerations for robustness. Inspired by recent studies on decision boundaries, this paper proposes a decision-boundary-aware data augmentation strategy to enhance robustness using pretrained language models. The proposed technique first focuses on shifting the latent features closer to the decision boundary, followed by reconstruction to generate an ambiguous version with a soft label. Additionally, mid-K sampling is suggested to enhance the diversity of the generated sentences. This paper demonstrates the performance of the proposed augmentation strategy compared to other methods through extensive experiments. Furthermore, the ablation study demonstrates the effect of soft labels and mid-K sampling and the extensibility of the method with curriculum data augmentation.

Predicting emotions elicited by news headlines can be challenging as the task is largely influenced by the varying nature of people’s interpretations and backgrounds. Previous works have explored classifying discrete emotions directly from news headlines. We provide a different approach to tackling this problem by utilizing people’s explanations of their emotion, written in free-text, on how they feel after reading a news headline. Using the dataset BU-NEmo+ (Gao et al., 2022), we found that for emotion classification, the free-text explanations have a strong correlation with the dominant emotion elicited by the headlines. The free-text explanations also contain more sentimental context than the news headlines alone and can serve as a better input to emotion classification models. Therefore, in this work we explored generating emotion explanations from headlines by training a sequence-to-sequence transformer model and by using pretrained large language model, ChatGPT (GPT-4). We then used the generated emotion explanations for emotion classification. In addition, we also experimented with training the pretrained T5 model for the intermediate task of explanation generation before fine-tuning it for emotion classification. Using McNemar’s significance test, methods that incorporate GPT-generated free-text emotion explanations demonstrated significant improvement (P-value < 0.05) in emotion classification from headlines, compared to methods that only use headlines. This underscores the value of using intermediate free-text explanations for emotion prediction tasks with headlines.

As pretrained language model emerge and consistently develop, prompt-based training has become a well-studied paradigm to improve the exploitation of models for many natural language processing tasks. Furthermore, prompting demonstrates great performance compared to conventional fine-tuning in scenarios with limited annotated data, such as zero-shot or few-shot situations. Verbalizers are crucial in this context, as they help interpret masked word distributions generated by language models into output predictions. This study introduces a benchmarking approach to assess three common baselines of verbalizers for topic classification in few-shot learning scenarios. Additionally, we find that increasing the number of label words for automatic label word searching enhances model performance. Moreover, we investigate the effectiveness of template assembling with various aggregation strategies to develop stronger classifiers that outperform models trained with individual templates. Our approach achieves comparable results to prior research while using significantly fewer resources. Our code is available at https://github.com/quang-anh-nguyen/verbalizer_benchmark.git.

pdf abs
Enhancing Hindi Feature Representation through Fusion of Dual-Script Word Embeddings
Lianxi Wang | Yujia Tian | Zhuowei Chen

Pretrained language models excel in various natural language processing tasks but often neglect the integration of different scripts within a language, constraining their ability to capture richer semantic information, such as in Hindi. In this work, we present a dual-script enhanced feature representation method for Hindi. We combine single-script features from Devanagari and Romanized Hindi Roberta using concatenation, addition, cross-attention, and convolutional networks. The experiment results show that using a dual-script approach significantly improves model performance across various tasks. The addition fusion technique excels in sequence generation tasks, while for text classification, the CNN-based dual-script enhanced representation performs best with longer sentences, and the addition fusion technique is more effective for shorter sequences. Our approach shows significant advantages in multiple natural language processing tasks, providing a new perspective on feature representation for Hindi. Our code has been released on https://github.com/JohnnyChanV/Hindi-Fusion.

pdf abs
Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning
Nurbanu Aksoy | Nishant Ravikumar | Serge Sharoff

Image-to-text generation involves automatically generating descriptive text from images and has applications in medical report generation. However, traditional approaches often exhibit a semantic gap between visual and textual information. In this paper, we propose a multi-task learning framework to leverage both visual and non-imaging data for generating radiology reports. Along with chest X-ray images, 10 additional features comprising numeric, binary, categorical, and text data were incorporated to create a unified representation. The model was trained to generate text, predict the degree of patient severity, and identify medical findings. Multi-task learning, especially with text generation prioritisation, improved performance over single-task baselines across language generation metrics. The framework also mitigated overfitting in auxiliary tasks compared to single-task models. Qualitative analysis showed logically coherent narratives and accurate identification of findings, though some repetition and disjointed phrasing remained. This work demonstrates the benefits of multi-modal, multi-task learning for image-to-text generation applications.

pdf abs
Enhancing Knowledge Retrieval with Topic Modeling for Knowledge-Grounded Dialogue
Nhat Tran | Diane Litman

Knowledge retrieval is one of the major challenges in building a knowledge-grounded dialogue system. A common method is to use a neural retriever with a distributed approximate nearest-neighbor database to quickly find the relevant knowledge sentences. In this work, we propose an approach that utilizes topic modeling on the knowledge base to further improve retrieval accuracy and as a result, improve response generation. Additionally, we experiment with a large language model (LLM), ChatGPT, to take advantage of the improved retrieval performance to further improve the generation results. Experimental results on two datasets show that our approach can increase retrieval and generation performance. The results also indicate that ChatGPT is a better response generator for knowledge-grounded dialogue when relevant knowledge is provided.

pdf abs
Enhancing Knowledge Selection via Multi-level Document Semantic Graph
Haoran Zhang | Tan Yongmei

Knowledge selection is a crucial sub-task of Document Grounded Dialogue System. Existing methods view knowledge selection as a sentence matching or classification. However, those methods can’t capture the semantic relationships within complex document. We propose a flexible method that can construct multi-level document semantic graph from the grounding document automatically and store semantic relationships within the documents effectively. Besides, we also devise an auxiliary task to leverage the graph more efficiently and can help the optimization of knowledge selection task. We conduct extensive experiments on public datasets: WoW(CITATION) and Holl-E(CITATION). And we achieves state-of-the-art result on WoW. Our code has been released at https://github.com/ddf62/multi-level-semantic-document-graph.

In this paper, we introduce a novel approach for enhancing the reasoning capabilities of large language models (LLMs) for constraint satisfaction problems (CSPs), by converting reasoning problems into classification tasks. Our method leverages the LLM’s ability to decide when to call a function from a set of logical-linguistic primitives, each of which can interact with a local “scratchpad” memory and logical inference engine. Invocation of these primitives in the correct order writes the constraints to the scratchpad memory and enables the logical engine to verifiably solve the problem. We additionally propose a formal framework for exploring the “linguistic” hardness of CSP reasoning-problems for LLMs. Our experimental results demonstrate that under our proposed method, tasks with significant computational hardness can be converted to a form that is easier for LLMs to solve and yields a 40% improvement over baselines. This opens up new avenues for future research into hybrid cognitive models that integrate symbolic and neural approaches.

Large Language Models (LLMs) operating in 0-shot or few-shot settings achieve competitive results in Text Classification tasks. In-Context Learning (ICL) typically achieves better accuracy than the 0-shot setting, but it pays in terms of efficiency, due to the longer input prompt. In this paper, we propose a strategy to make LLMs as efficient as 0-shot text classifiers, while getting comparable or better accuracy than ICL. Our solution targets the low resource setting, i.e., when only 4 examples per class are available. Using a single LLM and few-shot real data we perform a sequence of generation, filtering and Parameter-Efficient Fine-Tuning steps to create a robust and efficient classifier. Experimental results show that our approach leads to competitive results on multiple text classification datasets.

pdf abs
Enhancing Parameter-efficient Fine-tuning with Simple Calibration Based on Stable Rank
Peiyu Liu | Ze-Feng Gao | Xiao Zhang | Wayne Xin Zhao | Ji-Rong Wen

Lightweight fine-tuning is widely used as an important technique for efficiently adapting pre-trained language models (PLM) to downstream tasks. Despite the reduction in trainable parameters, existing lightweight fine-tuning methods are found to be effective in low-resource settings but often fail in high-resource settings, leading to unreliable outcomes. This limitation can be attributed to inflexible strategies: they identify the parameters of the model to be trained before fine-tuning and remain unchanged without taking into account the inherent variance of generalization ability in model components (i.e., feed-forward, attention layers) and potential changes during the fine-tuning process. In this paper, we introduce a simple but effective calibration for lightweight fine-tuning PLMs based on the matrix’s stable rank according to both model components and the training process. We proposed both theoretical analyses and experimental verification for the proposed calibration strategy. Considering efficiency, we further propose time-aware and structure-aware strategies to determine the most crucial time to commence the fine-tuning procedure and selectively apply parameter matrices for lightweight fine-tuning, respectively. Extensive experiments demonstrate the superiority of our proposed fine-tuning approach (average improvement 3.1 for GLUE score compared to lightweight fine-tuning method).

pdf abs
Enhancing Phrase Representation by Information Bottleneck Guided Text Diffusion Process for Keyphrase Extraction
Yuanzhen Luo | Qingyu Zhou | Feng Zhou

Keyphrase extraction (KPE) is an important task in Natural Language Processing for many scenarios, which aims to extract keyphrases that are present in a given document. Many existing supervised methods treat KPE as sequential labeling, span-level classification, or generative tasks. However, these methods lack the ability to utilize keyphrase information, which may result in biased results. In this study, we propose Diff-KPE, which leverages the supervised Variational Information Bottleneck (VIB) to guide the text diffusion process for generating enhanced keyphrase representations. Diff-KPE first generates the desired keyphrase embeddings conditioned on the entire document and then injects the generated keyphrase embeddings into each phrase representation. A ranking network and VIB are then optimized together with rank loss and classification loss, respectively. This design of Diff-KPE allows us to rank each candidate phrase by utilizing both the information of keyphrases and the document. Experiments show that Diff-KPE outperforms existing KPE methods on a large open domain keyphrase extraction benchmark, OpenKP, and a scientific domain dataset, KP20K.

pdf abs
Enhancing Scientific Document Summarization with Research Community Perspective and Background Knowledge
Sudipta Singha Roy | Robert E. Mercer

Scientific paper summarization has been the focus of much recent research. Unlike previous research which summarizes only the paper in question, or which summarizes the paper and the papers that it references, or which summarizes the paper and the citing sentences from the papers that cite it, this work puts all three of these summarization techniques together. To accomplish this, we have, by utilizing the citation network, introduced a corpus for scientific document summarization that provides information about the document being summarized, the papers referenced by it, as well as the papers that have cited it. The proposed summarizer model utilizes the referenced articles as background information and citing articles to capture the impact of the scientific document on the research community. Another aspect of the proposed model is its ability to generate both the extractive and abstractive summaries in parallel. The parallel training helps the counterparts to improve their individual performance. Results have shown that the summaries are of high quality when considering the standard metrics.

pdf abs
Enhancing Semantics in Multimodal Chain of Thought via Soft Negative Sampling
Guangmin Zheng | Jin Wang | Xiaobing Zhou | Xuejie Zhang

Chain of thought (CoT) has proven useful for problems requiring complex reasoning. Many of these problems are both textual and multimodal. Given the inputs in different modalities, a model generates a rationale and then uses it to answer a question. Because of the hallucination issue, the generated soft negative rationales with high textual quality but illogical semantics do not always help improve answer accuracy. This study proposes a rationale generation method using soft negative sampling (SNSE-CoT) to mitigate hallucinations in multimodal CoT. Five methods were applied to generate soft negative samples that shared highly similar text but had different semantics from the original. Bidirectional margin loss (BML) was applied to introduce them into the traditional contrastive learning framework that involves only positive and negative samples. Extensive experiments on the ScienceQA dataset demonstrated the effectiveness of the proposed method. Code and data are released at https://github.com/zgMin/SNSE-CoT.

pdf abs
Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems
Bo-Han Lu | Yi-Hsuan Lin | Annie Lee | Richard Tzong-Han Tsai

Machine translation focuses mainly on high-resource languages (HRLs), while low-resource languages (LRLs) like Taiwanese Hokkien are relatively under-explored. The study aims to address this gap by developing a dual translation model between Taiwanese Hokkien and both Traditional Mandarin Chinese and English. We employ a pre-trained LLaMA 2-7B model specialized in Traditional Mandarin Chinese to leverage the orthographic similarities between Taiwanese Hokkien Han and Traditional Mandarin Chinese. Our comprehensive experiments involve translation tasks across various writing systems of Taiwanese Hokkien as well as between Taiwanese Hokkien and other HRLs. We find that the use of a limited monolingual corpus still further improves the model’s Taiwanese Hokkien capabilities. We then utilize our translation model to standardize all Taiwanese Hokkien writing systems into Hokkien Han, resulting in further performance improvements. Additionally, we introduce an evaluation method incorporating back-translation and GPT-4 to ensure reliable translation quality assessment even for LRLs. The study contributes to narrowing the resource gap for Taiwanese Hokkien and empirically investigates the advantages and limitations of pre-training and fine-tuning based on LLaMA 2.

Large language models (LLMs) with prompting have achieved encouraging results on many natural language processing (NLP) tasks based on task-tailored promptings. Text-to-SQL is a critical task that generates SQL queries from natural language questions. However, prompting on LLMs haven’t show superior performance on Text-to-SQL task due to the absence of tailored promptings. In this work, we propose three promptings specifically designed for Text-to-SQL: SL-prompt, CC-prompt, and SL+CC prompt. SL-prompt is designed to guide LLMs to identify relevant tables; CC-prompt directs LLMs to generate SQL clause by clause; and SL+CC prompt is proposed to combine the strengths of these above promptings. The three prompting strategies makes three solutions for Text-to-SQL. Then, another prompting strategy, the RS-prompt is proposed to direct LLMs to select the best answer from the results of the solutions. We conducted extensive experiments, and experimental results show that our method achieved an execution accuracy of 86.2% and a test-suite accuracy of 76.9%, which is 1.1%, and 2.7% higher than the current state-of-the-art Text-to-SQL methods, respectively. The results confirmed that the proposed promptings enhanced the capabilities of LLMs on Text-to-SQL. Experimental results also show that the granularity of schema linking and the order of clause generation have great impact on the performance, which are considered little in previous research.

pdf abs
Enhancing Translation Ability of Large Language Models by Leveraging Task-Related Layers
Pei Cheng | Xiayang Shi | Yinlin Li

Fine-tuning Large Language Models (LLMs) for machine translation is effective but costly. It also increases the risk of overfitting and catastrophic forgetting, especially when training data is limited. To tackle these challenges, we propose a novel method that involves adjusting task-related layers in large models to better harness their machine translation capabilities. This method aims to retain the model’s knowledge on other tasks while optimizing performance on translation tasks. By revealing the structure and characteristics of attention weights through singular value decomposition (SVD), we can make fine adjustments to specific layers, leveraging the model’s potential for more accurate and efficient translations. Our method not only addresses computational resource consumption and catastrophic forgetting but also offers a new perspective on utilizing the capabilities of large models effectively. Experimental validation shows that adjusting task-related layers significantly improves performance on translation tasks while maintaining stability and accuracy on other tasks. This finding provides valuable insights for fine-tuning and applying large models, advancing the field of machine translation.

pdf abs
Enhancing Unrestricted Cross-Document Event Coreference with Graph Reconstruction Networks
Loic de Langhe | Orphee de Clercq | Veronique Hoste

Event Coreference Resolution remains a challenging discourse-oriented task within the domain of Natural Language Processing. In this paper we propose a methodology where we combine traditional mention-pair coreference models with a lightweight and modular graph reconstruction algorithm. We show that building graph models on top of existing mention-pair models leads to improved performance for both a wide range of baseline mention-pair algorithms as well as a recently developed state-of-the-art model and this at virtually no added computational cost. Moreover, additional experiments seem to indicate that our method is highly robust in low-data settings and that its performance scales with increases in performance for the underlying mention-pair models.

pdf abs
Enhancing Writing Proficiency Classification in Developmental Education: The Quest for Accuracy
Miguel Da Corte | Jorge Baptista

Developmental Education (DevEd) courses align students’ college-readiness skills with higher education literacy demands. These courses often use automated assessment tools like Accuplacer for student placement. Existing literature raises concerns about these exams’ accuracy and placement precision due to their narrow representation of the writing process. These concerns warrant further attention within the domain of automatic placement systems, particularly in the establishment of a reference corpus of annotated essays for these systems’ machine/deep learning. This study aims at an enhanced annotation procedure to assess college students’ writing patterns more accurately. It examines the efficacy of machine-learning-based DevEd placement, contrasting Accuplacer’s classification of 100 college-intending students’ essays into two levels (Level 1 and 2) against that of 6 human raters. The classification task encompassed the assessment of the 6 textual criteria currently used by Accuplacer: mechanical conventions, sentence variety & style, idea development & support, organization & structure, purpose & focus, and critical thinking. Results revealed low inter-rater agreement, both on the individual criteria and the overall classification, suggesting human assessment of writing proficiency can be inconsistent in this context. To achieve a more accurate determination of writing proficiency and improve DevEd placement, more robust classification methods are thus required.

Recent advancements in large language models have showcased their remarkable generalizability across various domains. However, their reasoning abilities still have significant room for improvement, especially when confronted with scenarios requiring multi-step reasoning. Although large language models possess extensive knowledge, their reasoning often fails to effectively utilize this knowledge to establish a coherent thinking paradigm. These models sometimes show hallucinations as their reasoning procedures are unconstrained by logical principles. Aiming at improving the zero-shot chain-of-thought reasoning ability of large language models, we propose LoT (Logical Thoughts), a self-improvement prompting framework that leverages principles rooted in symbolic logic, particularly Reductio ad Absurdum, to systematically verify and rectify the reasoning processes step by step. Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of enhanced reasoning by logic. The implementation code for LoT can be accessed at: https://github.com/xf-zhao/LoT.

pdf abs
Enough Is Enough! a Case Study on the Effect of Data Size for Evaluation Using Universal Dependencies
Rob van der Goot | Zoey Liu | Max Müller-Eberstein

When creating a new dataset for evaluation, one of the first considerations is the size of the dataset. If our evaluation data is too small, we risk making unsupported claims based on the results on such data. If, on the other hand, the data is too large, we waste valuable annotation time and costs that could have been used to widen the scope of our evaluation (i.e. annotate for more domains/languages). Hence, we investigate the effect of the size and a variety of sampling strategies of evaluation data to optimize annotation efforts, using dependency parsing as a test case. We show that for in-language in-domain datasets, 5,000 tokens is enough to obtain a reliable ranking of different parsers; especially if the data is distant enough from the training split (otherwise, we recommend 10,000). In cross-domain setups, the same amounts are required, but in cross-lingual setups much less (2,000 tokens) is enough.

pdf abs
Enriching a Time-Domain Astrophysics Corpus with Named Entity, Coreference and Astrophysical Relationship Annotations
Atilla Kaan Alkan | Felix Grezes | Cyril Grouin | Fabian Schussler | Pierre Zweigenbaum

Interest in Astrophysical Natural Language Processing (NLP) has increased recently, fueled by the development of specialized language models for information extraction. However, the scarcity of annotated resources for this domain is still a significant challenge. Most existing corpora are limited to Named Entity Recognition (NER) tasks, leaving a gap in resource diversity. To address this gap and facilitate a broader spectrum of NLP research in astrophysics, we introduce astroECR, an extension of our previously built Time-Domain Astrophysics Corpus (TDAC). Our contributions involve expanding it to cover named entities, coreferences, annotations related to astrophysical relationships, and normalizing celestial object names. We showcase practical utility through baseline models for four NLP tasks and provide the research community access to our corpus, code, and models.

pdf abs
Enriching Word Usage Graphs with Cluster Definitions
Andrey Kutuzov | Mariia Fedorova | Dominik Schlechtweg | Nikolay Arefyev

We present a dataset of word usage graphs (WUGs), where the existing WUGs for multiple languages are enriched with cluster labels functioning as sense definitions. They are generated from scratch by fine-tuned encoder-decoder language models. The conducted human evaluation has shown that these definitions match the existing clusters in WUGs better than the definitions chosen from WordNet by two baseline systems. At the same time, the method is straightforward to use and easy to extend to new languages. The resulting enriched datasets can be extremely helpful for moving on to explainable semantic change modeling.

pdf abs
Ensembles of Hybrid and End-to-End Speech Recognition.
Aditya Kamlesh Parikh | Louis ten Bosch | Henk van den Heuvel

We propose a method to combine the hybrid Kaldi-based Automatic Speech Recognition (ASR) system with the end-to-end wav2vec 2.0 XLS-R ASR using confidence measures. Our research is focused on the low-resource Irish language. Given the limited available open-source resources, neither the standalone hybrid ASR nor the end-to-end ASR system can achieve optimal performance. By applying the Recognizer Output Voting Error Reduction (ROVER) technique, we illustrate how ensemble learning could facilitate mutual error correction between both ASR systems. This paper outlines the strategies for merging the hybrid Kaldi ASR model and the end-to-end XLS-R model with the help of confidence scores. Although contemporary state-of-the-art end-to-end ASR models face challenges related to prediction overconfidence, we utilize Renyi’s entropy-based confidence approach, tuned with temperature scaling, to align it with the Kaldi ASR confidence. Although there was no significant difference in the Word Error Rate (WER) between the hybrid and end-to-end ASR, we could achieve a notable reduction in WER after ensembling through ROVER. This resulted in an almost 14% Word Error Rate Reduction (WERR) on our primary test set and an approximately 20% WERR on other noisy and imbalanced test data.

In recent years, Large Language Models (LLMs) have demonstrated exceptional performance in code-generation tasks. However, under enterprise scenarios where private APIs are pre-built, general LLMs often fail to meet expectations. Existing approaches are confronted with drawbacks of high resource consumption and inadequate handling of multi-API tasks. To address these challenges, we propose EpiGEN, an Efficient multi-Api code GENeration framework under enterprise scenario. It consists of three core modules: Task Decomposition Module (TDM), API Retrieval Module (ARM), and Code Generation Module (CGM), in which Langchain played an important role. Through a series of experiments, EpiGEN shows good acceptability and readability, compared to fully fine-tuned LLM with a larger number of parameters. Particularly, in medium and hard level tasks, the performance of EpiGEN on a single-GPU machine even surpasses that of a fully fine-tuned LLM that requires multi-GPU configuration. Generally, EpiGEN is model-size agnostic, facilitating a balance between the performance of code generation and computational requirements.

pdf abs
EpLSA: Synergy of Expert-prefix Mixtures and Task-Oriented Latent Space Adaptation for Diverse Generative Reasoning
Fujun Zhang | Xiangdong Su | Jiang Li | Rong Yan | Guanglai Gao

Existing models for diverse generative reasoning still struggle to generate multiple unique and plausible results. Through an in-depth examination, we argue that it is critical to leverage a mixture of experts as prefixes to enhance the diversity of generated results and make task-oriented adaptation in the latent space of the generation models to improve the quality of the responses. At this point, we propose EpLSA, an innovative model based on the synergy of expert-prefix mixtures and task-oriented latent space adaptation for diverse generative reasoning. Specifically, we use expert-prefixes mixtures to encourage the model to create multiple responses with different semantics and design a loss function to address the problem that the semantics is interfered by the expert-prefixes. Meanwhile, we design a task-oriented adaptation block to make the pre-trained encoder within the generation model more effectively adapted to the pre-trained decoder in the latent space, thus further improving the quality of the generated text. Extensive experiments on three different types of generative reasoning tasks demonstrate that EpLSA outperforms existing baseline models in terms of both the quality and diversity of the generated outputs. Our code is publicly available at https://github.com/IMU-MachineLearningSXD/EpLSA.

pdf abs
EPOQUE: An English-Persian Quality Estimation Dataset
Mohammed Hossein Jafari Harandi | Fatemeh Azadi | Mohammad Javad Dousti | Heshaam Faili

Translation quality estimation (QE) is an important component in real-world machine translation applications. Unfortunately, human labeled QE datasets, which play an important role in developing and assessing QE models, are only available for limited language pairs. In this paper, we present the first English-Persian QE dataset, called EPOQUE, which has manually annotated direct assessment labels. EPOQUE contains 1000 sentences translated from English to Persian and annotated by three human annotators. It is publicly available, and thus can be used as a zero-shot test set, or for other scenarios in future work. We also evaluate and report the performance of two state-of-the-art QE models, i.e., Transquest and CometKiwi, as baselines on our dataset. Furthermore, our experiments show that using a small subset of the proposed dataset containing 300 sentences to fine-tune Transquest, can improve its performance by more that 8% in terms of the Pearson correlation with a held-out test set.

pdf abs
EROS:Entity-Driven Controlled Policy Document Summarization
Joykirat Singh | Sehban Fazili | Rohan Jain | Md. Shad Akhtar

Privacy policy documents have a crucial role in educating individuals about the collection, usage, and protection of users’ personal data by organizations. However, they are notorious for their lengthy, complex, and convoluted language especially involving privacy-related entities. Hence, they pose a significant challenge to users who attempt to comprehend organization’s data usage policy. In this paper, we propose to enhance the interpretability and readability of policy documents by using controlled abstractive summarization – we enforce the generated summaries to include critical privacy-related entities (e.g., data and medium) and organization’s rationale (e.g., target and reason) in collecting those entities. To achieve this, we develop PD-Sum, a policy-document summarization dataset with marked privacy-related entity labels. Our proposed model, EROS, identifies critical entities through a span-based entity extraction model and employs them to control the information content of the summaries using proximal policy optimization (PPO). Comparison shows encouraging improvement over various baselines. Furthermore, we furnish qualitative and human evaluations to establish the efficacy of EROS.

pdf abs
Error Analysis of NLP Models and Non-Native Speakers of English Identifying Sarcasm in Reddit Comments
Oliver Cakebread-Andrews | Le An Ha | Ingo Frommholz | Burcu Can

This paper summarises the differences and similarities found between humans and three natural language processing models when attempting to identify whether English online comments are sarcastic or not. Three models were used to analyse 300 comments from the FigLang 2020 Reddit Dataset, with and without context. The same 300 comments were also given to 39 non-native speakers of English and the results were compared. The aim was to find whether there were any results that could be applied to English as a Foreign Language (EFL) teaching. The results showed that there were similarities between the models and non-native speakers, in particular the logistic regression model. They also highlighted weaknesses with both non-native speakers and the models in detecting sarcasm when the comments included political topics or were phrased as questions. This has potential implications for how the EFL teaching industry could implement the results of error analysis of NLP models in teaching practices.

pdf abs
Error-Robust Retrieval for Chinese Spelling Check
Xunjian Yin | Xinyu Hu | Jin Jiang | Xiaojun Wan

Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts, which has a wide range of applications. However, it is confronted with the challenges of insufficient annotated data and the issue that previous methods may actually not fully leverage the existing datasets. In this paper, we introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check (RERIC), which can be directly applied to existing CSC models. The datastore for retrieval is built completely based on the training data, with elaborate designs according to the characteristics of CSC. Specifically, we employ multimodal representations that fuse phonetic, morphologic, and contextual information in the calculation of query and key during retrieval to enhance robustness against potential errors. Furthermore, in order to better judge the retrieved candidates, the n-gram surrounding the token to be checked is regarded as the value and utilized for specific reranking. The experiment results on the SIGHAN benchmarks demonstrate that our proposed method achieves substantial improvements over existing work.

pdf abs
EsCoLA: Spanish Corpus of Linguistic Acceptability
Nuria Bel | Marta Punsola | Valle Ruíz-Fernández

Acceptability is one of the General Language Understanding Evaluation Benchmark (GLUE) probing tasks proposed to assess the linguistic capabilities acquired by a deep-learning transformer-based language model (LM). In this paper, we introduce the Spanish Corpus of Linguistic Acceptability EsCoLA. EsCoLA has been developed following the example of other linguistic acceptability data sets for English, Italian, Norwegian or Russian, with the aim of having a complete GLUE benchmark for Spanish. EsCoLA consists of 11,174 sentences and their acceptability judgements as found in well-known Spanish reference grammars. Additionally, all sentences have been annotated with the class of linguistic phenomenon the sentence is an example of, also following previous practices. We also provide as task baselines the results of fine-tuning four different language models with this data set and the results of a human annotation experiment. Results are also analyzed and commented to guide future research. EsCoLA is released under a CC-BY 4.0 license and freely available at https://doi.org/10.34810/data1138.

pdf abs
ESCP: Enhancing Emotion Recognition in Conversation with Speech and Contextual Prefixes
Xiujuan Xu | Xiaoxiao Shi | Zhehuan Zhao | Yu Liu

Emotion Recognition in Conversation (ERC) aims to analyze the speaker’s emotional state in a conversation. Fully mining the information in multimodal and historical utterances plays a crucial role in the performance of the model. However, recent works in ERC focus on historical utterances modeling and generally concatenate the multimodal features directly, which neglects mining deep multimodal information and brings redundancy at the same time. To address the shortcomings of existing models, we propose a novel model, termed Enhancing Emotion Recognition in Conversation with Speech and Contextual Prefixes (ESCP). ESCP employs a directed acyclic graph (DAG) to model historical utterances in a conversation and incorporates a contextual prefix containing the sentiment and semantics of historical utterances. By adding speech and contextual prefixes, the inter- and intra-modal emotion information is efficiently modeled using the prior knowledge of the large-scale pre-trained model. Experiments conducted on several public benchmarks demonstrate that the proposed approach achieves state-of-the-art (SOTA) performances. These results affirm the effectiveness of the novel ESCP model and underscore the significance of incorporating speech and contextual prefixes to guide the pre-trained model.

pdf abs
ESDM: Early Sensing Depression Model in Social Media Streams
Bichen Wang | Yuzhe Zi | Yanyan Zhao | Pengfei Deng | Bing Qin

Depression impacts millions worldwide, with increasing efforts to use social media data for early detection and intervention. Traditional Risk Detection (TRD) uses a user’s complete posting history for predictions, while Early Risk Detection (ERD) seeks early detection in a user’s posting history, emphasizing the importance of prediction earliness. However, ERD remains relatively underexplored due to challenges in balancing accuracy and earliness, especially with evolving partial data. To address this, we introduce the Early Sensing Depression Model (ESDM), which comprises two modules classification with partial information module (CPI) and decision for classification moment module (DMC), alongside an early detection loss function. Experiments show ESDM outperforms benchmarks in both earliness and accuracy.

pdf abs
Esposito: An English-Persian Scientific Parallel Corpus for Machine Translation
Mersad Esalati | Mohammad Javad Dousti | Heshaam Faili

Neural machine translation requires large number of parallel sentences along with in-domain parallel data to attain best results. Nevertheless, no scientific parallel corpus for English-Persian language pair is available. In this paper, a parallel corpus called Esposito is introduced, which contains 3.5 million parallel sentences in the scientific domain for English-Persian language pair. In addition, we present a manually validated scientific test set that might serve as a baseline for future studies. We show that a system trained using Esposito along with other publicly available data improves the baseline on average by 7.6 and 8.4 BLEU scores for En->Fa and Fa->En directions, respectively. Additionally, domain analysis using the 5-gram KenLM model revealed notable distinctions between our parallel corpus and the existing generic parallel corpus. This dataset will be available to the public upon the acceptance of the paper.

pdf abs
Estimating Lexical Complexity from Document-Level Distributions
Sondre Wold | Petter Mæhlum | Oddbjørn Hove

Existing methods for complexity estimation are typically developed for entire documents. This limitation in scope makes them inapplicable for shorter pieces of text, such as health assessment tools. These typically consist of lists of independent sentences, all of which are too short for existing methods to apply. The choice of wording in these assessment tools is crucial, as both the cognitive capacity and the linguistic competency of the intended patient groups could vary substantially. As a first step towards creating better tools for supporting health practitioners, we develop a two-step approach for estimating lexical complexity that does not rely on any pre-annotated data. We implement our approach for the Norwegian language and verify its effectiveness using statistical testing and a qualitative evaluation of samples from real assessment tools. We also investigate the relationship between our complexity measure and certain features typically associated with complexity in the literature, such as word length, frequency, and the number of syllables.

pdf abs
Estimating the Causal Effects of Natural Logic Features in Transformer-Based NLI Models
Julia Rozanova | Marco Valentino | André Freitas

Rigorous evaluation of the causal effects of semantic features on language model predictions can be hard to achieve for natural language reasoning problems. However, this is such a desirable form of analysis from both an interpretability and model evaluation perspective, that it is valuable to investigate specific patterns of reasoning with enough structure and regularity to identify and quantify systematic reasoning failures in widely-used models. In this vein, we pick a portion of the NLI task for which an explicit causal diagram can be systematically constructed: the case where across two sentences (the premise and hypothesis), two related words/terms occur in a shared context. In this work, we apply causal effect estimation strategies to measure the effect of context interventions (whose effect on the entailment label is mediated by the semantic monotonicity characteristic) and interventions on the inserted word-pair (whose effect on the entailment label is mediated by the relation between these words). Extending related work on causal analysis of NLP models in different settings, we perform an extensive interventional study on the NLI task to investigate robustness to irrelevant changes and sensitivity to impactful changes of Transformers. The results strongly bolster the fact that similar benchmark accuracy scores may be observed for models that exhibit very different behaviour. Moreover, our methodology reinforces previously suspected biases from a causal perspective, including biases in favour of upward-monotone contexts and ignoring the effects of negation markers.

pdf abs
Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language We Prompt Them in
Utkarsh Agarwal | Kumar Tanmay | Aditi Khandelwal | Monojit Choudhury

Ethical reasoning is a crucial skill for Large Language Models (LLMs). However, moral values are not universal, but rather influenced by language and culture. This paper explores how three prominent LLMs – GPT-4, ChatGPT, and Llama2Chat-70B – perform ethical reasoning in different languages and if their moral judgement depend on the language in which they are prompted. We extend the study of ethical reasoning of LLMs by (CITATION) to a multilingual setup following their framework of probing LLMs with ethical dilemmas and policies from three branches of normative ethics: deontology, virtue, and consequentialism. We experiment with six languages: English, Spanish, Russian, Chinese, Hindi, and Swahili. We find that GPT-4 is the most consistent and unbiased ethical reasoner across languages, while ChatGPT and Llama2Chat-70B show significant moral value bias when we move to languages other than English. Interestingly, the nature of this bias significantly vary across languages for all LLMs, including GPT-4.

Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassing a wide array of scripts, and are imbued with profound religious and cultural significance. This paper introduces EthioLLM – multilingual large language models for five Ethiopian languages (Amharic, Ge’ez, Afan Oromo, Somali, and Tigrinya) and English, and Ethiobenchmark – a new benchmark dataset for various downstream NLP tasks. We evaluate the performance of these models across five downstream NLP tasks. We open-source our multilingual language models, new benchmark datasets for various downstream tasks, and task-specific fine-tuned language models and discuss the performance of the models. Our dataset and models are available at the https://huggingface.co/EthioNLP repository.

The European Language Grid (ELG) is a cloud platform for the whole European Language Technology community. While the EU project that developed the platform successfully concluded in June 2022, the ELG initiative has continued. This article provides a description of the current state of ELG in terms of user adoption and number of language resources and technologies available in early 2024. It also provides an overview of the various activities with regard to ELG since the end of the project and since the publication of the ELG book, especially the co-authors’ attempt to integrate the ELG platform into various data space initiatives. The article also provides an overview of the Digital Language Equality (DLE) dashboard and the current state of DLE in Europe.

pdf abs
Evaluating Automatic Subtitling: Correlating Post-editing Effort and Automatic Metrics
Alina Karakanta | Mauro Cettolo | Matteo Negri | Luisa Bentivogli

Systems that automatically generate subtitles from video are gradually entering subtitling workflows, both for supporting subtitlers and for accessibility purposes. Even though robust metrics are essential for evaluating the quality of automatically-generated subtitles and for estimating potential productivity gains, there is limited research on whether existing metrics, some of which directly borrowed from machine translation (MT) evaluation, can fulfil such purposes. This paper investigates how well such MT metrics correlate with measures of post-editing (PE) effort in automatic subtitling. To this aim, we collect and publicly release a new corpus containing product-, process- and participant-based data from post-editing automatic subtitles in two language pairs (en→de,it). We find that different types of metrics correlate with different aspects of PE effort. Specifically, edit distance metrics have high correlation with technical and temporal effort, while neural metrics correlate well with PE speed.

pdf abs
Evaluating ChatGPT against Functionality Tests for Hate Speech Detection
Mithun Das | Saurabh Kumar Pandey | Animesh Mukherjee

Large language models like ChatGPT have recently shown a great promise in performing several tasks, including hate speech detection. However, it is crucial to comprehend the limitations of these models to build robust hate speech detection systems. To bridge this gap, our study aims to evaluate the strengths and weaknesses of the ChatGPT model in detecting hate speech at a granular level across 11 languages. Our evaluation employs a series of functionality tests that reveals various intricate failures of the model which the aggregate metrics like macro F1 or accuracy are not able to unfold. In addition, we investigate the influence of complex emotions, such as the use of emojis in hate speech, on the performance of the ChatGPT model. Our analysis highlights the shortcomings of the generative models in detecting certain types of hate speech and highlighting the need for further research and improvements in the workings of these models.

pdf abs
Evaluating Code-Switching Translation with Large Language Models
Muhammad Huzaifah | Weihua Zheng | Nattapol Chanpaisit | Kui Wu

Recent advances in large language models (LLMs) have shown they can match or surpass finetuned models on many natural language processing tasks. Currently, more studies are being carried out to assess whether this performance carries over across different languages. In this paper, we present a thorough evaluation of LLMs for the less well-researched code-switching translation setting, where inputs include a mixture of different languages. We benchmark the performance of six state-of-the-art LLMs across seven datasets, with GPT-4 and GPT-3.5 displaying strong ability relative to supervised translation models and commercial engines. GPT-4 was also found to be particularly robust against different code-switching conditions. Several methods to further improve code-switching translation are proposed including leveraging in-context learning and pivot translation. Through our code-switching experiments, we argue that LLMs show promising ability for cross-lingual understanding.

pdf abs
Evaluating Gender Bias of Pre-trained Language Models in Natural Language Inference by Considering All Labels
Panatchakorn Anantaprayoon | Masahiro Kaneko | Naoaki Okazaki

Discriminatory gender biases have been found in Pre-trained Language Models (PLMs) for multiple languages. In Natural Language Inference (NLI), existing bias evaluation methods have focused on the prediction results of one specific label out of three labels, such as neutral. However, such evaluation methods can be inaccurate since unique biased inferences are associated with unique prediction labels. Addressing this limitation, we propose a bias evaluation method for PLMs, called NLI-CoAL, which considers all the three labels of NLI task. First, we create three evaluation data groups that represent different types of biases. Then, we define a bias measure based on the corresponding label output of each data group. In the experiments, we introduce a meta-evaluation technique for NLI bias measures and use it to confirm that our bias measure can distinguish biased, incorrect inferences from non-biased incorrect inferences better than the baseline, resulting in a more accurate bias evaluation. We create the datasets in English, Japanese, and Chinese, and successfully validate the compatibility of our bias measure across multiple languages. Lastly, we observe the bias tendencies in PLMs of different languages. To our knowledge, we are the first to construct evaluation datasets and measure PLMs’ bias from NLI in Japanese and Chinese.

Modern Large Language Models (LLMs) have showcased remarkable prowess in various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a paradoxical performance discrepancy is observed, where these models underperform in seemingly elementary tasks like relation extraction and event extraction due to two issues in conventional evaluation. (1) The imprecision of existing evaluation metrics that struggle to effectively gauge semantic consistency between model outputs and ground truth, and (2) The inherent incompleteness of evaluation benchmarks, primarily due to restrictive human annotation schemas, resulting in underestimated LLM performances. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through subjective question correction data, to refine matching between model outputs and golden labels. Additionally, by incorporating a Natural Language Inference (NLI) model, SQC-Score enriches golden labels, addressing benchmark incompleteness by acknowledging correct yet previously omitted answers. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics. Utilizing SQC-Score, we conduct a comprehensive evaluation of the state-of-the-art LLMs and provide insights for future research for information extraction. Dataset and associated codes can be accessed at our <a href=https://github.com/THU-KEG/SQC-Score> GitHub repository </a>.

pdf abs
Evaluating Performance of Pre-trained Word Embeddings on Assamese, a Low-resource Language
Dhrubajyoti Pathak | Sukumar Nandi | Priyankoo Sarmah

Word embeddings and Language models are the building blocks of modern Deep Neural Network-based Natural Language Processing. They are extensively explored in high-resource languages and provide state-of-the-art (SOTA) performance for a wide range of downstream tasks. Nevertheless, these word embeddings are not explored in languages such as Assamese, where resources are limited. Furthermore, there has been limited study into the performance evaluation of these word embeddings for low-resource languages in downstream tasks. In this research, we explore the current state of Assamese pre-trained word embeddings. We evaluate these embeddings’ performance on sequence labeling tasks such as Parts-of-speech and Named Entity Recognition. In order to assess the efficiency of the embeddings, experiments are performed utilizing both ensemble and individual word embedding approaches. The ensembling approach that uses three word embeddings outperforms the others. In the paper, the outcomes of the investigations are described. The results of this comparative performance evaluation may assist researchers in choosing an Assamese pre-trained word embedding for subsequent tasks.

pdf abs
Evaluating Prompting Strategies for Grammatical Error Correction Based on Language Proficiency
Min Zeng | Jiexin Kuang | Mengyang Qiu | Jayoung Song | Jungyeul Park

This paper proposes an analysis of prompting strategies for grammatical error correction (GEC) with selected large language models (LLM) based on language proficiency. GEC using generative LLMs has been known for overcorrection where results obtain higher recall measures than precision measures. The writing examples of English language learners may be different from those of native speakers. Given that there is a significant differences in second language (L2) learners’ error types by their proficiency levels, this paper attempts to reduce overcorrection by examining the interaction between LLM’s performance and L2 language proficiency. Our method focuses on zero-shot and few-shot prompting and fine-tuning models for GEC for learners of English as a foreign language based on the different proficiency. We investigate GEC results and find that overcorrection happens primarily in advanced language learners’ writing (proficiency C) rather than proficiency A (a beginner level) and proficiency B (an intermediate level). Fine-tuned LLMs, and even few-shot prompting with writing examples of English learners, actually tend to exhibit decreased recall measures. To make our claim concrete, we conduct a comprehensive examination of GEC outcomes and their evaluation results based on language proficiency.

Deep learning models have performed well on many NLP tasks. However, their internal mechanisms are typically difficult for humans to understand. The development of methods to explain models has become a key issue in the reliability of deep learning models in many important applications. Various saliency explanation methods, which give each feature of input a score proportional to the contribution of output, have been proposed to determine the part of the input which a model values most. Despite a considerable body of work on the evaluation of saliency methods, whether the results of various evaluation metrics agree with human cognition remains an open question. In this study, we propose a new human-based method to evaluate saliency methods in NLP by crowdsourcing. We recruited 800 crowd workers and empirically evaluated seven saliency methods on two datasets with the proposed method. We analyzed the performance of saliency methods, compared our results with existing automated evaluation methods, and identified notable differences between NLP and computer vision (CV) fields when using saliency methods. The instance-level data of our crowdsourced experiments and the code to reproduce the explanations are available at https://github.com/xtlu/lreccoling_evaluation.

pdf abs
Evaluating Self-Supervised Speech Representations for Indigenous American Languages
Chih-Chen Chen | William Chen | Rodolfo Joel Zevallos | John E. Ortega

The application of self-supervision to speech representation learning has garnered significant interest in recent years, due to its scalability to large amounts of unlabeled data. However, much progress, both in terms of pre-training and downstream evaluation, has remained concentrated in monolingual models that only consider English. Few models consider other languages, and even fewer consider indigenous ones. In this work, benchmark the efficacy of large SSL models on 6 indigenous America languages: Quechua, Guarani , Bribri, Kotiria, Wa’ikhana, and Totonac on low-resource ASR. Our results show surprisingly strong performance by state-of-the-art SSL models, showing the potential generalizability of large-scale models to real-world data.

pdf abs
Evaluating Shortest Edit Script Methods for Contextual Lemmatization
Olia Toporkov | Rodrigo Agerri

Modern contextual lemmatizers often rely on automatically induced Shortest Edit Scripts (SES), namely, the number of edit operations to transform a word form into its lemma. In fact, different methods of computing SES have been proposed as an integral component in the architecture of several state-of-the-art contextual lemmatizers currently available. However, previous work has not investigated the direct impact of SES in the final lemmatization performance. In this paper we address this issue by focusing on lemmatization as a token classification task where the only input that the model receives is the word-label pairs in context, where the labels correspond to previously induced SES. Thus, by modifying in our lemmatization system only the SES labels that the model needs to learn, we may then objectively conclude which SES representation produces the best lemmatization results. We experiment with seven languages of different morphological complexity, namely, English, Spanish, Basque, Russian, Czech, Turkish and Polish, using multilingual and language-specific pre-trained masked language encoder-only models as a backbone to build our lemmatizers. Comprehensive experimental results, both in- and out-of-domain, indicate that computing the casing and edit operations separately is beneficial overall, but much more clearly for languages with high-inflected morphology. Notably, multilingual pre-trained language models consistently outperform their language-specific counterparts in every evaluation setting.

pdf abs
Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model
Siyang Wang | Eva Szekely

Recent advances in generative language modeling applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech language models (SLMs), similarly to their textual counterparts, are scalable, probabilistic, and context-aware. While they can produce diverse and natural outputs, they sometimes face issues such as unintelligibility and the inclusion of non-speech noises or hallucination. As the adoption of this innovative paradigm in speech synthesis increases, there is a clear need for an in-depth evaluation of its capabilities and limitations. In this paper, we evaluate TTS from a discrete token-based SLM, through both automatic metrics and listening tests. We examine five key dimensions: speaking style, intelligibility, speaker consistency, prosodic variation, spontaneous behaviour. Our results highlight the model’s strength in generating varied prosody and spontaneous outputs. It is also rated higher in naturalness and context appropriateness in listening tests compared to a conventional TTS. However, the model’s performance in intelligibility and speaker consistency lags behind traditional TTS. Additionally, we show that increasing the scale of SLMs offers a modest boost in robustness. Our findings aim to serve as a benchmark for future advancements in generative SLMs for speech synthesis.

pdf abs
Evaluating the Efficacy of Large Acoustic Model for Documenting Non-Orthographic Tribal Languages in India
Tonmoy Rajkhowa | Amartya Roy Chowdhury | Hrishikesh Ravindra Karande | S. R. Mahadeva Prasanna

Pre-trained Large Acoustic Models, when fine-tuned, have largely shown to improve the performances in various tasks related to spoken language technologies. However, their evaluation has been mostly on datasets that contain English or other widely spoken languages, and their potential for novel under-resourced languages is not fully known. In this work, four novel under-resourced tribal languages that do not have a standard writing system were introduced and the application of such large pre-trained models was assessed to document such languages using Automatic Speech Recognition and Direct Speech-to-Text Translation systems. The transcriptions for these tribal languages were generated by adapting scripts from those languages that held a prominent presence in the geographical regions where these tribal languages are spoken. The results from this study suggest a viable direction to document these languages in the electronic domain by using Spoken Language Technologies that incorporate LAMs. Additionally, this study helped in understanding the varying performances exhibited by the Large Acoustic Model between these four languages. This study not only informs the adoption of appropriate scripts for transliterating spoken-only languages based on the language family but also aids in making informed decisions in analyzing the behavior of particular Large Acoustic Model in linguistic contexts.

Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take the first steps to fill this gap by conducting a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023). We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context. Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF, despite the segmentation noise introduced by the resegmentation step systems. We release the collected human-annotated data in order to encourage further investigation.

pdf abs
Evaluating the Potential of Language-family-specific Generative Models for Low-resource Data Augmentation: A Faroese Case Study
Barbara Scalvini | Iben Nyholm Debess

We investigate GPT-SW3, a generative language model for the Nordic languages, to assess its understanding of the low-resourced Faroese language. Our aim is to demonstrate the advantages of using language-family-specific generative models to augment data for related languages with fewer resources. We evaluate GPT-SW3 by prompting it for Faroese to English translation in a zero, one, and few-shot setting. We assess such translations with an ensemble score consisting of an arithmetic average between the BLEU and a semantic similarity score (SBERT). Moreover, we challenge the model’s Faroese language understanding capabilities on a small dataset of curated Faroese trick sentences. There, we make a qualitative comparison of the model’s performance with respect to Open AI’s GPT-3.5 and GPT-4, demonstrating the advantages of using a language-family-specific generative model for navigating non-trivial scenarios. We evaluate the pipeline thus created and use it, as a proof of concept, to create an automatically annotated Faroese semantic textual similarity (STS) dataset.

Pretrained language models and large language models are increasingly used to assist in a great variety of natural language tasks. In this work, we explore their use in evaluating the quality of alternative corpus annotation schemes. For this purpose, we analyze two alternative annotations of the Turkish BOUN treebank, versions 2.8 and 2.11, in the Universal Dependencies framework using large language models. Using a suitable prompt generated using treebank annotations, large language models are used to recover the surface forms of sentences. Based on the idea that the large language models capture the characteristics of the languages, we expect that the better annotation scheme would yield the sentences with higher success. The experiments conducted on a subset of the treebank show that the new annotation scheme (2.11) results in a successful recovery percentage of about 2 points higher. All the code developed for this work is available at https://github.com/boun-tabi/eval-ud .

pdf abs
Evaluating Topic Model on Asymmetric and Multi-Domain Financial Corpus
Corentin Masson | Patrick Paroubek

Multiple recent research works in Finance try to quantify the exposure of market assets to various risks from text and how assets react if the risk materialize itself. We consider risk sections from french Financial Corporate Annual Reports, which are regulated documents with a mandatory section containing important risks the company is facing, to extract an accurate risk profile and exposure of companies. We identify multiple pitfalls of topic models when applied to corporate filing financial domain data for unsupervised risk distribution extraction which has not yet been studied on this domain. We propose two new metrics to evaluate the behavior of different types of topic models with respect to pitfalls previously mentioned about document risk distribution extraction. Our evaluation will focus on three aspects: regularizations, down-sampling and data augmentation. In our experiments, we found that classic Topic Models require down-sampling to obtain unbiased risks, while Topic Models using metadata and in-domain pre-trained word-embeddings partially correct the coherence imbalance per subdomain and remove sector’s specific language from the detected themes. We then demonstrate the relevance and usefulness of the extracted information with visualizations that help to understand the content of such corpus and its evolution along the years.

pdf abs
Evaluating Unsupervised Dimensionality Reduction Methods for Pretrained Sentence Embeddings
Gaifan Zhang | Yi Zhou | Danushka Bollegala

Sentence embeddings produced by Pretrained Language Models (PLMs) have received wide attention from the NLP community due to their superior performance when representing texts in numerous downstream applications. However, the high dimensionality of the sentence embeddings produced by PLMs is problematic when representing large numbers of sentences in memory- or compute-constrained devices. As a solution, we evaluate unsupervised dimensionality reduction methods to reduce the dimensionality of sentence embeddings produced by PLMs. Our experimental results show that simple methods such as Principal Component Analysis (PCA) can reduce the dimensionality of sentence embeddings by almost 50%, without incurring a significant loss in performance in multiple downstream tasks. Surprisingly, reducing the dimensionality further improves performance over the original high dimensional versions for the sentence embeddings produced by some PLMs in some tasks.

pdf abs
Evaluating Webcam-based Gaze Data as an Alternative for Human Rationale Annotations
Stephanie Brandl | Oliver Eberle | Tiago Ribeiro | Anders Søgaard | Nora Hollenstein

Rationales in the form of manually annotated input spans usually serve as ground truth when evaluating explainability methods in NLP. They are, however, time-consuming and often biased by the annotation process. In this paper, we debate whether human gaze, in the form of webcam-based eye-tracking recordings, poses a valid alternative when evaluating importance scores. We evaluate the additional information provided by gaze data, such as total reading times, gaze entropy, and decoding accuracy with respect to human rationale annotations. We compare WebQAmGaze, a multilingual dataset for information-seeking QA, with attention and explainability-based importance scores for 4 different multilingual Transformer-based language models (mBERT, distil-mBERT, XLMR, and XLMR-L) and 3 languages (English, Spanish, and German). Our pipeline can easily be applied to other tasks and languages. Our findings suggest that gaze data offers valuable linguistic insights that could be leveraged to infer task difficulty and further show a comparable ranking of explainability methods to that of human rationales.

pdf abs
Evaluating Word Expansion for Multilingual Sentiment Analysis of Parliamentary Speech
Yana Nikolova | Costanza Navarretta

This paper replicates and evaluates the word expansion (WE) method for sentiment lexicon generation from Rheault et al. (2016), applying it to two novel corpora of parliamentary speech from Denmark and Bulgaria. GloVe embeddings and vector similarity are leveraged to expand synonym seed lists with domain-specific terms from the speech corpora. The resulting Danish and Bulgarian lexica are compared to other multilingual lexica by analyzing a gold standard of speech excerpts annotated for sentiment. WE correlates best with hand-coded annotations for Danish, while a machine-translated Lexicoder dictionary does best for Bulgarian. WE performance is also found to be very sensitive to processing and scoring techniques, though this is also an issue with the other lexica. Overall, automatic lexicon translation best balances computational complexity and accuracy across both languages, but robust language-agnosticism remains elusive. Theoretical and practical problems of WE are discussed.

pdf abs
Evaluating Workflows for Creating Orthographic Transcripts for Oral Corpora by Transcribing from Scratch or Correcting ASR-Output
Jan Gorisch | Thomas Schmidt

Research projects incorporating spoken data require either a selection of existing speech corpora, or they plan to record new data. In both cases, recordings need to be transcribed to make them accessible to analysis. Underestimating the effort of transcribing can be risky. Automatic Speech Recognition (ASR) holds the promise to considerably reduce transcription effort. However, few studies have so far attempted to evaluate this potential. The present paper compares efforts for manual transcription vs. correction of ASR-output. We took recordings from corpora of varying settings (interview, colloquial talk, dialectal, historic) and (i) compared two methods for creating orthographic transcripts: transcribing from scratch vs. correcting automatically created transcripts. And (ii) we evaluated the influence of the corpus characteristics on the correcting efficiency. Results suggest that for the selected data and transcription conventions, transcribing and correcting still take equally long with 7 times real-time on average. The more complex the primary data, the more time has to be spent on corrections. Despite the impressive latest developments in speech technology, to be a real help for conversation analysts or dialectologists, ASR systems seem to require even more improvement, or we need sufficient and appropriate data for training such systems.

pdf abs
Evaluation Dataset for Lexical Translation Consistency in Chinese-to-English Document-level Translation
Xiangyu Lei | Junhui Li | Shimin Tao | Hao Yang

Lexical translation consistency is one of the most common discourse phenomena in Chinese-to-English document-level translation. To better evaluate the performance of lexical translation consistency, previous researches assumes that all repeated source words should be translated consistently. However, constraining translations of repeated source words to be consistent will hurt word diversity and human translators tend to use different words in translation. Therefore, in this paper we construct a test set of 310 bilingual news articles to properly evaluate lexical translation consistency. We manually differentiate those repeated source words whose translations are consistent into two types: true consistency and false consistency. Then based on the constructed test set, we evaluate the performance of lexical translation consistency for several typical NMT systems.

pdf abs
Evaluation of Really Good Grammatical Error Correction
Robert Östling | Katarina Gillholm | Murathan Kurfalı | Marie Mattson | Mats Wirén

Traditional evaluation methods for Grammatical Error Correction (GEC) fail to fully capture the full range of system capabilities and objectives. The emergence of large language models (LLMs) has further highlighted the shortcomings of these evaluation strategies, emphasizing the need for a paradigm shift in evaluation methodology. In the current study, we perform a comprehensive evaluation of various GEC systems using a recently published dataset of Swedish learner texts. The evaluation is performed using established evaluation metrics as well as human judges. We find that GPT-3 in a few-shot setting by far outperforms previous grammatical error correction systems for Swedish, a language comprising only about 0.1% of its training data. We also found that current evaluation methods contain undesirable biases that a human evaluation is able to reveal. We suggest using human post-editing of GEC system outputs to analyze the amount of change required to reach native-level human performance on the task, and provide a dataset annotated with human post-edits and assessments of grammaticality, fluency and meaning preservation of GEC system outputs.

pdf abs
Event-enhanced Retrieval in Real-time Search
Yanan Zhang | Xiaoling Bai | Tianhua Zhou

The embedding-based retrieval (EBR) approach is widely used in mainstream search engine retrieval systems and is crucial in recent retrieval-augmented methods for eliminating LLM illusions. However, existing EBR models often face the “semantic drift” problem and insufficient focus on key information, leading to a low adoption rate of retrieval results in subsequent steps. This issue is especially noticeable in real-time search scenarios, where the various expressions of popular events on the Internet make real-time retrieval heavily reliant on crucial event information. To tackle this problem, this paper proposes a novel approach called EER, which enhances real-time retrieval performance by improving the dual-encoder model of traditional EBR. We incorporate contrastive learning to accompany pairwise learning for encoder optimization. Furthermore, to strengthen the focus on critical event information in events, we include a decoder module after the document encoder, introduce a generative event triplet extraction scheme based on prompt-tuning, and correlate the events with query encoder optimization through comparative learning. This decoder module can be removed during inference. Extensive experiments demonstrate that EER can significantly improve the real-time search retrieval performance. We believe that this approach will provide new perspectives in the field of information retrieval. The codes and dataset are available at https://github.com/open-event-hub/Event-enhanced_Retrieval.

pdf abs
Event Extraction in Basque: Typologically Motivated Cross-Lingual Transfer-Learning Analysis
Mikel Zubillaga | Oscar Sainz | Ainara Estarrona | Oier Lopez de Lacalle | Eneko Agirre

Cross-lingual transfer-learning is widely used in Event Extraction for low-resource languages and involves a Multilingual Language Model that is trained in a source language and applied to the target language. This paper studies whether the typological similarity between source and target languages impacts the performance of cross-lingual transfer, an under-explored topic. We first focus on Basque as the target language, which is an ideal target language because it is typologically different from surrounding languages. Our experiments on three Event Extraction tasks show that the shared linguistic characteristic between source and target languages does have an impact on transfer quality. Further analysis of 72 language pairs reveals that for tasks that involve token classification such as entity and event trigger identification, common writing script and morphological features produce higher quality cross-lingual transfer. In contrast, for tasks involving structural prediction like argument extraction, common word order is the most relevant feature. In addition, we show that when increasing the training size, not all the languages scale in the same way in the cross-lingual setting. To perform the experiments we introduce EusIE, an event extraction dataset for Basque, which follows the Multilingual Event Extraction dataset (MEE). The dataset and code are publicly available.

Narrative reasoning relies on the understanding of eventualities in story contexts, which requires a wealth of background world knowledge. To help machines leverage such knowledge, existing solutions can be categorized into two groups. Some focus on implicitly modeling eventuality knowledge by pretraining language models (LMs) with eventuality-aware objectives. However, this approach breaks down knowledge structures and lacks interpretability. Others explicitly collect world knowledge of eventualities into structured eventuality-centric knowledge graphs (KGs). However, existing research on leveraging these knowledge sources for free-texts is limited. In this work, we propose an initial comprehensive framework called EventGround, which aims to tackle the problem of grounding free-texts to eventuality-centric KGs for contextualized narrative reasoning. We identify two critical problems in this direction: the event representation and sparsity problems. We provide simple yet effective parsing and partial information extraction methods to tackle these problems. Experimental results demonstrate that our approach consistently outperforms baseline models when combined with graph neural network (GNN) or large language model (LLM) based graph reasoning models. Our framework, incorporating grounded knowledge, achieves state-of-the-art performance while providing interpretable evidence.

pdf abs
Event Representation Learning with Multi-Grained Contrastive Learning and Triple-Mixture of Experts
Tianqi Hu | Lishuang Li | Xueyang Qin | Yubo Feng

Event representation learning plays a crucial role in numerous natural language processing (NLP) tasks, as it facilitates the extraction of semantic features associated with events. Current methods of learning event representation based on contrastive learning processes positive examples with single-grain random masked language model (MLM), but fall short in learn information inside events from multiple aspects. In this paper, we introduce multi-grained contrastive learning and triple-mixture of experts (MCTM) for event representation learning. Our proposed method extends the random MLM by incorporating a specialized MLM designed to capture different grammatical structures within events, which allows the model to learn token-level knowledge from multiple perspectives. Furthermore, we have observed that mask tokens with different granularities affect the model differently, therefore, we incorporate mixture of experts (MoE) to learn importance weights associated with different granularities. Our experiments demonstrate that MCTM outperforms other baselines in tasks such as hard similarity and transitive sentence similarity, highlighting the superiority of our method.

pdf abs
Every Verb in Its Right Place? A Roadmap for Operationalizing Developmental Stages in the Acquisition of L2 German
Josef Ruppenhofer | Matthias Schwendemann | Annette Portmann | Katrin Wisniewski | Torsten Zesch

Developmental stages are a linguistic concept claiming that language learning, despite its large inter-individual variance, generally progresses in an ordered, step-like manner. At the core of research has been the acquisition of verb placement by learners, as conceptualized within Processability Theory (Pienemann, 1989). The computational implementation of a system detecting developmental stages is a prerequisite for an automated analysis of L2 language development. However, such an implementation faces two main challenges. The first is the lack of a fully fleshed out, coherent linguistic specification of the stages. The second concerns the translation of the linguistic specification into computational procedures that can extract clauses from learner-produced text and assign them to a developmental stage based on verb placement. Our contribution provides the necessary linguistic specification of the stages as well as detaiiled discussion and recommendations regarding computational implementation.

pdf abs
Evidence-guided Inference for Neutralized Zero-shot Transfer
Xiaotong Feng | Meng-Fen Chiang | Wang-Chien Lee | Zixin Kuang

Human annotation is costly and impractical when it comes to scarcely labeled data. Besides, the presence of biased language in well-known benchmarks notably misleads predictive models to perform incredibly well, not because of the model capability but due to the hidden false correlations in the linguistic corpus. Motivated by this, we propose a neutralized Knowledge Transfer framework (NKT) to equip pre-trained language models with neutralized transferability. Specifically, we construct debiased multi-source corpora (CV and EL) for two exemplary knowledge transfer tasks: claim verification and evidence learning, respectively. To counteract biased language, we design a neutralization mechanism in the presence of label skewness. We also design a label adaptation mechanism in light of the mixed label systems in the multi-source corpora. In extensive experiments, the proposed NKT framework shows effective transferability contrarily to the disability of dominant baselines, particularly in the zero-shot cross-domain transfer setting.

pdf abs
EVil-Probe - a Composite Benchmark for Extensive Visio-Linguistic Probing
Marie Bexte | Andrea Horbach | Torsten Zesch

Research probing the language comprehension of visio-linguistic models has gained traction due to their remarkable performance on various tasks. We introduce EViL-Probe, a composite benchmark that processes existing probing datasets into a unified format and reorganizes them based on the linguistic categories they probe. On top of the commonly used negative probes, this benchmark introduces positive probes to more rigorously test the robustness of models. Since the language side alone may introduce a bias models could exploit in solving the probes, we estimate the difficulty of the individual subsets with a language-only baseline. Using the benchmark to probe a set of state-of-the-art visio-linguistic models sheds light on how sensitive they are to the different linguistic categories. Results show that the benchmark is challenging for all models we probe, as their performance is around the chance baseline for many of the categories. The only category all models are able to handle relatively well are nouns. Additionally, models that use a Vision Transformer to process the images are also somewhat robust against probes targeting color and image type. Among these models, our enrichment of EViL-Probe with positive probes helps further discriminate performance, showing BLIP to be the overall best-performing model.

pdf abs
EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries
Jing Han Sun | Ali Emami

While Large Language Models (LLMs) excel at the Winograd Schema Challenge (WSC), a coreference resolution task testing common-sense reasoning through pronoun disambiguation, they struggle with instances that feature minor alterations or rewording. To address this, we introduce EvoGrad, an open-source platform that harnesses a human-in-the-loop approach to create a dynamic dataset tailored to such altered WSC instances. Leveraging ChatGPT’s capabilities, we expand our task instances from 182 to 3691, setting a new benchmark for diverse common-sense reasoning datasets. Additionally, we introduce the error depth metric, assessing model stability in dynamic tasks. Our results emphasize the challenge posed by EvoGrad: Even the best performing LLM, GPT-3.5, achieves an accuracy of 65.0% with an average error depth of 7.2, a stark contrast to human performance of 92.8% accuracy without perturbation errors. This highlights ongoing model limitations and the value of dynamic datasets in uncovering them.

Large language models (LLMs) have demonstrated remarkable capabilities across various NLP tasks. However, their computational costs are prohibitively high. To address this issue, previous research has attempted to distill the knowledge of LLMs into smaller models by generating annotated data. Nonetheless, these works have mainly focused on the direct use of LLMs for text generation and labeling, without fully exploring their potential to comprehend the target task and acquire valuable knowledge. In this paper, we propose EvoKD: Evolving Knowledge Distillation, which leverages the concept of active learning to interactively enhance the process of data generation using large language models, simultaneously improving the task capabilities of small domain model (student model). Different from previous work, we actively analyze the student model’s weaknesses, and then synthesize labeled samples based on the analysis. In addition, we provide iterative feedback to the LLMs regarding the student model’s performance to continuously construct diversified and challenging samples. Experiments and analysis on different NLP tasks, namely, text classification and named entity recognition show the effectiveness of EvoKD.

pdf abs
Examining Temporalities on Stance Detection towards COVID-19 Vaccination
Yida Mu | Mali Jin | Kalina Bontcheva | Xingyi Song

Previous studies have highlighted the importance of vaccination as an effective strategy to control the transmission of the COVID-19 virus. It is crucial for policymakers to have a comprehensive understanding of the public’s stance towards vaccination on a large scale. However, attitudes towards COVID-19 vaccination, such as pro-vaccine or vaccine hesitancy, have evolved over time on social media. Thus, it is necessary to account for possible temporal shifts when analysing these stances. This study aims to examine the impact of temporal concept drift on stance detection towards COVID-19 vaccination on Twitter. To this end, we evaluate a range of transformer-based models using chronological (splitting the training, validation, and test sets in order of time) and random splits (randomly splitting these three sets) of social media data. Our findings reveal significant discrepancies in model performance between random and chronological splits in several existing COVID-19-related datasets; specifically, chronological splits significantly reduce the accuracy of stance classification. Therefore, real-world stance detection approaches need to be further refined to incorporate temporal factors as a key consideration.

pdf abs
Examining the Limitations of Computational Rumor Detection Models Trained on Static Datasets
Yida Mu | Xingyi Song | Kalina Bontcheva | Nikolaos Aletras

A crucial aspect of a rumor detection model is its ability to generalize, particularly its ability to detect emerging, previously unknown rumors. Past research has indicated that content-based (i.e., using solely source post as input) rumor detection models tend to perform less effectively on unseen rumors. At the same time, the potential of context-based models remains largely untapped. The main contribution of this paper is in the in-depth evaluation of the performance gap between content and context-based models specifically on detecting new, unseen rumors. Our empirical findings demonstrate that context-based models are still overly dependent on the information derived from the rumors’ source post and tend to overlook the significant role that contextual information can play. We also study the effect of data split strategies on classifier performance. Based on our experimental results, the paper also offers practical suggestions on how to minimize the effects of temporal concept drift in static datasets during the training of rumor detection methods.

Executing computer programs described in natural language has long been a pursuit of computer science. With the advent of enhanced natural language understanding capabilities exhibited by large language models (LLMs), the path toward this goal has been illuminated. In this paper, we seek to examine the capacity of present-day LLMs to comprehend and execute algorithms outlined in natural language. We established an algorithm test set sourced from Introduction to Algorithm, a well-known textbook that contains many representative widely-used algorithms. To systematically assess LLMs’ code execution abilities, we selected 30 algorithms, generated 300 random-sampled instances in total, and evaluated whether popular LLMs can understand and execute these algorithms. Our findings reveal that LLMs, notably GPT-4, can effectively execute programs described in natural language, as long as no heavy numeric computation is involved. We believe our findings contribute to evaluating LLMs’ code execution abilities and would encourage further investigation and application for the computation power of LLMs.

pdf abs
Experimental versus In-Corpus Variation in Referring Expression Choice
T. Mark Ellison | Fahime Same

In this paper, we compare the results of three studies. The first explored feature-conditioned distributions of referring expression (RE) forms in the original corpus from which the contexts were taken. The second is a crowdsourcing study in which we asked participants to express entities within a pre-existing context, given fully specified referents. The third study replicates the crowdsourcing experiment using Large Language Models (LLMs). We evaluate how well the corpus itself can model the variation found when multiple informants (either human participants or LLMs) choose REs in the same contexts. We measure the similarity of the conditional distributions of form categories using the Jensen-Shannon Divergence metric and Description Length metric. We find that the experimental methodology introduces substantial noise, but by taking this noise into account, we can model the variation captured from the corpus and RE form choices made during experiments. Furthermore, we compared the three conditional distributions over the corpus, the human experimental results, and the GPT models. Against our expectations, the divergence is greatest between the corpus and the GPT model.

pdf abs
Experiments on Speech Synthesis for Teochew, Can Taiwanese Help ?
Pierre Magistry | Ilaine Wang | Ty Eng Lim

This paper reports on our preliminary experiments in speech processing for Teochew, an under-resourced Sinitic language spoken both in China and around the world in diasporan communities. Following the recent uptick of interest in Teochew from heritage speakers of the diaspora and in order to respond to the needs of this community, we develop a Teochew Text-to-Speech system. We describe experiments to build this system and to assess the possible contribution of available resources in Taiwanese Hokkien, the closest language with a significant body of resources. The results of these experiments are not as conclusive as we expected: the Taiwanese dataset did not help our model significantly, but considering our objectives, we find it encouraging that they show that a large training dataset was not necessary for this precise task. A promising model could still be obtained with only a small dataset of Teochew. We hope that this work inspires other communities of speakers of languages in a revitalization phase.

pdf abs
Explainable Multi-hop Question Generation: An End-to-End Approach without Intermediate Question Labeling
Seonjeong Hwang | Yunsu Kim | Gary Geunbae Lee

In response to the increasing use of interactive artificial intelligence, the demand for the capacity to handle complex questions has increased. Multi-hop question generation aims to generate complex questions that requires multi-step reasoning over several documents. Previous studies have predominantly utilized end-to-end models, wherein questions are decoded based on the representation of context documents. However, these approaches lack the ability to explain the reasoning process behind the generated multi-hop questions. Additionally, the question rewriting approach, which incrementally increases the question complexity, also has limitations due to the requirement of labeling data for intermediate-stage questions. In this paper, we introduce an end-to-end question rewriting model that increases question complexity through sequential rewriting. The proposed model has the advantage of training with only the final multi-hop questions, without intermediate questions. Experimental results demonstrate the effectiveness of our model in generating complex questions, particularly 3- and 4-hop questions, which are appropriately paired with input answers. We also prove that our model logically and incrementally increases the complexity of questions, and the generated multi-hop questions are also beneficial for training question answering models.

pdf abs
Explaining Pre-Trained Language Models with Attribution Scores: An Analysis in Low-Resource Settings
Wei Zhou | Heike Adel | Hendrik Schuff | Ngoc Thang Vu

Attribution scores indicate the importance of different input parts and can, thus, explain model behaviour. Currently, prompt-based models are gaining popularity, i.a., due to their easier adaptability in low-resource settings. However, the quality of attribution scores extracted from prompt-based models has not been investigated yet. In this work, we address this topic by analyzing attribution scores extracted from prompt-based models w.r.t. plausibility and faithfulness and comparing them with attribution scores extracted from fine-tuned models and large language models. In contrast to previous work, we introduce training size as another dimension into the analysis. We find that using the prompting paradigm (with either encoder-based or decoder-based models) yields more plausible explanations than fine-tuning the models in low-resource settings and Shapley Value Sampling consistently outperforms attention and Integrated Gradients in terms of leading to more plausible and faithful explanations.

pdf abs
Explicit over Implict: Explicit Diversity Conditions for Effective Question Answer Generation
Vikas Yadav | Hyuk joon Kwon | Vijay Srinivasan | Hongxia Jin

Question Answer Generation (QAG) is an effective data augmentation technique to improve the accuracy of question answering systems, especially in low-resource domains. While recent pretrained and large language model-based QAG methods have made substantial progress, they face the critical issue of redundant QA pair generation, affecting downstream QA systems. Implicit diversity techniques such as sampling and diverse beam search are proven effective solutions but often yield smaller diversity. We present explicit diversity conditions for QAG, focusing on spatial aspects, question types, and entities, substantially increasing diversity in QA generation. Our work emphasizes the need of explicit diversity conditions for generating diverse question-answer synthetic data by showing significant improvements in downstream QA task over existing implicit diversity techniques. In particular, generated QA pairs from explicit diversity conditions result in an average 4.1% exact match and 4.5% F1 improvement over implicit sampling techniques on SQuAD-DU. Our work emphasizes the need for explicit diversity conditions even more in low-resource datasets (SubjQA), where average QA performance improvements are ~12% EM.

Recent generative large language models (LLMs) have exhibited incredible instruction-following capabilities while keeping strong task completion ability, even without task-specific fine-tuning. Some works attribute this to the bonus of the new scaling law, in which the continuous improvement of model capacity yields emergent capabilities, e.g., reasoning and universal generalization. However, we point out that recent LLMs still show shortcut learning behavior, where the models tend to exploit spurious correlations between non-robust features and labels for prediction, which might lead to overestimating model capabilities. LLMs memorize more complex spurious correlations (i.e., task ↔ feature ↔ label) compared with that learned from previous pre-training and task-specific fine-tuning paradigm (i.e., feature ↔ label). Based on our findings, we propose FSLI, a framework for encouraging LLMs to Forget Spurious correlations and Learn from In-context information. Experiments on three tasks show that FSFI can effectively mitigate shortcut learning. Besides, we argue not to overestimate the capabilities of LLMs and conduct evaluations in more challenging and complete test scenarios.

pdf abs
Exploring BERT-Based Classification Models for Detecting Phobia Subtypes: A Novel Tweet Dataset and Comparative Analysis
Anik Das | Milton King | James Alexander Hughes

Phobias, characterized by irrational fears of specific objects or situations, can profoundly affect an individual’s quality of life. This research presents a comprehensive investigation into phobia classification, where we propose a novel dataset of 811,569 English tweets from user timelines spanning 102 phobia subtypes over six months, including 47,614 self-diagnosed phobia users. BERT models were leveraged to differentiate non-phobia from phobia users and classify them into 65 specific phobia subtypes. The study produced promising results, with the highest f1-score of 78.44% in binary classification (phobic user or not phobic user) and 24.01% in a multi-class classification (detecting the specific phobia subtype of a user). This research provides insights into people with phobias on social media and emphasizes the capacity of natural language processing and machine learning to automate the evaluation and support of mental health.

pdf abs
Exploring Geometric Representational Disparities between Multilingual and Bilingual Translation Models
Neha Verma | Kenton Murray | Kevin Duh

Multilingual machine translation has proven immensely useful for both parameter efficiency and overall performance across many language pairs via complete multilingual parameter sharing. However, some language pairs in multilingual models can see worse performance than in bilingual models, especially in the one-to-many translation setting. Motivated by their empirical differences, we examine the geometric differences in representations from bilingual models versus those from one-to-many multilingual models. Specifically, we compute the isotropy of these representations using intrinsic dimensionality and IsoScore, in order to measure how the representations utilize the dimensions in their underlying vector space. Using the same evaluation data in both models, we find that for a given language pair, its multilingual model decoder representations are consistently less isotropic and occupy fewer dimensions than comparable bilingual model decoder representations. Additionally, we show that much of the anisotropy in multilingual decoder representations can be attributed to modeling language-specific information, therefore limiting remaining representational capacity.

pdf abs
Exploring Interpretability of Independent Components of Word Embeddings with Automated Word Intruder Test
Tomáš Musil | David Mareček

Independent Component Analysis (ICA) is an algorithm originally developed for finding separate sources in a mixed signal, such as a recording of multiple people in the same room speaking at the same time. Unlike Principal Component Analysis (PCA), ICA permits the representation of a word as an unstructured set of features, without any particular feature being deemed more significant than the others. In this paper, we used ICA to analyze word embeddings. We have found that ICA can be used to find semantic features of the words and these features can easily be combined to search for words that satisfy the combination. We show that most of the independent components represent such features. To quantify the interpretability of the components, we use the word intruder test, performed both by humans and by large language models. We propose to use the automated version of the word intruder test as a fast and inexpensive way of quantifying vector interpretability without the need for human effort.

pdf abs
Exploring Neural Topic Modeling on a Classical Latin Corpus
Ginevra Martinelli | Paola Impicciché | Elisabetta Fersini | Francesco Mambrini | Marco Passarotti

The large availability of processable textual resources for Classical Latin has made it possible to study Latin literature through methods and tools that support distant reading. This paper describes a number of experiments carried out to test the possibility of investigating the thematic distribution of the Classical Latin corpus Opera Latina by means of topic modeling. For this purpose, we train, optimize and compare two neural models, Product-of-Experts LDA (ProdLDA) and Embedded Topic Model (ETM), opportunely revised to deal with the textual data from a Classical Latin corpus, to evaluate which one performs better both on the basis of topic diversity and topic coherence metrics, and from a human judgment point of view. Our results show that the topics extracted by neural models are coherent and interpretable and that they are significant from the perspective of a Latin scholar. The source code of the proposed model is available at https://github.com/MIND-Lab/LatinProdLDA.

pdf abs
Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context
Tuan Nguyen | Corinne Fredouille | Alain Ghio | Mathieu Balaguer | Virginie Woisard

Automatic speech quality assessment has raised more attention as an alternative or support to traditional perceptual clinical evaluation. However, most research so far only gains good results on simple tasks such as binary classification, largely due to data scarcity. To deal with this challenge, current works tend to segment patients’ audio files into many samples to augment the datasets. Nevertheless, this approach has limitations, as it indirectly relates overall audio scores to individual segments. This paper introduces a novel approach where the system learns at the audio level instead of segments despite data scarcity. This paper proposes to use the pre-trained Wav2Vec2 architecture for both SSL, and ASR as feature extractor in speech assessment. Carried out on the HNC dataset, our ASR-driven approach established a new baseline compared with other approaches, obtaining average MSE = 0.73 and MSE = 1.15 for the prediction of intelligibility and severity scores respectively, using only 95 training samples. It shows that the ASR based Wav2Vec2 model brings the best results and may indicate a strong correlation between ASR and speech quality assessment. We also measure its ability on variable segment durations and speech content, exploring factors influencing its decision.

pdf abs
Exploring the Emotional Dimension of French Online Toxic Content
Valentina Dragos | Delphine Battistelli | Fatou Sow | Aline Etienne

One of the biggest hurdles for the effective analysis of data collected on social platforms is the need for deeper insights on the content and meaning of this data. Emotion annotation can bring new perspectives on this issue and can enable the identification of content–specific features. This study aims at investigating the ways in which variation in online content can be explored through emotion annotation and corpus-based analysis. The paper describes the emotion annotation of three data sets in French composed of extremist, sexist and hateful messages respectively. To this end, first a fine-grained, corpus annotation scheme was used to annotate the data sets and then several empirical studies were carried out to characterize the content in the light of emotional categories. Results suggest that emotion annotations can provide new insights for online content analysis and stronger empirical background for automatic content detection.

pdf abs
Exploring the Generalization of Cancer Clinical Trial Eligibility Classifiers across Diseases
Yumeng Yang

Clinical trials are pivotal in medical research, and NLP can enhance their success, with application in recruitment. This study aims to evaluate the generalizability of eligibility classification across a broad spectrum of clinical trials. Starting with phase 3 cancer trials, annotated with seven eligibility exclusions, then to determine how well models can generalize to non-cancer and non-phase 3 trials. To assess this, we have compiled eligibility criteria data for five types of trials: (1) additional phase 3 cancer trials, (2) phase 1 and 2 cancer trials, (3) heart disease trials, (4) type 2 diabetes trials, and (5) observational trials for any disease, comprising 2,490 annotated eligibility criteria across seven exclusion types. Our results show that models trained on the extensive cancer dataset can effectively handle criteria commonly found in non-cancer trials, such as autoimmune diseases. However, they struggle with criteria disproportionately prevalent in cancer trials, like prior malignancy. We also experiment with few-shot learning, demonstrating that a limited number of disease-specific examples can partially overcome this performance gap. We are releasing this new dataset of annotated eligibility statements to promote the development of cross-disease generalization in clinical trial classification.

pdf abs
Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation
Sarah E. Finch | James D. Finch | Jinho D. Choi

Human evaluation has been widely accepted as the standard for evaluating chat-oriented dialogue systems. However, there is a significant variation in previous work regarding who gets recruited as evaluators. Evaluator groups such as domain experts, university students, and crowdworkers have been used to assess and compare dialogue systems, although it is unclear to what extent the choice of an evaluator group can affect results. This paper analyzes the evaluator group impact on dialogue system evaluation by testing 4 state-of-the-art dialogue systems using 4 distinct evaluator groups. Our analysis reveals a robustness towards evaluator groups for Likert evaluations that is not seen for Pairwise, with only minor differences observed when changing evaluator groups. Furthermore, two notable limitations to this robustness are observed, which reveal discrepancies between evaluators with different levels of chatbot expertise and indicate that evaluator objectivity is beneficial for certain dialogue metrics.

pdf abs
Exploring the Potential of Large Language Models (LLMs) for Low-resource Languages: A Study on Named-Entity Recognition (NER) and Part-Of-Speech (POS) Tagging for Nepali Language
Bipesh Subedi | Sunil Regmi | Bal Krishna Bal | Praveen Acharya

Large Language Models (LLMs) have made significant advancements in Natural Language Processing (NLP) by excelling in various NLP tasks. This study specifically focuses on evaluating the performance of LLMs for Named Entity Recognition (NER) and Part-of-Speech (POS) tagging for a low-resource language, Nepali. The aim is to study the effectiveness of these models for languages with limited resources by conducting experiments involving various parameters and fine-tuning and evaluating two datasets namely, ILPRL and EBIQUITY. In this work, we have experimented with eight LLMs for Nepali NER and POS tagging. While some prior works utilized larger datasets than ours, our contribution lies in presenting a comprehensive analysis of multiple LLMs in a unified setting. The findings indicate that NepBERTa, trained solely in the Nepali language, demonstrated the highest performance with F1-scores of 0.76 and 0.90 in ILPRL dataset. Similarly, it achieved 0.79 and 0.97 in EBIQUITY dataset for NER and POS respectively. This study not only highlights the potential of LLMs in performing classification tasks for low-resource languages but also compares their performance with that of alternative approaches deployed for the tasks.

pdf abs
Exploring the Synergy of Dual-path Encoder and Alignment Module for Better Graph-to-Text Generation
Tianxin Zhao | Yingxin Liu | Xiangdong Su | Jiang Li | Guanglai Gao

The mainstream approaches view the knowledge graph-to-text (KG-to-text) generation as a sequence-to-sequence task and fine-tune the pre-trained model (PLM) to generate the target text from the linearized knowledge graph. However, the linearization of knowledge graphs and the structure of PLMs lead to the loss of a large amount of graph structure information. Moreover, PLMs lack an explicit graph-text alignment strategy because of the discrepancy between structural and textual information. To solve these two problems, we propose a synergetic KG-to-text model with a dual-path encoder, an alignment module, and a guidance module. The dual-path encoder consists of a graph structure encoder and a text encoder, which can better encode the structure and text information of the knowledge graph. The alignment module contains a two-layer Transformer block and an MLP block, which aligns and integrates the information from the dual encoder. The guidance module combines an improved pointer network and an MLP block to avoid error-generated entities and ensures the fluency and accuracy of the generated text. Our approach obtains very competitive performance on three benchmark datasets. Our code is available from https://github.com/IMu-MachineLearningsxD/G2T.

pdf abs
Exploring the Usability of Persuasion Techniques for Downstream Misinformation-related Classification Tasks
Nikolaos Nikolaidis | Jakub Piskorski | Nicolas Stefanovitch

We systematically explore the predictive power of features derived from Persuasion Techniques detected in texts, for solving different tasks of interest for media analysis; notably: detecting mis/disinformation, fake news, propaganda, partisan news and conspiracy theories. Firstly, we propose a set of meaningful features, aiming to capture the persuasiveness of a text. Secondly, we assess the discriminatory power of these features in different text classification tasks on 8 selected datasets from the literature using two metrics. We also evaluate the per-task discriminatory power of each Persuasion Technique and report on different insights. We find out that most of these features have a noticeable potential to distinguish conspiracy theories, hyperpartisan news and propaganda, while we observed mixed results in the context of fake news detection.

pdf abs
Extending AZee with Non-manual Gesture Rules for French Sign Language
Camille Challant | Michael Filhol

This paper presents a study on non-manual gestures, using a formal model named AZee. This is an approach which allows to formally represent Sign Language (SL) discourses, but also to animate them with a virtual signer. As non-manual gestures are essential in SL and therefore necessary for a quality synthesis, we wanted to extend AZee with them, by adding some production rules to the AZee production set. For this purpose, we applied a methodology which allows to find new production rules on a corpus representing one hour of French Sign Language, the 40 brèves (Challant and Filhol, 2022). 23 production rules for non-manual gestures in LSF have thus been determined. We took advantage of this study to directly insert these new rules in the first corpus of AZee discourses expressions, which describe with AZee the productions in SL of the 40 brèves corpus. 533 non-manual rules were inserted in the corpus, and some updates were made. This article proposes a new version of this AZee expressions corpus.

In this system demonstration paper, we describe the Whiteboards extension for an existing web-based platform for digital qualitative discourse analysis. Whiteboards comprise interactive graph-based interfaces to organize and manipulate objects, which can be qualitative research data, such as documents, images, etc., and analyses of these research data, such as annotations, tags, and code structures. The proposed extension offers a customizable view of the material and a wide range of actions that enable new ways of interacting and working with such resources. We show that the visualizations facilitate various use cases of qualitative data analysis, including reflection of the research process through sampling maps, creation of actor networks, and refining code taxonomies.

Automatic Speech Recognition (ASR) technology is fundamental in transcribing spoken language into text, with considerable applications in the clinical realm, including streamlining medical transcription and integrating with Electronic Health Record (EHR) systems. Nevertheless, challenges persist, especially when transcriptions contain noise, leading to significant drops in performance when Natural Language Processing (NLP) models are applied. Named Entity Recognition (NER), an essential clinical task, is particularly affected by such noise, often termed the ASR-NLP gap. Prior works have primarily studied ASR’s efficiency in clean recordings, leaving a research gap concerning the performance in noisy environments. This paper introduces a novel dataset, BioASR-NER, designed to bridge the ASR-NLP gap in the biomedical domain, focusing on extracting adverse drug reactions and mentions of entities from the Brief Test of Adult Cognition by Telephone (BTACT) exam. Our dataset offers a comprehensive collection of almost 2,000 clean and noisy recordings. In addressing the noise challenge, we present an innovative transcript-cleaning method using GPT-4, investigating both zero-shot and few-shot methodologies. Our study further delves into an error analysis, shedding light on the types of errors in transcription software, corrections by GPT-4, and the challenges GPT-4 faces. This paper aims to foster improved understanding and potential solutions for the ASR-NLP gap, ultimately supporting enhanced healthcare documentation practices.

pdf abs
Extracting Financial Events from Raw Texts via Matrix Chunking
Yusheng Huang | Ning Hu | Kunping Li | Nan Wang | Zhouhan Lin

Event Extraction (EE) is widely used in the Chinese financial field to provide valuable structured information. However, there are two key challenges for Chinese financial EE in application scenarios. First, events need to be extracted from raw texts, which sets it apart from previous works like the Automatic Content Extraction (ACE) EE task, where EE is treated as a classification problem given the entity spans. Second, recognizing financial entities can be laborious, as they may involve multiple elements. In this paper, we introduce CFTE, a novel task for Chinese Financial Text-to-Event extraction, which directly extracts financial events from raw texts. We further present FINEED, a Chinese FINancial Event Extraction Dataset, and an efficient MAtrix-ChunKing method called MACK, designed for the extraction of financial events from raw texts. Specifically, FINEED is manually annotated with rich linguistic features. We propose a novel two-dimensional annotation method for FINEED, which can visualize the interactions among text components. Our MACK method is fault-tolerant by preserving the tag frequency distribution when identifying financial entities. We conduct extensive experiments and the results verify the effectiveness of our MACK method.

Social determinants of health (SDoH) play a critical role in shaping health outcomes, particularly in pediatric populations where interventions can have long-term implications. SDoH are frequently studied in the Electronic Health Record (EHR), which provides a rich repository for diverse patient data. In this work, we present a novel annotated corpus, the Pediatric Social History Annotation Corpus (PedSHAC), and evaluate the automatic extraction of detailed SDoH representations using fine-tuned and in-context learning methods with Large Language Models (LLMs). PedSHAC comprises annotated social history sections from 1,260 clinical notes obtained from pediatric patients within the University of Washington (UW) hospital system. Employing an event-based annotation scheme, PedSHAC captures ten distinct health determinants to encompass living and economic stability, prior trauma, education access, substance use history, and mental health with an overall annotator agreement of 81.9 F1. Our proposed fine-tuning LLM-based extractors achieve high performance at 78.4 F1 for event arguments. In-context learning approaches with GPT-4 demonstrate promise for reliable SDoH extraction with limited annotated examples, with extraction performance at 82.3 F1 for event triggers.

pdf abs
Eye-Tracking Features Masking Transformer Attention in Question-Answering Tasks
Leran Zhang | Nora Hollenstein

Eye movement features are considered to be direct signals reflecting human attention distribution with a low cost to obtain, inspiring researchers to augment language models with eye-tracking (ET) data. In this study, we select first fixation duration (FFD) and total reading time (TRT) as the cognitive signals to guide Transformer attention in question-answering (QA) tasks. We design three different ET attention masks based on the two features, either collected from human reading events or generated by a gaze-predicting model. We augment BERT and ALBERT models with attention masks structured based on the ET data. We find that augmenting a model with ET data carries linguistic features complementing the information captured by the model. It improves the models’ performance but compromises the stability. Different Transformer models benefit from different types of ET attention masks, while ALBERT performs better than BERT. Moreover, ET data collected from real-life reading events has better model augmenting ability than the model-predicted data.

Previous Sign Language Translation (SLT) methods achieve superior performance by relying on gloss annotations. However, labeling high-quality glosses is a labor-intensive task, which limits the further development of SLT. Although some approaches work towards gloss-free SLT through jointly training the visual encoder and translation network, these efforts still suffer from poor performance and inefficient use of the powerful Large Language Model (LLM). Most seriously, we find that directly introducing LLM into SLT will lead to insufficient learning of visual representations as LLM dominates the learning curve. To address these problems, we propose Factorized Learning assisted with Large Language Model (FLa-LLM) for gloss-free SLT. Concretely, we factorize the training process into two stages. In the visual initialing stage, we employ a lightweight translation model after the visual encoder to pre-train the visual encoder. In the LLM fine-tuning stage, we freeze the acquired knowledge in the visual encoder and integrate it with a pre-trained LLM to inspire the LLM’s translation potential. This factorized training strategy proves to be highly effective as evidenced by significant improvements achieved across three SLT datasets which are all conducted under the gloss-free setting.

pdf abs
FaGANet: An Evidence-Based Fact-Checking Model with Integrated Encoder Leveraging Contextual Information
Weiyao Luo | Junfeng Ran | Zailong Tian | Sujian Li | Zhifang Sui

In the face of the rapidly growing spread of false and misleading information in the real world, manual evidence-based fact-checking efforts become increasingly challenging and time-consuming. In order to tackle this issue, we propose FaGANet, an automated and accurate fact-checking model that leverages the power of sentence-level attention and graph attention network to enhance performance. This model adeptly integrates encoder-only models with graph attention network, effectively fusing claims and evidence information for accurate identification of even well-disguised data. Experiment results showcase the significant improvement in accuracy achieved by our FaGANet model, as well as its state-of-the-art performance in the evidence-based fact-checking task. We release our code and data in https://github.com/WeiyaoLuo/FaGANet.

Multi-domain aspect-based sentiment analysis (ABSA) seeks to capture fine-grained sentiment across diverse domains. While existing research narrowly focuses on single-domain applications constrained by methodological limitations and data scarcity, the reality is that sentiment naturally traverses multiple domains. Although large language models (LLMs) offer a promising solution for ABSA, it is difficult to integrate effectively with established techniques, including graph-based models and linguistics, because modifying their internal architecture is not easy. To alleviate this problem, we propose a novel framework, Feature-aware In-context Learning for Multi-domain ABSA (FaiMA). The core insight of FaiMA is to utilize in-context learning (ICL) as a feature-aware mechanism that facilitates adaptive learning in multi-domain ABSA tasks. Specifically, we employ a multi-head graph attention network as a text encoder optimized by heuristic rules for linguistic, domain, and sentiment features. Through contrastive learning, we optimize sentence representations by focusing on these diverse features. Additionally, we construct an efficient indexing mechanism, allowing FaiMA to stably retrieve highly relevant examples across multiple dimensions for any given input. To evaluate the efficacy of FaiMA, we build the first multi-domain ABSA benchmark dataset. Extensive experimental results demonstrate that FaiMA achieves significant performance improvements in multiple domains compared to baselines, increasing F1 by 2.07% on average. Source code and data sets are available at https://github.com/SupritYoung/FaiMA.

pdf abs
FAIRification of LeiLanD
Eric Sanders | Sara Petrollino | Gilles R. Scheifer | Henk van den Heuvel | Christopher Handy

LeiLanD (Leiden Language Data) is a searchable catalogue initiated by the Leiden University Centre for Linguistics (LUCL) with the support of CLARIAH. The catalogue contains metadata about language datasets collected at LUCL and other institutes of Leiden University. This paper describes a project to FAIRify the datasets increasing their findability and accessibility through a standardised metadata format CMDI so as to obtain a rich metadata description for all resources and to make them findable through CLARIN’s Virtual Language Observatory. The paper describes the creation of the catalogue and the steps that led from unstructured metadata to CMDI standards. This FAIRifi- cation of LeiLanD has enhanced the findability and accessibility of incredibly diverse collection of language datasets.

pdf abs
FalAI: A Dataset for End-to-end Spoken Language Understanding in a Low-Resource Scenario
Andres Pineiro-Martin | Carmen Garcia-Mateo | Laura Docio-Fernandez | Maria del Carmen Lopez-Perez | Jose Gandarela-Rodriguez

End-to-end (E2E) Spoken Language Understanding (SLU) systems infer structured information directly from the speech signal using a single model. Due to the success of virtual assistants and the increasing demand for speech interfaces, these architectures are being actively researched for their potential to improve system performance by exploiting acoustic information and avoiding the cascading errors of traditional architectures. However, these systems require large amounts of specific, well-labelled speech data for training, which is expensive to obtain even in English, where the number of public audio datasets for SLU is limited. In this paper, we release the FalAI dataset, the largest public SLU dataset in terms of hours (250 hours), recordings (260,000) and participants (over 10,000), which is also the first SLU dataset in Galician and the first to be obtained in a low-resource scenario. Furthermore, we present new measures of complexity for the text corpora, the strategies followed for the design, collection and validation of the dataset, and we define splits for noisy audio, hesitant audio and audio where the sentence has changed but the structured information is preserved. These novel splits provide a unique resource for testing SLU systems in challenging, real-world scenarios.

pdf abs
Fast Adaptation via Prompted Data: An Efficient Cross-Domain Fine-tuning Method for Large Language Models
Yiming Zhang | Hantao Yang | Haobo Wang | Jake Zhao

Large language models (LLMs) have achieved great success in a variety of natural language understanding tasks. However, domain discrepancies between the downstream task and the pre-training corpora may have hurdled LLMs to excel further in the vertical applications. Contrary to prior computational-heavy methods, we propose a lightweight solution to further bridge the gap in applying LLMs to diverse downstream tasks — a Fast Adaptation method for LLMs via Prompted Data, in short FAvPD. Notably, with FAvPD, we establish an additional adaptive tuning procedure, wherein we integrate downstream text corpora, gold labels as well as external knowledge sources and then envelop them into a form of highly controllable prompt. As a simple, easy-to-use, and versatile solution, FAvPD lies in the intersection of regimes like knowledge-augmented LLMs, fine-tuning, and adaptation techniques. With extensive experiments, we prove that FAvPD excels in both performance efficacy and training efficiency over related prior works. FAvPD is publicly available at https://github.com/Hyatio/FAvPD.

pdf abs
FastSpell: The LangId Magic Spell
Marta Bañón | Gema Ramírez-Sánchez | Jaume Zaragoza-Bernabeu | Sergio Ortiz Rojas

Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts. However, commonly used language identifiers struggle to differentiate between similar or closely-related languages. This paper introduces FastSpell, a language identifier that combines fastText (a pre-trained language identifier tool) and Hunspell (a spell checker) with the aim of having a refined second-opinion before deciding which language should be assigned to a text. We provide a description of the FastSpell algorithm along with an explanation on how to use and configure it. To that end, we motivate the need of such a tool and present a benchmark including some popular language identifiers evaluated during the development of FastSpell. We show how FastSpell is useful not only to improve identification of similar languages, but also to identify new ones ignored by other tools.

pdf abs
FCDS: Fusing Constituency and Dependency Syntax into Document-Level Relation Extraction
Xudong Zhu | Zhao Kang | Bei Hui

Document-level Relation Extraction (DocRE) aims to identify relation labels between entities within a single document. It requires handling several sentences and reasoning over them. State-of-the-art DocRE methods use a graph structure to connect entities across the document to capture dependency syntax information. However, this is insufficient to fully exploit the rich syntax information in the document. In this work, we propose to fuse constituency and dependency syntax into DocRE. It uses constituency syntax to aggregate the whole sentence information and select the instructive sentences for the pairs of targets. It exploits dependency syntax in a graph structure with constituency syntax enhancement and chooses the path between entity pairs based on the dependency graph. The experimental results on datasets from various domains demonstrate the effectiveness of the proposed method.

pdf abs
Feature Structure Matching for Multi-source Sentiment Analysis with Efficient Adaptive Tuning
Rui Li | Cheng Liu | Yu Tong | Jiang Dazhi

Recently, fine-tuning the large pre-trained language models on the labeled sentiment dataset achieves appealing performance. However, the obtained model may not generalize well to the other domains due to the domain shift, and it is expensive to update the entire parameters within the large models. Although some existing domain matching methods are proposed to alleviate the above issues, there are multiple relevant source domains in practice which makes the whole training more costly and complicated. To this end, we focus on the efficient unsupervised multi-source sentiment adaptation task which is more challenging and beneficial for real-world applications. Specifically, we propose to extract multi-layer features from the large pre-trained model, and design a dynamic parameters fusion module to exploit these features for both efficient and adaptive tuning. Furthermore, we propose a novel feature structure matching constraint, which enforces similar feature-wise correlations across different domains. Compared with the traditional domain matching methods which tend to pull all feature instances close, we show that the proposed feature structure matching is more robust and generalizable in the multi-source scenario. Extensive experiments on several multi-source sentiment analysis benchmarks demonstrate the effectiveness and superiority of our proposed framework.

pdf abs
Federated Document-Level Biomedical Relation Extraction with Localized Context Contrast
Yan Xiao | Yaochu Jin | Kuangrong Hao

Existing studies on relation extraction focus at the document level in a centralized training environment, requiring the collection of documents from various sources. However, this raises concerns about privacy protection, especially in sensitive domains such as finance and healthcare. For the first time, this work extends document-level relation extraction to a federated environment. The proposed federated framework, called FedLCC, is tailored for biomedical relation extraction that enables collaborative training without sharing raw medical texts. To fully exploit the models of all participating clients and improve the local training on individual clients, we propose a novel concept of localized context contrast on the basis of contrastive learning. By comparing and rectifying the similarity of localized context in documents between clients and the central server, the global model can better represent the documents on individual clients. Due to the lack of a widely accepted measure of non-IID text data, we introduce a novel non-IID scenario based on graph structural entropy. Experimental results on three document-level biomedical relation extraction datasets demonstrate the effectiveness of our method. Our code is available at https://github.com/xxxxyan/FedLCC.

pdf abs
Federated Foundation Models: Privacy-Preserving and Collaborative Learning for Large Models
Sixing Yu | Juan Pablo Munoz | Ali Jannesari

Foundation Models (FMs), such as LLaMA, BERT, GPT, ViT, and CLIP, have demonstrated remarkable success in a wide range of applications, driven by their ability to leverage vast amounts of data for pre-training. However, optimizing FMs often requires access to sensitive data, raising privacy concerns and limiting their applicability in many domains. In this paper, we propose the Federated Foundation Models (FFMs) paradigm, which combines the benefits of FMs and Federated Learning (FL) to enable privacy-preserving and collaborative learning across multiple end-users. We discuss the potential benefits and challenges of integrating FL into the lifespan of FMs, covering pre-training, fine-tuning, and application. We further outline potential future research avenues in FFM, including FFM pre-training, FFM fine-tuning, and federated prompt tuning, which allow the development of more personalized and context-aware models while ensuring data privacy. Moreover, we explore the possibility of continual/lifelong learning in FFMs, as increased computational power at the edge may unlock the potential for optimizing FMs using newly generated private data close to the data source. The proposed FFM concepts offer a flexible and scalable framework for training large language models in a privacy-preserving manner, setting the stage for subsequent advancements in both FM training and federated learning.

pdf abs
Few-Shot Learning for Cold-Start Recommendation
Mingming Li | Songlin Hu | Fuqing Zhu | Qiannan Zhu

Cold-start is a significant problem in recommender systems. Recently, with the development of few-shot learning and meta-learning techniques, many researchers have devoted themselves to adopting meta-learning into recommendation as the natural scenario of few-shots. Nevertheless, we argue that recent work has a huge gap between few-shot learning and recommendations. In particular, users are locally dependent, not globally independent in recommendation. Therefore, it is necessary to formulate the local relationships between users. To accomplish this, we present a novel Few-shot learning method for Cold-Start (FCS) recommendation that consists of three hierarchical structures. More concretely, this first hierarchy is the global-meta parameters for learning the global information of all users; the second hierarchy is the local-meta parameters whose goal is to learn the adaptive cluster of local users; the third hierarchy is the specific parameters of the target user. Both the global and local information are formulated, addressing the new user’s problem in accordance with the few-shot records rapidly. Experimental results on two public real-world datasets show that the FCS method could produce stable improvements compared with the state-of-the-art.

pdf abs
Few-shot Link Prediction on Hyper-relational Facts
Jiyao Wei | Saiping Guan | Xiaolong Jin | Jiafeng Guo | Xueqi Cheng

Hyper-relational facts, which consist of a primary triple (head entity, relation, tail entity) and auxiliary attribute-value pairs, are widely present in real-world Knowledge Graphs (KGs). Link Prediction on Hyper-relational Facts (LPHFs) is to predict a missing element in a hyper-relational fact, which helps populate and enrich KGs. However, existing LPHFs studies usually require an amount of high-quality data. They overlook few-shot relations, which have limited instances, yet are common in real-world scenarios. Thus, we introduce a new task, Few-Shot Link Prediction on Hyper-relational Facts (FSLPHFs). It aims to predict a missing entity in a hyper-relational fact with limited support instances. To tackle FSLPHFs, we propose MetaRH, a model that learns Meta Relational information in Hyper-relational facts. MetaRH comprises three modules: relation learning, support-specific adjustment, and query inference. By capturing meta relational information from limited support instances, MetaRH can accurately predict the missing entity in a query. As there is no existing dataset available for this new task, we construct three datasets to validate the effectiveness of MetaRH. Experimental results on these datasets demonstrate that MetaRH significantly outperforms existing representative models.

Multimodal Named Entity Recognition (MNER) models typically require a significant volume of labeled data for effective training to extract relations between entities. In real-world scenarios, we frequently encounter unseen relation types. Nevertheless, existing methods are predominantly tailored for complete datasets and are not equipped to handle these new relation types. In this paper, we introduce the Few-shot Multimodal Named Entity Recognition (FMNER) task to address these novel relation types. FMNER trains in the source domain (seen types) and tests in the target domain (unseen types) with different distributions. Due to limited available resources for sampling, each sampling instance yields different content, resulting in data bias and alignment problems of multimodal units (image patches and words). To alleviate the above challenge, we propose a novel Multimodal causal Intervention graphs (MOUSING) model for FMNER. Specifically, we begin by constructing a multimodal graph that incorporates fine-grained information from multiple modalities. Subsequently, we introduce the Multimodal Causal Intervention Strategy to update the multimodal graph. It aims to decrease spurious correlations and emphasize accurate correlations between multimodal units, resulting in effectively aligned multimodal representations. Extensive experiments on two multimodal named entity recognition datasets demonstrate the superior performance of our model in the few-shot setting.

Few-shot NER aims to identify entities of target types with only limited number of illustrative instances. Unfortunately, few-shot NER is severely challenged by the intrinsic precise generalization problem, i.e., it is hard to accurately determine the desired target type due to the ambiguity stemming from information deficiency. In this paper, we propose Superposition Concept Discriminator (SuperCD), which resolves the above challenge via an active learning paradigm. Specifically, a concept extractor is first introduced to identify superposition concepts from illustrative instances, with each concept corresponding to a possible generalization boundary. Then a superposition instance retriever is applied to retrieve corresponding instances of these superposition concepts from large-scale text corpus. Finally, annotators are asked to annotate the retrieved instances and these annotated instances together with original illustrative instances are used to learn FS-NER models. To this end, we learn a universal concept extractor and superposition instance retriever using a large-scale openly available knowledge bases. Experiments show that SuperCD can effectively identify superposition concepts from illustrative instances, retrieve superposition instances from large-scale corpus, and significantly improve the few-shot NER performance with minimal additional efforts.

pdf abs
Few-Shot Relation Extraction with Hybrid Visual Evidence
Jiaying Gong | Hoda Eldardiry

The goal of few-shot relation extraction is to predict relations between name entities in a sentence when only a few labeled instances are available for training. Existing few-shot relation extraction methods focus on uni-modal information such as text only. This reduces performance when there is no clear contexts between the name entities described in text. We propose a multi-modal few-shot relation extraction model (MFS-HVE) that leverages both textual and visual semantic information to learn a multi-modal representation jointly. The MFS-HVE includes semantic feature extractors and multi-modal fusion components. The MFS-HVE semantic feature extractors are developed to extract both textual and visual features. The visual features include global image features and local object features within the image. The MFS-HVE multi-modal fusion unit integrates information from various modalities using image-guided attention, object-guided attention, and hybrid feature attention to fully capture the semantic interaction between visual regions of images and relevant texts. Extensive experiments conducted on two public datasets demonstrate that semantic visual information significantly improves performance of few-shot relation prediction.

Graph neural networks (GNNs) have achieved promising performance on semantic dependency parsing (SDP), owing to their powerful graph representation learning ability. However, training a high-performing GNN-based model requires a large amount of labeled data and it is prone to over-fitting in the absence of sufficient labeled data. To address this drawback, we propose a syntax-guided graph contrastive learning framework to pre-train GNNs with plenty of unlabeled data and fine-tune pre-trained GNNs with few-shot labeled SDP data. Through extensive experiments conducted on the SemEval-2015 Task 18 English dataset in three formalisms (DM, PAS, and PSD), we demonstrate that our framework achieves promising results when few-shot training samples are available. Furthermore, benefiting from the pre-training process, our framework exhibits notable advantages in the out-of-domain test sets.

Diffusion models have achieved significant success in computer vision and shown immense potential in natural language processing applications, particularly for text generation tasks. However, generating high-quality text using these models often necessitates thousands of iterations, leading to slow sampling rates. Existing acceleration methods either neglect the importance of the distribution of sampling steps, resulting in compromised performance with smaller number of iterations, or require additional training, introducing considerable computational overheads. In this paper, we present Few-shot Temporal Pruning, a novel technique designed to accelerate diffusion models for text generation without supplementary training while effectively leveraging limited data. Employing a Bayesian optimization approach, our method effectively eliminates redundant sampling steps during the sampling process, thereby enhancing the generation speed. A comprehensive evaluation of discrete and continuous diffusion models across various tasks, including machine translation, question generation, and paraphrasing, reveals that our approach achieves competitive performance even with minimal sampling steps after down to less than 1 minute of optimization, yielding a significant acceleration of up to 400x in text generation tasks.

pdf abs
FFSTC: Fongbe to French Speech Translation Corpus
D. Fortuné Kponou | Fréjus A. A. Laleye | Eugène Cokou Ezin

In this paper, we introduce the Fongbe to French Speech Translation Corpus (FFSTC). This corpus encompasses approximately 31 hours of collected Fongbe language content, featuring both French transcriptions and corresponding Fongbe voice recordings. FFSTC represents a comprehensive dataset compiled through various collection methods and the efforts of dedicated individuals. Furthermore, we conduct baseline experiments using Fairseq’s transformer_s and conformer models to evaluate data quality and validity. Our results indicate a score BLEU of 8.96 for the transformer_s model and 8.14 for the conformer model, establishing a baseline for the FFSTC corpus.

pdf abs
FinCorpus-DE10k: A Corpus for the German Financial Domain
Serhii Hamotskyi | Nata Kozaeva | Christian Hänig

We introduce a predominantly German corpus comprising 12.5k PDF documents sourced from the financial domain. The corresponding extracted textual data encompasses more than 165 million tokens derived predominantly from German, and to a lesser extent, bilingual documents. We provide detailed information about the document types included in the corpus, such as final terms, base prospectuses, annual reports, information materials, law documents, international financial reporting standards, and monthly reports from the Bundesbank, accompanied by comprehensive statistical analysis. To our knowledge, it is the first non-email German financial corpus available, and we hope it will fill this gap and foster further research in the financial domain both in the German language and in multilingual contexts.

pdf abs
Finding Educationally Supportive Contexts for Vocabulary Learning with Attention-Based Models
Sungjin Nam | Kevyn Collins-Thompson | David Jurgens | Xin Tong

When learning new vocabulary, both humans and machines acquire critical information about the meaning of an unfamiliar word through contextual information in a sentence or passage. However, not all contexts are equally helpful for learning an unfamiliar ‘target’ word. Some contexts provide a rich set of semantic clues to the target word’s meaning, while others are less supportive. We explore the task of finding educationally supportive contexts with respect to a given target word for vocabulary learning scenarios, particularly for improving student literacy skills. Because of their inherent context-based nature, attention-based deep learning methods provide an ideal starting point. We evaluate attention-based approaches for predicting the amount of educational support from contexts, ranging from a simple custom model using pre-trained embeddings with an additional attention layer, to a commercial Large Language Model (LLM). Using an existing major benchmark dataset for educational context support prediction, we found that a sophisticated but generic LLM had poor performance, while a simpler model using a custom attention-based approach achieved the best-known performance to date on this dataset.

The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge, we present a semi-automated dataset creation pipeline that leverages large language models. We use this pipeline to generate a dataset of speakers identifying themself or another speaker as belonging to a particular race, ethnicity, or national origin group. We use OpenaAI’s GPT-4 to perform two complex annotation tasks- separating files relevant to our intended dataset from the irrelevant ones (filtering) and finding and extracting information on identifications within a transcript (tagging). By evaluating GPT-4’s performance using human annotations as ground truths, we show that it can reduce resources required by dataset annotation while barely losing any important information. For the filtering task, GPT-4 had a very low miss rate of 6.93%. GPT-4’s tagging performance showed a trade-off between precision and recall, where the latter got as high as 97%, but precision never exceeded 45%. Our approach reduces the time required for the filtering and tagging tasks by 95% and 80%, respectively. We also present an in-depth error analysis of GPT-4’s performance.

pdf abs
Find-the-Common: A Benchmark for Explaining Visual Patterns from Images
Yuting Shi | Naoya Inoue | Houjing Wei | Yufeng Zhao | Tao Jin

Recent advances in Instruction-fine-tuned Vision and Language Models (IVLMs), such as GPT-4V and InstructBLIP, have prompted some studies have started an in-depth analysis of the reasoning capabilities of IVLMs. However, Inductive Visual Reasoning, a vital skill for text-image understanding, remains underexplored due to the absence of benchmarks. In this paper, we introduce Find-the-Common (FTC): a new vision and language task for Inductive Visual Reasoning. In this task, models are required to identify an answer that explains the common attributes across visual scenes. We create a new dataset for the FTC and assess the performance of several contemporary approaches including Image-Based Reasoning, Text-Based Reasoning, and Image-Text-Based Reasoning with various models. Extensive experiments show that even state-of-the-art models like GPT-4V can only archive with 48% accuracy on the FTC, for which, the FTC is a new challenge for the visual reasoning research community. Our dataset has been released and is available online: https://github.com/SSSSSeki/Find-the-common.

pdf abs
Fine-grained Classification of Circumstantial Meanings within the Prague Dependency Treebank Annotation Scheme
Marie Mikulova

In the contribution, we propose a formally and semantically based fine-grained classification of circumstantial meanings based on the analysis of a large number of valuable examples from the Prague Dependency Treebanks. The methodology and principles of the presented approach are elaborated in detail and demonstrated on two case studies. The classification of circumstantial meanings is carried out for the Czech language, but the methodology and principles used are language independent. The contribution also addresses the question of language universality and specificity through a comparison with English. The aim of this work is to enrich the annotation in the Prague Dependency Treebanks with detailed information on circumstantial meanings but it may also be useful for other semantically oriented projects. To the best of our knowledge, a similar corpus-based and corpus-verified elaborate classification of circumstantial meanings has not yet been proposed in any annotation project. The contribution presents the results of an ongoing work.

Legal Argument-Pair Extraction (LAE) is dedicated to the identification of interactive arguments targeting the same subject matter within legal complaints and corresponding defenses. This process serves as a foundation for automatically recognizing the focal points of disputes. Current methodologies predominantly conceptualize LAE as a supervised sentence-pair classification problem and usually necessitate extensive manual annotations, thereby constraining their scalability and general applicability. To this end, we present an innovative approach to LAE that focuses on fine-grained alignment of argument pairs, building upon coarse-grained complaint-defense pairs. This strategy stems from two key observations: 1) In general, every argument presented in a legal complaint is likely to be addressed by at least one corresponding argument in the defense. 2) It’s rare for multiple complaint arguments to be addressed by a single defense argument; rather, each complaint argument usually corresponds to a unique defense argument. Motivated by these insights, we develop a specialized pre-training framework. Our model employs pre-training objectives designed to exploit the coarse-grained supervision signals. This enables expressive representations of legal arguments for LAE, even when working with a limited amount of labeled data. To verify the effectiveness of our model, we construct the largest LAE datasets from two representative causes, private lending, and contract dispute. The experimental results demonstrate that our model can effectively capture informative argument knowledge from unlabeled complaint-defense pairs and outperform the unsupervised and supervised baselines by 3.7 and 2.4 points on average respectively. Besides, our model can reach superior accuracy with only half manually annotated data. The datasets and code can be found in https://github.com/thunlp/LAE.

pdf abs
Fine-Tuning a Pre-Trained Wav2Vec2 Model for Automatic Speech Recognition- Experiments with De Zahrar Sproche
Andrea Gulli | Francesco Costantini | Diego Sidraschi | Emanuela Li Destri

We present the results of an Automatic Speech Recognition system developed to support linguistic documentation efforts. The test case is the zahrar sproche language, a Southern Bavarian variety spoken in the language island of Sauris/Zahre in Italy. We collected a dataset of 9,000 words and approximately 80 minutes of speech. The goal is to reduce the transcription workload of field linguists. The method used is a deep learning approach based on the language-specific tuning of a generic pre-trained representation model, XLS-R. The transcription quality of the experiments on the collected dataset is promising. We test the model’s performance on some fieldwork historical recordings, report the results, and evaluate them qualitatively. Finally, we indicate possibilities for improvement in this challenging task.

pdf abs
First Steps Towards the Integration of Resources on Historical Glossing Traditions in the History of Chinese: A Collection of Standardized Fǎnqiè Spellings from the Guǎngyùn
Michele Pulini | Johann-Mattis List

Due to the peculiar nature of the Chinese writing system, it is difficult to assess the pronunciation of historical varieties of Chinese. In order to reconstruct ancient pronunciations, historical glossing practices play a crucial role. However, although studied thoroughly by numerous scholars, most research has been carried out in a qualitative manner, and no attempt at providing integrated resources of historical glossing practices has been made so far. Here, we present a first step towards the integration of resources on historical glossing traditions in the history of Chinese. Our starting point are so-called fǎnqiè spellings in the Guǎngyùn, one of the early rhyme books in the history of Chinese, providing pronunciations for more than 20000 Chinese characters. By standardizing digital versions of the resource using tools from computational historical linguistics, we show that we can predict historical spellings with high precision and at the same time shed light on the precision of ancient glossing practices. Although a considerably small first step, our resource could be the starting point for an integrated, standardized collection that could ultimately shed new light on the history of Chinese.

pdf abs
Fisher Mask Nodes for Language Model Merging
Thennal D K | Ganesh Nathan | Suchithra M S

Fine-tuning pre-trained models provides significant advantages in downstream performance. The ubiquitous nature of pre-trained models such as BERT and its derivatives in natural language processing has also led to a proliferation of task-specific fine-tuned models. As these models typically only perform one task well, additional training or ensembling is required in multi-task scenarios. The growing field of model merging provides a solution, dealing with the challenge of combining multiple task-specific models into a single multi-task model. In this study, we introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning. Utilizing the Fisher information of mask nodes within the Transformer architecture, we devise a computationally efficient weighted-averaging scheme. Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost, with baseline performance improvements of up to +6.5 and a speedup between 57.4x and 321.7x across models. Our results prove the potential of our method in current multi-task learning environments and suggest its scalability and adaptability to new model architectures and learning scenarios.

pdf abs
FlattenQuant: Breaking through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
Yi Zhang | Fei Yang | Shuang Peng | Fangyu Wang | Aimin Pan

Large language models (LLMs) have demonstrated state-of-the-art accuracies across various tasks. However, the latency of inference and the large GPU memory consumption of LLMs restrict their deployment performance. Recently, there have been some efficient attempts to quantize LLMs, yet inference with large batch size or long sequence still has the issue of being compute-bound. Fine-grained quantization methods have showcased their proficiency in achieving low-bit quantization for LLMs, while requiring FP16 data type for linear layer computations, which is time-consuming when dealing with large batch size or long sequence. In this paper, we introduce a method called FlattenQuant, which significantly reduces the maximum value of the tensor by flattening the larger channels in the tensor, to achieve low bit per-tensor quantization with minimal accuracy loss. Our experiments show that FlattenQuant can directly use 4 bits to achieve 48.29% of the linear layer calculation in LLMs, with the remaining layer using 8 bits. The 4-bit matrix multiplication introduced in the FlattenQuant method can effectively address the compute-bound caused by large matrix calculation. Our work achieves up to 2× speedup and 2.3× memory reduction for LLMs with negligible loss in accuracy.

pdf abs
Flexible Lexicalization in Rule-based Text Realization
Avril Gazeau | Francois Lareau

GenDR is a text realizer that takes as input a graph-based semantic representation and outputs the corresponding syntactic dependency trees. One of the tasks in this transduction is lexicalization, i.e., choosing the right lexical units to express a given semanteme. To do so, GenDR uses a semantic dictionary that maps semantemes to corresponding lexical units in a given language. This study aims to develop a flexible lexicalization module to automatically build a rich semantic dictionary for French. To achieve this, we tried two methods. The first one consisted in extracting information from the French Lexical Network, a large-scale French lexical resource, and adapting it to GenDR. The second one was to test a contextual neural language model’s ability to generate potential additional lexicalizations. The first method significantly broadened the coverage of GenDR, while the additional lexicalizations produced by the language model turned out to be of limited use, which brings us to the conclusion that it is not suited to perform the task we’ve asked from it.

Large language models have amply proven their great capabilities, both in downstream tasks and real-life settings. However, low- and mid-resource languages do not have access to the necessary means to train such models from scratch, and often have to rely on multilingual models despite being underrepresented in the training data. For the particular case of the Catalan language, we prove that continued pre-training with vocabulary adaptation is a better alternative to take the most out of already pre-trained models, even if these have not seen any Catalan data during their pre-training phase. We curate a 26B tokens corpus and use it to further pre-train BLOOM, giving rise to the FLOR models. We perform an extensive evaluation to assess the effectiveness of our method, obtaining consistent gains across Catalan and Spanish tasks. The models, training data, and evaluation framework are made freely available under permissive licenses.

pdf abs
FoRC4CL: A Fine-grained Field of Research Classification and Annotated Dataset of NLP Articles
Raia Abu Ahmad | Ekaterina Borisova | Georg Rehm

The steep increase in the number of scholarly publications has given rise to various digital repositories, libraries and knowledge graphs aimed to capture, manage, and preserve scientific data. Efficiently navigating such databases requires a system able to classify scholarly documents according to the respective research (sub-)field. However, not every digital repository possesses a relevant classification schema for categorising publications. For instance, one of the largest digital archives in Computational Linguistics (CL) and Natural Language Processing (NLP), the ACL Anthology, lacks a system for classifying papers into topics and sub-topics. This paper addresses this gap by constructing a corpus of 1,500 ACL Anthology publications annotated with their main contributions using a novel hierarchical taxonomy of core CL/NLP topics and sub-topics. The corpus is used in a shared task with the goal of classifying CL/NLP papers into their respective sub-topics.

pdf abs
FORECAST2023: A Forecast and Reasoning Corpus of Argumentation Structures
Kamila Górska | John Lawrence | Chris Reed

It is known from large-scale crowd experimentation that some people are innately better at analysing complex situations and making justified predictions – the so-called ‘superforecasters’. Surprisingly, however, there has to date been no work exploring the role played by the reasoning in those justifications. Bag-of-words analyses might tell us something, but the real value lies in understanding what features of reasoning and argumentation lead to better forecasts – both in providing an objective measure for argument quality, and even more importantly, in providing guidance on how to improve forecasting performance. The work presented here covers the creation of a unique dataset of such prediction rationales, the structure of which naturally lends itself to partially automated annotation which in turn is used as the basis for subsequent manual enhancement that provides a uniquely fine-grained and close characterisation of the structure of argumentation, with potential impact on forecasting domains from intelligence analysis to investment decision-making.

pdf abs
FoTo: Targeted Visual Topic Modeling for Focused Analysis of Short Texts
Sanuj Kumar | Tuan Le

Given a corpus of documents, focused analysis aims to find topics relevant to aspects that a user is interested in. The aspects are often expressed by a set of keywords provided by the user. Short texts such as microblogs and tweets pose several challenges to this task because the sparsity of word co-occurrences may hinder the extraction of meaningful and relevant topics. Moreover, most of the existing topic models perform a full corpus analysis that treats all topics equally, which may make the learned topics not be on target. In this paper, we propose a novel targeted topic model for semantic short-text embedding which aims to learn all topics and low-dimensional visual representations of documents, while preserving relevant topics for focused analysis of short texts. To preserve the relevant topics in the visualization space, we propose jointly modeling topics and the pairwise document ranking based on document-keyword distances in the visualization space. The extensive experiments on several real-world datasets demonstrate the effectiveness of our proposed model in terms of targeted topic modeling and visualization.

pdf abs
FRACAS: a FRench Annotated Corpus of Attribution relations in newS
Ange Richard | Laura Cristina Alonzo Canul | François Portet

Quotation extraction is a widely useful task both from a sociological and from a Natural Language Processing perspective. However, very little data is available to study this task in languages other than English. In this paper, we present FRACAS, a manually annotated corpus of 1,676 newswire texts in French for quotation extraction and source attribution. We first describe the composition of our corpus and the choices that were made in selecting the data. We then detail the annotation guidelines, the annotation process and give relevant statistics about our corpus. We give results for the inter-annotator agreement, which is substantially high for such a difficult linguistic phenomenon. We use this new resource to test the ability of a neural state-of-the-art relation extraction system to extract quotes and their source and we compare this model to the latest available system for quotation extraction for the French language, which is rule-based. Experiments using our dataset on the state-of-the-art system show very promising results considering the difficulty of the task at hand.

This paper presents the Frame2 dataset, a multimodal dataset built from a corpus of a Brazilian travel TV show annotated for FrameNet categories for both the text and image communicative modes. Frame2 comprises 230 minutes of video, which are correlated with 2,915 sentences either transcribing the audio spoken during the episodes or the subtitling segments of the show where the host conducts interviews in English. For this first release of the dataset, a total of 11,796 annotation sets for the sentences and 6,841 for the video are included. Each of the former includes a target lexical unit evoking a frame or one or more frame elements. For each video annotation, a bounding box in the image is correlated with a frame, a frame element and lexical unit evoking a frame in FrameNet.

This paper presents Framed Multi30K (FM30K), a novel frame-based Brazilian Portuguese multimodal-multilingual dataset which i) extends the Multi30K dataset (Elliot et al., 2016) with 158,915 original Brazilian Portuguese descriptions, and 30,104 Brazilian Portuguese translations from original English descriptions; and ii) adds 2,677,613 frame evocation labels to the 158,915 English descriptions and to the ones created for Brazilian Portuguese; (iii) extends the Flickr30k Entities dataset (Plummer et al., 2015) with 190,608 frames and Frame Elements correlations with the existing phrase-to-region correlations.

Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection and is freely available on GitHub (link: https://github.com/JamilProg/crosslingual_bert_annotation_projection). Leveraging a language agnostic BERT-based approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation projection approach showed both effectiveness and high accuracy in the resulting dataset. As a practical application of this methodology, we present the creation of French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED), an annotated corpus comprising 2’051 synthetic clinical cases in French. The corpus is now available for researchers and practitioners to develop and refine French natural language processing (NLP) applications in the clinical field (https://zenodo.org/record/8355629), making it the largest open annotated corpus with linked medical concepts in French.

pdf abs
FReND: A French Resource of Negation Data
Hafida Le Cloirec - Ait Yahya | Olga Seminck | Pascal Amsili

FReND is a freely available corpus of French language in which negations are hand-annotated. Negations are annotated by their cues and scopes. Comprising 590K tokens and over 8.9K negations, it is the largest dataset available for French. A variety of types of textual genres are covered: literature, blog posts, Wikipedia articles, political debates, clinical reports and newspaper articles. As the understanding of negation is not yet mastered by current state of the art AI-models, FReND is not only a valuable resource for linguistic research into negation, but also as training data for AI tasks such as negation detection.

Confusing charge prediction is a challenging task in legal AI, which involves predicting confusing charges based on fact descriptions. While existing charge prediction methods have shown impressive performance, they face significant challenges when dealing with confusing charges, such as Snatch and Robbery. In the legal domain, constituent elements play a pivotal role in distinguishing confusing charges. Constituent elements are fundamental behaviors underlying criminal punishment and have subtle distinctions among charges. In this paper, we introduce a novel From Graph to Word Bag (FWGB) approach, which introduces domain knowledge regarding constituent elements to guide the model in making judgments on confusing charges, much like a judge’s reasoning process. Specifically, we first construct a legal knowledge graph containing constituent elements to help select keywords for each charge, forming a word bag. Subsequently, to guide the model’s attention towards the differentiating information for each charge within the context, we expand the attention mechanism and introduce a new loss function with attention supervision through words in the word bag. We construct the confusing charges dataset from real-world judicial documents. Experiments demonstrate the effectiveness of our method, especially in maintaining exceptional performance in imbalanced label distributions.

In this digital era, memes have become a prevalent online expression, humor, sarcasm, and social commentary. However, beneath their surface lies concerning issues such as the propagation of misogyny, gender-based bias, and harmful stereotypes. To overcome these issues, we introduced MDMD (Misogyny Detection Meme Dataset) in this paper. This article focuses on creating an annotated dataset with detailed annotation guidelines to delve into online misogyny within the Tamil and Malayalam-speaking communities. Through analyzing memes, we uncover the intricate world of gender bias and stereotypes in these communities, shedding light on their manifestations and impact. This dataset, along with its comprehensive annotation guidelines, is a valuable resource for understanding the prevalence, origins, and manifestations of misogyny in various contexts, aiding researchers, policymakers, and organizations in developing effective strategies to combat gender-based discrimination and promote equality and inclusivity. It enables a deeper understanding of the issue and provides insights that can inform strategies for cultivating a more equitable and secure online environment. This work represents a crucial step in raising awareness and addressing gender-based discrimination in the digital space.

With advances in the field of Linked (Open) Data (LOD), language data on the LOD cloud has grown in number, size, and variety. With an increased volume and variety of language data, optimizations of methods for distributing, storing, and querying these data become more central. To this end, this position paper investigates use cases at the intersection of LLOD and Big Data, existing approaches to utilizing Big Data techniques within the context of linked data, and discusses the challenges and benefits of this union.

pdf abs
From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization
Botond Barta | Dorina Lakatos | Attila Nagy | Milán Konor Nyist | Judit Ács

Training summarization models requires substantial amounts of training data. However for less resourceful languages like Hungarian, openly available models and datasets are notably scarce. To address this gap our paper introduces an open-source Hungarian corpus suitable for training abstractive and extractive summarization models. The dataset is assembled from segments of the Common Crawl corpus undergoing thorough cleaning, preprocessing and deduplication. In addition to abstractive summarization we generate sentence-level labels for extractive summarization using sentence similarity. We train baseline models for both extractive and abstractive summarization using the collected dataset. To demonstrate the effectiveness of the trained models, we perform both quantitative and qualitative evaluation. Our models and dataset will be made publicly available, encouraging replication, further research, and real-world applications across various domains.

pdf abs
From Technology to Market. Bilingual Corpus on the Evaluation of Technology Opportunity Discovery
Amir Hazem | Kazuyuki Motohashi | Chen Zhu

As companies aim to enhance and expand their product portfolios, Technology Opportunity Discovery (TOD) has gained increasing interest. To comprehend the role of emerging technologies in innovation, we introduce a novel technology-market corpus in English and Japanese languages, and conduct a comprehensive empirical evaluation of the linkage between technology and the market. Our dataset comprises English patents extracted from the USPTO database and Japanese patents from the Japanese Patent Office (JPO), along with their associated products for each stock market company. We compare several static and contextualized word embedding methods to construct a technology-market space and propose an effective methodology based on a fine-tuned BERT model for linking technology to the market.

pdf abs
From Text to Historical Ecological Knowledge: The Construction and Application of the Shan Jing Knowledge Base
Ke Liang | Chu-Ren Huang | Xin-Lan Jiang

Traditional Ecological Knowledge (TEK) has been recognized as a shared cultural heritage and a crucial instrument to tackle today’s environmental challenges. In this paper, we deal with historical ecological knowledge, a special type of TEK that is based on ancient language texts. In particular, we aim to build a language resource based on Shanhai Jing (The Classic of Mountains and Seas). Written 2000 years ago, Shanhai Jing is a record of flora and fauna in ancient China, anchored by mountains (shan) and seas (hai). This study focuses on the entities in the Shan Jing part and builds a knowledge base for them. We adopt a pattern-driven and bottom-up strategy to accommodate two features of the source: highly stylized narrative and juxtaposition of knowledge from multiple domains. The PRF values of both entity and relationship extraction are above 96%. Quality assurance measures like entity disambiguation and resolution were done by domain experts. Neo4j graph database is used to visualize the result. We think the knowledge base, containing 1432 systematically classified entities and 3294 relationships, can provide the foundation for the construction of a historical ecological knowledge base of China. Additionally, the ruled-based text-matching method can be helpful in ancient language processing.

pdf abs
From Text to Source: Results in Detecting Large Language Model-Generated Content
Wissam Antoun | Benoît Sagot | Djamé Seddah

The widespread use of Large Language Models (LLMs), celebrated for their ability to generate human-like text, has raised concerns about misinformation and ethical implications. Addressing these concerns necessitates the development of robust methods to detect and attribute text generated by LLMs. This paper investigates “Cross-Model Detection,” by evaluating whether a classifier trained to distinguish between source LLM-generated and human-written text can also detect text from a target LLM without further training. The study comprehensively explores various LLM sizes and families and assesses the impact of conversational fine-tuning techniques, quantization, and watermarking on classifier generalization. The research also explores Model Attribution, encompassing source model identification, model family, and model size classification, in addition to quantization and watermarking detection. Our results reveal several key findings: a clear inverse relationship between classifier effectiveness and model size, with larger LLMs being more challenging to detect, especially when the classifier is trained on data from smaller models. Training on data from similarly sized LLMs can improve detection performance from larger models but may lead to decreased performance when dealing with smaller models. Additionally, model attribution experiments show promising results in identifying source models and model families, highlighting detectable signatures in LLM-generated text, with particularly remarkable outcomes in watermarking detection, while no detectable signatures of quantization were observed. Overall, our study contributes valuable insights into the interplay of model size, family, and training data in LLM detection and attribution.

pdf abs
FUSE - FrUstration and Surprise Expressions: A Subtle Emotional Multimodal Language Corpus
Rajesh Titung | Cecilia Ovesdotter Alm

This study introduces a novel multimodal corpus for expressive task-based spoken language and dialogue, focused on language use under frustration and surprise, elicited from three tasks motivated by prior research and collected in an IRB-approved experiment. The resource is unique both because these are understudied affect states for emotion modeling in language, and also because it provides both individual and dyadic multimodally grounded language. The study includes a detailed analysis of annotations and performance results for multimodal emotion inference in language use.

Common document ranking pipelines in search systems are cascade systems that involve multiple ranking layers to integrate different information step-by-step. In this paper, we propose a novel re-ranker Fusion-in-T5 (FiT5), which integrates text matching information, ranking features, and global document information into one single unified model via templated-based input and global attention. Experiments on passage ranking benchmarks MS MARCO and TREC DL show that FiT5, as one single model, significantly improves ranking performance over complex cascade pipelines. Analysis finds that through attention fusion, FiT5 jointly utilizes various forms of ranking information via gradually attending to related documents and ranking features, and improves the detection of subtle nuances. Our code is open-sourced at https://github.com/OpenMatch/FiT5 . Keywords: document ranking, attention, fusion

pdf abs
GAATME: A Genetic Algorithm for Adversarial Translation Metrics Evaluation
Josef Jon | Ondřej Bojar

Building on a recent method for decoding translation candidates from a Machine Translation (MT) model via a genetic algorithm, we modify it to generate adversarial translations to test and challenge MT evaluation metrics. The produced translations score very well in an arbitrary MT evaluation metric selected beforehand, despite containing serious, deliberately introduced errors. The method can be used to create adversarial test sets to analyze the biases and shortcomings of the metrics. We publish various such test sets for the Czech to English language pair, as well as the code to convert any parallel data into a similar adversarial test set.

pdf abs
GCNet: Global-and-Context Collaborative Learning for Aspect-Based Sentiment Analysis
Ting Zhou | Ying Shen | Yinghui Li

Aspect-Based Sentiment Analysis (ABSA) aims to determine the sentiment polarities of specified aspect terms in a sentence. Most previous approaches mainly use an attention mechanism or graph neural networks based on dependency trees to explicitly model the connections between aspect terms and opinion words. However, these methods may not effectively address cases where the sentiment of an aspect term is implicitly described, as the corresponding opinion words may not directly appear in the sentence. To alleviate this issue, in this paper, we propose a GCNet that explicitly leverages global semantic information to guide context encoding. Particularly, we design a semantics encoding module that incorporates global semantic features into sequential modeling process to enable the consideration of the overall sentiment tendency of a sentence, while the global semantic features are also refined by adaptively focusing on different parts of the sentence. Moreover, for a comprehensive sentence analysis, we also include a syntactic feature encoding module along with a pre-fusion module to integrate the refined global features with the syntactic representations. Extensive experiments on three public datasets demonstrate that our model outperforms state-of-the-art methods, indicating the robustness and effectiveness of our approach.

pdf abs
GECSum: Generative Evaluation-Driven Sequence Level Contrastive Learning for Abstractive Summarization
Jiawen Xie | Shaoting Zhang | Xiaofan Zhang

While dominant in abstractive summarization, transformer-based language models with the standard maximum likelihood estimation (MLE) training remain challenged by two discrepancies: the misalignment between token-level training and sequence-level evaluation, and the divergence between teacher-forcing training manner and auto-regressive generation behavior. Recent studies have shown that sequence-level contrastive learning, which utilizes the quality differences between multiple summaries as prior information, can effectively mitigate these issues. However, as certain evaluation metrics often determine the contrastive signals in existing methods, this leads to the model performance aligning with the preferences of these metrics being limited by the evaluation capabilities of these metrics. Inspired by prior works that treat the evaluation of generated text as a text generation problem, we propose a generative evaluation-driven contrastive learning framework, which leverages the semantic understanding capabilities of the abstractive model itself to evaluate summary in reference-based settings. In this way, our method establishes a connection between the model’s reference-based evaluation and reference-free generation scenarios, allowing them to share the benefits of model capability enhancements. Extensive experiments on four summarization datasets demonstrate that our method outperforms the previous state-of-the-art regarding comprehensive performance. Various empirical analyses further substantiate the effectiveness of our method.

pdf abs
Gendered Grammar or Ingrained Bias? Exploring Gender Bias in Icelandic Language Models
Steinunn Rut Friðriksdóttir | Hafsteinn Einarsson

Large language models, trained on vast datasets, exhibit increased output quality in proportion to the amount of data that is used to train them. This data-driven learning process has brought forth a pressing issue where these models may not only reflect but also amplify gender bias, racism, religious prejudice, and queerphobia present in their training data that may not always be recent. This study explores gender bias in language models trained on Icelandic, focusing on occupation-related terms. Icelandic is a highly grammatically gendered language that favors the masculine when referring to groups of people with indeterminable genders. Our aim is to explore whether language models merely mirror gender distributions within the corresponding professions or if they exhibit biases tied to their grammatical genders. Results indicate a significant overall predisposition towards the masculine but specific occupation terms consistently lean toward a particular gender, indicating complex interplays of societal and linguistic influences.

pdf abs
Generating Clarification Questions for Disambiguating Contracts
Anmol Singhal | Chirag Jain | Preethu Rose Anish | Arkajyoti Chakraborty | Smita Ghaisas

Enterprises frequently enter into commercial contracts that can serve as vital sources of project-specific requirements. Contractual clauses are obligatory, and the requirements derived from contracts can detail the downstream implementation activities that non-legal stakeholders, including requirement analysts, engineers, and delivery personnel, need to conduct. However, comprehending contracts is cognitively demanding and error-prone for such stakeholders due to the extensive use of Legalese and the inherent complexity of contract language. Furthermore, contracts often contain ambiguously worded clauses to ensure comprehensive coverage. In contrast, non-legal stakeholders require a detailed and unambiguous comprehension of contractual clauses to craft actionable requirements. In this work, we introduce a novel legal NLP task that involves generating clarification questions for contracts. These questions aim to identify contract ambiguities on a document level, thereby assisting non-legal stakeholders in obtaining the necessary details for eliciting requirements. This task is challenged by three core issues: (1) data availability, (2) the length and unstructured nature of contracts, and (3) the complexity of legal text. To address these issues, we propose ConRAP, a retrieval-augmented prompting framework for generating clarification questions to disambiguate contractual text. Experiments conducted on contracts sourced from the publicly available CUAD dataset show that ConRAP with ChatGPT can detect ambiguities with an F2 score of 0.87. 70% of the generated clarification questions are deemed useful by human evaluators.

We investigate the problem of synthesizing relevant visual imagery from generic long-form text, leveraging Large Language Models (LLMs) and Text-to-Image Models (TIMs). Current Text-to-Image models require short prompts that describe the image content and style explicitly. Unlike image prompts, generation of images from general long-form text requires the image synthesis system to derive the visual content and style elements from the text. In this paper, we study zero-shot prompting and supervised fine-tuning approaches that use LLMs and TIMs jointly for synthesizing images. We present an empirical study on generating images for Wikipedia articles covering a broad spectrum of topic and image styles. We compare these systems using a suite of metrics, including a novel metric specifically designed to evaluate the semantic correctness of generated images. Our study offers a preliminary understanding of existing models’ strengths and limitation for the task of image generation from long-form text, and sets up an evaluation framework and establishes baselines for future research.

pdf abs
Generating Hard-Negative Out-of-Scope Data with ChatGPT for Intent Classification
Zhijian Li | Stefan Larson | Kevin Leach

Intent classifiers must be able to distinguish when a user’s utterance does not belong to any supported intent to avoid producing incorrect and unrelated system responses. Although out-of-scope (OOS) detection for intent classifiers has been studied, previous work has not yet studied changes in classifier performance against hard-negative out-of-scope utterances (i.e., inputs that share common features with in-scope data, but are actually out-of-scope). We present an automated technique to generate hard-negative OOS data using ChatGPT. We use our technique to build five new hard-negative OOS datasets, and evaluate each against three benchmark intent classifiers. We show that classifiers struggle to correctly identify hard-negative OOS utterances more than general OOS utterances. Finally, we show that incorporating hard-negative OOS data for training improves model robustness when detecting hard-negative OOS data and general OOS data. Our technique, datasets, and evaluation address an important void in the field, offering a straightforward and inexpensive way to collect hard-negative OOS data and improve intent classifiers’ robustness.

pdf abs
Generating Multiple-choice Questions for Medical Question Answering with Distractors and Cue-masking
Damien Sileo | Kanimozhi Uma | Marie-Francine Moens

Medical multiple-choice question answering (MCQA) is a challenging evaluation for medical natural language processing and a helpful task in itself. Medical questions may describe patient symptoms and ask for the correct diagnosis, which requires domain knowledge and complex reasoning. Standard language modeling pretraining alone is not sufficient to achieve the best results with BERT-base size (Devlin et al., 2019) encoders. Jin et al. (2020) showed that focusing masked language modeling on disease name prediction when using medical encyclopedic paragraphs as input leads to considerable MCQA accuracy improvement. In this work, we show that (1) fine-tuning on generated MCQA dataset outperforms the masked language modeling based objective and (2) correctly masking the cues to the answers is critical for good performance. We release new pretraining datasets and achieve state-of-the-art results on 4 MCQA datasets, notably +5.7% with base-size model on MedQA-USMLE.

pdf abs
Generative Multimodal Entity Linking
Senbao Shi | Zhenran Xu | Baotian Hu | Min Zhang

Multimodal Entity Linking (MEL) is the task of mapping mentions with multimodal contexts to the referent entities from a knowledge base. Existing MEL methods mainly focus on designing complex multimodal interaction mechanisms and require fine-tuning all model parameters, which can be prohibitively costly and difficult to scale in the era of Large Language Models (LLMs). In this work, we propose GEMEL, a Generative Multimodal Entity Linking framework based on LLMs, which directly generates target entity names. We keep the vision and language model frozen and only train a feature mapper to enable cross-modality interactions. To adapt LLMs to the MEL task, we leverage the in-context learning capability of LLMs by retrieving multimodal instances as demonstrations. Extensive experiments show that, with only ∼0.3% of the model parameters fine-tuned, GEMEL achieves state-of-the-art results on two well-established MEL datasets (7.7% accuracy gains on WikiDiverse and 8.8% accuracy gains on WikiMEL). The performance gain stems from mitigating the popularity bias of LLM predictions and disambiguating less common entities effectively. Further analysis verifies the generality and scalability of GEMEL. Our framework is compatible with any off-the-shelf language model, paving the way towards an efficient and general solution for utilizing LLMs in the MEL task. Our code is available at https://github.com/HITsz-TMG/GEMEL.

pdf abs
GENTRAC: A Tool for Tracing Trauma in Genocide and Mass Atrocity Court Transcripts
Miriam Schirmer | Christian Brechenmacher | Endrit Jashari | Juergen Pfeffer

This paper introduces GENTRAC, an open-access web-based tool built to interactively detect and analyze potentially traumatic content in witness statements of genocide and mass atrocity trials. Harnessing recent developments in natural language processing (NLP) to detect trauma, GENTRAC processes and formats court transcripts for NLP analysis through a sophisticated parsing algorithm and detects the likelihood of traumatic content for each speaker segment. The tool visualizes the density of such content throughout a trial day and provides statistics on the overall amount of traumatic content and speaker distribution. Capable of processing transcripts from four prominent international criminal courts, including the International Criminal Court (ICC), GENTRAC’s reach is vast, tailored to handle millions of pages of documents from past and future trials. Detecting potentially re-traumatizing examination methods can enhance the development of trauma-informed legal procedures. GENTRAC also serves as a reliable resource for legal, human rights, and other professionals, aiding their comprehension of mass atrocities’ emotional toll on survivors.

pdf abs
Geographically-Informed Language Identification
Jonathan Dunn | Lane Edwards-Brown

This paper develops an approach to language identification in which the set of languages considered by the model depends on the geographic origin of the text in question. Given that many digital corpora can be geo-referenced at the country level, this paper formulates 16 region-specific models, each of which contains the languages expected to appear in countries within that region. These regional models also each include 31 widely-spoken international languages in order to ensure coverage of these linguae francae regardless of location. An upstream evaluation using traditional language identification testing data shows an improvement in f-score ranging from 1.7 points (Southeast Asia) to as much as 10.4 points (North Africa). A downstream evaluation on social media data shows that this improved performance has a significant impact on the language labels which are applied to large real-world corpora. The result is a highly-accurate model that covers 916 languages at a sample size of 50 characters, the performance improved by incorporating geographic information into the model.

pdf abs
GerDISDETECT: A German Multilabel Dataset for Disinformation Detection
Mina Schütz | Daniela Pisoiu | Daria Liakhovets | Alexander Schindler | Melanie Siegel

Disinformation has become increasingly relevant in recent years both as a political issue and as object of research. Datasets for training machine learning models, especially for other languages than English, are sparse and the creation costly. Annotated datasets often have only binary or multiclass labels, which provide little information about the grounds and system of such classifications. We propose a novel textual dataset GerDISDETECT for German disinformation. To provide comprehensive analytical insights, a fine-grained taxonomy guided annotation scheme is required. The goal of this dataset, instead of providing a direct assessment regarding true or false, is to provide wide-ranging semantic descriptors that allow for complex interpretation as well as inferred decision-making regarding information and trustworthiness of potentially critical articles. This allows this dataset to be also used for other tasks. The dataset was collected in the first three months of 2022 and contains 39 multilabel classes with 5 top-level categories for a total of 1,890 articles: General View (3 labels), Offensive Language (11 labels), Reporting Style (15 labels), Writing Style (6 labels), and Extremism (4 labels). As a baseline, we further pre-trained a multilingual XLM-R model on around 200,000 unlabeled news articles and fine-tuned it for each category.

pdf abs
German Also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset
Laura Mascarell | Ribin Chalumattu | Annette Rios

The advent of Large Language Models (LLMs) has led to remarkable progress on a wide range of natural language processing tasks. Despite the advances, these large-sized models still suffer from hallucinating information in their output, which poses a major issue in automatic text summarization, as we must guarantee that the generated summary is consistent with the content of the source document. Previous research addresses the challenging task of detecting hallucinations in the output (i.e. inconsistency detection) in order to evaluate the faithfulness of the generated summaries. However, these works primarily focus on English and recent multilingual approaches lack German data. This work presents Absinth, a manually annotated dataset for hallucination detection in German news summarization and explores the capabilities of novel open-source LLMs on this task in both fine-tuning and in-context learning settings. We open-source and release the Absinth dataset to foster further research on hallucination detection in German.

pdf abs
German Parliamentary Corpus (GerParCor) Reloaded
Giuseppe Abrami | Mevlüt Bagci | Alexander Mehler

In 2022, the largest German-speaking corpus of parliamentary protocols from three different centuries, on a national and federal level from the countries of Germany, Austria, Switzerland and Liechtenstein, was collected and published - GerParCor. Through GerParCor, it became possible to provide for the first time various parliamentary protocols which were not available digitally and, moreover, could not be retrieved and processed in a uniform manner. Furthermore, GerParCor was additionally preprocessed using NLP methods and made available in XMI format. In this paper, GerParCor is significantly updated by including all new parliamentary protocols in the corpus, as well as adding and preprocessing further parliamentary protocols previously not covered, so that a period up to 1797 is now covered. Besides the integration of a new, state-of-the-art and appropriate NLP preprocessing for the handling of large text corpora, this update also provides an overview of the further reuse of GerParCor by presenting various provisioning capabilities such as API’s, among others.

pdf abs
German SRL: Corpus Construction and Model Training
Maxim Konca | Andy Luecking | Alexander Mehler

A useful semantic role-annotated resource for training semantic role models for the German language is missing. We point out some problems of previous resources and provide a new one due to a combined translation and alignment process: The gold standard CoNLL-2012 semantic role annotations are translated into German. Semantic role labels are transferred due to alignment models. The resulting dataset is used to train a German semantic role model. With F1-scores around 0.7, the major roles achieve competitive evaluation scores, but avoid limitations of previous approaches. The described procedure can be applied to other languages as well.

pdf abs
GERMS-AT: A Sexism/Misogyny Dataset of Forum Comments from an Austrian Online Newspaper
Brigitte Krenn | Johann Petrak | Marina Kubina | Christian Burger

Brigitte Krenn, Johann Petrak, Marina Kubina, Christian Burger This paper presents a sexism/misogyny dataset extracted from comments of a large online forum of an Austrian newspaper. The comments are in Austrian German language, and in some cases interspersed with dialectal or English elements. We describe the data collection, the annotation guidelines and the annotation process resulting in a corpus of approximately 8 000 comments which were annotated with 5 levels of sexism/misogyny, ranging from 0 (not sexist/misogynist) to 4 (highly sexist/misogynist). The professional forum moderators (self-identified females and males) of the online newspaper were involved as experts in the creation of the annotation guidelines and the annotation of the user comments. In addition, we also describe first results of training transformer-based classification models for both binarized and original label classification of the corpus.

pdf abs
GIL-GALaD: Gender Inclusive Language - German Auto-Assembled Large Database
Anna-Katharina Dick | Matthias Drews | Valentin Pickard | Victoria Pierz

As the need for gender-inclusive language has become a highly debated topic over the years, gendered biases in speech are unfortunately often picked up and propagated by modern language models trained on large amounts of text. While remedial efforts are underway, grammatically gendered languages such as German pose some unique challenges in generating gender-inclusive language for corrective model training or fine-tuning. We assembled GIL-GALaD, a corpus of German gender-inclusive language from different sources such as social media, news articles, public speeches and academic publications. Our corpus includes the most common types of modifications of generic masculine forms of nouns and spans 30 years (1993-2023), containing over 800,000 instances of gender-inclusive language. Tools for corpus usage and extension are to be included in the release. During corpus assembly, we were also able to gain some insights into which types of gender-inclusive language were used in practice throughout the years and across different domains.

This paper introduces GLAMR, an Abstract Meaning Representation (AMR) interpretation of Generative Lexicon (GL) semantic components. It includes a structured subeventual interpretation of linguistic predicates, and encoding of the opposition structure of property changes of event arguments. Both of these features are recently encoded in VerbNet (VN), and form the scaffolding for the semantic form associated with VN frame files. We develop a new syntax, concepts, and roles for subevent structure based on VN for connecting subevents to atomic predicates. Our proposed extension is compatible with current AMR specification. We also present an approach to automatically augment AMR graphs by inserting subevent structure of the predicates and identifying the subevent arguments from the semantic roles. A pilot annotation of GLAMR graphs of 65 documents (486 sentences), based on procedural texts as a source, is presented as a public dataset. The annotation includes subevents, argument property change, and document-level anaphoric links. Finally, we provide baseline models for converting text to GLAMR and vice versa, along with the application of GLAMR for generating enriched paraphrases with details on subevent transformation and arguments that are not present in the surface form of the texts.

Multi-level implicit discourse relation recognition (MIDRR) is a challenging task to recognize the hierarchical discourse relations between the arguments with the absence of connectives. Recent methods tend to incorporate the static hierarchical structure containing all senses (defined as global hierarchy) into prompt tuning through a path prompt template or hierarchical label refining. Howerver, hierarchical modeling is independent of the verbalizer, resulting in a failure to effectively utilize the output probability distribution information of verbalizer. Besides, they ignore the utilization of the dynamic hierarchical label sequence for each instance (defined as local hierarchy) in prompt tuning. In this paper, we propose a global and local hierarchical prompt tuning (GLHPT) framework, which utilize prior knowledge of PLMs while better incorporating hierarchical information from two aspects. We leverage bottom-up propagated probability as the global hierarchy to inject it into multi-level verbalizer (MLV). Furthermore, we design a local hierarchy-driven contrastive learning (LHCL) to improve the probability distribution of MLV. Finally, our model achieves competitive results on two benchmacks.

pdf abs
GlotScript: A Resource and Tool for Low Resource Writing System Identification
Amir Hossein Kargaran | François Yvon | Hinrich Schütze

We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.

pdf abs
GMEG-EXP: A Dataset of Human- and LLM-Generated Explanations of Grammatical and Fluency Edits
S. Magalí López Cortez | Mark Josef Norris | Steve Duman

Recent work has explored the ability of large language models (LLMs) to generate explanations of existing labeled data. In this work, we investigate the ability of LLMs to explain revisions in sentences. We introduce a new dataset demonstrating a novel task, which we call explaining text revisions. We collected human- and LLM-generated explanations of grammatical and fluency edits and defined criteria for the human evaluation of the explanations along three dimensions: Coverage, Informativeness, and Correctness. The results of a side-by-side evaluation show an Overall preference for human explanations, but there are many instances in which annotators show no preference. Annotators prefer human-generated explanations for Informativeness and Correctness, but they show no preference for Coverage. We also examined the extent to which the number of revisions in a sentence influences annotators’ Overall preference for the explanations. We found that the preference for human explanations increases as the number of revisions in the sentence increases. Additionally, we show that the Overall preference for human explanations depends on the type of error being explained. We discuss explanation styles based on a qualitative analysis of 300 explanations. We release our dataset and annotation guidelines to encourage future research.

pdf abs
GOLEM: GOld Standard for Learning and Evaluation of Motifs
W. Victor Yarlott | Anurag Acharya | Diego Castro Estrada | Diana Gomez | Mark Finlayson

Motifs are distinctive, recurring, widely used idiom-like words or phrases, often originating from folklore, whose meaning are anchored in a narrative. Motifs have significance as communicative devices because they concisely imply a constellation of culturally relevant information. Their broad usage suggests their cognitive importance as touchstones of cultural knowledge. We present GOLEM, the first dataset annotated for motific information. The dataset comprises 7,955 English articles (2,039,424 words). The corpus identifies 26,078 motif candidates across 34 motif types from three cultural or national groups: Jewish, Irish, and Puerto Rican. Each motif candidate is labeled with the type of usage (Motific, Referential, Eponymic, or Unrelated), resulting in 1,723 actual motific instances. Annotation was performed by individuals identifying as members of each group and achieved a Fleiss’ kappa of >0.55. We demonstrate that classification of candidate type is a challenging task for LLMs using a few-shot approach; recent models such as T5, FLAN-T5, GPT-2, and Llama 2 (7B) achieved a performance of 41% accuracy at best. These data will support development of new models and approaches for detecting (and reasoning about) motific information in text. We release the corpus, the annotation guide, and the code to support other researchers building on this work.

pdf abs
Good or Bad News? Exploring GPT-4 for Sentiment Analysis for Faroese on a Public News Corpora
Iben Nyholm Debess | Annika Simonsen | Hafsteinn Einarsson

Sentiment analysis in low-resource languages presents unique challenges that Large Language Models may help address. This study explores the efficacy of GPT-4 for sentiment analysis on Faroese news texts, an uncharted task for this language. On the basis of guidelines presented, the sentiment analysis was performed with a multi-class approach at the sentence and document level with 225 sentences analysed in 170 articles. When comparing GPT-4 to human annotators, we observe that GPT-4 performs remarkably well. We explored two prompt configurations and observed a benefit from having clear instructions for the sentiment analysis task, but no benefit from translating the articles to English before the sentiment analysis task. Our results indicate that GPT-4 can be considered as a valuable tool for generating Faroese test data. Furthermore, our investigation reveals the intricacy of news sentiment. This motivates a more nuanced approach going forward, and we suggest a multi-label approach for future research in this domain. We further explored the efficacy of GPT-4 in topic classification on news texts and observed more negative sentiments expressed in international than national news. Overall, this work demonstrates GPT-4’s proficiency on a novel task and its utility for augmenting resources in low-data languages.

pdf abs
Gos 2: A New Reference Corpus of Spoken Slovenian
Darinka Verdonik | Kaja Dobrovoljc | Tomaž Erjavec | Nikola Ljubešić

This paper introduces a new version of the Gos reference corpus of spoken Slovenian, which was recently extended to more than double the original size (300 hours, 2.4 million words) by adding speech recordings and transcriptions from two related initiatives, the Gos VideoLectures corpus of public academic speech, and the Artur speech recognition database. We describe this process by first presenting the criteria guiding the balanced selection of the newly added data and the challenges encountered when merging language resources with divergent designs, followed by the presentation of other major enhancements of the new Gos corpus, such as improvements in lemmatization and morphosyntactic annotation, word-level speech alignment, a new XML schema and the development of a specialized online concordancer.

pdf abs
GPT-3.5 for Grammatical Error Correction
Anisia Katinskaia | Roman Yangarber

This paper investigates the application of GPT-3.5 for Grammatical Error Correction (GEC) in multiple languages in several settings: zero-shot GEC, fine-tuning for GEC, and using GPT-3.5 to re-rank correction hypotheses generated by other GEC models. In the zero-shot setting, we conduct automatic evaluations of the corrections proposed by GPT-3.5 using several methods: estimating grammaticality with language models (LMs), the Scribendy test, and comparing the semantic embeddings of sentences. GPT-3.5 has a known tendency to over-correct erroneous sentences and propose alternative corrections. For several languages, such as Czech, German, Russian, Spanish, and Ukrainian, GPT-3.5 substantially alters the source sentences, including their semantics, which presents significant challenges for evaluation with reference-based metrics. For English, GPT-3.5 demonstrates high recall, generates fluent corrections, and generally preserves sentence semantics. However, human evaluation for both English and Russian reveals that, despite its strong error-detection capabilities, GPT-3.5 struggles with several error types, including punctuation mistakes, tense errors, syntactic dependencies between words, and lexical compatibility at the sentence level.

pdf abs
GPTEval: A Survey on Assessments of ChatGPT and GPT-4
Rui Mao | Guanyi Chen | Xulang Zhang | Frank Guerin | Erik Cambria

The emergence of ChatGPT has generated much speculation in the press about its potential to disrupt social and economic systems. Its astonishing language ability has aroused strong curiosity among scholars about its performance in different domains. There have been many studies evaluating the ability of ChatGPT and GPT-4 in different tasks and disciplines. However, a comprehensive review summarizing the collective assessment findings is lacking. The objective of this survey is to thoroughly analyze prior assessments of ChatGPT and GPT-4, focusing on its language and reasoning abilities, scientific knowledge, and ethical considerations. Furthermore, an examination of the existing evaluation methods is conducted, offering several recommendations for future research.

pdf abs
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection?
Yiping Jin | Leo Wanner | Alexander Shvets

Online hate detection suffers from biases incurred in data sampling, annotation, and model pre-training. Therefore, measuring the averaged performance over all examples in held-out test data is inadequate. Instead, we must identify specific model weaknesses and be informed when it is more likely to fail. A recent proposal in this direction is HateCheck, a suite for testing fine-grained model functionalities on synthesized data generated using templates of the kind “You are just a [slur] to me.” However, despite enabling more detailed diagnostic insights, the HateCheck test cases are often generic and have simplistic sentence structures that do not match the real-world data. To address this limitation, we propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch by instructing large language models (LLMs). We employ an additional natural language inference (NLI) model to verify the generations. Crowd-sourced annotation demonstrates that the generated test cases are of high quality. Using the new functional tests, we can uncover model weaknesses that would be overlooked using the original HateCheck dataset.

This paper details the process of developing the first native large generative language model for the North Germanic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation, applications, and considerations for release strategies. We discuss pros and cons of developing large language models for smaller languages and in relatively peripheral regions of the globe, and we hope that this paper can serve as a guide and reference for other researchers that undertake the development of large generative models for smaller languages.

Multilingual neural machine translation handles the translation of multiple languages with one unified model. However, this joint-training paradigm incurs the notorious issue of parameter interference, where the model compromises with the language diversity to find a common solution. Recent research has explored avoiding this problem by selecting certain parameters for each language direction from the original model to form language-specific sub-networks. However, determining how many parameters to choose and which parameters to select is still a serious challenge. In this work, we propose an approach called CaPA (Consistency-based Parameter Allocation), which dynamically allocates parameters of appropriate scale to each language direction based on the consistency between the gradient of the individual language and the average gradient. Specifically, CaPA allocates more parameters to languages with higher gradient consistency as these languages tend to have a more positive impact on other languages. Furthermore, considering the varying levels of interference across different parts of the model, we propose an adaptive parameter allocation based on module-level gradient consistency. Experimental results show the correlation between gradient consistency and parameter interference, as well as the effectiveness of our proposed method.

pdf abs
Gramble: A Tabular Programming Language for Collaborative Linguistic Modeling
Patrick Littell | Darlene Stewart | Fineen Davis | Aidan Pine | Roland Kuhn

We introduce Gramble, a domain-specific programming language for linguistic parsing and generation, in the tradition of XFST, TWOLC, and Kleene. Gramble features an intuitive tabular syntax and supports live group programming, allowing community experts to participate more directly in system development without having to be programmers themselves. A cross-platform interpreter is available for Windows, MacOS, and UNIX, supports collaborative programming on the web via Google Sheets, and is released open-source under the MIT license.

pdf abs
Grammatical Error Correction for Code-Switched Sentences by Learners of English
Kelvin Wey Han Chan | Christopher Bryant | Li Nguyen | Andrew Caines | Zheng Yuan

Code-switching (CSW) is a common phenomenon among multilingual speakers where multiple languages are used in a single discourse or utterance. Mixed language utterances may still contain grammatical errors however, yet most existing Grammar Error Correction (GEC) systems have been trained on monolingual data and not developed with CSW in mind. In this work, we conduct the first exploration into the use of GEC systems on CSW text. Through this exploration, we propose a novel method of generating synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora. We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints, and identify how they affect the performance of GEC systems on CSW text. Our best model achieves an average increase of 1.57 F0.5 across 3 CSW test sets (English-Chinese, English-Korean and English-Japanese) without affecting the model’s performance on a monolingual dataset. We furthermore discovered that models trained on one CSW language generalise relatively well to other typologically similar CSW languages.

pdf abs
Granular Change Accuracy: A More Accurate Performance Metric for Dialogue State Tracking
Taha Aksu | Nancy Chen

Current metrics for evaluating Dialogue State Tracking (DST) systems exhibit three primary limitations. They: i) erroneously presume a uniform distribution of slots throughout the dialog, ii) neglect to assign partial scores for individual turns, iii) frequently overestimate or underestimate performance by repeatedly counting the models’ successful or failed predictions. To address these shortcomings, we introduce a novel metric: Granular Change Accuracy (GCA). GCA focuses on evaluating the predicted changes in dialogue state over the entire dialogue history. Benchmarking reveals that GCA effectively reduces biases arising from distribution uniformity and the positioning of errors across turns, resulting in a more precise evaluation. Notably, we find that these biases are particularly pronounced when evaluating few-shot or zero-shot trained models, becoming even more evident as the model’s error rate increases. Hence, GCA offers significant promise, particularly for assessing models trained with limited resources. Our GCA implementation is a useful addition to the pool of DST metrics.

The era of transfer learning has revolutionized the fields of Computer Vision and Natural Language Processing, bringing powerful pretrained models with exceptional performance across a variety of tasks. Specifically, Natural Language Processing tasks have been dominated by transformer-based language models. In Natural Language Inference and Natural Language Generation tasks, the BERT model and its variants, as well as the GPT model and its successors, demonstrated exemplary performance. However, the majority of these models are pretrained and assessed primarily for the English language or on a multilingual corpus. In this paper, we introduce GreekBART, the first Seq2Seq model based on BART-base architecture and pretrained on a large-scale Greek corpus. We evaluate and compare GreekBART against BART-random, Greek-BERT, and XLM-R on a variety of discriminative tasks. In addition, we examine its performance on two NLG tasks from GreekSUM, a newly introduced summarization dataset for the Greek language. The model, the code, and the new summarization dataset will be publicly available.

pdf abs
GRIT: A Dataset of Group Reference Recognition in Italian
Sergio E. Zanotto | Qi Yu | Miriam Butt | Diego Frassinelli

For the analysis of political discourse a reliable identification of group references, i.e., linguistic components that refer to individuals or groups of people, is useful. However, the task of automatically recognizing group references has not yet gained much attention within NLP. To address this gap, we introduce GRIT (Group Reference for Italian), a large-scale, multi-domain manually annotated dataset for group reference recognition in Italian. GRIT represents a new resource for automatic and generalizable recognition of group references. With this dataset, we aim to establish group reference recognition as a valid classification task, which extends the domain of Named Entity Recognition by expanding its focus to literal and figurative mentions of social groups. We verify the potential of achieving automated group reference recognition for Italian through an experiment employing a fine-tuned BERT model. Our experimental results substantiate the validity of the task, implying a huge potential for applying automated systems to multiple fields of analysis, such as political text or social media analysis.

Much of commonsense knowledge in real world is the form of procudures or sequences of steps to achieve particular goals. In recent years, knowledge extraction on procedural documents has attracted considerable attention. However, they often focus on procedural text but ignore a common multimodal scenario in the real world. Images and text can complement each other semantically, alleviating the semantic ambiguity suffered in text-only modality. Motivated by these, in this paper, we explore a problem of grounded multimodal procedural entity recognition (GMPER), aiming to detect the entity and the corresponding bounding box groundings in image (i.e., visual entities). A new dataset (Wiki-GMPER) is bult and extensive experiments are conducted to evaluate the effectiveness of our proposed model.

pdf abs
Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language
Alistair Plum | Tharindu Ranasinghe | Christoph Purschke

Relation extraction is essential for extracting and understanding biographical information in the context of digital humanities and related subjects. There is a growing interest in the community to build datasets capable of training machine learning models to extract relationships. However, annotating such datasets can be expensive and time-consuming, in addition to being limited to English. This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German. Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset. We also create a manually annotated dataset with 2000 instances to evaluate the models and release it together with the dataset compiled using guided distant supervision. We train several state-of-the-art machine learning models on the automatically created dataset and release them as well. Furthermore, we experiment with multilingual and cross-lingual zero-shot experiments that could benefit many low-resource languages.

Large language models (LLMs) trained on massive corpora demonstrate impressive capabilities in a wide range of tasks. While there are ongoing efforts to adapt these models to languages beyond English, the attention given to their evaluation methodologies remains limited. Current multilingual benchmarks often rely on back translations or re-implementations of English tests, limiting their capacity to capture unique cultural and linguistic nuances. To bridge this gap for the Korean language, we introduce the HAE-RAE Bench, a dataset curated to challenge models lacking Korean cultural and contextual depth. The dataset encompasses six downstream tasks across four domains: vocabulary, history, general knowledge, and reading comprehension. Unlike traditional evaluation suites focused on token and sequence classification or mathematical and logical reasoning, the HAE-RAE Bench emphasizes a model’s aptitude for recalling Korean-specific knowledge and cultural contexts. Comparative analysis with prior Korean benchmarks indicates that the HAE-RAE Bench presents a greater challenge to non-Korean models by disturbing abilities and knowledge learned from English being transferred.

pdf abs
Halwasa: Quantify and Analyze Hallucinations in Large Language Models: Arabic as a Case Study
Hamdy Mubarak | Hend Al-Khalifa | Khaloud Suliman Alkhalefah

Large Language Models (LLMs) have shown superb abilities to generate texts that are indistinguishable from human-generated texts in many cases. However, sometimes they generate false, incorrect, or misleading content, which is often described as “hallucinations”. Quantifying and analyzing hallucination in LLMs can increase their reliability and usage. While hallucination is being actively studied for English and other languages, and different benchmarking datsets have been created, this area is not studied at all for Arabic. In our paper, we create the first Arabic dataset that contains 10K of generated sentences by LLMs and annotate it for factuality and correctness. We provide detailed analysis of the dataset to analyze factual and linguistic errors. We found that 25% of the generated sentences are factually incorrect. We share the dataset with the research community.

pdf abs
HarmPot: An Annotation Framework for Evaluating Offline Harm Potential of Social Media Text
Ritesh Kumar | Ojaswee Bhalla | Madhu Vanthi | Shehlat Maknoon Wani | Siddharth Singh

In this paper, we discuss the development of an annotation schema to build datasets for evaluating the offline harm potential of social media texts. We define “harm potential” as the potential for an online public post to cause real-world physical harm (i.e., violence). Understanding that real-world violence is often spurred by a web of triggers, often combining several online tactics and pre-existing intersectional fissures in the social milieu, to result in targeted physical violence, we do not focus on any single divisive aspect (i.e., caste, gender, religion, or other identities of the victim and perpetrators) nor do we focus on just hate speech or mis/dis-information. Rather, our understanding of the intersectional causes of such triggers focuses our attempt at measuring the harm potential of online content, irrespective of whether it is hateful or not. In this paper, we discuss the development of a framework/annotation schema that allows annotating the data with different aspects of the text including its socio-political grounding and intent of the speaker (as expressed through mood and modality) that together contribute to it being a trigger for offline harm. We also give a comparative analysis and mapping of our framework with some of the existing frameworks.

Handling graph data is one of the most difficult tasks. Traditional techniques, such as those based on geometry and matrix factorization, rely on assumptions about the data relations that become inadequate when handling large and complex graph data. On the other hand, deep learning approaches demonstrate promising results in handling large graph data, but they often fall short of providing interpretable explanations. To equip the graph processing with both high accuracy and explainability, we introduce a novel approach that harnesses the power of a large language model (LLM), enhanced by an uncertainty-aware module to provide a confidence score on the generated answer. We experiment with our approach on two graph processing tasks: few-shot knowledge graph completion and graph classification. Our results demonstrate that through parameter efficient fine-tuning, the LLM surpasses state-of-the-art algorithms by a substantial margin across ten diverse benchmark datasets. Moreover, to address the challenge of explainability, we propose an uncertainty estimation based on perturbation, along with a calibration scheme to quantify the confidence scores of the generated answers. Our confidence measure achieves an AUC of 0.8 or higher on seven out of the ten datasets in predicting the correctness of the answer generated by LLM.

Recent progress in large language models (LLMs) has enabled the deployment of many generative NLP applications. At the same time, it has also led to a misleading public discourse that “it’s all been solved.” Not surprisingly, this has, in turn, made many NLP researchers – especially those at the beginning of their careers – worry about what NLP research area they should focus on. Has it all been solved, or what remaining questions can we work on regardless of LLMs? To address this question, this paper compiles NLP research directions rich for exploration. We identify fourteen different research areas encompassing 45 research directions that require new research and are not directly solvable by LLMs. While we identify many research areas, many others exist; we do not cover areas currently addressed by LLMs, but where LLMs lag behind in performance or those focused on LLM development. We welcome suggestions for other research directions to include: https://bit.ly/nlp-era-llm.

pdf abs
HealthFC: Verifying Health Claims with Evidence-Based Medical Fact-Checking
Juraj Vladika | Phillip Schneider | Florian Matthes

In the digital age, seeking health advice on the Internet has become a common practice. At the same time, determining the trustworthiness of online medical content is increasingly challenging. Fact-checking has emerged as an approach to assess the veracity of factual claims using evidence from credible knowledge sources. To help advance automated Natural Language Processing (NLP) solutions for this task, in this paper we introduce a novel dataset HealthFC. It consists of 750 health-related claims in German and English, labeled for veracity by medical experts and backed with evidence from systematic reviews and clinical trials. We provide an analysis of the dataset, highlighting its characteristics and challenges. The dataset can be used for NLP tasks related to automated fact-checking, such as evidence retrieval, claim verification, or explanation generation. For testing purposes, we provide baseline systems based on different approaches, examine their performance, and discuss the findings. We show that the dataset is a challenging test bed with a high potential for future use.

pdf abs
Hierarchical Graph Convolutional Network Approach for Detecting Low-Quality Documents
Jaeyoung Lee | Joonwon Jang | Misuk Kim

Consistency within a document is a crucial feature indicative of its quality. Recently, within the vast amount of information produced across various media, there exists a significant number of low-quality documents that either lack internal consistency or contain content utterly unrelated to their headlines. Such low-quality documents induce fatigue in readers and undermine the credibility of the media source that provided them. Consequently, research to automatically detect these low-quality documents based on natural language processing is imperative. In this study, we introduce a hierarchical graph convolutional network (HGCN) that can detect internal inconsistencies within a document and incongruences between the title and body. Moreover, we constructed the Inconsistency Dataset, leveraging published news data and its meta-data, to train our model to detect document inconsistencies. Experimental results demonstrated that the HGCN achieved superior performance with an accuracy of 91.20% on our constructed Inconsistency Dataset, outperforming other comparative models. Additionally, on the publicly available incongruent-related dataset, the proposed methodology demonstrated a performance of 92.00%, validating its general applicability. Finally, an ablation study further confirmed the significant impact of meta-data utilization on performance enhancement. We anticipate that our model can be universally applied to detect and filter low-quality documents in the real world.

We study the problem of Event Causality Identification (ECI) that seeks to predict causal relation between event mentions in the text. In contrast to previous classification-based models, a few recent ECI methods have explored generative models to deliver state-of-the-art performance. However, such generative models cannot handle document-level ECI where long context between event mentions must be encoded to secure correct predictions. In addition, previous generative ECI methods tend to rely on external toolkits or human annotation to obtain necessary training signals. To address these limitations, we propose a novel generative framework that leverages Optimal Transport (OT) to automatically select the most important sentences and words from full documents. Specifically, we introduce hierarchical OT alignments between event pairs and the document to extract pertinent contexts. The selected sentences and words are provided as input and output to a T5 encoder-decoder model which is trained to generate both the causal relation label and salient contexts. This allows richer supervision without external tools. We conduct extensive evaluations on different datasets with multiple languages to demonstrate the benefits and state-of-the-art performance of ECI.

Hierarchical topic modeling, which can mine implicit semantics in the corpus and automatically construct topic hierarchical relationships, has received considerable attention recently. However, the current hierarchical topic models are mainly based on Euclidean space, which cannot well retain the implicit hierarchical semantic information in the corpus, leading to irrational structure of the generated topics. On the other hand, the existing Generative Adversarial Network (GAN) based neural topic models perform satisfactorily, but they remain constrained by pattern collapse due to the discontinuity of latent space. To solve the above problems, with the hypothesis of hyperbolic space, we propose a novel GAN-based hierarchical topic model to mine high-quality topics by introducing contrastive learning to capture information from documents. Furthermore, the distinct tree-like property of hyperbolic space preserves the implicit hierarchical semantics of documents in topic embeddings, which are projected into the hyperbolic space. Finally, we use a multi-head self-attention mechanism to learn implicit hierarchical semantics of topics and mine topic structure information. Experiments on real-world corpora demonstrate the remarkable performance of our model on topic coherence and topic diversity, as well as the rationality of the topic hierarchy.

pdf abs
High-order Joint Constituency and Dependency Parsing
Yanggan Gu | Yang Hou | Zhefeng Wang | Xinyu Duan | Zhenghua Li

This work revisits the topic of jointly parsing constituency and dependency trees, i.e., to produce compatible constituency and dependency trees simultaneously for input sentences, which is attractive considering that the two types of trees are complementary in representing syntax. The original work of Zhou and Zhao (2019) performs joint parsing only at the inference phase. They train two separate parsers under the multi-task learning framework (i.e., one shared encoder and two independent decoders). They design an ad-hoc dynamic programming-based decoding algorithm of O(n⁵) time complexity for finding optimal compatible tree pairs. Compared to their work, we make progress in three aspects: (1) adopting a much more efficient decoding algorithm of O(n⁴) time complexity, (2) exploring joint modeling at the training phase, instead of only at the inference phase, (3) proposing high-order scoring components to promote constituent-dependency interaction. We conduct experiments and analysis on seven languages, covering both rich-resource and low-resource scenarios. Results and analysis show that joint modeling leads to a modest overall performance boost over separate modeling, but substantially improves the complete matching ratio of whole trees, thanks to the explicit modeling of tree compatibility.

pdf abs
High-Order Semantic Alignment for Unsupervised Fine-Grained Image-Text Retrieval
Rui Gao | Miaomiao Cheng | Xu Han | Wei Song

Cross-modal retrieval is an important yet challenging task due to the semantic discrepancy between visual content and language. To measure the correlation between images and text, most existing research mainly focuses on learning global or local correspondence, failing to explore fine-grained local-global alignment. To infer more accurate similarity scores, we introduce a novel High Order Semantic Alignment (HOSA) model that can provide complementary and comprehensive semantic clues. Specifically, to jointly learn global and local alignment and emphasize local-global interaction, we employ tensor-product (t-product) operation to reconstruct one modal’s representation based on another modal’s information in a common semantic space. Such a cross-modal reconstruction strategy would significantly enhance inter-modal correlation learning in a fine-grained manner. Extensive experiments on two benchmark datasets validate that our model significantly outperforms several state-of-the-art baselines, especially in retrieving the most relevant results.

pdf abs
HoLM: Analyzing the Linguistic Unexpectedness in Homeric Poetry
John Pavlopoulos | Ryan Sandell | Maria Konstantinidou | Chiara Bozzone

The authorship of the Homeric poems has been a matter of debate for centuries. Computational approaches such as language modeling exist that can aid experts in making crucial headway. We observe, however, that such work has, thus far, only been carried out at the level of lengthier excerpts, but not individual verses, the level at which most suspected interpolations occur. We address this weakness by presenting a corpus of Homeric verses, each complemented with a score quantifying linguistic unexpectedness based on Perplexity. We assess the nature of these scores by exploring their correlation with named entities, the frequency of character n-grams, and (inverse) word frequency, revealing robust correlations with the latter two. This apparent bias can be partly overcome by simply dividing scores for unexpectedness by the maximum term frequency per verse.

pdf abs
How Diplomats Dispute: The UN Security Council Conflict Corpus
Karolina Zaczynska | Peter Bourgonje | Manfred Stede

We investigate disputes in the United Nations Security Council (UNSC) by studying the linguistic means of expressing conflicts. As a result, we present the UNSC Conflict Corpus (UNSCon), a collection of 87 UNSC speeches that are annotated for conflicts. We explain and motivate our annotation scheme and report on a series of experiments for automatic conflict classification. Further, we demonstrate the difficulty when dealing with diplomatic language - which is highly complex and often implicit along various dimensions - by providing corpus examples, readability scores, and classification results.

pdf abs
How Do Hyenas Deal with Human Speech? Speech Recognition and Translation with ConfHyena
Marco Gaido | Sara Papi | Matteo Negri | Luisa Bentivogli

The attention mechanism, a cornerstone of state-of-the-art neural models, faces computational hurdles in processing long sequences due to its quadratic complexity. Consequently, research efforts in the last few years focused on finding more efficient alternatives. Among them, Hyena (Poli et al., 2023) stands out for achieving competitive results in both language modeling and image classification, while offering sub-quadratic memory and computational complexity. Building on these promising results, we propose ConfHyena, a Conformer whose encoder self-attentions are replaced with an adaptation of Hyena for speech processing, where the long input sequences cause high computational costs. Through experiments in automatic speech recognition (for English) and translation (from English into 8 target languages), we show that our best ConfHyena model significantly reduces the training time by 27%, at the cost of minimal quality degradation (∼1%), which, in most cases, is not statistically significant.

pdf abs
How Far Is Too Far? Studying the Effects of Domain Discrepancy on Masked Language Models
Subhradeep Kayal | Alexander Rakhlin | Ali Dashti | Serguei Stepaniants

Pre-trained masked language models, such as BERT, perform strongly on a wide variety of NLP tasks and have become ubiquitous in recent years. The typical way to use such models is to fine-tune them on downstream data. In this work, we aim to study how the difference in domains between the pre-trained model and the task effects its final performance. We first devise a simple mechanism to quantify the domain difference (using a cloze task) and use it to partition our dataset. Using these partitions of varying domain discrepancy, we focus on answering key questions around the impact of discrepancy on final performance, robustness to out-of-domain test-time examples and effect of domain-adaptive pre-training. We base our experiments on a large-scale openly available e-commerce dataset, and our findings suggest that in spite of pre-training the performance of BERT degrades on datasets with high domain discrepancy, especially in low resource cases. This effect is somewhat mitigated by continued pre-training for domain adaptation. Furthermore, the domain-gap also makes BERT sensitive to out-of-domain examples during inference, even in high resource tasks, and it is prudent to use as diverse a dataset as possible during fine-tuning to make it robust to domain shift.

pdf abs
How Gender Interacts with Political Values: A Case Study on Czech BERT Models
Adnan Al Ali | Jindřich Libovický

Neural language models, which reach state-of-the-art results on most natural language processing tasks, are trained on large text corpora that inevitably contain value-burdened content and often capture undesirable biases, which the models reflect. This case study focuses on the political biases of pre-trained encoders in Czech and compares them with a representative value survey. Because Czech is a gendered language, we also measure how the grammatical gender coincides with responses to men and women in the survey. We introduce a novel method for measuring the model’s perceived political values. We find that the models do not assign statement probability following value-driven reasoning, and there is no systematic difference between feminine and masculine sentences. We conclude that BERT-sized models do not manifest systematic alignment with political values and that the biases observed in the models are rather due to superficial imitation of training data patterns than systematic value beliefs encoded in the models.

Out-of-distribution (OOD) detection plays a vital role in enhancing the reliability of machine learning models. As large language models (LLMs) become more prevalent, the applicability of prior research on OOD detection that utilized smaller-scale Transformers such as BERT, RoBERTa, and GPT-2 may be challenged, due to the significant differences in the scale of these models, their pre-training objectives, and the paradigms used for inference. This paper initiates a pioneering empirical investigation into the OOD detection capabilities of LLMs, focusing on the LLaMA series ranging from 7B to 65B in size. We thoroughly evaluate commonly used OOD detectors, examining their performance in both zero-grad and fine-tuning scenarios. Notably, we alter previous discriminative in-distribution fine-tuning into generative fine-tuning, aligning the pre-training objective of LLMs with downstream tasks. Our findings unveil that a simple cosine distance OOD detector demonstrates superior efficacy, outperforming other OOD detectors. We provide an intriguing explanation for this phenomenon by highlighting the isotropic nature of the embedding spaces of LLMs, which distinctly contrasts with the anisotropic property observed in smaller BERT family models. The new insight enhances our understanding of how LLMs detect OOD data, thereby enhancing their adaptability and reliability in dynamic environments. We have released the source code at https://github.com/Awenbocc/LLM-OOD for other researchers to reproduce our results.

pdf abs
How Important Is Tokenization in French Medical Masked Language Models?
Yanis Labrak | Adrien Bazoge | Béatrice Daille | Mickael Rouvier | Richard Dufour

Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair Encoding (BPE) and was later followed by the adoption of SentencePiece and WordPiece. While subword tokenization consistently outperforms character and word-level tokenization, the precise factors contributing to its success remain unclear. Key aspects such as the optimal segmentation granularity for diverse tasks and languages, the influence of data sources on tokenizers, and the role of morphological information in Indo-European languages remain insufficiently explored. This is particularly pertinent for biomedical terminology, characterized by specific rules governing morpheme combinations. Despite the agglutinative nature of biomedical terminology, existing language models do not explicitly incorporate this knowledge, leading to inconsistent tokenization strategies for common terms. In this paper, we seek to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks and pinpoint areas where further enhancements can be made. We analyze classical tokenization algorithms, including BPE and SentencePiece, and introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.

Previous work has showcased the intriguing capability of large language models (LLMs) in retrieving facts and processing context knowledge. However, only limited research exists on the layer-wise capability of LLMs to encode knowledge, which challenges our understanding of their internal mechanisms. In this paper, we devote the first attempt to investigate the layer-wise capability of LLMs through probing tasks. We leverage the powerful generative capability of ChatGPT to construct probing datasets, providing diverse and coherent evidence corresponding to various facts. We employ \mathcal V-usable information as the validation metric to better reflect the capability in encoding context knowledge across different layers. Our experiments on conflicting and newly acquired knowledge show that LLMs: (1) prefer to encode more context knowledge in the upper layers; (2) primarily encode context knowledge within knowledge-related entity tokens at lower layers while progressively expanding more knowledge within other tokens at upper layers; and (3) gradually forget the earlier context knowledge retained within the intermediate layers when provided with irrelevant evidence. Code is publicly available at https://github.com/Jometeorie/probing_llama.

pdf abs
How Much Do Robots Understand Rudeness? Challenges in Human-Robot Interaction
Michael Andrew Orme | Yanchao Yu | Zhiyuan Tan

This paper concerns the pressing need to understand and manage inappropriate language within the evolving human-robot interaction (HRI) landscape. As intelligent systems and robots transition from controlled laboratory settings to everyday households, the demand for polite and culturally sensitive conversational abilities becomes paramount, especially for younger individuals. This study explores data cleaning methods, focussing on rudeness and contextual similarity, to identify and mitigate inappropriate language in real-time interactions. State-of-the-art natural language models are also evaluated for their proficiency in discerning rudeness. This multifaceted investigation highlights the challenges of handling inappropriate language, including its tendency to hide within idiomatic expressions and its context-dependent nature. This study will further contribute to the future development of AI systems capable of engaging in intelligent conversations and upholding the values of courtesy and respect across diverse cultural and generational boundaries.

pdf abs
How Robust Are the QA Models for Hybrid Scientific Tabular Data? A Study Using Customized Dataset
Akash Ghosh | Venkata Sahith Bathini | Niloy Ganguly | Pawan Goyal | Mayank Singh

Question-answering (QA) on hybrid scientific tabular and textual data deals with scientific information, and relies on complex numerical reasoning. In recent years, while tabular QA has seen rapid progress, understanding their robustness on scientific information is lacking due to absence of any benchmark dataset. To investigate the robustness of the existing state-of-the-art QA models on scientific hybrid tabular data, we propose a new dataset, “SciTabQA”, consisting of 822 question-answer pairs from scientific tables and their descriptions. With the help of this dataset, we assess the state-of-the-art Tabular QA models based on their ability (i) to use heterogeneous information requiring both structured data (table) and unstructured data (text) and (ii) to perform complex scientific reasoning tasks. In essence, we check the capability of the models to interpret scientific tables and text. Our experiments show that “SciTabQA” is an innovative dataset to study question-answering over scientific heterogeneous data. We benchmark three state-of-the-art Tabular QA models, and find that the best F1 score is only 0.462.

pdf abs
How Speculative Can Speculative Decoding Be?
Zhuorui Liu | Chen Zhang | Dawei Song

Large language models (LLMs) have drawn great attention from the field of natural language processing and beyond, due to their impressive capability of autoregressive modeling, yet bringing an obvious problem, i.e., the largely increased latency. An emerging idea to alleviate this problem is speculative decoding, which first uses a draft model to draft tokens autoregressively and then makes the target model verify these tokens in parallel. The draft model is typically smaller than the target model, and it essentially trades generation quality for speed. Thereby, speculative decoding can be viewed as a speculative game for the target model in term of verification failures. That is, the lengthy draft tokens proposed by the small draft models could fail in the verification stage. Naturally, a critical question arises: how speculative can speculative decoding be, or in other words, how small can an adequate draft model be and how large can an appropriate number of draft tokens be? This work aims to investigate these questions and demonstrate how the scale of the draft model and the number of draft tokens would have an impact on the overall latency of the speculative decoding. We theoretically show that neither of above two factors will be infinitely speculative. Namely, there is a certain turning point for each of them. We then empirically show that the scale of the draft model could be 10-20× smaller than the target model and the optimal number of draft tokens should lie in 3-5.

pdf abs
How Susceptible Are LLMs to Logical Fallacies?
Amirreza Payandeh | Dan Pluth | Jordan Hosier | Xuesu Xiao | Vijay K. Gurbani

This paper investigates the rational thinking capability of Large Language Models (LLMs) in multi-round argumentative debates by exploring the impact of fallacious arguments on their logical reasoning performance. More specifically, we present Logic Competence Measurement Benchmark (LOGICOM), a diagnostic benchmark to assess the robustness of LLMs against logical fallacies. LOGICOM involves two agents: a persuader and a debater engaging in a multi-round debate on a controversial topic, where the persuader tries to convince the debater of the correctness of its claim. First, LOGICOM assesses the potential of LLMs to change their opinions through reasoning. Then, it evaluates the debater’s performance in logical reasoning by contrasting the scenario where the persuader employs logical fallacies against one where logical reasoning is used. We use this benchmark to evaluate the performance of GPT-3.5 and GPT-4 using a dataset containing controversial topics, claims, and reasons supporting them. Our findings indicate that both GPT-3.5 and GPT-4 can adjust their opinion through reasoning. However, when presented with logical fallacies, GPT-3.5 and GPT-4 are erroneously convinced 41% and 69% more often, respectively, compared to when logical reasoning is used. Finally, we introduce a new dataset containing over 5k pairs of logical vs. fallacious arguments.

pdf abs
How to Do Politics with Words: Investigating Speech Acts in Parliamentary Debates
Ines Reinig | Ines Rehbein | Simone Paolo Ponzetto

This paper presents a new perspective on framing through the lens of speech acts and investigates how politicians make use of different pragmatic speech act functions in political debates. To that end, we created a new resource of German parliamentary debates, annotated with fine-grained speech act types. Our hierarchical annotation scheme distinguishes between cooperation and conflict communication, further structured into six subtypes, such as informative, declarative or argumentative-critical speech acts, with 14 fine-grained classes at the lowest level. We present classification baselines on our new data and show that the fine-grained classes in our schema can be predicted with an avg. F1 of around 82.0%. We then use our classifier to analyse the use of speech acts in a large corpus of parliamentary debates over a time span from 2003–2023.

Current language models require a lot of training data to obtain high performance. For Relation Classification (RC), many datasets are domain-specific, so combining datasets to obtain better performance is non-trivial. We explore a multi-domain training setup for RC, and attempt to improve performance by encoding domain information. Our proposed models improve > 2 Macro-F1 against the baseline setup, and our analysis reveals that not all the labels benefit the same: The classes which occupy a similar space across domains (i.e., their interpretation is close across them, for example “physical”) benefit the least, while domain-dependent relations (e.g., “part-of”) improve the most when encoding domain information.

pdf abs
How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have
Viktor Hangya | Alexander Fraser

Due to the broad range of social media platforms, the requirements of abusive language detection systems are varied and ever-changing. Already a large set of annotated corpora with different properties and label sets were created, such as hate or misogyny detection, but the form and targets of abusive speech are constantly evolving. Since, the annotation of new corpora is expensive, in this work we leverage datasets we already have, covering a wide range of tasks related to abusive language detection. Our goal is to build models cheaply for a new target label set and/or language, using only a few training examples of the target domain. We propose a two-step approach: first we train our model in a multitask fashion. We then carry out few-shot adaptation to the target requirements. Our experiments show that using already existing datasets and only a few-shots of the target task the performance of models improve both monolingually and across languages. Our analysis also shows that our models acquire a general understanding of abusive language, since they improve the prediction of labels which are present only in the target dataset and can benefit from knowledge about labels which are not directly used for the target task.

pdf abs
How to Understand “Support”? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding
Jiamin Luo | Jianing Zhao | Jingjing Wang | Guodong Zhou

Weakly-supervised Phrase Grounding (WPG) is an emerging task of inferring the fine-grained phrase-region matching, while merely leveraging the coarse-grained sentence-image pairs for training. However, existing studies on WPG largely ignore the implicit phrase-region matching relations, which are crucial for evaluating the capability of models in understanding the deep multimodal semantics. To this end, this paper proposes an Implicit-Enhanced Causal Inference (IECI) approach to address the challenges of modeling the implicit relations and highlighting them beyond the explicit. Specifically, this approach leverages both the intervention and counterfactual techniques to tackle the above two challenges respectively. Furthermore, a high-quality implicit-enhanced dataset is annotated to evaluate IECI and detailed evaluations show the great advantages of IECI over the state-of-the-art baselines. Particularly, we observe an interesting finding that IECI outperforms the advanced multimodal LLMs by a large margin on this implicit-enhanced dataset, which may facilitate more research to evaluate the multimodal LLMs in this direction.

pdf abs
How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque.
Gorka Urbizu | Muitze Zulaika | Xabier Saralegi | Ander Corral

This work investigates the acquisition of formal linguistic competence by neural language models, hypothesizing that languages with complex grammar, such as Basque, present substantial challenges during the pre-training phase. Basque is distinguished by its complex morphology and flexible word order, potentially complicating grammar extraction. In our analysis, we evaluated the grammatical knowledge of BERT models trained under various pre-training configurations, considering factors such as corpus size, model size, number of epochs, and the use of lemmatization. To assess this grammatical knowledge, we constructed the BL2MP (Basque L2 student-based Minimal Pairs) test set. This test set consists of minimal pairs, each containing both a grammatically correct and an incorrect sentence, sourced from essays authored by students at different proficiency levels in the Basque language. Additionally, our analysis explores the difficulties in learning various grammatical phenomena, the challenges posed by flexible word order, and the influence of the student’s proficiency level on the difficulty of correcting grammar errors.

pdf abs
HS-GC: Holistic Semantic Embedding and Global Contrast for Effective Text Clustering
Chen Yang | Bin Cao | Jing Fan

In this paper, we introduce Holistic Semantic Embedding and Global Contrast (HS-GC), an end-to-end approach to learn the instance- and cluster-level representation. Specifically, for instance-level representation learning, we introduce a new loss function that exploits different layers of semantic information in a deep neural network to provide a more holistic semantic text representation. Contrastive learning is applied to these representations to improve the model’s ability to represent text instances. Additionally, for cluster-level representation learning we propose two strategies that utilize global update to construct cluster centers from a global view. The extensive experimental evaluation on five text datasets shows that our method outperforms the state-of-the-art model. Particularly on the SearchSnippets dataset, our method leads by 4.4% in normalized mutual information against the latest comparison method. On the StackOverflow and TREC datasets, our method improves the clustering accuracy of 5.9% and 3.2%, respectively.

The paper introduces the Hungarian Language Understanding (HuLU) benchmark, a comprehensive assessment framework designed to evaluate the performance of neural language models on Hungarian language tasks. Inspired by the renowned GLUE and SuperGLUE benchmarks, HuLU aims to address the challenges specific to Hungarian language processing. The benchmark consists of various datasets, each representing different linguistic phenomena and task complexities. Moreover, the paper presents a web service developed for HuLU, offering a user-friendly interface for model evaluation. This platform not only ensures consistent assessment but also fosters transparency by maintaining a leaderboard showcasing model performances. Preliminary evaluations of various LMMs on HuLU datasets indicate that while Hungarian models show promise, there’s room for improvement to match the proficiency of English-centric models in their native language.

pdf abs
Human and System Perspectives on the Expression of Irony: An Analysis of Likelihood Labels and Rationales
Aaron Maladry | Alessandra Teresa Cignarella | Els Lefever | Cynthia van Hee | Veronique Hoste

In this paper, we examine the recognition of irony by both humans and automatic systems. We achieve this by enhancing the annotations of an English benchmark data set for irony detection. This enhancement involves a layer of human-annotated irony likelihood using a 7-point Likert scale that combines binary annotation with a confidence measure. Additionally, the annotators indicated the trigger words that led them to perceive the text as ironic, which leveraged necessary theoretical insights into the definition of irony and its various forms. By comparing these trigger word spans across annotators, we determine the extent to which humans agree on the source of irony in a text. Finally, we compare the human-annotated spans with sub-token importance attributions for fine-tuned transformers using Layer Integrated Gradients, a state-of-the-art interpretability metric. Our results indicate that our model achieves better performance on tweets that were annotated with high confidence and high agreement. Although automatic systems can identify trigger words with relative success, they still attribute a significant amount of their importance to the wrong tokens.

pdf abs
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
Qiwei Peng | Yekun Chai | Xuhong Li

Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.

pdf abs
Human in the Loop: How to Effectively Create Coherent Topics by Manually Labeling Only a Few Documents per Class
Anton F. Thielmann | Christoph Weisser | Benjamin Säfken

Few-shot methods for accurate modeling under sparse label-settings have improved significantly. However, the applications of few-shot modeling in natural language processing remain solely in the field of document classification. With recent performance improvements, supervised few-shot methods, combined with a simple topic extraction method pose a significant challenge to unsupervised topic modeling methods. Our research shows that supervised few-shot learning, combined with a simple topic extraction method, can outperform unsupervised topic modeling techniques in terms of generating coherent topics, even when only a few labeled documents per class are used. The code is available at the following link: https://github.com/AnFreTh/STREAM

pdf abs
Humanistic Buddhism Corpus: A Challenging Domain-Specific Dataset of English Translations for Classical and Modern Chinese
Youheng W. Wong | Natalie Parde | Erdem Koyuncu

We introduce the Humanistic Buddhism Corpus (HBC), a dataset containing over 80,000 Chinese-English parallel phrases extracted and translated from publications in the domain of Buddhism. HBC is one of the largest free domain-specific datasets that is publicly available for research, containing text from both classical and modern Chinese. Moreover, since HBC originates from religious texts, many phrases in the dataset contain metaphors and symbolism, and are subject to multiple interpretations. Compared to existing machine translation datasets, HBC presents difficult unique challenges. In this paper, we describe HBC in detail. We evaluate HBC within a machine translation setting, validating its use by establishing performance benchmarks using a Transformer model with different transfer learning setups.

pdf abs
Humanitarian Corpora for English, French and Spanish
Loryn Isaacs | Santiago Chambó | Pilar León-Araúz

This paper presents three corpora of English, French and Spanish humanitarian documents compiled with reports obtained from ReliefWeb through its API. ReliefWeb is a leading database of humanitarian documents operated by the UN Office for the Coordination of Humanitarian Affairs (OCHA). To compile these corpora, documents were selected with language identification and noise reduction techniques. They were subsequently tokenized, lemmatized, tagged by part of speech, and enriched with metadata for use by linguists in corpus query software. These corpora were compiled to satisfy the research needs of the Humanitarian Encyclopedia, a project with a focus on conceptual variation. However, they can also be useful for other humanitarian endeavors, whether they are research- or practitioner-oriented; the source code for generating the corpora is available on GitHub. To compare materials, an exploratory analysis of definitional and generic-specific information was conducted for the concept of ARMED ACTOR with lexical data extracted from an English legacy corpus (where the concept is underrepresented) as well as on the new English and Spanish corpora. Lexical data were compared among corpora and presented by means of online data visualization to illustrate its potential to inform conceptual modelling.

pdf abs
Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack
Ying Zhou | Ben He | Le Sun

With the development of large language models (LLMs), detecting whether text is generated by a machine becomes increasingly challenging in the face of malicious use cases like the spread of false information, protection of intellectual property, and prevention of academic plagiarism. While well-trained text detectors have demonstrated promising performance on unseen test data, recent research suggests that these detectors have vulnerabilities when dealing with adversarial attacks, such as paraphrasing. In this paper, we propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection. We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model’s robustness against such attacks. The empirical results reveal that the current detection model can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content. Furthermore, we explore the prospect of improving the model’s robustness over iterative adversarial learning. Although some improvements in model robustness are observed, practical applications still face significant challenges. These findings shed light on the future development of AI-text detectors, emphasizing the need for more accurate and robust detection methods.

A crucial aspect in abusive language on social media platforms (toxicity, hate speech, harmful stereotypes, etc.) is its inherent contextual nature. In this paper, we focus on the role of conversational context in abusive language detection, one of the most “direct” forms of context in this domain, as given by the conversation threads (e.g., directly preceding message, original post). The incorporation of surrounding messages has proven vital for the accurate human annotation of harmful content. However, many prior works have either ignored this aspect, collecting and processing messages in isolation, or have obtained inconsistent results when attempting to embed such contextual information into traditional classification methods. The reasons behind these findings have not yet been properly addressed. To this end, we propose an analysis of the impact of conversational context in abusive language detection, through: (1) an analysis of prior works and the limitations of the most common concatenation-based approach, which we attempt to address with two alternative architectures; (2) an evaluation of these methods on existing datasets in English, and a new dataset of French tweets annotated for hate speech and stereotypes; and (3) a qualitative analysis showcasing the necessity for context-awareness in ALD, but also its difficulties.

pdf abs
Human vs. Machine Perceptions on Immigration Stereotypes
Wolfgang S. Schmeisser-Nieto | Pol Pastells | Simona Frenda | Mariona Taule

The increasing popularity of natural language processing has led to a race to improve machine learning models that often leaves aside the core study object, the language itself. In this study, we present classification models designed to detect stereotypes related to immigrants, along with both quantitative and qualitative analyses, shedding light on linguistic distinctions in how humans and various models perceive stereotypes. Given the subjective nature of this task, one of the models incorporates the judgments of all annotators by utilizing soft labels. Through a comparative analysis of BERT-based models using both hard and soft labels, along with predictions from GPT-4, we gain a clearer understanding of the linguistic challenges posed by texts containing stereotypes. Our dataset comprises Spanish Twitter posts collected as responses to immigrant-related hoaxes, annotated with binary values indicating the presence of stereotypes, implicitness, and the requirement for conversational context to understand the stereotype. Our findings suggest that both model prediction confidence and inter-annotator agreement are higher for explicit stereotypes, while stereotypes conveyed through irony and other figures of speech prove more challenging to detect than other implicit stereotypes.

pdf abs
Hybrid of Spans and Table-Filling for Aspect-Level Sentiment Triplet Extraction
Minghua Nuo | Chaofan Guo

Aspect Sentiment Triplet Extraction (ASTE) has become an emerging task in sentiment analysis research. Recently, researchers have proposed different tagging schemes, containing tagging of words, tagging of word pairs, and tagging of spans. However, the first two of these methods are often insufficient for the identification of multi-word terms, while the span tagging can label the entire phrase span, but it lacks the interactive information between words. In this paper, we propose Span in Table(S&T) model which combining span with table-filling. Specifically, S&T model achieve full fusion of syntactic and contextual features through cross-attention and generate the structures of word-pair table through Biaffine. Then, our model converts it to a span table by computing semantic distance based on syntactic dependency tree, which can enrich each unit of span table with semantic and interactive information. Meanwhile, the initial sentence features are constructed as simple phrase tables to enhance textual information of the phrase itself. In decoding, we define 8 types of labels for identifying three dimensions including aspect, opinion, and sentiment. Finally, the extensive experiments on D2 dataset show S&T model achieves competitive results in ASTE task, the results certify the effectiveness and robustness of our S&T model.

pdf abs
Hyperbolic Graph Neural Network for Temporal Knowledge Graph Completion
Yancong Li | Xiaoming Zhang | Ying Cui | Shuai Ma

Temporal Knowledge Graphs (TKGs) represent a crucial source of structured temporal information and exhibit significant utility in various real-world applications. However, TKGs are susceptible to incompleteness, necessitating Temporal Knowledge Graph Completion (TKGC) to predict missing facts. Existing models have encountered limitations in effectively capturing the intricate temporal dynamics and hierarchical relations within TKGs. To address these challenges, HyGNet is proposed, leveraging hyperbolic geometry to effectively model temporal knowledge graphs. The model comprises two components: the Hyperbolic Gated Graph Neural Network (HGGNN) and the Hyperbolic Convolutional Neural Network (HCNN). HGGNN aggregates neighborhood information in hyperbolic space, effectively capturing the contextual information and dependencies between entities. HCNN interacts with embeddings in hyperbolic space, effectively modeling the complex interactions between entities, relations, and timestamps. Additionally, a consistency loss is introduced to ensure smooth transitions in temporal embeddings. The extensive experimental results conducted on four benchmark datasets for TKGC highlight the effectiveness of HyGNet. It achieves state-of-the-art performance in comparison to previous models, showcasing its potential for real-world applications that involve temporal reasoning and knowledge prediction.

pdf abs
Hyperbolic Representations for Prompt Learning
Nan Chen | Xiangdong Su | Feilong Bao

Continuous prompt tuning has gained significant attention for its ability to train only continuous prompts while freezing the language model. This approach greatly reduces the training time and storage for downstream tasks. In this work, we delve into the hierarchical relationship between the prompts and downstream text inputs. In prompt learning, the prefix prompt acts as a module to guide the downstream language model, establishing a hierarchical relationship between the prefix prompt and subsequent inputs. Furthermore, we explore the benefits of leveraging hyperbolic space for modeling hierarchical structures. We project representations of pre-trained models from Euclidean space into hyperbolic space using the Poincaré disk which effectively captures the hierarchical relationship between the prompt and input text. The experiments on natural language understanding (NLU) tasks illustrate that hyperbolic space can model the hierarchical relationship between prompt and text input. We release our code at https://github.com/myaxxxxx/Hyperbolic-Prompt-Learning.

pdf abs
Hypergraph-Based Session Modeling: A Multi-Collaborative Self-Supervised Approach for Enhanced Recommender Systems
Xiangping Zheng | Bo Wu | Alex X. Zhang | Wei Li

Session-based recommendation (SBR) is a challenging task that involves predicting a user’s next item click based on their recent session history. Presently, many state-of-the-art methodologies employ graph neural networks to model item transitions. Notwithstanding their impressive performance, graph-based models encounter significant challenges when confronted with intricate session dependencies and data sparsity in real-world scenarios, ultimately constraining their capacity to enhance recommendation accuracy. In recognition of these challenges, we introduce an innovative methodology known as ‘Mssen,’ which stands for Multi-collaborative self-supervised learning in hypergraph neural networks. Mssen is meticulously crafted to adeptly discern user intent. Our approach initiates by representing session-based data as a hypergraph, adeptly capturing intricate, high-order relationships. Subsequently, we employ self-supervised learning on item-session hypergraphs to mitigate the challenges of data sparsity, all without necessitating manual fine-tuning, extensive search, or domain-specific expertise in augmentation selection. Comprehensive experimental analyses conducted across multiple datasets consistently underscore the superior performance of our approach when compared to existing methodologies.

pdf abs
HyperMR: Hyperbolic Hypergraph Multi-hop Reasoning for Knowledge-based Visual Question Answering
Bin Wang | Fuyong Xu | Peiyu Liu | Zhenfang Zhu

Knowledge-based Visual Question Answering (KBVQA) is a challenging task, which aims to answer an image related question based on external knowledge. Most of the works describe the semantic distance using the actual Euclidean distance between two nodes, which leads to distortion in modeling knowledge graphs with hierarchical and scale-free structure in KBVQA, and limits the multi-hop reasoning capability of the model. In contrast, the hyperbolic space shows exciting prospects for low-distortion embedding of graphs with hierarchical and free-scale structure. In addition, we map the different stages of reasoning into multiple adjustable hyperbolic spaces, achieving low-distortion, fine-grained reasoning. Extensive experiments on the KVQA, PQ and PQL datasets demonstrate the effectiveness of HyperMR for strong-hierarchy knowledge graphs.

pdf abs
HYPERTTS: Parameter Efficient Adaptation in Text to Speech Using Hypernetworks
Yingting Li | Rishabh Bhardwaj | Ambuj Mehrish | Bo Cheng | Soujanya Poria

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, “hypernetwork”, that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of , comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems. The audio samples and code are available at https://github.com/declare-lab/HyperTTS.

pdf abs
HYRR: Hybrid Infused Reranking for Passage Retrieval
Jing Lu | Keith Hall | Ji Ma | Jianmo Ni

Existing passage retrieval systems typically adopt a two-stage retrieve-then-rerank pipeline. To obtain an effective reranking model, many prior works have focused on improving the model architectures, such as leveraging powerful pretrained large language models (LLM) and designing better objective functions. However, less attention has been paid to the issue of collecting high-quality training data. In this paper, we propose HYRR, a framework for training robust reranking models. Specifically, we propose a simple but effective approach to select training data using hybrid retrievers. Our experiments show that the rerankers trained with HYRR are robust to different first-stage retrievers. Moreover, evaluations using MS MARCO and BEIR data sets demonstrate our proposed framework effectively generalizes to both supervised and zero-shot retrieval settings.

pdf abs
IAD: In-Context Learning Ability Decoupler of Large Language Models in Meta-Training
Yuhan Liu | Xiuying Chen | Gao Xing | Ji Zhang | Rui Yan

Large Language Models (LLMs) exhibit remarkable In-Context Learning (ICL) ability, where the model learns tasks from prompts consisting of input-output examples. However, the pre-training objectives of LLMs often misalign with ICL objectives. They’re mainly pre-trained with methods like masked language modeling and next-sentence prediction. On the other hand, ICL leverages example pairs to guide the model in generating task-aware responses such as text classification and question-answering tasks. The basic pre-training task-related capabilities can sometimes overshadow or conflict with task-specific subtleties required in ICL. To address this, we propose an In-context learning Ability Decoupler (IAD). The model aims to separate the ICL ability from the general ability of LLMs in the meta-training phase, where the ICL-related parameters are separately tuned to adapt for ICL tasks. Concretely, we first identify the parameters that are suitable for ICL by transference-driven gradient importance. We then propose a new max-margin loss to emphasize the separation of the general and ICL abilities. The loss is defined as the difference between the output of ICL and the original LLM, aiming to prevent the overconfidence of the LLM. By meta-training these ICL-related parameters with max-margin loss, we enable the model to learn and adapt to new tasks with limited data effectively. Experimental results show that IAD’s capability yields state-of-the-art performance on benchmark datasets by utilizing only 30% of the model’s parameters. Ablation study and detailed analysis prove the separation of the two abilities.

pdf abs
IDC: Boost Text-to-image Retrieval via Indirect and Direct Connections
Guowei Ge | Kuangrong Hao | Lingguang Hao

The Dual Encoders (DE) framework maps image and text inputs into a coordinated representation space, and calculates their similarity directly. On the other hand, the Cross Attention (CA) framework performs modalities interactions after completing the feature embedding of images and text, and then outputs a similarity score. For scenarios with bulk query requests or large query sets, the latter is more accurate, but the former is faster. Therefore, this work finds a new way to improve the retrieval accuracy of the DE framework by borrowing the advantages of the CA framework. Drawing inspiration from image captioning, we introduce a text decoder in the model training stage to simulate the cross-modal interaction function, like the CA framework. The text decoder is eventually discarded, aligning our model with the DE framework. Finally, to ensure training stability and prevent overfitting, we modify the Self-Distillation from Last Mini-Batch and apply it to the retrieval areas. Extensive experiments conducted on the MSCOCO and Flickr30K datasets validate the effectiveness of our proposed methods. Notably, our model achieves competitive results compared to state-of-the-art approaches on the Flickr30K dataset.

pdf abs
IDEATE: Detecting AI-Generated Text Using Internal and External Factual Structures
Quan Wang | Licheng Zhang | Zikang Guo | Zhendong Mao

The effective detection of AI-generated text is a vital principle to ensure responsible use of large language models (LLMs). Previous studies mainly focused on discovering and utilizing internal evidences contained in the text itself to perform the detection, while ignoring external evidences implicated in an established knowledge graph (KG) which may also be key discriminative factors between AI-generated and human-written text. To address this deficiency, we propose IDEATE, a novel hierarchical graph network that utilizes both internal and external factual structures to detect AI-generated text. IDEATE consists of a mention-level subgraph at the bottom to describe internal factual structures of mentioned entities reflected in the input text, and an entity-level subgraph at the top to describe external factual structures of mentioned entities reflected in an external KG. Hierarchical graph convolution is then applied successively on the two subgraphs, through which the two types of factual structures will be embedded into the output and used for the final detection. Extensive experiments on four benchmarking datasets show that IDEATE consistently outperforms current state-of-the-art methods in detecting text generated by various LLMs, ranging from GPT-2 to the more powerful ChatGPT, verifying the necessity and superiority of introducing external evidences for AI-generated text detection.

Idiomatic expressions are used in everyday language and typically convey affect, i.e., emotion. However, very little work investigating the extent to which automated methods can recognise emotions expressed in idiom-containing text has been undertaken. This can be attributed to the lack of emotion-labelled datasets that support the development and evaluation of such methods. In this paper, we present the IDioms with EMotions (IDEM) dataset consisting of a total of 9685 idiom-containing sentences that were generated and labelled with any one of 36 emotion types, with the help of the GPT-4 generative language model. Human validation by two independent annotators showed that more than 51% of the generated sentences are ideal examples, with the annotators reaching an agreement rate of 62% measured in terms of Cohen’s Kappa coefficient. To establish baseline performance on IDEM, various transformer-based emotion recognition approaches were implemented and evaluated. Results show that a RoBERTa model fine-tuned as a sequence classifier obtains a weighted F1-score of 58.73%, when the sequence provided as input specifies the idiom contained in a given sentence, together with its definition. Since this input configuration is based on the assumption that the idiom contained in the given sentence is already known, we also sought to assess the feasibility of automatically identifying the idioms contained in IDEM sentences. To this end, a hybrid idiom identification approach combining a rule-based method and a deep learning-based model was developed, whose performance on IDEM was determined to be 84.99% in terms of F1-score.

pdf abs
Identifying and Aligning Medical Claims Made on Social Media with Medical Evidence
Anthony James Hughes | Xingyi Song

Evidence-based medicine is the practise of making medical decisions that adhere to the latest, and best known evidence at that time. Currently, the best evidence is often found in the form of documents, such as randomized control trials, meta-analyses and systematic reviews. This research focuses on aligning medical claims made on social media platforms with this medical evidence. By doing so, individuals without medical expertise can more effectively assess the veracity of such medical claims. We study three core tasks: identifying medical claims, extracting medical vocabulary from these claims, and retrieving evidence relevant to those identified medical claims. We propose a novel system that can generate synthetic medical claims to aid each of these core tasks. We additionally introduce a novel dataset produced by our synthetic generator that, when applied to these tasks, demonstrates not only a more flexible and holistic approach, but also an improvement in all comparable metrics. We make our dataset, the Expansive Medical Claim Corpus (EMCC), available at https://zenodo.org/records/8321460.

pdf abs
Identifying Fine-grained Depression Signs in Social Media Posts
Augusto R. Mendes | Helena Caseli

Natural Language Processing has already proven to be an effective tool for helping in the identification of mental health disorders in text. However, most studies limit themselves to a binary classification setup or base their label set on pre-established resources. By doing so, they don’t explicitly model many common ways users can express their depression online, limiting our understanding of what kind of depression signs such models can accurately classify. This study evaluates how machine learning techniques deal with the classification of a fine-grained set of 21 depression signs in social media posts from Brazilian undergraduate students. We found out that model performance is not necessarily driven by a depression sign’s frequency on social media posts, since evaluated machine learning techniques struggle to classify the majority of signs of depression typically present in posts. Thus, model performance seems to be more related to the inherent difficulty of identifying a given sign than with its occurrence frequency.

pdf abs
Identifying Source Language Expressions for Pre-editing in Machine Translation
Norizo Sakaguchi | Yugo Murawaki | Chenhui Chu | Sadao Kurohashi

Machine translation-mediated communication can benefit from pre-editing source language texts to ensure accurate transmission of intended meaning in the target language. The primary challenge lies in identifying source language expressions that pose difficulties in translation. In this paper, we hypothesize that such expressions tend to be distinctive features of texts originally written in the source language (native language) rather than translations generated from the target language into the source language (machine translation). To identify such expressions, we train a neural classifier to distinguish native language from machine translation, and subsequently isolate the expressions that contribute to the model’s prediction of native language. Our manual evaluation revealed that our method successfully identified characteristic expressions of the native language, despite the noise and the inherent nuances of the task. We also present case studies where we edit the identified expressions to improve translation quality.

pdf abs
Ideological Knowledge Representation: Framing Climate Change in EcoLexicon
Arianne Reimerink | Melania Cabezas-García | Pilar León-Araúz | Pamela Faber

Culture is underrepresented in terminological resources and ideology is an especially complicated cultural aspect to convey. This complexity stems from the intertwined relationships among the discourse community of politicians, the media and the general public, as well as their interactions with scientific knowledge. Nevertheless, terminological resources should provide the necessary information to understand the political perspective taken in discourse on scientific issues with a high political profile. As in all specialized domains, environmental concepts and terms are subject to dynamism and variation (León-Araúz, 2017). Cognitive term variants (e.g., climate change, climate crisis) are of particular interest because of their presence in political discourse and their potential to influence climate actions. They can be used to reflect multidimensionality, imprecision or ideological attachment. This paper describes a method based on framing in Communication Studies to extract ideological knowledge from corpora. We used Spanish and English parliamentary debates (ParlaMint 2.1) and annotated the interventions that included a term variant of climate change according to an adapted version of the frames proposed by Bolsen and Shapiro (2018). The results showed how climate change discourse changes across de ideological spectrum and we give a proposal on how to represent that knowledge in an environmental TKB on the environment.

pdf abs
ILCiteR: Evidence-grounded Interpretable Local Citation Recommendation
Sayar Ghosh Roy | Jiawei Han

Existing Machine Learning approaches for local citation recommendation directly map or translate a query, which is typically a claim or an entity mention, to citation-worthy research papers. Within such a formulation, it is challenging to pinpoint why one should cite a specific research paper for a particular query, leading to limited recommendation interpretability. To alleviate this, we introduce the evidence-grounded local citation recommendation task, where the target latent space comprises evidence spans for recommending specific papers. Using a distantly-supervised evidence retrieval and multi-step re-ranking framework, our proposed system, ILCiteR, recommends papers to cite for a query grounded on similar evidence spans extracted from the existing research literature. Unlike past formulations that simply output recommendations, ILCiteR retrieves ranked lists of evidence span and recommended paper pairs. Secondly, previously proposed neural models for citation recommendation require expensive training on massive labeled data, ideally after every significant update to the pool of candidate papers. In contrast, ILCiteR relies solely on distant supervision from a dynamic evidence database and pre-trained Transformer-based Language Models without any model training. We contribute a novel dataset for the evidence-grounded local citation recommendation task and demonstrate the efficacy of our proposed conditional neural rank-ensembling approach for re-ranking evidence spans.

pdf abs
ILLUMINER: Instruction-tuned Large Language Models as Few-shot Intent Classifier and Slot Filler
Paramita Mirza | Viju Sudhi | Soumya Ranjan Sahoo | Sinchana Ramakanth Bhat

State-of-the-art intent classification (IC) and slot filling (SF) methods often rely on data-intensive deep learning models, limiting their practicality for industry applications. Large language models on the other hand, particularly instruction-tuned models (Instruct-LLMs), exhibit remarkable zero-shot performance across various natural language tasks. This study evaluates Instruct-LLMs on popular benchmark datasets for IC and SF, emphasizing their capacity to learn from fewer examples. We introduce ILLUMINER, an approach framing IC and SF as language generation tasks for Instruct-LLMs, with a more efficient SF-prompting method compared to prior work. A comprehensive comparison with multiple baselines shows that our approach, using the FLAN-T5 11B model, outperforms the state-of-the-art joint IC+SF method and in-context learning with GPT3.5 (175B), particularly in slot filling by 11.1–32.2 percentage points. Additionally, our in-depth ablation study demonstrates that parameter-efficient fine-tuning requires less than 6% of training data to yield comparable performance with traditional full-weight fine-tuning.

pdf abs
Image Matters: A New Dataset and Empirical Study for Multimodal Hyperbole Detection
Huixuan Zhang | Xiaojun Wan

Hyperbole, or exaggeration, is a common linguistic phenomenon. The detection of hyperbole is an important part of understanding human expression. There have been several studies on hyperbole detection, but most of which focus on text modality only. However, with the development of social media, people can create hyperbolic expressions with various modalities, including text, images, videos, etc. In this paper, we focus on multimodal hyperbole detection. We create a multimodal detection dataset from Weibo (a Chinese social media) and carry out some studies on it. We treat the text and image from a piece of weibo as two modalities and explore the role of text and image for hyperbole detection. Different pre-trained multimodal encoders are also evaluated on this downstream task to show their performance. Besides, since this dataset is constructed from five different keywords, we also evaluate the cross-domain performance of different models. These studies can serve as a benchmark and point out the direction of further study on multimodal hyperbole detection.

pdf abs
Impact of Task Adapting on Transformer Models for Targeted Sentiment Analysis in Croatian Headlines
Sofia Lee | Jelke Bloem

Transformer models, such as BERT, are often taken off-the-shelf and then fine-tuned on a downstream task. Although this is sufficient for many tasks, low-resource settings require special attention. We demonstrate an approach of performing an extra stage of self-supervised task-adaptive pre-training to a number of Croatian-supporting Transformer models. In particular, we focus on approaches to language, domain, and task adaptation. The task in question is targeted sentiment analysis for Croatian news headlines. We produce new state-of-the-art results (F1 = 0.781), but the highest performing model still struggles with irony and implicature. Overall, we find that task-adaptive pre-training benefits massively multilingual models but not Croatian-dominant models.

pdf abs
Impoverished Language Technology: The Lack of (Social) Class in NLP
Amanda Cercas Curry | Zeerak Talat | Dirk Hovy

Since Labov’s foundational 1964 work on the social stratification of language, linguistics has dedicated concerted efforts towards understanding the relationships between socio-demographic factors and language production and perception. Despite the large body of evidence identifying significant relationships between socio-demographic factors and language production, relatively few of these factors have been investigated in the context of NLP technology. While age and gender are well covered, Labov’s initial target, socio-economic class, is largely absent. We survey the existing Natural Language Processing (NLP) literature and find that only 20 papers even mention socio-economic status. However, the majority of those papers do not engage with class beyond collecting information of annotator-demographics. Given this research lacuna, we provide a definition of class that can be operationalised by NLP researchers, and argue for including socio-economic class in future language technologies.

pdf abs
Improved Neural Protoform Reconstruction via Reflex Prediction
Liang Lu | Jingzhi Wang | David R. Mortensen

Protolanguage reconstruction is central to historical linguistics. The comparative method, one of the most influential theoretical and methodological frameworks in the history of the language sciences, allows linguists to infer protoforms (reconstructed ancestral words) from their reflexes (related modern words) based on the assumption of regular sound change. Not surprisingly, numerous computational linguists have attempted to operationalize comparative reconstruction through various computational models, the most successful of which have been supervised encoder-decoder models, which treat the problem of predicting protoforms given sets of reflexes as a sequence-to-sequence problem. We argue that this framework ignores one of the most important aspects of the comparative method: not only should protoforms be inferable from cognate sets (sets of related reflexes) but the reflexes should also be inferable from the protoforms. Leveraging another line of research—reflex prediction—we propose a system in which candidate protoforms from a reconstruction model are reranked by a reflex prediction model. We show that this more complete implementation of the comparative method allows us to surpass state-of-the-art protoform reconstruction methods on three of four Chinese and Romance datasets.

pdf abs
Improved Out-of-Scope Intent Classification with Dual Encoding and Threshold-based Re-Classification
Hossam Zawbaa | Wael Rashwan | Sourav Dutta | Haytham Assem

Detecting out-of-scope user utterances is essential for task-oriented dialogues and intent classification. Current methodologies face difficulties with the unpredictable distribution of outliers and often rely on assumptions about data distributions. We present the Dual Encoder for Threshold-Based Re-Classification (DETER) to address these challenges. This end-to-end framework efficiently detects out-of-scope intents without requiring assumptions on data distributions or additional post-processing steps. The core of DETER utilizes dual text encoders, the Universal Sentence Encoder (USE) and the Transformer-based Denoising AutoEncoder (TSDAE), to generate user utterance embeddings, which are classified through a branched neural architecture. Further, DETER generates synthetic outliers using self-supervision and incorporates out-of-scope phrases from open-domain datasets. This approach ensures a comprehensive training set for out-of-scope detection. Additionally, a threshold-based re-classification mechanism refines the model’s initial predictions. Evaluations on the CLINC-150, Stackoverflow, and Banking77 datasets demonstrate DETER’s efficacy. Our model outperforms previous benchmarks, achieving an increase of up to 13% and 5% in F1 score for known and unknown intents on CLINC-150 and Stackoverflow, and 16% for known and 24% for unknown intents on Banking77. The source code has been released at https://github.com/Hossam-Mohammed-tech/Intent_Classification_OOS.

pdf abs
Improving Bengali and Hindi Large Language Models
Arif Shahriar | Denilson Barbosa

Despite being widely spoken worldwide, Bengali and Hindi are low-resource languages. The state-of-the-art in modeling such languages uses BERT and the Wordpiece tokenizer. We observed that the Wordpiece tokenizer often breaks words into meaningless tokens, failing to separate roots from affixes. Moreover, Wordpiece does not take into account fine-grained character-level information. We hypothesize that modeling fine-grained character-level information or interactions between roots and affixes helps with modeling highly inflected and morphologically complex languages such as Bengali and Hindi. We used BERT with two different tokenizers - a Unigram tokenizer and a character-level tokenizer and observed better performance. Then, we pretrained four language models accordingly - Bengali Unigram BERT, Hindi Unigram BERT, Bengali Character BERT, and Hindi Character BERT, and evaluated them for masked token detection, both in correct and erroneous settings, across many NLU tasks. We provide experimental evidence that Unigram and character-level tokenizers lead to better pretrained models for Bengali and Hindi, outperforming the previous state-of-the-art and BERT with Wordpiece vocabulary. We conduct the first study investigating the efficacy of different tokenization methods in modeling Bengali and Hindi.

Nowadays, character-based sequence labeling becomes the mainstream Chinese named entity recognition (CNER) approach, instead of word-based methods, since the latter degrades performance due to propagation of word segmentation (WS) errors. To make use of WS information, previous studies usually learn CNER and WS simultaneously with multi-task learning (MTL) framework, or treat WS information as extra guide features for CNER model, in which the utilization of WS information is indirect and shallow. In light of the complementary information inside multi-grained words, and the close connection between named entities and part-of-speech (POS) tags, this work proposes a tree parsing approach for joint modeling CNER, multi-grained word segmentation (MWS) and POS tagging tasks simultaneously. Specifically, we first propose a unified tree representation for MWS, POS tagging, and CNER.Then, we automatically construct the MWS-POS-NER data based on the unified tree representation for model training. Finally, we present a two-stage joint tree parsing framework. Experimental results on OntoNotes4 and OntoNotes5 show that our proposed approach of jointly modeling CNER with MWS and POS tagging achieves better or comparable performance with latest methods.

Addressing the challenges related to data sparsity, cold-start problems, and diversity in recommendation systems is both crucial and demanding. Many current solutions leverage knowledge graphs to tackle these issues by combining both item-based and user-item collaborative signals. A common trend in these approaches focuses on improving ranking performance at the cost of escalating model complexity, reducing diversity, and complicating the task. It is essential to provide recommendations that are both personalized and diverse, rather than solely relying on achieving high rank-based performance, such as Click-through rate, Recall, etc. In this paper, we propose a hybrid multi-task learning approach, training on user-item and item-item interactions. We apply item-based contrastive learning on descriptive text, sampling positive and negative pairs based on item metadata. Our approach allows the model to better understand the relationships between entities within the knowledge graph by utilizing semantic information from text. It leads to more accurate, relevant, and diverse user recommendations and a benefit that extends even to cold-start users who have few interactions with items. We perform extensive experiments on two widely used datasets to validate the effectiveness of our approach. Our findings demonstrate that jointly training user-item interactions and item-based signals using synopsis text is highly effective. Furthermore, our results provide evidence that item-based contrastive learning enhances the quality of entity embeddings, as indicated by metrics such as uniformity and alignment.

pdf abs
Improving Continual Few-shot Relation Extraction through Relational Knowledge Distillation and Prototype Augmentation
Zhiheng Zhang | Daojian Zeng | Xue Bai

In this paper, we focus on the challenging yet practical problem of Continual Few-shot Relation Extraction (CFRE), which involves extracting relations in the continuous and iterative arrival of new data with only a few labeled examples. The main challenges in CFRE are overfitting due to few-shot learning and catastrophic forgetting caused by continual learning. To address these problems, we propose a novel framework called RK2DA, which seamlessly integrates prototype-based data augmentation and relational knowledge distillation. Specifically, RK2DA generates pseudo data by introducing Gaussian noise to the prototype embeddings and utilizes a novel two-phase multi-teacher relational knowledge distillation method to transfer various knowledge from different embedding spaces. Experimental results on the FewRel and TACRED datasets demonstrate that our method outperforms the state-of-the-art baselines.

pdf abs
Improving Copy-oriented Text Generation via EDU Copy Mechanism
Tianxiang Wu | Han Chen | Luozheng Qin | Ziqiang Cao | Chunhui Ai

Many text generation tasks are copy-oriented. For instance, nearly 30% content of news summaries is copied. The copy rate is even higher in Grammatical Error Correction (GEC). However, existing generative models generate texts through word-by-word decoding, which may lead to factual inconsistencies and slow inference. While Elementary Discourse Units (EDUs) are outstanding extraction units, EDU-based extractive methods can alleviate the aforementioned problems. As a consequence, we propose EDUCopy, a framework that integrates the behavior of copying EDUs into generative models. The main idea of EDUCopy is to use special index tags to represent the copied EDUs during generation. Specifically, we extract important EDUs from input sequences, finetune generative models to generate sequences with special index tags, and restore the generated special index tags into corresponding text spans. By doing so, EDUCopy reduces the number of generated tokens significantly. To verify the effectiveness of EDUCopy, we conduct experiments on the news summarization datasets CNNDM, NYT and the GEC datasets FCE, WI-LOCNESS. While achieving notable ROUGE and M2 scores, GPT-4 evaluation validates the strength of our models in terms of factual consistency, fluency, and overall performance. Moreover, compared to baseline models, EDUCopy achieves a significant acceleration of 1.65x.

Recent studies improve the cross-lingual transfer learning by better aligning the internal representations within the multilingual model or exploring the information of the target language using self-training. However, the alignment-based methods exhibit intrinsic limitations such as non-transferable linguistic elements, while most of the self-training based methods ignore the useful information hidden in the low-confidence samples. To address this issue, we propose CoNLST (Contrastive Negative Learning and Self-Training) to leverage the information of low-confidence samples. Specifically, we extend the negative learning to the metric space by selecting negative pairs based on the complementary labels and then employ self-training to iteratively train the model to converge on the obtained clean pseudo-labels. We evaluate our approach on the widely-adopted cross-lingual benchmark XNLI. The experiment results show that our method improves upon the baseline models and can serve as a beneficial complement to the alignment-based methods.

State-of-the-art abstractive summarization models still suffer from the content contradiction between the summaries and the input text, which is referred to as the factual inconsistency problem. Recently, a large number of works have also been proposed to evaluate factual consistency or improve it by post-editing methods. However, these post-editing methods typically focus on replacing suspicious entities, failing to identify and modify incorrect content hidden in sentence structures. In this paper, we first verify that the correctable errors can be enriched by leveraging sentence structure pruning operation, and then we propose a post-editing method based on that. In the correction process, the pruning operation on possible errors is performed on the syntactic dependency tree with the guidance of multiple factual evaluation metrics. Experimenting on the FRANK dataset shows a great improvement in factual consistency compared with strong baselines and, when combined with them, can achieve even better performance. All the codes and data will be released on paper acceptance.

pdf abs
Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency
Taiji Li | Zhi Li | Yin Zhang

Despite large language models (LLMs) have demonstrated impressive performance in various tasks, they are still suffering from the factual inconsistency problem called hallucinations. For instance, LLMs occasionally generate content that diverges from source article, and prefer to extract information that appears at the beginning and end of the context, especially in long document summarization. Inspired by these findings, we propose to improve the faithfulness of LLMs in summarization by impelling them to process the entire article more fairly and faithfully. We present a novel summary generation strategy, namely SliSum, which exploits the ideas of sliding windows and self-consistency. Specifically, SliSum divides the source article into overlapping windows, and utilizes LLM to generate local summaries for the content in the windows. Finally, SliSum aggregates all local summaries using clustering and majority voting algorithm to produce more faithful summary of entire article. Extensive experiments demonstrate that SliSum significantly improves the faithfulness of diverse LLMs including LLaMA-2, Claude-2 and GPT-3.5 in both short and long text summarization, while maintaining their fluency and informativeness and without additional fine-tuning and resources. We further conduct qualitative and quantitative studies to investigate why SliSum works and impacts of hyperparameters in SliSum on performance.

pdf abs
Improving Grammatical Error Correction by Correction Acceptability Discrimination
Bin Cao | Kai Jiang | Fayu Pan | Chenlei Bao | Jing Fan

Existing Grammatical Error Correction (GEC) methods often overlook the assessment of sentence-level syntax and semantics in the corrected sentence. This oversight results in final corrections that may not be acceptable in the context of the original sentence. In this paper, to improve the performance of Grammatical Error Correction methods, we propose the post-processing task of Correction Acceptability Discrimination (CAD) which aims to remove invalid corrections by comparing the source sentence and its corrected version from the perspective of “sentence-level correctness”. To solve the CAD task, we propose a pipeline method where the acceptability of each possible correction combination based on the predicted corrections for a source sentence will be judged by a discriminator. Within the discriminator, we design a symmetrical comparison operator to overcome the conflicting results that might be caused by the sentence concatenation order. Experiments show that our method can averagely improve F_0.5 score by 1.01% over 13 GEC systems in the BEA-2019 test set.

pdf abs
Improving Implicit Discourse Relation Recognition with Semantics Confrontation
Mingyang Cai | Zhen Yang | Ping Jian

Implicit Discourse Relation Recognition (IDRR), which infers discourse logical relations without explicit connectives, is one of the most challenging tasks in natural language processing (NLP). Recently, pre-trained language models (PLMs) have yielded impressive results across numerous NLP tasks, but their performance still remains unsatisfactory in IDRR. We argue that prior studies have not fully harnessed the potential of PLMs, thereby resulting in a mixture of logical semantics, which determine the logical relations between discourse arguments, and general semantics, which encapsulate the non-logical contextual aspects (detailed in Sec.1). Such a mixture would inevitably compromise the logic reasoning ability of PLMs. Therefore, we propose a novel method that trains the PLMs through two semantics enhancers to implicitly differentiate logical and general semantics, ultimately achieving logical semantics enhancement. Due to the characteristic of PLM in word representation learning, these two semantics enhancers will inherently confront with each other, facilitating an augmentation of logical semantics by disentangling them from general semantics. The experimental results on PDTB 2.0 dataset show that the confrontation approach exceeds our baseline by 3.81% F1 score, and the effectiveness of the semantics confrontation method is validated by comprehensive ablation experiments.

pdf abs
Improving Language Model Reasoning with Self-motivated Learning
Yunlong Feng | Yang Xu | Libo Qin | Yasheng Wang | Wanxiang Che

Large-scale high-quality training data is important for improving the performance of models. After trained with data that has rationales (reasoning steps), models gain reasoning capability. However, the dataset with high-quality rationales is relatively scarce due to the high annotation cost. To address this issue, we propose Self-motivated Learning framework. The framework motivates the model itself to automatically generate rationales on existing datasets. Based on the inherent rank from correctness across multiple rationales, the model learns to generate better rationales, leading to higher reasoning capability. Specifically, we train a reward model with the rank to evaluate the quality of rationales, and improve the performance of reasoning through reinforcement learning. Experiment results of Llama2 7B on multiple reasoning datasets show that our method significantly improves the reasoning ability of models, even outperforming InstructGPT in some datasets.

pdf abs
Improving Low-Resource Keyphrase Generation through Unsupervised Title Phrase Generation
Byungha Kang | Youhyun Shin

This paper introduces a novel approach called title phrase generation (TPG) for unsupervised keyphrase generation (UKG), leveraging a pseudo label generated from a document title. Previous UKG method extracts all phrases from a corpus to build a phrase bank, then draws candidate absent keyphrases related to a document from the phrase bank to generate a pseudo label. However, we observed that when separating the document title from the document body, a significant number of phrases absent from the document body are included in the title. Based on this observation, we propose an effective method for generating pseudo labels using phrases mined from the document title. We initially train BART using these pseudo labels (TPG) and then perform supervised fine-tuning on a small amount of human-annotated data, which we term low-resource fine-tuning (LRFT). Experimental results on five benchmark datasets demonstrate that our method outperforms existing low-resource keyphrase generation approaches even with fewer labeled data, showing strength in generating absent keyphrases. Moreover, our model trained solely with TPG, without any labeled data, surpasses previous UKG method, highlighting the effectiveness of utilizing titles over a phrase bank. The code is available at https://github.com/kangnlp/low-resource-kpgen-through-TPG.

pdf abs
Improving Multi-view Document Clustering: Leveraging Multi-structure Processor and Hybrid Ensemble Clustering Module
Ruina Bai | Qi Bai

We introduce a multi-view document clustering model called DMsECN (Deep Multi-structure Ensemble Clustering Network), comprising a multi-structure processor and a hybrid ensemble clustering module. Unlike existing models, DMsECN distinguishes itself by creating a consensus structure from multiple clustering structures. The multi-structure processor comprises two stages, each contributing to the extraction of clustering structures that preserve both consistency and complementarity across multiple views. Representation learning extracts both view and view-fused representations from multi-views through the use of contrastive learning. Subsequently, multi-structure learning employs distinct view clustering guidance to generate the corresponding clustering structures. The hybrid ensemble clustering module merges two ensemble methods to amalgamate multiple structures, producing a consensus structure that guarantees both the separability and compactness of clusters within the clustering results. The attention-based ensemble primarily concentrates on learning the contribution weights of diverse clustering structures, while the similarity-based ensemble employs cluster assignment similarity and cluster classification dissimilarity to guide the refinement of the consensus structure. Experimental results demonstrate that DMsECN outperforms other models, achieving new state-of-the-art results on four multi-view document clustering datasets.

pdf abs
Improving Personalized Sentiment Representation with Knowledge-enhanced and Parameter-efficient Layer Normalization
You Zhang | Jin Wang | Liang-Chih Yu | Dan Xu | Xuejie Zhang

Existing studies on personalized sentiment classification consider a document review as an overall text unit and incorporate backgrounds (i.e., user and product information) to learn sentiment representation. However, it is difficult when these methods meet the current pretrained language models (PLMs) owing to quadratic costs that increase with text length and heterogeneous mixes of randomly initialized background information and textual information initialized from well-pretrained checkpoints during information incorporation. To address these problems, we propose a knowledge-enhanced and parameter-efficient layer normalization (E2LN) for efficient and effective review modeling via leveraging LN in transformer structures. Initially, a knowledge base is introduced that stores well-pretrained checkpoints, structured text information, and background information. Based on such a knowledge base, the ability of LN can be magnified as being a crucial component of transformer structure and then improve the performance of PLMs in downstream tasks. Moreover, the proposed E2LN can make PLMs capable of modeling long document reviews and incorporating background information with parameter-efficient fine-tuning and knowledge injecting. Extensive experimental results were obtained for three document-level sentiment classification benchmark datasets. By comparing the results, the effectiveness and efficiency of the proposed model was demonstrated. Code and Data are released at https://github.com/yoyo-yun/E2LN.

pdf abs
Improving Recall of Large Language Models: A Model Collaboration Approach for Relational Triple Extraction
Zepeng Ding | Wenhao Huang | Jiaqing Liang | Yanghua Xiao | Deqing Yang

Relation triple extraction, which outputs a set of triples from long sentences, plays a vital role in knowledge acquisition. Large language models can accurately extract triples from simple sentences through few-shot learning or fine-tuning when given appropriate instructions. However, they often miss out when extracting from complex sentences. In this paper, we design an evaluation-filtering framework that integrates large language models with small models for relational triple extraction tasks. The framework includes an evaluation model that can extract related entity pairs with high precision. We propose a simple labeling principle and a deep neural network to build the model, embedding the outputs as prompts into the extraction process of the large model. We conduct extensive experiments to demonstrate that the proposed method can assist large language models in obtaining more accurate extraction results, especially from complex sentences containing multiple relational triples. Our evaluation model can also be embedded into traditional extraction models to enhance their extraction precision from complex sentences.

pdf abs
Improving Robustness of GNN-based Anomaly Detection by Graph Adversarial Training
Xiangping Zheng | Bo Wu | Alex X. Zhang | Wei Li

Graph neural networks (GNNs) play a fundamental role in anomaly detection, excelling at the identification of node anomalies by aggregating information from neighboring nodes. Nonetheless, they exhibit vulnerability to attacks, with even minor alterations in the graph structure or node attributes resulting in substantial performance degradation. To address this critical challenge, we introduce an innovative mechanism for graph adversarial training, meticulously designed to bolster GNN-based anomaly detection systems against potential poisoning attacks. This novel approach follows a two-step framework. (1) In the initial phase, we employ a Multiple-Objective Generative Adversarial Attack (MO-GAA), which focuses on generating feature modifications and inducing structural disruptions within the graph. Its primary objective is to mimic the adversarial behavior of potential attackers on the anomaly detection graph, with the explicit intention of confounding the anomaly detector. (2) In the subsequent stage, we introduce Purification-Based Adversarial Attack Defense (PB-AAD), a method specifically designed to rectify any contamination and restore the integrity of the graph. The central aim of PB-AAD is to counteract the destructive actions carried out by potential attackers. Our empirical findings, derived from extensive experiments conducted on four real-world anomaly detection datasets, serve to demonstrate how MO-GAA systematically disrupts the graph, compromising the effectiveness of GNN-based detectors, while PB-AAD effectively mitigates these adversarial actions, thereby enhancing the overall robustness of GNN-based anomaly detectors.

Role-oriented dialogue summarization aims at generating summaries for different roles in dialogue, e.g., user and agent. Interaction between different roles is vital for the task. Existing methods could not fully capture interaction patterns between roles when encoding dialogue, thus are prone to ignore the interaction-related key information. In this paper, we propose a contrastive learning based interaction-aware model for the role-oriented dialogue summarization namely CIAM. An interaction-aware contrastive objective is constructed to guide the encoded dialogue representation to learn role-level interaction. The representation is then used by the decoder to generate role-oriented summaries. The contrastive objective is trained jointly with the primary dialogue summarization task. Additionally, we innovatively utilize different decoder start tokens to control what kind of summary to generate, thus could generate different role-oriented summaries with a unified model. Experimental results show that our method achieves new state-of-the-art results on two public datasets. Extensive analyses further demonstrate that our method excels at capturing interaction information between different roles and producing informative summaries.

pdf abs
Improving Text Readability through Segmentation into Rheses
Antoine Jamelot | Solen Quiniou | Sophie Hamon

Enhancing text readability is crucial for readers with challenges like dyslexia. This paper delves into the segmentation of sentences into rheses, i.e. rhythmic and semantic units. Their aim is to clarify sentence structures for improved comprehension, through a harmonious balance between syntactic accuracy, the natural rhythm of reading aloud, and the delineation of meaningful units. This study relates and compares our various attempts to improve a pre-existing rhesis segmentation tool, which is based on the selection of candidate segmentations. We also release TeRheSe (Texts with Rhesis Segmentation), a bilingual dataset, segmented into rheses, comprising 12 books from classic literature in French and English. We evaluated our approaches on this dataset, showing the efficiency of a novel approach based on token classification, reaching a F1-score of 90.0% in English (previously 85.3%) and 91.3% in French (previously 88.0%). We also study the potential of leveraging prosodic elements, though its definitive impact remains inconclusive.

Large language models (LLMs) have shown tremendous success in following user instructions and generating helpful responses. Nevertheless, their robustness is still far from optimal, as they may generate significantly inconsistent responses due to minor changes in the verbalized instructions. Recent literature has explored this inconsistency issue, highlighting the importance of continued improvement in the robustness of response generation. However, systematic analysis and solutions are still lacking. In this paper, we quantitatively define the inconsistency problem and propose a two-stage training framework consisting of instruction-augmented supervised fine-tuning and consistency alignment training. The first stage helps a model generalize on following instructions via similar instruction augmentations. In the second stage, we improve the diversity and help the model understand which responses are more aligned with human expectations by differentiating subtle differences in similar responses. The training process is accomplished by self-rewards inferred from the trained model at the first stage without referring to external human preference resources. We conduct extensive experiments on recent publicly available LLMs on instruction-following tasks and demonstrate the effectiveness of our training framework.

pdf abs
Improving Unsupervised Neural Machine Translation via Training Data Self-Correction
Jinliang Lu | Jiajun Zhang

Unsupervised neural machine translation (UNMT) models are trained with pseudo-parallel sentences constructed by on-the-fly back-translation using monolingual corpora. However, the quality of pseudo-parallel sentences cannot be guaranteed, which hinders the final performance of UNMT. This paper demonstrates that although UNMT usually generates mistakes during pseudo-parallel data construction, some of them can be corrected by the token-level translations that exist in the embedding table. Therefore, we propose a self-correction method to automatically improve the quality of pseudo-parallel sentences during training, thereby enhancing translation performance. Specifically, for a pseudo sentence pair, our self-correction method first estimates the alignment relations between tokens by treating and solving it as an optimal transport problem. Then, we measure the translation reliability for each token and detect the mis-translated ones. Finally, the mis-translated tokens are corrected with real-time computed token-by-token translations based on the embedding table, yielding a better training example. Considering that the modified examples are semantically equivalent to the original ones when UNMT converges, we introduce second-phase training to strengthen the output consistency between them, further improving the generalization capability and translation performance. Empirical results on widely used UNMT datasets demonstrate the effectiveness of our method and it significantly outperforms several strong baselines.

pdf abs
Improving Vietnamese-English Medical Machine Translation
Nhu Vo | Dat Quoc Nguyen | Dung D. Le | Massimo Piccardi | Wray Buntine

Machine translation for Vietnamese-English in the medical domain is still an under-explored research area. In this paper, we introduce MedEV—a high-quality Vietnamese-English parallel dataset constructed specifically for the medical domain, comprising approximately 360K sentence pairs. We conduct extensive experiments comparing Google Translate, ChatGPT (gpt-3.5-turbo), state-of-the-art Vietnamese-English neural machine translation models and pre-trained bilingual/multilingual sequence-to-sequence models on our new MedEV dataset. Experimental results show that the best performance is achieved by fine-tuning “vinai-translate” for each translation direction. We publicly release our dataset to promote further research.

pdf abs
InaGVAD : A Challenging French TV and Radio Corpus Annotated for Speech Activity Detection and Speaker Gender Segmentation
David Doukhan | Christine Maertens | William Le Personnic | Ludovic Speroni | Reda Dehak

InaGVAD is an audio corpus collected from 10 French radio and 18 TV channels categorized into 4 groups: generalist radio, music radio, news TV, and generalist TV. It contains 277 1-minute-long annotated recordings aimed at representing the acoustic diversity of French audiovisual programs and was primarily designed to build systems able to monitor men’s and women’s speaking time in media. inaGVAD is provided with Voice Activity Detection (VAD) and Speaker Gender Segmentation (SGS) annotations extended with overlap, speaker traits (gender, age, voice quality), and 10 non-speech event categories. Annotation distributions are detailed for each channel category. This dataset is partitioned into a 1h development and a 3h37 test subset, allowing fair and reproducible system evaluation. A benchmark of 6 freely available VAD software is presented, showing diverse abilities based on channel and non-speech event categories. Two existing SGS systems are evaluated on the corpus and compared against a baseline X-vector transfer learning strategy, trained on the development subset. Results demonstrate that our proposal, trained on a single - but diverse - hour of data, achieved competitive SGS results. The entire inaGVAD package; including corpus, annotations, evaluation scripts, and baseline training code; is made freely accessible, fostering future advancement in the domain.

pdf abs
In-Context Example Retrieval from Multi-Perspectives for Few-Shot Aspect-Based Sentiment Analysis
Qianlong Wang | Hongling Xu | Keyang Ding | Bin Liang | Ruifeng Xu

In this paper, we focus on few-shot aspect-based sentiment analysis (ABSA) and try to solve it with in-context learning (ICL) paradigm. However, the effectiveness of ICL is highly affected by retrieved in-context examples. Previous works generally leverage the semantic similarity between the candidate examples and test input to retrieve examples. However, they may yield sub-optimal results for this task. This is because considering only the overall semantic perspective may leave some useful examples, which have syntactic structural relevance to the test input or share identical sentiments and similar aspects to one unretrievable. To address this shortcoming, we advocate retrieving in-context examples for few-shot ABSA by simultaneously considering three perspectives, overall semantics, syntactic structure relevance, and aspect-sentiment semantics. To achieve this, we construct positive and negative pairs from these three perspectives and train the demonstration retriever using contrastive learning. Experimental results on four ABSA datasets show that our retrieval framework can significantly outperform baselines across the board. Moreover, to understand factors influencing ICL performance on few-shot ABSA, we conduct extensive analysis in various scenarios, which can inspire and advance future research.

pdf abs
Incorporating Lexical and Syntactic Knowledge for Unsupervised Cross-Lingual Transfer
Jianyu Zheng | Fengfei Fan | Jianquan Li

Unsupervised cross-lingual transfer involves transferring knowledge between languages without explicit supervision. Although numerous studies have been conducted to improve performance in such tasks by focusing on cross-lingual knowledge, particularly lexical and syntactic knowledge, current approaches are limited as they only incorporate syntactic or lexical information. Since each type of information offers unique advantages and no previous attempts have combined both, we attempt to explore the potential of this approach. In this paper, we present a novel framework called “Lexicon-Syntax Enhanced Multilingual BERT” that combines both lexical and syntactic knowledge. Specifically, we use Multilingual BERT (mBERT) as the base model and employ two techniques to enhance its learning capabilities. The code-switching technique is used to implicitly teach the model lexical alignment information, while a syntactic-based graph attention network is designed to help the model encode syntactic structure. To integrate both types of knowledge, we input code-switched sequences into both the syntactic module and the mBERT base model simultaneously. Our extensive experimental results demonstrate this framework can consistently outperform all baselines of zero-shot cross-lingual transfer, with the gains of 1.0 3.7 points on text classification, named entity recognition (ner), and semantic parsing tasks.

pdf abs
Incorporating Word-level Phonemic Decoding into Readability Assessment
Christine Pinney | Casey Kennington | Maria Soledad Pera | Katherine Landau Wright | Jerry Alan Fails

Current approaches in automatic readability assessment have found success with the use of large language models and transformer architectures. These techniques lead to accuracy improvement, but they do not offer the interpretability that is uniquely required by the audience most often employing readability assessment tools: teachers and educators. Recent work that employs more traditional machine learning methods has highlighted the linguistic importance of considering semantic and syntactic characteristics of text in readability assessment by utilizing handcrafted feature sets. Research in Education suggests that, in addition to semantics and syntax, phonetic and orthographic instruction are necessary for children to progress through the stages of reading and spelling development; children must first learn to decode the letters and symbols on a page to recognize words and phonemes and their connection to speech sounds. Here, we incorporate this word-level phonemic decoding process into readability assessment by crafting a phonetically-based feature set for grade-level classification for English. Our resulting feature set shows comparable performance to much larger, semantically- and syntactically-based feature sets, supporting the linguistic value of orthographic and phonetic considerations in readability assessment.

pdf abs
IndicFinNLP: Financial Natural Language Processing for Indian Languages
Sohom Ghosh | Arnab Maji | Aswartha Narayana | Sudip Kumar Naskar

Applications of Natural Language Processing (NLP) in the finance domain have been very popular of late. For financial NLP, (FinNLP) while various datasets exist for widely spoken languages like English and Chinese, datasets are scarce for low resource languages,particularly for Indian languages. In this paper, we address this challenges by presenting IndicFinNLP – a collection of 9 datasets consisting of three tasks relating to FinNLP for three Indian languages. These tasks are Exaggerated Numeral Detection, Sustainability Classification, and ESG Theme Determination of financial texts in Hindi, Bengali, and Telugu. Moreover, we release the datasets under CC BY-NC-SA 4.0 license for the benefit of the research community.

pdf abs
Indic-TEDST: Datasets and Baselines for Low-Resource Speech to Text Translation
Nivedita Sethiya | Saanvi Nair | Chandresh Maurya

Speech-to-text (ST) task is the translation of speech in a language to text in a different language. It has use cases in subtitling, dubbing, etc. Traditionally, ST task has been solved by cascading automatic speech recognition (ASR) and machine translation (MT) models which leads to error propagation, high latency, and training time. To minimize such issues, end-to-end models have been proposed recently. However, we find that only a few works have reported results of ST models on a limited number of low-resource languages. To take a step further in this direction, we release datasets and baselines for low-resource ST tasks. Concretely, our dataset has 9 language pairs and benchmarking has been done against SOTA ST models. The low performance of SOTA ST models on Indic-TEDST data indicates the necessity of the development of ST models specifically designed for low-resource languages.

pdf abs
IndirectQA: Understanding Indirect Answers to Implicit Polar Questions in French and Spanish
Christin Müller | Barbara Plank

Polar questions are common in dialogue and expect exactly one of two answers (yes/no). It is however not uncommon for speakers to bypass these expected choices and answer, for example, “Islands are generally by the sea” to the question: “An island? By the sea?”. While such answers are natural in spoken dialogues, conversational systems still struggle to interpret them. Seminal work to interpret indirect answers were made in recent years—but only for English and with strict question formulations. In this work, we present a new corpus for French and Spanish—IndirectQA —where we mine subtitle data for indirect answers to study the labeling task with six different labels, while broadening polar questions to include also implicit polar questions (statements that trigger a yes/no-answer which are not necessarily formulated as a question). We opted for subtitles since they are a readily available source of conversation in various languages, but also come with peculiarities and challenges which we will discuss. Overall, we provide the first results on French and Spanish. They show that the task is challenging: the baseline accuracy scores drop from 61.43 on English to 44.06 for French and Spanish.

pdf abs
Inductive Knowledge Graph Completion with GNNs and Rules: An Analysis
Akash Anil | Victor Gutierrez-Basulto | Yazmin Ibanez-Garcia | Steven Schockaert

The task of inductive knowledge graph completion requires models to learn inference patterns from a training graph, which can then be used to make predictions on a disjoint test graph. Rule-based methods seem like a natural fit for this task, but in practice they significantly underperform state-of-the-art methods based on Graph Neural Networks (GNNs), such as NBFNet. We hypothesise that the underperformance of rule-based methods is due to two factors: (i) implausible entities are not ranked at all and (ii) only the most informative path is taken into account when determining the confidence in a given link prediction answer. To analyse the impact of these factors, we study a number of variants of a rule-based approach, which are specifically aimed at addressing the aforementioned issues. We find that the resulting models can achieve a performance which is close to that of NBFNet. Crucially, the considered variants only use a small fraction of the evidence that NBFNet relies on, which means that they largely keep the interpretability advantage of rule-based methods. Moreover, we show that a further variant, which does look at the full KG, consistently outperforms NBFNet.

pdf abs
InferBR: A Natural Language Inference Dataset in Portuguese
Luciana Bencke | Francielle Vasconcellos Pereira | Moniele Kunrath Santos | Viviane Moreira

Natural Language Inference semantic concepts are central to all aspects of natural language meaning. Portuguese has few NLI-annotated datasets created through automatic translation followed by manual checking. The manual creation of NLI datasets is complex and requires many efforts that are sometimes unavailable. Thus, investments to produce good quality synthetic instances that could be used to train machine learning models for NLI are welcome. This work produced InferBR, an NLI dataset for Portuguese. We relied on a semiautomatic process to generate premises and an automatic process to generate hypotheses. The dataset was manually revised, showing that 97.4% of the sentence pairs had good quality, and nearly 100% of the instances had the correct label assigned. The model trained with InferBR is better at recognizing entailment classes in the other Portuguese datasets than the reverse. Because of its diversity and many unique sentences, InferBR can potentially be further augmented. In addition to the dataset, a key contribution is our proposed generation processes for premises and hypotheses that can easily be adapted to other languages and tasks.

pdf abs
InfFeed: Influence Functions as a Feedback to Improve the Performance of Subjective Tasks
Somnath Banerjee | Maulindu Sarkar | Punyajoy Saha | Binny Mathew | Animesh Mukherjee

Recently, influence functions present an apparatus for achieving explainability for deep neural models by quantifying the perturbation of individual train instances that might impact a test prediction. Our objectives in this paper are twofold. First we incorporate influence functions as a feedback into the model to improve its performance. Second, in a dataset extension exercise, using influence functions to automatically identify data points that have been initially ‘silver’ annotated by some existing method and need to be cross-checked (and corrected) by annotators to improve the model performance. To meet these objectives, in this paper, we introduce InfFeed, which uses influence functions to compute the influential instances for a target instance. Toward the first objective, we adjust the label of the target instance based on its influencer(s) label. In doing this, InfFeed outperforms the state-of-the-art baselines (including LLMs) by a maximum macro F1-score margin of almost 4% for hate speech classification, 3.5% for stance classification, and 3% for irony and 2% for sarcasm detection. Toward the second objective we show that manually re-annotating only those silver annotated data points in the extension set that have a negative influence can immensely improve the model performance bringing it very close to the scenario where all the data points in the extension set have gold labels. This allows for huge reduction of the number of data points that need to be manually annotated since out of the silver annotated extension dataset, the influence function scheme picks up ~1/1000 points that need manual correction.

pdf abs
InfoEnh: Towards Multimodal Sentiment Analysis via Information Bottleneck Filter and Optimal Transport Alignment
Yifeng Xie | Zhihong Zhu | Xuan Lu | Zhiqi Huang | Haoran Xiong

In recent years, Multimodal Sentiment Analysis (MSA) leveraging deep learning has demonstrated exceptional performance in a wide range of domains. Its success lies in effectively utilizing information from multiple modalities to analyze sentiments. Despite these advancements, MSA is confronted with two significant challenges. Firstly, each modality often has a surplus of unimportance data, which can overshadow the essential information. Secondly, the crucial cues for sentiment analysis may conflict across different modalities, thereby complicating the analysis process. These issues have a certain impact on the model’s effectiveness in MSA tasks. To address these challenges, this paper introduces a novel method tailored for MSA, termed InfoEnh. This approach utilizes a masking technique as the bottleneck for information filtering, simultaneously maximizing mutual information to retain crucial data. Furthermore, the method integrates all modalities into a common feature space via domain adaptation, which is enhanced by the application of optimal transport. Extensive experiments conducted on two benchmark MSA datasets demonstrate the effectiveness of our proposed approach. Further analyzes indicate significant improvements over the baselines.

pdf abs
Information Extraction with Differentiable Beam Search on Graph RNNs
Niama El Khbir | Nadi Tomeh | Thierry Charnois

Information extraction (IE) from text documents is an important NLP task that includes entity, relation, and event extraction. These tasks are often addressed jointly as a graph generation problem, where entities and event triggers represent nodes and where relations and event arguments represent edges. Most existing systems use local classifiers for nodes and edges, trained using cross-entropy loss, and employ inference strategies such as beam search to approximate the optimal graph structure. These approaches typically suffer from exposure bias due to the discrepancy between training and decoding. In this paper, we tackle this problem by casting graph generation as auto-regressive sequence labeling and making its training aware of the decoding procedure by using a differentiable version of beam search. We evaluate the effectiveness of our approach through extensive experiments conducted on the ACE05 and ConLL04 datasets across diverse languages. Our experimental findings affirm that our model outperforms its non-decoding-aware version for all datasets employed. Furthermore, we conduct ablation studies that emphasize the effectiveness of aligning training and inference. Additionally, we introduce a novel quantification of exposure bias within this context, providing valuable insights into the functioning of our model.

A steady increase in the performance of Massively Multilingual Models (MMLMs) has contributed to their rapidly increasing use in data collection pipelines. Interactive Neural Machine Translation (INMT) systems are one class of tools that can utilize MMLMs to promote such data collection in several under-resourced languages. However, these tools are often not adapted to the deployment constraints that native language speakers operate in, as bloated, online inference-oriented MMLMs trained for data-rich languages, drive them. INMT-Lite addresses these challenges through its support of (1) three different modes of Internet-independent deployment and (2) a suite of four assistive interfaces suitable for (3) data-sparse languages. We perform an extensive user study for INMT-Lite with an under-resourced language community, Gondi, to find that INMT-Lite improves the data generation experience of community members along multiple axes, such as cognitive load, task productivity, and interface interaction time and effort, without compromising on the quality of the generated translations.INMT-Lite’s code is open-sourced to further research in this domain.

pdf abs
Integrating Headedness Information into an Auto-generated Multilingual CCGbank for Improved Semantic Interpretation
Tu-Anh Tran | Yusuke Miyao

Previously, we introduced a method to generate a multilingual Combinatory Categorial Grammar (CCG) treebank by converting from the Universal Dependencies (UD). However, the method only produces bare CCG derivations without any accompanying semantic representations, which makes it difficult to obtain satisfactory analyses for constructions that involve non-local dependencies, such as control/raising or relative clauses, and limits the general applicability of the treebank. In this work, we present an algorithm that adds semantic representations to existing CCG derivations, in the form of predicate-argument structures. Through hand-crafted rules, we enhance each CCG category with headedness information, with which both local and non-local dependencies can be properly projected. This information is extracted from various sources, including UD, Enhanced UD, and proposition banks. Evaluation of our projected dependencies on the English PropBank and the Universal PropBank 2.0 shows that they can capture most of the semantic dependencies in the target corpora. Further error analysis measures the effectiveness of our algorithm for each language tested, and reveals several issues with the previous method and source data.

Multimodal emotion recognition (MER) aims to identify emotions by utilizing affective information from multiple modalities. Due to the inherent disparities among these heterogeneous modalities, there is a large modality gap in their representations, leading to the challenge of fusing multiple modalities for MER. To address this issue, this work proposes a novel attention-based MER framework by integrating representation subspace mapping with unimodal auxiliary loss for enhancing multimodal fusion capabilities. Initially, a representation subspace mapping module is proposed to map each modality into two distinct subspaces. One is modality-public, enabling the acquisition of common representations and reducing the discrepancies across modalities. The other is modality-unique, retaining the unique characteristics of each modality while eliminating redundant inter-modal attributes. Then, a cross-modality attention is leveraged to bridge the modality gap in unique representations and facilitate modality adaptation. Additionally, our method designs an unimodal auxiliary loss to remove the noise unrelated to emotion classification, resulting in robust and meaningful representations for MER. Comprehensive experiments are conducted on the IEMOCAP and MSP-Improv datasets, and experiment results show that our method achieves superior performance to state-of-the-art MER methods. Keywords: Multimodal emotion recognition, representation subspace mapping, cross-modality attention, unimodal auxiliary loss, fusion

Counterspeech is an effective way to combat online hate speech. Considering the multifaceted nature of online hate speech, counterspeech with varying intents (e.g., denouncing or empathy) has significant potential to mitigate hate speech effectively. Recently, controlled approaches based on large language models (LLMs) have been explored to generate intent-specific counterspeech. Due to the lack of attention to intent-specific information by LLMs during the decoding process, those methods cater more to the semantic information rather than matching with the desired intents. Further, there are still limitations in quantitatively evaluating the effectiveness of counterspeech with different intents in mitigating hate speech. In this paper, to address the above issues, we propose DART, an LLMs-based DuAl-discRiminaTor guided framework for counterspeech generation. We employ an intent-aware discriminator and hate-mitigating discriminator to jointly guide the decoding preferences of LLMs, which facilitates the model towards generating counterspeech catering to specific intent and hate mitigation. We apply a maximum-margin relative objective for training discriminators. This objective leverages the distance between counterspeech aligned with the desired target (such as specific intent or effectiveness in hate mitigation) and undesired as an effective learning signal. Extensive experiments show that DART achieves excellent performances in matching the desired intent and mitigating hate.

pdf abs
Intention and Face in Dialog
Adil Soubki | Owen Rambow

The notion of face described by Brown and Levinson (1987) has been studied in great detail, but a critical aspect of the framework, that which focuses on how intentions mediate the planning of turns which impose upon face, has received far less attention. We present an analysis of three computational systems trained for classifying both intention and politeness, focusing on how the former influences the latter. In politeness theory, agents attend to the desire to have their wants appreciated (positive face), and a complementary desire to act unimpeded and maintain freedom (negative face). Similar to speech acts, utterances can perform so-called face acts which can either raise or threaten the positive or negative face of the speaker or hearer. We begin by using an existing corpus to train a model which classifies face acts, achieving a new SoTA in the process. We then observe that every face act has an underlying intention that motivates it and perform additional experiments integrating dialog act annotations to provide these intentions by proxy. Our analysis finds that dialog acts improve performance on face act detection for minority classes and points to a close relationship between aspects of face and intent.

Eye movements during reading offer a window into cognitive processes and language comprehension, but the scarcity of reading data with interruptions – which learners frequently encounter in their everyday learning environments – hampers advances in the development of intelligent learning technologies. We introduce InteRead – a novel 50-participant dataset of gaze data recorded during self-paced reading of real-world text. InteRead further offers fine-grained annotations of interruptions interspersed throughout the text as well as resumption lags incurred by these interruptions. Interruptions were triggered automatically once readers reached predefined target words. We validate our dataset by reporting interdisciplinary analyses on different measures of gaze behavior. In line with prior research, our analyses show that the interruptions as well as word length and word frequency effects significantly impact eye movements during reading. We also explore individual differences within our dataset, shedding light on the potential for tailored educational solutions. InteRead is accessible from our datasets web-page: https://www.ife.uni-stuttgart.de/en/llis/research/datasets/.

This paper sheds light on a relatively unexplored area which is deep learning interpretability for speech disorder assessment and characterization. Building upon a state-of-the-art methodology for the explainability and interpretability of hidden representation inside a deep-learning speech model, we provide a deeper understanding and interpretation of the final intelligibility assessment of patients experiencing speech disorders due to Head and Neck Cancers (HNC). Promising results have been obtained regarding the prediction of speech intelligibility and severity of HNC patients while giving relevant interpretations of the final assessment both at the phonemes and phonetic feature levels. The potential of this approach becomes evident as clinicians can acquire more valuable insights for speech therapy. Indeed, this can help identify the specific linguistic units that affect intelligibility from an acoustic point of view and enable the development of tailored rehabilitation protocols to improve the patient’s ability to communicate effectively, and thus, the patient’s quality of life.

pdf abs
Interpretable Short Video Rumor Detection Based on Modality Tampering
Kaixuan Wu | Yanghao Lin | Donglin Cao | Dazhen Lin

With the rapid development of social media and short video applications in recent years, browsing short videos has become the norm. Due to its large user base and unique appeal, spreading rumors via short videos has become a severe social problem. Many methods simply fuse multimodal features for rumor detection, which lack interpretability. For short video rumors, rumor makers create rumors by modifying and/or splicing different modal information, so we should consider how to detect rumors from the perspective of modality tampering. Inspired by cross-modal contrastive learning, we propose a novel short video rumor detection framework by designing two pretraining tasks: modality tampering detection and inter-modal matching, imbuing the model with the ability to detect modality tampering and employing it for downstream rumor detection tasks. In addition, we design an interpretability mechanism to make the rumor detection results more reasonable by backtracking the model’s decision-making process. The experimental results show that the method on the short video rumor dataset has an improvement of about 4.6%-12% in macro-F1 compared with other models and can explain whether the short video is a rumor or not through the perspective of modality tampering.

pdf abs
Interpreting Themes from Educational Stories
Yigeng Zhang | Fabio Gonzalez | Thamar Solorio

Reading comprehension continues to be a crucial research focus in the NLP community. Recent advances in Machine Reading Comprehension (MRC) have mostly centered on literal comprehension, referring to the surface-level understanding of content. In this work, we focus on the next level - interpretive comprehension, with a particular emphasis on inferring the themes of a narrative text. We introduce the first dataset specifically designed for interpretive comprehension of educational narratives, providing corresponding well-edited theme texts. The dataset spans a variety of genres and cultural origins and includes human-annotated theme keywords with varying levels of granularity. We further formulate NLP tasks under different abstractions of interpretive comprehension toward the main idea of a story. After conducting extensive experiments with state-of-the-art methods, we found the task to be both challenging and significant for NLP research. The dataset and source code have been made publicly available to the research community at https://github.com/RiTUAL-UH/EduStory.

pdf abs
Intrinsic Subgraph Generation for Interpretable Graph Based Visual Question Answering
Pascal Tilli | Ngoc Thang Vu

The large success of deep learning based methods in Visual Question Answering (VQA) has concurrently increased the demand for explainable methods. Most methods in Explainable Artificial Intelligence (XAI) focus on generating post-hoc explanations rather than taking an intrinsic approach, the latter characterizing an interpretable model. In this work, we introduce an interpretable approach for graph-based VQA and demonstrate competitive performance on the GQA dataset. This approach bridges the gap between interpretability and performance. Our model is designed to intrinsically produce a subgraph during the question-answering process as its explanation, providing insight into the decision making. To evaluate the quality of these generated subgraphs, we compare them against established post-hoc explainability methods for graph neural networks, and perform a human evaluation. Moreover, we present quantitative metrics that correlate with the evaluations of human assessors, acting as automatic metrics for the generated explanatory subgraphs. Our code will be made publicly available at link removed due to anonymity period.

pdf abs
Introducing a Parsed Corpus of Historical High German
Christopher D. Sapp | Elliott Evans | Rex Sprouse | Daniel Dakota

We outline the ongoing development of the Indiana Parsed Corpus of (Historical) High German. Once completed, this corpus will fill the gap in Penn-style treebanks for Germanic languages by spanning High German from 1050 to 1950. This paper describes the process of building the corpus: selection of texts, decisions on part-of-speech tags and other labels, the process of annotation, and illustrative annotation issues unique to historical High German. The construction of the corpus has led to a refinement of the Penn labels, tailored to the particulars of this language.

pdf abs
Introducing CQuAE : A New French Contextualised Question-Answering Corpus for the Education Domain
Thomas Gerald | Anne Vilnat | Sofiane Ettayeb | Louis Tamames | Patrick Paroubek

We present a new question answering corpus in French designed to educational domain. To be useful in such domain, we have to propose more complex questions and to be able to justify the answers on validated material. We analyze some properties of this corpus. The last part of this paper will be devoted to present the first experiments we have carried out to demonstrate the value of this dataset for learning a Retrieval Augmented Genration framework. Different experiments are proposed, with an automatic evaluation. A human evaluation is proposed to confirm or infirm this automatic evaluation.

pdf abs
Investigating the Robustness of Modelling Decisions for Few-Shot Cross-Topic Stance Detection: A Preregistered Study
Myrthe Reuver | Suzan Verberne | Antske Fokkens

For a viewpoint-diverse news recommender, identifying whether two news articles express the same viewpoint is essential. One way to determine “same or different” viewpoint is stance detection. In this paper, we investigate the robustness of operationalization choices for few-shot stance detection, with special attention to modelling stance across different topics. Our experiments test pre-registered hypotheses on stance detection. Specifically, we compare two stance task definitions (Pro/Con versus Same Side Stance), two LLM architectures (bi-encoding versus cross-encoding), and adding Natural Language Inference knowledge, with pre-trained RoBERTa models trained with shots of 100 examples from 7 different stance detection datasets. Some of our hypotheses and claims from earlier work can be confirmed, while others give more inconsistent results. The effect of the Same Side Stance definition on performance differs per dataset and is influenced by other modelling choices. We found no relationship between the number of training topics in the training shots and performance. In general, cross-encoding out-performs bi-encoding, and adding NLI training to our models gives considerable improvement, but these results are not consistent across all datasets. Our results indicate that it is essential to include multiple datasets and systematic modelling experiments when aiming to find robust modelling choices for the concept ‘stance’.

Effective information retrieval (IR) in settings with limited training data, particularly for complex queries, remains a challenging task. This paper introduces IR2, Information Regularization for Information Retrieval, a technique for reducing overfitting during synthetic data generation. This approach, representing a novel application of regularization techniques in synthetic data creation for IR, is tested on three recent IR tasks characterized by complex queries: DORIS-MAE, ArguAna, and WhatsThatBook. Experimental results indicate that our regularization techniques not only outperform previous synthetic query generation methods on the tasks considered but also reduce cost by up to 50%. Furthermore, this paper categorizes and explores three regularization methods at different stages of the query synthesis pipeline—input, prompt, and output—each offering varying degrees of performance improvement compared to models where no regularization is applied. This provides a systematic approach for optimizing synthetic data generation in data-limited, complex-query IR scenarios. All code, prompts and synthetic data are available at https://github.com/Info-Regularization/Information-Regularization.

pdf abs
I Remember You!: SUI Corpus for Remembering and Utilizing Users’ Information in Chat-oriented Dialogue Systems
Yuiko Tsunomori | Ryuichiro Higashinaka

To construct a chat-oriented dialogue system that will be used for a long time by users, it is important to build a good relationship between the user and the system. To achieve a good relationship, several methods for remembering and utilizing information on users (preferences, experiences, jobs, etc.) in system utterances have been investigated. One way to do this is to utilize user information to fill in utterance templates for use in response generation, but the utterances do not always fit the context. Another way is to use neural-based generation, but in current methods, user information can be incorporated only when the current dialogue topic is similar to that of the user information. This paper tackled these problems by constructing a novel corpus to incorporate arbitrary user information into system utterances regardless of the current dialogue topic while retaining appropriateness for the context. We then fine-tuned a model for generating system utterances using the constructed corpus. The result of a subjective evaluation demonstrated the effectiveness of our model. Furthermore, we incorporated our fine-tuned model into a dialogue system and confirmed the effectiveness of the system through interactive dialogues with users.

pdf abs
ÌròyìnSpeech: A Multi-purpose Yorùbá Speech Corpus
Tolulope Ogunremi | Kola Tubosun | Anuoluwapo Aremu | Iroro Orife | David Ifeoluwa Adelani

We introduce ÌròyìnSpeech corpus—a new dataset influenced by a desire to increase the amount of high quality, freely available, contemporary Yorùbá speech data that can be used for both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. We curated about 23,000 text sentences from the news and creative writing domains with an open license i.e., CC-BY-4.0 and asked multiple speakers to record each sentence. To encourage more participatory approach to data creation, we provide 5 000 utterances from the curated sentences to the Mozilla Common Voice platform to crowd-source the recording and validation of Yorùbá speech data. In total, we created about 42 hours of speech data recorded by 80 volunteers in-house, and 6 hours validated recordings on Mozilla Common Voice platform. Our evaluation on TTS shows that we can create a good quality general domain single-speaker TTS model for Yorùbá with as little 5 hours of speech by leveraging an end-to-end VITS architecture. Similarly, for ASR, we obtained a WER of 21.5.

pdf abs
Is Crowdsourcing Breaking Your Bank? Cost-Effective Fine-Tuning of Pre-trained Language Models with Proximal Policy Optimization
Shuo Yang | Gjergji Kasneci

Wide usage of ChatGPT has highlighted the potential of reinforcement learning from human feedback. However, its training pipeline relies on manual ranking, a resource-intensive process. To reduce labor costs, we propose a self-supervised text ranking approach for applying Proximal-Policy-Optimization to fine-tune language models while eliminating the need for human annotators. Our method begins with probabilistic sampling to encourage a language model to generate diverse responses for each input. We then employ TextRank and ISODATA algorithms to rank and cluster these responses based on their semantics. Subsequently, we construct a reward model to learn the rank and optimize our generative policy. Our experimental results, conducted using two language models on three tasks, demonstrate that the models trained by our method considerably outperform baselines regarding BLEU, GLEU, and METEOR scores. Furthermore, our manual evaluation shows that our ranking results exhibit a remarkably high consistency with that of humans. This research significantly reduces training costs of proximal policy-guided models and demonstrates the potential for self-correction of language models.

pdf abs
Is Gender Reference Gender-specific? Studies in a Polar Domain
Manfred Klenner | Dylan Massey

We investigate how gender authorship influences polar, i.e. positive and negative gender reference. Given German-language newspaper texts where the full name of the authors are known and their gender can be inferred from the first names. And given that nouns in the text have gender reference, i.e. are labeled by a gender classifier as female or male denoting nouns. If these nouns carry a polar load, they count towards the gender-specific statistics we are interested in. A polar load is given either via phrase-level sentiment composition, or by a verb-based analysis of the polar role a noun (phrase) plays: is it framed by the verb as a positive or negative actor, or as receiving a positive or negative effect? Also, reported gender-gender relations (in favor, against) might be gender-specific. Statistical hypothesis testing is carried out in order to find out whether significant gender-wise correlations exist. We found that, in fact, gender reference is gender-specific: each gender significantly more often focuses on their own gender than the other one and e.g. positive actorship supremacy is claimed (intra-) gender-wise.

pdf abs
Is It Possible to Modify Text to a Target Readability Level? An Initial Investigation Using Zero-Shot Large Language Models
Asma Farajidizaji | Vatsal Raina | Mark Gales

Text simplification is a common task where the text is adapted to make it easier to understand. Similarly, text elaboration can make a passage more sophisticated, offering a method to control the complexity of reading comprehension tests. However, text simplification and elaboration tasks are limited to only relatively alter the readability of texts. It is useful to directly modify the readability of any text to an absolute target readability level to cater to a diverse audience. Ideally, the readability of readability-controlled generated text should be independent of the source text. Therefore, we propose a novel readability-controlled text modification task. The task requires the generation of 8 versions at various target readability levels for each input text. We introduce novel readability-controlled text modification metrics. The baselines for this task use ChatGPT and Llama-2, with an extension approach introducing a two-step process (generating paraphrases by passing through the language model twice). The zero-shot approaches are able to push the readability of the paraphrases in the desired direction but the final readability remains correlated with the original text’s readability. We also find greater drops in semantic and lexical similarity between the source and target texts with greater shifts in the readability.

pdf abs
Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks
Ruiyang Zhou | Lu Chen | Kai Yu

The use of large language models (LLM), especially ChatGPT, to help with research has come into practice. Researchers use it for timely advice and hope to obtain in-depth feedback. However, can LLM be a qualified and reliable reviewer? Although there already exist several review-related datasets, few works have carefully and thoroughly inspected model’s capability as a reviewer, especially the correctness of generated reviews. In this paper, we first evaluate GPT-3.5 and GPT-4 (the current top-performing LLM) on 2 types of tasks under different settings: the score prediction task and the review generation task. In addition, we propose a dataset containing 197 review-revision multiple-choice questions (RR-MCQ) with detailed labels from the review-rebuttal forum in ICLR-2023. By asking questions from technical details to the overall presentation and quality, our RR-MCQ data provides a more complete model ability assessment. The results show that LLM is generally helpful, but great caution is needed as it always makes mistakes. Although it can give passable decisions (> 60% accuracy) on single options, completely correct answers are still rare (about 20%); models are still weak on long paper processing, zero-shot scoring, and giving critical feedback like human reviewers.

pdf abs
Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation
Mateusz Klimaszewski | Piotr Andruszkiewicz | Alexandra Birch

The rise of Modular Deep Learning showcases its potential in various Natural Language Processing applications. Parameter-efficient fine-tuning (PEFT) modularity has been shown to work for various use cases, from domain adaptation to multilingual setups. However, all this work covers the case where the modular components are trained and deployed within one single Pre-trained Language Model (PLM). This model-specific setup is a substantial limitation on the very modularity that modular architectures are trying to achieve. We ask whether current modular approaches are transferable between models and whether we can transfer the modules from more robust and larger PLMs to smaller ones. In this work, we aim to fill this gap via a lens of Knowledge Distillation, commonly used for model compression, and present an extremely straightforward approach to transferring pre-trained, task-specific PEFT modules between same-family PLMs. Moreover, we propose a method that allows the transfer of modules between incompatible PLMs without any change in the inference complexity. The experiments on Named Entity Recognition, Natural Language Inference, and Paraphrase Identification tasks over multiple languages and PEFT methods showcase the initial potential of transferable modularity.

pdf abs
ISO 24617-12: A New Standard for Semantic Annotation
Harry Bunt

This paper presents ISO 24617-12, an annotation scheme for quantification phenomena in natural language., as part of the ISO Semantic Annotation Framework (ISO 24617). This scheme combines ideas from the theory of generalised quantifiers, from neo-Davidsonian event semantics, and from Discourse Representation Theory. The scheme consists of (1) an abstract syntax which defines ‘annotation structures’ as triples and other set-theoretic constructs of quantification-related concepts; (2) a reference representation of annotation structures (‘concrete syntax’); and (3) a compositional semantics of annotation structures. Together, these components define the markup language QuantML. This paper focuses on the identification and structuring of the semantic information useful for the characterisation of quantification in natural language and the interoperable representation of these information structures in QuantML.

pdf abs
IsraParlTweet: The Israeli Parliamentary and Twitter Resource
Guy Mor-Lan | Effi Levi | Tamir Sheafer | Shaul R. Shenhav

We introduce IsraParlTweet, a new linked corpus of Hebrew-language parliamentary discussions from the Knesset (Israeli Parliament) between the years 1992-2023 and Twitter posts made by Members of the Knesset between the years 2008-2023, containing a total of 294.5 million Hebrew tokens. In addition to raw text, the corpus contains comprehensive metadata on speakers and Knesset sessions as well as several linguistic annotations. As a result, IsraParlTweet can be used to conduct a wide variety of quantitative and qualitative analyses and provide valuable insights into political discourse in Israel.

Even though various speech data sets are available in Hungarian, there is a lack of a general overview about their types and sizes. To fill in this gap, we provide a survey of available data sets in spoken Hungarian in five categories (e.g., monolingual, Hungarian part of multilingual, pathological, child-related and dialectal collections). In total, the estimated size of available data is about 2800 hours (across 7500 speakers) and it represents a rich spoken language diversity. However, the distribution of the data and its alignment to real-life (e.g. speech recognition) tasks is far from optimal indicating the need for additional larger-scale natural language speech data sets. Our survey presents an overview of available data sets for Hungarian explaining their strengths and weaknesses which is useful for researchers working on Hungarian across disciplines. In addition, our survey serves as a starting point towards a unified foundational speech model specific to Hungarian.

pdf abs
Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks
Xiao Pu | Mingqi Gao | Xiaojun Wan

Research on automated text summarization typically uses human and automatic evaluation methods. While most recent studies focus on intrinsic evaluation, which assesses the general quality of summaries, e.g. coherence and informativeness, we concentrate on task-based extrinsic evaluation to determine the usefulness of summaries. We incorporate three downstream tasks, namely question answering, text classification, and text similarity assessment, and measure the usefulness of summaries for these tasks by several metrics. Our findings reveal that summaries are generally useful in tasks that require a comprehensive grasp of the text but are less useful in tasks requiring a more specific understanding of the text. We also analyze the usefulness and inherent properties of summaries from different models, and find that fine-tuned models consistently produce more useful summaries across all three tasks. In contrast, zero-shot models tend to lean towards text classification and similarity assessment, providing more general and less detailed summaries. Additionally, we assess the correlation between 14 intrinsic automatic metrics and human judgments. Intrinsic metrics perform well in evaluating summaries for question answering but are less effective in the other two tasks. This highlights the limitations of relying solely on intrinsic metrics for assessing summary performance and usefulness.

pdf abs
IT2ACL Learning Easy-to-Hard Instructions via 2-Phase Automated Curriculum Learning for Large Language Models
Yufei Huang | Deyi Xiong

Instruction tuning has demonstrated its superiority in unlocking the abilities of pre-trained large language models (LLMs), including their capability to respond to diverse human instructions and conduct complex reasoning. In order to further enhance the continuous learning capabilities of pre-trained LLMs, we explore the training process of instruction tuning through the lens of task sequences. We propose a 2-phase automated curriculum learning guided instruction tuning framework, IT2ACL that learns easy-to-hard instructions for LLMs in a self-adjusting dynamic manner. To facilitate curriculum learning from instructions, we propose a loss-driven progress signal for two-phase strategies: instruction prediction gain that decides the instruction level syllabus. Through comprehensive experiments on 70 Chinese datasets which have been grouped into 16 distinct task clusters, we demonstrate the effectiveness of our approach in eliciting latent ability in pre-trained LLMs and achieving superior performance across diverse tasks.

pdf abs
IT5: Text-to-text Pretraining for Italian Language Understanding and Generation
Gabriele Sarti | Malvina Nissim

We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.

pdf abs
Italian Word Embeddings for the Medical Domain
Franco Alberto Cardillo | Franca Debole

Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing.

pdf abs
It’s Not under the Lamppost: Expanding the Reach of Conversational AI
Christy Doran | Deborah A. Dahl

Generic commercial language-based assistants have become ubiquitously available, originally in the form of smart speakers and mobile apps, and more recently in the form of systems based on generative AI. At first glance, their capabilities seem remarkable. Speech recognition works well, NLU mostly works, and access to back-end information sources is usually quite good. However, there is still a lot of work to be done. In the area of NLU in particular, focused probes into the capabilities of language-based assistants easily reveal significant areas of brittleness that demonstrate large gaps in their coverage. For example, the straightforward disjunctive query is this monday or tuesday elicited the nonsensical response it’s 2:50 p.m. many consider it to be the afternoon. These gaps are difficult to identify if the development process relies on training the system with an ongoing supply of natural user data, because this natural data can become distorted by a self-reinforcing feedback loop where the system ‘trains’ the user to produce data that works. This paper describes a process for collecting specific kinds of data to uncover these gaps and an annotation scheme for system responses, and includes examples of simple utterances that nonetheless fail to be correctly processed. The systems tested include both Conventional assistants, such as Amazon Alexa and Google Assistant, as well as GenAI systems, including ChatGPT and Bard/Gemini. We claim that these failures are due to a lack of attention to the full spectrum of input possibilities, and argue that systems would benefit from the inclusion of focused manual assessment to directly target likely gaps.

pdf abs
JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus
Masaaki Nagata | Makoto Morishita | Katsuki Chousa | Norihito Yasuda

We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.

Pretrained Language Models (PLMs) are the de facto backbone of most state-of-the-art NLP systems. In this paper, we introduce a family of domain-specific pretrained PLMs for French, focusing on three important domains: transcribed speech, medicine, and law. We use a transformer architecture based on efficient methods (LinFormer) to maximise their utility, since these domains often involve processing long documents. We evaluate and compare our models to state-of-the-art models on a diverse set of tasks and datasets, some of which are introduced in this paper. We gather the datasets into a new French-language evaluation benchmark for these three domains. We also compare various training configurations: continued pretraining, pretraining from scratch, as well as single- and multi-domain pretraining. Extensive domain-specific experiments show that it is possible to attain competitive downstream performance even when pre-training with the approximative LinFormer attention mechanism. For full reproducibility, we release the models and pretraining data, as well as contributed datasets.

pdf abs
JCoLA: Japanese Corpus of Linguistic Acceptability
Taiga Someya | Yushi Sugimoto | Yohei Oseki

Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese and multilingual language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.

Understanding expressions that refer to the physical world is crucial for such human-assisting systems in the real world, as robots that must perform actions that are expected by users. In real-world reference resolution, a system must ground the verbal information that appears in user interactions to the visual information observed in egocentric views. To this end, we propose a multimodal reference resolution task and construct a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3). Our dataset contains egocentric video and dialogue audio of real-world conversations between two people acting as a master and an assistant robot at home. The dataset is annotated with crossmodal tags between phrases in the utterances and the object bounding boxes in the video frames. These tags include indirect reference relations, such as predicate-argument structures and bridging references as well as direct reference relations. We also constructed an experimental model and clarified the challenges in multimodal reference resolution tasks.

pdf abs
JDocQA: Japanese Document Question Answering Dataset for Generative Language Models
Eri Onami | Shuhei Kurita | Taiki Miyanishi | Taro Watanabe

Document question answering is a task of question answering on given documents such as reports, slides, pamphlets, and websites, and it is a truly demanding task as paper and electronic forms of documents are so common in our society. This is known as a quite challenging task because it requires not only text understanding but also understanding of figures and tables, and hence visual question answering (VQA) methods are often examined in addition to textual approaches. We introduce Japanese Document Question Answering (JDocQA), a large-scale document-based QA dataset, essentially requiring both visual and textual information to answer questions, which comprises 5,504 documents in PDF format and annotated 11,600 question-and-answer instances in Japanese. Each QA instance includes references to the document pages and bounding boxes for the answer clues. We incorporate multiple categories of questions and unanswerable questions from the document for realistic question-answering applications. We empirically evaluate the effectiveness of our dataset with text-based large language models (LLMs) and multimodal models. Incorporating unanswerable questions in finetuning may contribute to harnessing the so-called hallucination generation.

pdf abs
JEMHopQA: Dataset for Japanese Explainable Multi-Hop Question Answering
Ai Ishii | Naoya Inoue | Hisami Suzuki | Satoshi Sekine

We present JEMHopQA, a multi-hop QA dataset for the development of explainable QA systems. The dataset consists not only of question-answer pairs, but also of supporting evidence in the form of derivation triples, which contributes to making the QA task more realistic and difficult. It is created based on Japanese Wikipedia using both crowd-sourced human annotation as well as prompting a large language model (LLM), and contains a diverse set of question, answer and topic categories as compared with similar datasets released previously. We describe the details of how we built the dataset as well as the evaluation of the QA task presented by this dataset using GPT-4, and show that the dataset is sufficiently challenging for the state-of-the-art LLM while showing promise for combining such a model with existing knowledge resources to achieve better performance.

Large language models (LLMs) have proficiently solved a broad range of tasks with their rich knowledge but often struggle with logical reasoning. To foster the research on logical reasoning, many benchmarks have been proposed so far. However, most of these benchmarks are limited to English, hindering the evaluation of LLMs specialized for each language. To address this, we propose **JFLD** (**J**apanese **F**ormal **L**ogic **D**eduction), a deductive reasoning benchmark for Japanese. JFLD assess whether LLMs can generate logical steps to (dis-)prove a given hypothesis based on a given set of facts. Its key features are assessing pure logical reasoning abilities isolated from knowledge and assessing various reasoning rules. We evaluate various Japanese LLMs and see that they are still poor at logical reasoning, thus highlighting a substantial need for future research.

pdf abs
JLBert: Japanese Light BERT for Cross-Domain Short Text Classification
Chandrai Kayal | Sayantan Chattopadhyay | Aryan Gupta | Satyen Abrol | Archie Gugol

Models, such as BERT, have made a significant breakthrough in the Natural Language Processing (NLP) domain solving 11+ tasks. This is achieved by training on a large scale of unlabelled text resources and leveraging Transformers architecture making it the “Jack of all NLP trades”. However, one of the popular and challenging tasks in Sequence Classification is Short Text Classification (STC). Short Texts face the problem of being short, equivocal, and non-standard. In this paper, we address two major problems: 1. Improving STC tasks performance in Japanese language which consists of many varieties and dialects. 2. Building a light-weight Japanese BERT model with cross-domain functionality and comparable accuracy with State of the Art (SOTA) BERT models. To solve this, we propose a novel cross-domain scalable model called JLBert, which is pre-trained on a rich, diverse and less explored Japanese e-commerce corpus. We present results from extensive experiments to show that JLBert is outperforming SOTA Multilingual and Japanese specialized BERT models on three Short Text datasets by approx 1.5% across various domain.

pdf abs
JL-Hate: An Annotated Dataset for Joint Learning of Hate Speech and Target Detection
Kaan Büyükdemirci | Izzet Emre Kucukkaya | Eren Ölmez | Cagri Toraman

The detection of hate speech is a subject extensively explored by researchers, and machine learning algorithms play a crucial role in this domain. The existing resources mostly focus on text sequence classification for the task of hate speech detection. However, the target of hateful content is another dimension that has not been studied in details due to the lack of data resources. In this study, we address this gap by introducing a novel tweet dataset for the task of joint learning of hate speech detection and target detection, called JL-Hate, for the tasks of sequential text classification and token classification, respectively. The JL-Hate dataset consists of 1,530 tweets divided equally in English and Turkish languages. Leveraging this dataset, we conduct a series of benchmark experiments. We utilize a joint learning model to concurrently perform sequence and token classification tasks on our data. Our experimental results demonstrate consistent performance with the prevalent studies, both in sequence and token classification tasks.

pdf abs
JMultiWOZ: A Large-Scale Japanese Multi-Domain Task-Oriented Dialogue Dataset
Atsumoto Ohashi | Ryu Hirai | Shinya Iizuka | Ryuichiro Higashinaka

Dialogue datasets are crucial for deep learning-based task-oriented dialogue system research. While numerous English language multi-domain task-oriented dialogue datasets have been developed and contributed to significant advancements in task-oriented dialogue systems, such a dataset does not exist in Japanese, and research in this area is limited compared to that in English. In this study, towards the advancement of research and development of task-oriented dialogue systems in Japanese, we constructed JMultiWOZ, the first Japanese language large-scale multi-domain task-oriented dialogue dataset. Using JMultiWOZ, we evaluated the dialogue state tracking and response generation capabilities of the state-of-the-art methods on the existing major English benchmark dataset MultiWOZ2.2 and the latest large language model (LLM)-based methods. Our evaluation results demonstrated that JMultiWOZ provides a benchmark that is on par with MultiWOZ2.2. In addition, through evaluation experiments of interactive dialogues with the models and human participants, we identified limitations in the task completion capabilities of LLMs in Japanese.

pdf abs
Joint Annotation of Morphology and Syntax in Dependency Treebanks
Bruno Guillaume | Kim Gerdes | Kirian Guiller | Sylvain Kahane | Yixuan Li

In this paper, we compare different ways to annotate both syntactic and morphological relations in a dependency treebank and we propose new formats we call mSUD and mUD, compatible with the Universal Dependencies (UD) schema for syntactic treebanks. We emphasize mSUD rather than mUD, the former being based on distributional criteria for the choice of the head of any combination, which allow us to clearly encode the internal structure of a word, that is, the derivational path. We investigate different problems posed by a morph-based annotation, concerning tokenization, choice of the head of a morph combination, relations between morphs, additional features needed, such as the token type differentiating roots and derivational and inflectional affixes. We show how our annotation schema can be applied to different languages from polysynthetic languages such as Yupik to isolating languages such as Chinese.

Dialogue policy learning (DPL) aims to determine an abstract representation (also known as action) to guide what the response should be. Typically, DPL is cast as a sequential decision problem across a series of predefined action candidates. However, such static and narrow actions can limit response diversity and impede the dialogue agent’s adaptability to new scenarios and edge cases. To overcome these challenges, we introduce a novel Joint Transformer Reinforcement Learning framework, coined as JoTR, where a text-to-text Transformer-based model is employed to directly generate dialogue actions. More concretely, JoTR formulates a token-grained policy, facilitating more dynamic and adaptable dialogue action generation without the need for predefined action candidates. This method not only enhances the diversity of responses but also significantly improves the system’s capability to manage unfamiliar scenarios. Furthermore, JoTR utilizes Reinforcement Learning with a reward-shaping mechanism to efficiently fine-tune the token-grained policy. This allows the model to evolve through interactions, thereby enhancing its performance over time. Our extensive evaluation demonstrates that JoTR surpasses previous state-of-the-art models, showing improvements of 9% and 13% in success rate, and 34% and 37% in the diversity of dialogue actions across two benchmark dialogue modeling tasks respectively. These results have been validated by both user simulators and human evaluators. Code and data are available at ://github.com/KwanWaiChung/JoTR.

pdf abs
JRC-Names-Retrieval: A Standardized Benchmark for Name Search
Philip Blair | Kfir Bar

Many systems rely on the ability to effectively search through databases of personal and organization entity names in multiple writing scripts. Despite this, there is a relative lack of research studying this problem in isolation. In this work, we discuss this problem in detail and support future research by publishing what we believe is the first comprehensive dataset designed for this task. Additionally, we present a number of baselines against which future work can be compared; among which, we describe a neural solution based on ByT5 (Xue et al. 2022) which demonstrates up to a 12% performance gain over preexisting baselines, indicating that there remains much room for improvement in this space.

pdf abs
J-SNACS: Adposition and Case Supersenses for Japanese Joshi
Tatsuya Aoyama | Chihiro Taguchi | Nathan Schneider

Many languages use adpositions (prepositions or postpositions) to mark a variety of semantic relations, with different languages exhibiting both commonalities and idiosyncrasies in the relations grouped under the same lexeme. We present the first Japanese extension of the SNACS framework (Schneider et al., 2018), which has served as the basis for annotating adpositions in corpora from several languages. After establishing which of the set of particles (joshi) in Japanese qualify as case markers and adpositions as defined in SNACS, we annotate 10 chapters (≈10k tokens) of the Japanese translation of Le Petit Prince (The Little Prince), achieving high inter-annotator agreement. We find that, while a majority of the particles and their uses are captured by the existing and extended SNACS annotation guidelines from the previous work, some unique cases were observed. We also conduct experiments investigating the cross-lingual similarity of adposition and case marker supersenses, showing that the language-agnostic SNACS framework captures similarities not clearly observed in multilingual embedding space.

pdf abs
Jump to Conclusions: Short-Cutting Transformers with Linear Transformations
Alexander Yom Din | Taelin Karidi | Leshem Choshen | Mor Geva

Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. One way to elucidate this is to cast the hidden representations as final representations, bypassing the transformer computation in-between. In this work, we suggest a simple method for such casting, using linear transformations. This approximation far exceeds the prevailing practice of inspecting hidden representations from all layers, in the space of the final layer. Moreover, in the context of language modeling, our method produces more accurate predictions from hidden layers, across various model scales, architectures, and data distributions. This allows “peeking” into intermediate representations, showing that GPT-2 and BERT often predict the final output already in early layers. We then demonstrate the practicality of our method to recent early exit strategies, showing that when aiming, for example, at retention of 95% accuracy, our approach saves additional 7.9% layers for GPT-2 and 5.4% layers for BERT. Last, we extend our method to linearly approximate sub-modules, finding that attention is most tolerant to this change. Our code and learned mappings are publicly available at https://github.com/sashayd/mat.

pdf abs
KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis
Adal Abilbekov | Saida Mussakhojayeva | Rustem Yeshpanov | Huseyin Atakan Varol

This study focuses on the creation of the KazEmoTTS dataset, designed for emotional Kazakh text-to-speech (TTS) applications. KazEmoTTS is a collection of 54,760 audio-text pairs, with a total duration of 74.85 hours, featuring 34.23 hours delivered by a female narrator and 40.62 hours by two male narrators. The list of the emotions considered include “neutral”, “angry”, “happy”, “sad”, “scared”, and “surprised”. We also developed a TTS model trained on the KazEmoTTS dataset. Objective and subjective evaluations were employed to assess the quality of synthesized speech, yielding an MCD score within the range of 6.02 to 7.67, alongside a MOS that spanned from 3.51 to 3.57. To facilitate reproducibility and inspire further research, we have made our code, pre-trained model, and dataset accessible in our GitHub repository.

pdf abs
KazParC: Kazakh Parallel Corpus for Machine Translation
Rustem Yeshpanov | Alina Polonskaya | Huseyin Atakan Varol

We introduce KazParC, a parallel corpus designed for machine translation across Kazakh, English, Russian, and Turkish. The first and largest publicly available corpus of its kind, KazParC contains a collection of 371,902 parallel sentences covering different domains and developed with the assistance of human translators. Our research efforts also extend to the development of a neural machine translation model nicknamed Tilmash. Remarkably, the performance of Tilmash is on par with, and in certain instances, surpasses that of industry giants, such as Google Translate and Yandex Translate, as measured by standard evaluation metrics such as BLEU and chrF. Both KazParC and Tilmash are openly available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.

pdf abs
KazQAD: Kazakh Open-Domain Question Answering Dataset
Rustem Yeshpanov | Pavel Efimov | Leonid Boytsov | Ardak Shalkarbayuli | Pavel Braslavski

We introduce KazQAD—a Kazakh open-domain question answering (ODQA) dataset—that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI’s ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at url https://github.com/IS2AI/KazQAD

pdf abs
KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes
Rustem Yeshpanov | Huseyin Atakan Varol

This paper presents KazSAnDRA, a dataset developed for Kazakh sentiment analysis that is the first and largest publicly available dataset of its kind. KazSAnDRA comprises an extensive collection of 180,064 reviews obtained from various sources and includes numerical ratings ranging from 1 to 5, providing a quantitative representation of customer attitudes. The study also pursued the automation of Kazakh sentiment classification through the development and evaluation of four machine learning models trained for both polarity classification and score classification. Experimental analysis included evaluation of the results considering both balanced and imbalanced scenarios. The most successful model attained an F1-score of 0.81 for polarity classification and 0.39 for score classification on the test sets. The dataset and fine-tuned models are open access and available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.

The goal of knowledge graph completion (KGC) is to predict missing facts among entities. Previous methods for KGC re-ranking are mostly built on non-generative language models to obtain the probability of each candidate. Recently, generative large language models (LLMs) have shown outstanding performance on several tasks such as information extraction and dialog systems. Leveraging them for KGC re-ranking is beneficial for leveraging the extensive pre-trained knowledge and powerful generative capabilities. However, it may encounter new problems when accomplishing the task, namely mismatch, misordering and omission. To this end, we introduce KC-GenRe, a knowledge-constrained generative re-ranking method based on LLMs for KGC. To overcome the mismatch issue, we formulate the KGC re-ranking task as a candidate identifier sorting generation problem implemented by generative LLMs. To tackle the misordering issue, we develop a knowledge-guided interactive training method that enhances the identification and ranking of candidates. To address the omission issue, we design a knowledge-augmented constrained inference method that enables contextual prompting and controlled generation, so as to obtain valid rankings. Experimental results show that KG-GenRe achieves state-of-the-art performance on four datasets, with gains of up to 6.7% and 7.7% in the MRR and Hits@1 metric compared to previous methods, and 9.0% and 11.1% compared to that without re-ranking. Extensive analysis demonstrates the effectiveness of components in KG-GenRe.

pdf abs
KCL: Few-shot Named Entity Recognition with Knowledge Graph and Contrastive Learning
Shan Zhang | Bin Cao | Jing Fan

Named Entity Recognition(NER), as a crucial subtask in natural language processing(NLP), is limited to a few labeled samples(a.k.a. few-shot). Metric-based meta-learning methods aim to learn the semantic space and assign the entity to its nearest label based on the similarity of their representations. However, these methods have trouble with semantic space learning and result in suboptimal performance. Specifically, the label name or its description is widely used for label semantic representation learning, but the label information extracted from the existing label description is limited. In addition, these methods focus on reducing the distance between the entity and the corresponding label, which may also reduce the distance between the labels and thus cause misclassification. In this paper, we propose a few-shot NER method that harnesses the power of Knowledge Graph and Contrastive Learning to improve the prototypical semantic space learning. First, KCL leverages knowledge graphs to provide rich and structured label information for label semantic representation learning. Then, KCL introduces the idea of contrastive learning to learn the label semantic representation. The label semantic representation is used to help distance the label clusters in the prototypical semantic space to reduce misclassification. Extensive experiments show that KCL achieves significant improvement over the state-of-the-art methods.

Knowledge-enhanced pre-trained language models (KEPLMs) leverage relation triples from knowledge graphs (KGs) and integrate these external data sources into language models via self-supervised learning. Previous works treat knowledge enhancement as two independent operations, i.e., knowledge injection and knowledge integration. In this paper, we propose to learn Knowledge-Enhanced language representations with Hierarchical Reinforcement Learning (KEHRL), which jointly addresses the problems of detecting positions for knowledge injection and integrating external knowledge into the model in order to avoid injecting inaccurate or irrelevant knowledge. Specifically, a high-level reinforcement learning (RL) agent utilizes both internal and prior knowledge to iteratively detect essential positions in texts for knowledge injection, which filters out less meaningful entities to avoid diverting the knowledge learning direction. Once the entity positions are selected, a relevant triple filtration module is triggered to perform low-level RL to dynamically refine the triples associated with polysemic entities through binary-valued actions. Experiments validate KEHRL’s effectiveness in probing factual knowledge and enhancing the model’s performance on various natural language understanding tasks.

pdf abs
KET-QA: A Dataset for Knowledge Enhanced Table Question Answering
Mengkang Hu | Haoyu Dong | Ping Luo | Shi Han | Dongmei Zhang

Due to the concise and structured nature of tables, the knowledge contained therein may be incomplete or missing, posing a significant challenge for table question answering (TableQA) systems. However, most existing datasets either overlook the challenge of missing knowledge in TableQA or only utilize unstructured text as supplementary information for tables. In this paper, we propose to use a knowledge base (KB) as the external knowledge source for TableQA and construct a dataset KET-QA with fine-grained gold evidence annotation. Each table in the dataset corresponds to a sub-graph of the entire KB, and every question requires the integration of information from both the table and the sub-graph to be answered. To extract pertinent information from the vast knowledge sub-graph and apply it to TableQA, we design a retriever-reasoner structured pipeline model. Experimental results demonstrate that our model consistently achieves remarkable relative performance improvements ranging from 1.9 to 6.5 times on EM scores across three distinct settings (fine-tuning, zero-shot, and few-shot), in comparison with solely relying on table information. However, even the best model achieves a 60.23% EM score, which still lags behind the human-level performance, highlighting the challenging nature of KET-QA for the question-answering community.

pdf abs
Keyphrase Generation: Lessons from a Reproducibility Study
Edwin Thomas | Sowmya Vajjala

Reproducibility studies are treated as means to verify the validity of a scientific method, but what else can we learn from such experiments? We addressed this question taking Keyphrase Generation (KPG) as the use case in this paper, by studying three state-of-the-art KPG models in terms of reproducibility under either the same (same data/model/code) or varied (different training data/model, but same code) conditions, and exploring different ways of comparing KPG models beyond the most commonly used evaluation measures. We drew some conclusions on the state of the art in KPG based on these experiments, and provided guidelines for researchers working on the topic about reporting experimental results in a more comprehensive manner.

pdf abs
KGConv, a Conversational Corpus Grounded in Wikidata
Quentin Brabant | Lina M. Rojas Barahona | Gwénolé Lecorvé | Claire Gardent

We present KGConv, a large corpus of 71k English conversations where each question-answer pair is grounded in a Wikidata fact. Conversations contain on average 8.6 questions and for each Wikidata fact, we provide multiple variants (12 on average) of the corresponding question using templates, human annotations, hand-crafted rules and a question rewriting neural model. We provide baselines for the task of Knowledge-Based, Conversational Question Generation. KGConv can further be used for other generation and analysis tasks such as single-turn question generation from Wikidata triples, question rewriting, question answering from conversation or from knowledge graphs and quiz generation.

pdf abs
Khan Academy Corpus: A Multilingual Corpus of Khan Academy Lectures
Dominika Ďurišková | Daniela Jurášová | Matúš Žilinec | Eduard Šubert | Ondřej Bojar

We present the Khan Academy Corpus totalling 10122 hours in 87394 recordings across 29 languages, where 43% of recordings (4252 hours) are equipped with human-written subtitles. The subtitle texts cover a total of 137 languages. The dataset was collected from open access Khan Academy lectures, benefiting from their manual transcripts and manual translations of the transcripts. The dataset can serve in creation or evaluation of multilingual speech recognition or translation systems, featuring a diverse set of subject domains.

pdf abs
Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information
Chihiro Taguchi | Jefferson Saransig | Dayana Velásquez | David Chiang

This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies, all done in ELAN, the annotation software. The audio data was retrieved from a publicly available radio program in Kichwa. This paper also provides corpus-linguistic analyses of the dataset with a special focus on the agglutinative morphology of Kichwa and frequent code-switching with Spanish. The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size. This dataset, the ASR model, and the code used to develop them will be publicly available. Thus, our study positively showcases resource building and its applications for low-resource languages and their community.

pdf abs
KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models
Dongjun Jang | Sungjoo Byun | Hyemi Jo | Hyopil Shin

Instruction Tuning on Large Language Models is an essential process for model to function well and achieve high performance in the specific tasks. Accordingly, in mainstream languages such as English, instruction-based datasets are being constructed and made publicly available. In the case of Korean, publicly available models and datasets all rely on using the output of ChatGPT or translating datasets built in English. In this paper, We introduce KIT-19 as an instruction dataset for the development of LLM in Korean. KIT-19 is a dataset created in an instruction format, comprising 19 existing open-source datasets for Korean NLP tasks. In this paper, we train a Korean Pretrained LLM using KIT-19 to demonstrate its effectiveness. The experimental results show that the model trained on KIT-19 significantly outperforms existing Korean LLMs. Based on the its quality and empirical results, this paper proposes that KIT-19 has the potential to make a substantial contribution to the future improvement of Korean LLMs’ performance.

pdf abs
Know-Adapter: Towards Knowledge-Aware Parameter-Efficient Transfer Learning for Few-shot Named Entity Recognition
Binling Nie | Yiming Shao | Yigang Wang

Parameter-Efficient Fine-Tuning (PEFT) is a promising approach to mitigate the challenges about the model adaptation of pretrained language models (PLMs) for the named entity recognition (NER) task. Recent studies have highlighted the improvements that can be made to the quality of information retrieved from PLMs by adding explicit knowledge from external source like KGs to otherwise naive PEFTs. In this paper, we propose a novel knowledgeable adapter, Know-adapter, to incorporate structure and semantic knowledge of knowledge graphs into PLMs for few-shot NER. First, we construct a related KG entity type sequence for each sentence using a knowledge retriever. However, the type system of a domain-specific NER task is typically independent of that of current KGs and thus exhibits heterogeneity issue inevitably, which makes matching between the original NER and KG types (e.g. Person in NER potentially matches President in KBs) less likely, or introduces unintended noises. Thus, then we design a unified taxonomy based on KG ontology for KG entity types and NER labels. This taxonomy is used to build a learnable shared representation module, which provides shared representations for both KG entity type sequences and NER labels. Based on these shared representations, our Know-adapter introduces high semantic relevance knowledge and structure knowledge from KGs as inductive bias to guide the updating process of the adapter. Additionally, the shared representations guide the learnable representation module to reduce noise in the unsupervised expansion of label words. Extensive experiments on multiple NER datasets show the superiority of Know-Adapter over other state-of-the-art methods in both full-resource and low-resource settings.

pdf abs
Knowledge-augmented Graph Neural Networks with Concept-aware Attention for Adverse Drug Event Detection
Ya Gao | Shaoxiong Ji | Pekka Marttinen

Adverse drug events (ADEs) are an important aspect of drug safety. Various texts such as biomedical literature, drug reviews, and user posts on social media and medical forums contain a wealth of information about ADEs. Recent studies have applied word embedding and deep learning-based natural language processing to automate ADE detection from text. However, they did not explore incorporating explicit medical knowledge about drugs and adverse reactions or the corresponding feature learning. This paper adopts the heterogeneous text graph, which describes relationships between documents, words, and concepts, augments it with medical knowledge from the Unified Medical Language System, and proposes a concept-aware attention mechanism that learns features differently for the different types of nodes in the graph. We further utilize contextualized embeddings from pretrained language models and convolutional graph neural networks for effective feature representation and relational learning. Experiments on four public datasets show that our model performs competitively to the recent advances, and the concept-aware attention consistently outperforms other attention mechanisms.

pdf abs
Knowledge-aware Attention Network for Medication Effectiveness Prediction
Yingying Zhang | Xian Wu | Yu Zhang | Yefeng Zheng

The first 24 hours’ medication plan is critical to patients with serious or life-threatening illnesses and injuries. An appropriate medication can result in a lower mortality, a shorter length stay and a higher APACHE score. However, in clinical practice, the medication plan is often error-prone, especially when a decision must be made quickly for life-threatening situations in Intensive Care Unit (ICU). Therefore, predicting the effectiveness of the first 24 hours’ medication plan is of great importance in assisting doctors to make proper decisions. Existing effectiveness prediction works usually focus on one specific medicine, one specific disease, or one specific lab test, making it hard to extend to general medicines and diseases in hospital/ICU scenarios. In this paper, we propose to predict medication effectiveness of the first 24 hours in hospital/ICU based on patients’ information. Specifically, we use a knowledge enhanced module to incorporate external knowledge about medications and a medical feature learning module to determine the interaction between diagnosis and medications. To handle the data imbalance problem, we further optimize the proposed model with a contrastive loss. Extensive experimental results on a public dataset show that our model can significantly outperform state-of-the-art methods.

In recent years, multilingual pre-trained language models (mPLMs) have achieved significant progress in cross-lingual dense retrieval. However, most mPLMs neglect the importance of knowledge. Knowledge always conveys similar semantic concepts in a language-agnostic manner, while query-passage pairs in cross-lingual retrieval also share common factual information. Motivated by this observation, we introduce KEPT, a novel mPLM that effectively leverages knowledge to learn language-agnostic semantic representations. To achieve this, we construct a multilingual knowledge base using hyperlinks and cross-language page alignment data annotated by Wiki. From this knowledge base, we mine intra- and cross-language pairs by extracting symmetrically linked segments and multilingual entity descriptions. Subsequently, we adopt contrastive learning with the mined pairs to pre-train KEPT. We evaluate KEPT on three widely-used benchmarks, considering both zero-shot cross-lingual transfer and supervised multilingual fine-tuning scenarios. Extensive experimental results demonstrate that KEPT achieves strong multilingual and cross-lingual retrieval performance with significant improvements over existing mPLMs.

pdf abs
Knowledge-enhanced Prompt Tuning for Dialogue-based Relation Extraction with Trigger and Label Semantic
Hao An | Zhihong Zhu | Xuxin Cheng | Zhiqi Huang | Yuexian Zou

Dialogue-based relation extraction (DRE) aims to determine the semantic relation of a given pair of arguments from a piece of dialogue, which has received increasing attention. Due to the low information density of dialogue text, it is difficult for the model to focus on key information. To this end, in this paper, we propose a Knowledge-Enhanced Prompt-Tuning (KEPT) method to effectively enhance DRE model by exploiting trigger and label semantic. Specifically, we propose two beneficial tasks, masked trigger prediction, and verbalizer representation learning, to effectively inject trigger knowledge and label semantic knowledge respectively. Furthermore, we convert the DRE task to a masked language modeling task to unify the format of knowledge injection and utilization, aiming to better promote DRE performance. Experimental results on the DialogRE dataset show that our KEPT achieves state-of-the-art performance in F1 and F1c scores. Detailed analyses demonstrate the effectiveness and efficiency of our proposed approach. Code is available at https://github.com/blackbookay/KEPT.

pdf abs
Knowledge GeoGebra: Leveraging Geometry of Relation Embeddings in Knowledge Graph Completion
Kossi Amouzouvi | Bowen Song | Sahar Vahdati | Jens Lehmann

Knowledge graph embedding (KGE) models provide a low-dimensional representation of knowledge graphs in continuous vector spaces. This representation learning enables different downstream AI tasks such as link prediction for graph completion. However, most embedding models are only designed considering the algebra and geometry of the entity embedding space, the algebra of the relation embedding space, and the interaction between relation and entity embeddings. Neglecting the geometry of relation embedding limits the optimization of entity and relation distribution leading to suboptimal performance of knowledge graph completion. To address this issue, we propose a new perspective in the design of KGEs by looking into the geometry of relation embedding space. The proposed method and its variants are developed on top of an existing framework, RotatE, from which we leverage the geometry of the relation embeddings by mutating the unit circle to an ellipse, and further generalize it with the concept of a butterfly curve, consecutively. Besides the theoretical abilities of the model in preserving topological and relational patterns, the experiments on the WN18RR, FB15K-237 and YouTube benchmarks showed that this new family of KGEs can challenge or outperform state-of-the-art models.

pdf abs
Knowledge Graphs for Real-World Rumour Verification
John Dougrez-Lewis | Elena Kochkina | Maria Liakata | Yulan He

Despite recent progress in automated rumour verification, little has been done on evaluating rumours in a real-world setting. We advance the state-of-the-art on the PHEME dataset, which consists of Twitter response threads collected as a rumour was unfolding. We automatically collect evidence relevant to PHEME and use it to construct knowledge graphs in a time-sensitive manner, excluding information post-dating rumour emergence. We identify discrepancies between the evidence retrieved and PHEME’s labels, which are discussed in detail and amended to release an updated dataset. We develop a novel knowledge graph approach which finds paths linking disjoint fragments of evidence. Our rumour verification model which combines evidence from the graph outperforms the state-of-the-art on PHEME and has superior generisability when evaluated on a temporally distant rumour verification dataset.

Visual question generation (VQG) task aims to generate high-quality questions based on the input image. Current methods primarily focus on generating questions containing specified content utilizing answers or question types as constraints. However, these constraints make it challenging to control the topic of generated questions (e.g., conversation or test subject topics) for various applications. Thus, it is necessary to utilize topics as constraints to guide question generation. Considering that there are many topics and it is almost impossible for human annotations to cover them, we propose the cross-topic learning VQG (CTL-VQG) task, which aims to generate questions related to unseen topics in cross-topic scenarios. In this paper, we propose a knowledge-guided cross-topic visual question generation (KC-VQG) model to extract unseen topic-related information for question generation. Specifically, an image-topic feature extractor is introduced in our model to extract topic-related intuitive visual features; an image-topic knowledge extractor is used to extract and select the most appropriate topic-related implicit knowledge from large language models for generating questions. Extensive experiments show that our model outperforms baselines and can effectively generate unseen topic-related questions in cross-topic scenarios.

Scientific Information Extraction (SciIE) is a vital task and is increasingly being adopted in biomedical data mining to conceptualize and epitomize knowledge triplets from the scientific literature. Existing relation extraction methods aim to extract explicit triplet knowledge from documents, however, they can hardly perceive unobserved factual relations. Recent generative methods have more flexibility, but their generated relations will encounter trustworthiness problems. In this paper, we first propose a novel Extraction-Contextualization-Derivation (ECD) strategy to generate a document-specific and entity-expanded dynamic graph from a shared static knowledge graph. Then, we propose a novel Dual-Graph Resonance Network (DGRN) which can generate richer explicit and implicit relations under the guidance of static and dynamic knowledge topologies. Experiments conducted on a public PubMed corpus validate the superiority of our method against several state-of-the-art baselines.

In Visually-rich Document Understanding (VrDU), recent advances of incorporating layout and image features into the pre-training language models have achieved significant progress. Existing methods usually developed complicated dedicated architectures based on pre-trained models and fine-tuned them with costly high-quality data to eliminate the inconsistency of knowledge distribution between the pre-training task and specialized downstream tasks. However, due to their huge data demands, these methods are not suitable for few-shot settings, which are essential for quick applications with limited resources but few previous works are presented. To solve these problems, we propose a unified Knowledge-aware prompt-tuning framework for Visual-rich Document Understanding (KnowVrDU) to enable broad utilization for diverse concrete applications and reduce data requirements. To model heterogeneous VrDU structures without designing task-specific architectures, we propose to reformulate various VrDU tasks into a single question-answering format with task-specific prompts and train the pre-trained model with the parameter-efficient prompt tuning method. To bridge the knowledge gap between the pre-training task and specialized VrDU tasks without additional annotations, we propose a prompt knowledge integration mechanism to leverage external open-source knowledge bases. We conduct experiments on several benchmark datasets in few-shot settings and the results validate the effectiveness of our method.

pdf abs
KoCoSa: Korean Context-aware Sarcasm Detection Dataset
Yumin Kim | Heejae Suh | Mingi Kim | Dongyeon Won | Hwanhee Lee

Sarcasm is a way of verbal irony where someone says the opposite of what they mean, often to ridicule a person, situation, or idea. It is often difficult to detect sarcasm in the dialogue since detecting sarcasm should reflect the context (i.e., dialogue history). In this paper, we introduce a new dataset for the Korean dialogue sarcasm detection task, KoCoSa (Korean Context-aware Sarcasm Detection Dataset), which consists of 12.8K daily Korean dialogues and the labels for this task on the last response. To build the dataset, we propose an efficient sarcasm detection dataset generation pipeline: 1) generating new sarcastic dialogues from source dialogues with large language models, 2) automatic and manual filtering of abnormal and toxic dialogues, and 3) human annotation for the sarcasm detection task. We also provide a simple but effective baseline for the Korean sarcasm detection task trained on our dataset. Experimental results on the dataset show that our baseline system outperforms strong baselines like large language models, such as GPT-3.5, in the Korean sarcasm detection task. We show that the sarcasm detection task relies deeply on the existence of sufficient context. We will release the dataset at https://github.com/Yu-billie/KoCoSa_sarcasm_detection.

pdf abs
KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark
Seongbo Jang | Seonghyeon Lee | Hwanjo Yu

As language models are often deployed as chatbot assistants, it becomes a virtue for models to engage in conversations in a user’s first language. While these models are trained on a wide range of languages, a comprehensive evaluation of their proficiency in low-resource languages such as Korean has been lacking. In this work, we introduce KoDialogBench, a benchmark designed to assess language models’ conversational capabilities in Korean. To this end, we collect native Korean dialogues on daily topics from public sources, or translate dialogues from other languages. We then structure these conversations into diverse test datasets, spanning from dialogue comprehension to response selection tasks. Leveraging the proposed benchmark, we conduct extensive evaluations and analyses of various language models to measure a foundational understanding of Korean dialogues. Experimental results indicate that there exists significant room for improvement in models’ conversation skills. Furthermore, our in-depth comparisons across different language models highlight the effectiveness of recent training techniques in enhancing conversational proficiency. We anticipate that KoDialogBench will promote the progress towards conversation-aware Korean language models.

pdf abs
KoFREN: Comprehensive Korean Word Frequency Norms Derived from Large Scale Free Speech Corpora
Jin-seo Kim | Anna Seo Gyeong Choi | Sunghye Cho

Word frequencies are integral in linguistic studies, showing strong correlations with speakers’ cognitive abilities and other important linguistic parameters including the Age of Acquisition (AoA). However, the formulation of credible Korean word frequency norms has been obstructed by the lack of expansive speech data and a reliable part-ofspeech (POS) tagger. In this study, we unveil Korean word frequency norms (KoFREN), derived from large-scale spontaneous speech corpora (41 million words) that include a balanced representation of gender and age. We employed a machine learning-powered POS tagger, showcasing accuracy on par with human annotators. Our frequency norms correlate significantly with external studies’ lexical decision time (LDT) and AoA measures. KoFREN also aligns with English counterparts sourced from SUBTLEX_US - an English word frequency measure that has been frequently used in the literature. KoFREN is poised to facilitate research in spontaneous Contemporary Korean and can be utilized in many fields, including clinical studies of Korean patients.

pdf abs
Konidioms Corpus: A Dataset of Idioms in Konkani Language
Naziya Mahamdul Shaikh | Jyoti D. Pawar | Mubarak Banu Sayed

Konkani is a language spoken by a large number of people from the states located in the west coast of India. It is the official language of Goa state from the Indian subcontinent. Currently there is a lack of idioms corpus in the low-resource Konkani language. This paper aims to improve the progress in idiomatic sentence identification in order to enhance linguistic processing by creating the first corpus for idioms in the Konkani language. We select a unique list of 1597 idioms from multiple sources and proceed with a strictly controlled sentence creation procedure through crowdsourcing. This is followed by quality check of the sentences and annotation procedure by the experts in the Konkani language. We were able to build a good quality corpus comprising of 6520 sentences written in the Devanagari script of Konkani language. Analysis of the collected idioms and their usage in the created sentences revealed the dominance of selective domains like ‘human body’ in the creation and occurrences of idiomatic expressions in the Konkani language. This corpus is made publicly available.

Named Entity Recognition (NER) plays a pivotal role in medical Natural Language Processing (NLP). Yet, there has not been an open-source medical NER dataset specifically for the Korean language. To address this, we utilized ChatGPT to assist in constructing the KBMC (Korean Bio-Medical Corpus), which we are now presenting to the public. With the KBMC dataset, we noticed an impressive 20% increase in medical NER performance compared to models trained on general Korean NER datasets. This research underscores the significant benefits and importance of using specialized tools and datasets, like ChatGPT, to enhance language processing in specialized fields such as healthcare.

Sign language is a crucial means of communication for deaf communities. However, those outside deaf communities often lack understanding of sign language, leading to inadequate communication accessibility for the deaf. Therefore, sign language translation is a significantly important research area. In this context, we present a new benchmark dataset for Korean sign language translation named SSL:korean disaster Safety information Sign Language translation benchmark dataset. Korean sign language translation datasets provided by the National Information Society Agency in South Korea have faced challenges related to computational resources, heterogeneity between train and test sets, and unrefined data. To alleviate the aforementioned issue, we refine the origin data and release them. Additionally, we report experimental results of baseline using a transformer architecture. We empirically demonstrate that the baseline performance varies depending on the tokenization method applied to gloss sequences. In particular, tokenization based on characteristics of sign language outperforms tokenization considering characteristics of spoken language and tokenization utilizing statistical techniques. We release materials at our https://github.com/SSL-Sign-Language/Korean-Disaster-Safety-Information-Sign-Language-Translation-Benchmark-Dataset

Existing English-based text similarity measurements primarily focus on the semantic dimension, neglecting the unique linguistic attributes found in languages like Korean, where honorific expressions are explicitly integrated. To address this limitation, this study proposes Kosmic, a novel Korean text-similarity metric that encompasses the semantic and tonal facets of a given text pair. For the evaluation, we introduce a novel benchmark annotated by human experts, empirically showing that Kosmic outperforms the existing method. Moreover, by leveraging Kosmic, we assess various Korean paraphrasing methods to determine which techniques are most effective in preserving semantics and tone.

Zero-shot stance detection on social media (ZSSD-SM) aims to distinguish the attitude in tweets towards an unseen target. Previous work capture latent variables between source and target domains to perform this task, but the lack of context knowledge hinders the detection performance. Recent studies have been devoted to obtaining the accurate representation of tweets by bringing additional facts from Knowledge Graph (KG), showing promising performance. However, these knowledge injection methods still suffer from two challenges: (i) The pipeline of knowledge injection causes error accumulation and (ii) irrelevant knowledge makes them fail to understand the semantics. In this paper, we propose a novel knowledge injection method for ZSSD-SM, which adopts two training stages, namely knowledge compression and task guidance, to flexibly inject knowledge into the pre-trained language model (PLM) and adaptively expand tweets context. Specifically, in the knowledge compression stage, the latent representation of KG is reconstructed by the triplet denoising task and compressed into external matrices; while in the task guidance stage, the frozen matrices are employed to guide the PLM to adaptively extract its own context-related knowledge, and then complete the fine-tuning of the ZSSD-SM task. Extensive experiments on multiple datasets show the effectiveness of our proposed method. The code is available at: https://github.com/ShuohaoLin/KPatch.

pdf abs
K-pop Lyric Translation: Dataset, Analysis, and Neural-Modelling
Haven Kim | Jongmin Jung | Dasaem Jeong | Juhan Nam

Lyric translation, a field studied for over a century, is now attracting computational linguistics researchers. We identified two limitations in previous studies. Firstly, lyric translation studies have predominantly focused on Western genres and languages, with no previous study centering on K-pop despite its popularity. Second, the field of lyric translation suffers from a lack of publicly available datasets; to the best of our knowledge, no such dataset exists. To broaden the scope of genres and languages in lyric translation studies, we introduce a novel singable lyric translation dataset, approximately 89% of which consists of K-pop song lyrics. This dataset aligns Korean and English lyrics line-by-line and section-by-section. We leveraged this dataset to unveil unique characteristics of K-pop lyric translation, distinguishing it from other extensively studied genres, and to construct a neural lyric translation model, thereby underscoring the importance of a dedicated dataset for singable lyric translations.

pdf abs
Lˆ2GC:Lorentzian Linear Graph Convolutional Networks for Node Classification
Qiuyu Liang | Weihua Wang | Feilong Bao | Guanglai Gao

Linear Graph Convolutional Networks (GCNs) are used to classify the node in the graph data. However, we note that most existing linear GCN models perform neural network operations in Euclidean space, which do not explicitly capture the tree-like hierarchical structure exhibited in real-world datasets that modeled as graphs. In this paper, we attempt to introduce hyperbolic space into linear GCN and propose a novel framework for Lorentzian linear GCN. Specifically, we map the learned features of graph nodes into hyperbolic space, and then perform a Lorentzian linear feature transformation to capture the underlying tree-like structure of data. Experimental results on standard citation networks datasets with semi-supervised learning show that our approach yields new state-of-the-art results of accuracy 74.7% on Citeseer and 81.3% on PubMed datasets. Furthermore, we observe that our approach can be trained up to two orders of magnitude faster than other nonlinear GCN models on PubMed dataset. Our code is publicly available at https://github.com/llqy123/LLGC-master.

pdf abs
Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
Elaheh Baharlouei | Mahsa Shafaei | Yigeng Zhang | Hugo Jair Escalante | Thamar Solorio

We address the challenge of detecting questionable content in online media, specifically the subcategory of comic mischief. This type of content combines elements such as violence, adult content, or sarcasm with humor, making it difficult to detect. Employing a multimodal approach is vital to capture the subtle details inherent in comic mischief content. To tackle this problem, we propose a novel end-to-end multimodal system for the task of comic mischief detection. As part of this contribution, we release a novel dataset for the targeted task consisting of three modalities: video, text (video captions and subtitles), and audio. We also design a HIerarchical Cross-attention model with CAPtions (HICCAP) to capture the intricate relationships among these modalities. The results show that the proposed approach makes a significant improvement over robust baselines and state-of-the-art models for comic mischief detection and its type classification. This emphasizes the potential of our system to empower users, to make informed decisions about the online content they choose to see.

pdf abs
Labeling Results of Topic Models: Word Sense Disambiguation as Key Method for Automatic Topic Labeling with GermaNet
Jennifer Ecker

The combination of topic modeling and automatic topic labeling sheds light on understanding large corpora of text. It can be used to add semantic information for existing metadata. In addition, one can use the documents and the corresponding topic labels for topic classification. While there are existing algorithms for topic modeling readily accessible for processing texts, there is a need to postprocess the result to make the topics more interpretable and self-explanatory. The topic words from the topic model are ranked and the first/top word could easily be considered as a label. However, it is imperative to use automatic topic labeling, because the highest scored word is not the word that sums up the topic in the best way. Using the lexical-semantic word net GermaNet, the first step is to disambiguate words that are represented in GermaNet with more than one sense. We show how to find the correct sense in the context of a topic with the method of word sense disambiguation. To enhance accuracy, we present a similarity measure based on vectors of topic words that considers semantic relations of the senses demonstrating superior performance of the investigated cases compared to existing methods.

Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing speaker-specific appearance characteristics. Furthermore, a max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations. Experimental evaluations on public lip reading datasets demonstrate the effectiveness of the proposed approach under the intra-speaker and inter-speaker conditions.

pdf abs
Language and Speech Technology for Central Kurdish Varieties
Sina Ahmadi | Daban Jaff | Md Mahfuz Ibn Alam | Antonios Anastasopoulos

Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum and known for its diversity in language varieties. Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language, resulting in disparities for dialects and varieties for which there are few resources and tools available. In this paper, we take a step towards developing resources for language and speech technology for varieties of Central Kurdish, creating a corpus by transcribing movies and TV series as an alternative to fieldwork. Additionally, we report the performance of machine translation, automatic speech recognition, and language identification as downstream tasks evaluated on Central Kurdish subdialects. Data and models are publicly available under an open license at https://github.com/sinaahmadi/CORDI.

pdf abs
Language Models and Semantic Relations: A Dual Relationship
Olivier Ferret

Since they rely on the distributional hypothesis, static and contextual language models are closely linked to lexical semantic relations. In this paper, we exploit this link for enhancing a BERT model. More precisely, we propose to extract lexical semantic relations with two unsupervised methods, one based on a static language model, the other on a contextual model, and to inject the extracted relations into a BERT model for improving its semantic capabilities. Through various evaluations performed for English and focusing on semantic similarity at the word and sentence levels, we show the interest of this approach, allowing us to semantically enrich a BERT model without using any external semantic resource.

pdf abs
Language Models for Text Classification: Is In-Context Learning Enough?
Aleksandra Edwards | Jose Camacho-Collados

Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches based on fine-tuning is the ability to understand instructions written in natural language (prompts), which helps them generalise better to different tasks and domains without the need for specific training data. This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances. However, existing research is limited in scale and lacks understanding of how text generation models combined with prompting techniques compare to more established methods for text classification such as fine-tuning masked language models. In this paper, we address this research gap by performing a large-scale evaluation study for 16 text classification datasets covering binary, multiclass, and multilabel problems. In particular, we compare zero- and few-shot approaches of large language models to fine-tuning smaller language models. We also analyse the results by prompt, classification type, domain, and number of labels. In general, the results show how fine-tuning smaller and more efficient language models can still outperform few-shot approaches of larger language models, which have room for improvement when it comes to text classification.

pdf abs
Language Pivoting from Parallel Corpora for Word Sense Disambiguation of Historical Languages: A Case Study on Latin
Iacopo Ghinassi | Simone Tedeschi | Paola Marongiu | Roberto Navigli | Barbara McGillivray

Word Sense Disambiguation (WSD) is an important task in NLP, which serves the purpose of automatically disambiguating a polysemous word with its most likely sense in context. Recent studies have advanced the state of the art in this task, but most of the work has been carried out on contemporary English or other modern languages, leaving challenges posed by low-resource languages and diachronic change open. Although the problem with low-resource languages has recently been mitigated by using existing multilingual resources to propagate otherwise expensive annotations from English to other languages, such techniques have hitherto not been applied to historical languages such as Latin. In this work, we make the following two major contributions. First, we test such a strategy on a historical language and propose a new approach in this framework which makes use of existing bilingual corpora instead of native English datasets. Second, we fine-tune a Latin WSD model on the data produced and achieve state-of-the-art results on a standard benchmark for the task. Finally, we release the dataset generated with our approach, which is the largest dataset for Latin WSD to date. This work opens the door to further research, as our approach can be used for different historical and, generally, under-resourced languages.

pdf abs
Language Technologies as If People Mattered: Centering Communities in Language Technology Development
Nina Markl | Lauren Hall-Lew | Catherine Lai

In this position paper we argue that researchers interested in language and/or language technologies should attend to challenges of linguistic and algorithmic injustice together with language communities. We put forward that this can be done by drawing together diverse scholarly and experiential insights, building strong interdisciplinary teams, and paying close attention to the wider social, cultural and historical contexts of both language communities and the technologies we aim to develop.

Language identification is an important first step in many NLP applications. Most publicly available language identification datasets, however, are compiled under the assumption that the gold label of each instance is determined by where texts are retrieved from. Research has shown that this is a problematic assumption, particularly in the case of very similar languages (e.g., Croatian and Serbian) and national language varieties (e.g., Brazilian and European Portuguese), where texts may contain no distinctive marker of the particular language or variety. To overcome this important limitation, this paper presents DSL True Labels (DSL-TL), the first human-annotated multilingual dataset for language variety identification. DSL-TL contains a total of 12,900 instances in Portuguese, split between European Portuguese and Brazilian Portuguese; Spanish, split between Argentine Spanish and Castilian Spanish; and English, split between American English and British English. We trained multiple models to discriminate between these language varieties, and we present the results in detail. The data and models presented in this paper provide a reliable benchmark toward the development of robust and fairer language variety identification systems. We make DSL-TL freely available to the research community.

pdf abs
LANID: LLM-assisted New Intent Discovery
Lu Fan | Jiashu Pu | Rongsheng Zhang | Xiao-Ming Wu

Data annotation is expensive in Task-Oriented Dialogue (TOD) systems. New Intent Discovery (NID) is a task aims to identify novel intents while retaining the ability to recognize known intents. It is essential for expanding the intent base of task-based dialogue systems. Previous works relying on external datasets are hardly extendable. Meanwhile, the effective ones are generally depends on the power of the Large Language Models (LLMs). To address the limitation of model extensibility and take advantages of LLMs for the NID task, we propose LANID, a framework that leverages LLM’s zero-shot capability to enhance the performance of a smaller text encoder on the NID task. LANID employs KNN and DBSCAN algorithms to select appropriate pairs of utterances from the training set. The LLM is then asked to determine the relationships between them. The collected data are then used to construct finetuning task and the small text encoder is optimized with a triplet loss. Our experimental results demonstrate the efficacy of the proposed method on three distinct NID datasets, surpassing all strong baselines in both unsupervised and semi-supervised settings. Our code can be found in https://github.com/floatSDSDS/LANID.

Modern large language models and chatbots based on them show impressive results in text generation and dialog tasks. At the same time, these models are subject to criticism in many aspects, e.g., they can generate hate speech and untrue and biased content. In this work, we show another problematic feature of such chatbots: they are echo chambers in the sense that they tend to agree with the opinions of their users. Social media, such as Facebook, was criticized for a similar problem and called an echo chamber. We experimentally test five LLM-based chatbots, which we feed with opinionated inputs. We annotate the chatbot answers whether they agree or disagree with the input. All chatbots tend to agree. However, the echo chamber effect is not equally strong. We discuss the differences between the chatbots and make the dataset publicly available.

Collecting labeled datasets in finance is challenging due to scarcity of domain experts and higher cost of employing them. While Large Language Models (LLMs) have demonstrated remarkable performance in data annotation tasks on general domain datasets, their effectiveness on domain specific datasets remains under-explored. To address this gap, we investigate the potential of LLMs as efficient data annotators for extracting relations in financial documents. We compare the annotations produced by three LLMs (GPT-4, PaLM 2, and MPT Instruct) against expert annotators and crowdworkers. We demonstrate that the current state-of-the-art LLMs can be sufficient alternatives to non-expert crowdworkers. We analyze models using various prompts and parameter settings and find that customizing the prompts for each relation group by providing specific examples belonging to those groups is paramount. Furthermore, we introduce a reliability index (LLM-RelIndex) used to identify outputs that may require expert attention. Finally, we perform an extensive time, cost and error analysis and provide recommendations for the collection and usage of automated annotations in domain-specific settings.

pdf abs
Large Language Models for Generative Recommendation: A Survey and Visionary Discussions
Lei Li | Yongfeng Zhang | Dugang Liu | Li Chen

Large language models (LLM) not only have revolutionized the field of natural language processing (NLP) but also have the potential to reshape many other fields, e.g., recommender systems (RS). However, most of the related work treats an LLM as a component of the conventional recommendation pipeline (e.g., as a feature extractor), which may not be able to fully leverage the generative power of LLM. Instead of separating the recommendation process into multiple stages, such as score computation and re-ranking, this process can be simplified to one stage with LLM: directly generating recommendations from the complete pool of items. This survey reviews the progress, methods, and future directions of LLM-based generative recommendation by examining three questions: 1) What generative recommendation is, 2) Why RS should advance to generative recommendation, and 3) How to implement LLM-based generative recommendation for various RS tasks. We hope that this survey can provide the context and guidance needed to explore this interesting and emerging topic.

pdf abs
Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling
Yida Mu | Chun Dong | Kalina Bontcheva | Xingyi Song

Topic modelling, as a well-established unsupervised technique, has found extensive use in automatically detecting significant topics within a corpus of documents. However, classic topic modelling approaches (e.g., LDA) have certain drawbacks, such as the lack of semantic understanding and the presence of overlapping topics. In this work, we investigate the untapped potential of large language models (LLMs) as an alternative for uncovering the underlying topics within extensive text corpora. To this end, we introduce a framework that prompts LLMs to generate topics from a given set of documents and establish evaluation protocols to assess the clustering efficacy of LLMs. Our findings indicate that LLMs with appropriate prompts can stand out as a viable alternative, capable of generating relevant topic titles and adhering to human guidelines to refine and merge topics. Through in-depth experiments and evaluation, we summarise the advantages and constraints of employing LLMs in topic extraction.

pdf abs
Latent vs Explicit Knowledge Representation: How ChatGPT Answers Questions about Low-Frequency Entities
Arianna Graciotti | Valentina Presutti | Rocco Tripodi

In this paper, we present an evaluation of two different approaches to the free-form Question Answering (QA) task. The main difference between the two approaches is that one is based on latent representations of knowledge, and the other uses explicit knowledge representation. For the evaluation, we developed DynaKnowledge, a new benchmark composed of questions concerning Wikipedia low-frequency entities. We wanted to ensure, on the one hand, that the questions are answerable and, on the other, that the models can provide information about very specific facts. The evaluation that we conducted highlights that the proposed benchmark is particularly challenging. The best model answers correctly only on 50% of the questions. Analysing the results, we also found that ChatGPT shows low reliance on low-frequency entity questions, manifesting a popularity bias. On the other hand, a simpler model based on explicit knowledge is less affected by this bias. With this paper, we want to provide a living benchmark for open-form QA to test knowledge and latent representation models on a dynamic benchmark.

With the evolution of LLMs, they are endowed with impressive logical reasoning, or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model’s lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: (1) posing high-quality questions that break out of conventional norms but are beneficial for puzzle-solving. (2) integrating existing information to gradually deduce the truth through reasoning. We observe that it is hard for most LLMs to accomplish lateral thinking during interactions. Even the most powerful LLM, GPT-4, faces challenges in achieving satisfactory performance, and for most open-source models, simply completing this task is quite difficult. This evaluation benchmark provides LLMs with a highly challenging and differentiating task that is crucial to an effective AI assistant. Our dataset and source codes are available at https://github.com/THUKElab/LatEval.

The few-shot tasks require the model to have the ability to generalize from a few samples. However, due to the lack of cognitive ability, the current works cannot fully utilize limited samples to expand the sample space and still suffer from overfitting issues. To address the problems, we propose a LLM-Augmented Unsupervised Contrastive Learning Framework (LA-UCL), which introduces a cognition-enabled Large Language Model (LLM) for efficient data augmentation, and presents corresponding contrastive learning strategies. Specifically, in the self-augmented contrastive learning module, we construct a retrieval-based in-context prompt scheme by retrieving similar but different category data from the original samples, guiding the LLM to generate more discriminative augmented data. Then, by designing group-level contrastive loss to enhance the model’s discriminative ability. In the external-augmented contrastive learning module, we utilize web knowledge retrieval to expand the sample space and leverage LLM to generate more diverse data, and introduce sample-level contrastive loss for unlabeled data to improve the model’s generalization. Experimental results on six datasets show that our model exceeds the baseline models.

pdf abs
Layer-wise Regularized Dropout for Neural Language Models
Shiwen Ni | Min Yang | Ruifeng Xu | Chengming Li | Xiping Xiping Hu

Among the various pre-trained neural language models that are popular today, dropout is already an indispensable regularization technique. To solve the inconsistency between training and inference caused by the randomness of dropout, some studies use consistency training to regularize dropout at the output layer. In this paper, we propose a novel Layer-wise Regularized Dropout (LR-Drop), which is specially designed for Transformer-based Language models. Specifically, LR-Drop layer-wise regularizes each Transformer layer using the consistency training strategy. Each training sample passes through the two siamese sub-models sampled by dropout, and then LR-Drop forces the hidden states, multi-head attention matrices, and output distribution of the two siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a “self-distillation” framework, in which each sub-model generated by dropout is the other’s “teacher” model and “student” model. Through extensive experiments on 8 natural language understanding datasets, 6 neural machine translation datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we show that LR-Drop achieves superior performances, including state-of-the-art results.

pdf abs
LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding
Masato Fujitake

This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Visually Rich Document Understanding tasks, such as document image classification and information extraction, have gained significant attention due to their importance. Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure. However, these methods require fine-tuning for each task and dataset, and the models are expensive to train and operate. To overcome this limitation, we propose a new LayoutLLM that integrates these with large-scale language models (LLMs). By leveraging the strengths of existing research in document image understanding and LLMs’ superior language understanding capabilities, the proposed model, fine-tuned with multimodal instruction datasets, performs an understanding of document images in a single model. Our experiments demonstrate improvement over the baseline model in various document analysis tasks.

pdf abs
LCGbank: A Corpus of Syntactic Analyses Based on Proof Nets
Aditya Bhargava | Timothy A. D. Fowler | Gerald Penn

In syntactic parsing, *proof nets* are graphical structures that have the advantageous property of invariance to spurious ambiguities. Semantically-equivalent derivations correspond to a single proof net. Recent years have seen fresh interest in statistical syntactic parsing with proof nets, including the development of methods based on neural networks. However, training of statistical parsers requires corpora that provide ground-truth syntactic analyses. Unfortunately, there has been a paucity of corpora in formalisms for which proof nets are applicable, such as Lambek categorial grammar (LCG), a formalism related to combinatory categorial grammar (CCG). To address this, we leverage CCGbank and the relationship between LCG and CCG to develop LCGbank, an English-language corpus of syntactic analyses based on LCG proof nets. In contrast to CCGbank, LCGbank eschews type-changing and uses only categorial rules; the syntactic analyses thus provide fully compositional semantics, exploiting the transparency between syntax and semantics that so characterizes categorial grammars.

pdf abs
LeadEmpathy: An Expert Annotated German Dataset of Empathy in Written Leadership Communication
Didem Sedefoglu | Allison Claire Lahnala | Jasmin Wagner | Lucie Flek | Sandra Ohly

Empathetic leadership communication plays a pivotal role in modern workplaces as it is associated with a wide range of positive individual and organizational outcomes. This paper introduces LeadEmpathy, an innovative expert-annotated German dataset for modeling empathy in written leadership communication. It features a novel theory-based coding scheme to model cognitive and affective empathy in asynchronous communication. The final dataset comprises 770 annotated emails from 385 participants who were allowed to rewrite their emails after receiving recommendations for increasing empathy in an online experiment. Two independent annotators achieved substantial inter-annotator agreement of >= .79 for all categories, indicating that the annotation scheme can be applied to produce high-quality, multidimensional empathy ratings in current and future applications. Beyond outlining the dataset’s development procedures, we present a case study on automatic empathy detection, establishing baseline models for predicting empathy scores in a range of ten possible scores that achieve a Pearson correlation of 0.816 and a mean squared error of 0.883. Our dataset is available at https://github.com/caisa-lab/LEAD-empathy-dataset.

pdf abs
Learning Bidirectional Morphological Inflection like Humans
Akiyo Fukatsu | Yuto Harada | Yohei Oseki

For nearly the past forty years, there has been discussion regarding whether symbolic representations are involved in morphological inflection, a debate commonly known as the Past Tense Debate. The previous literature has extensively explored whether neural models, which do not use symbolic representations can process morphological inflection like humans. However, current research interest has shifted towards whether neural models can acquire morphological inflection like humans. In this paper, we trained neural models, the recurrent neural network (RNN) with attention and the transformer, and a symbolic model, the Minimal Generalization Learner (MGL), under a human-like learning environment. Evaluating the models from the perspective of language acquisition, we found that while the transformer and the MGL exhibited some human-like characteristics, the RNN with attention did not demonstrate human-like behavior across all the evaluation metrics considered in this study. Furthermore, none of the models accurately inflected verbs in the same manner as humans in terms of morphological inflection direction. These results suggest that these models fall short as cognitive models of morphological inflection.

pdf abs
Learning from Wrong Predictions in Low-Resource Neural Machine Translation
Jia Cheng Hu | Roberto Cavicchioli | Giulia Berardinelli | Alessandro Capotondi

Resource scarcity in Neural Machine Translation is a challenging problem in both industry applications and in the support of less-spoken languages represented, in the worst case, by endangered and low-resource languages. Many Data Augmentation methods rely on additional linguistic sources and software tools but these are often not available in less favoured language. For this reason, we present USKI (Unaligned Sentences Keytokens pre-traIning), a pre-training strategy that leverages the relationships and similarities that exist between unaligned sentences. By doing so, we increase the dataset size of endangered and low-resource languages by the square of the initial quantity, matching the typical size of high-resource language datasets such as WMT14 En-Fr. Results showcase the effectiveness of our approach with an increase on average of 0.9 BLEU across the benchmarks using a small fraction of the entire unaligned corpus, suggesting the importance of the research topic and the potential of a currently under-utilized resource and under-explored approach.

pdf abs
Learning Intrinsic Dimension via Information Bottleneck for Explainable Aspect-based Sentiment Analysis
Zhenxiao Cheng | Jie Zhou | Wen Wu | Qin Chen | Liang He

Gradient-based explanation methods are increasingly used to interpret neural models in natural language processing (NLP) due to their high fidelity. Such methods determine word-level importance using dimension-level gradient values through a norm function, often presuming equal significance for all gradient dimensions. However, in the context of Aspect-based Sentiment Analysis (ABSA), our preliminary research suggests that only specific dimensions are pertinent. To address this, we propose the Information Bottleneck-based Gradient (IBG) explanation framework for ABSA. This framework leverages an information bottleneck to refine word embeddings into a concise intrinsic dimension, maintaining essential features and omitting unrelated information. Comprehensive tests show that our IBG approach considerably improves both the models’ performance and the explanations’ clarity by identifying sentiment-aware features.

pdf abs
Learning Strategies for Robust Argument Mining: An Analysis of Variations in Language and Domain
Ramon Ruiz-Dolz | Chr-Jr Chiu | Chung-Chi Chen | Noriko Kando | Hsin-Hsi Chen

Argument mining has typically been researched for specific corpora belonging to concrete languages and domains independently in each research work. Human argumentation, however, has domain- and language-dependent linguistic features that determine the content and structure of arguments. Also, when deploying argument mining systems in the wild, we might not be able to control some of these features. Therefore, an important aspect that has not been thoroughly investigated in the argument mining literature is the robustness of such systems to variations in language and domain. In this paper, we present a complete analysis across three different languages and three different domains that allow us to have a better understanding on how to leverage the scarce available corpora to design argument mining systems that are more robust to natural language variations.

pdf abs
Lemmatisation of Medieval Greek: Against the Limits of Transformer’s Capabilities?
Colin Swaelens | Pranaydeep Singh | Ilse de Vos | Els Lefever

This paper presents preliminary experiments for the lemmatisation of unedited, Byzantine Greek epigrams. This type of Greek is quite different from its classical ancestor, mostly because of its orthographic inconsistencies. Existing lemmatisation algorithms display an accuracy drop of around 30pp when tested on these Byzantine book epigrams. We conducted seven different lemmatisation experiments, which were either transformer-based or based on neural edit-trees. The best performing lemmatiser was a hybrid method combining transformer-based embeddings with a dictionary look-up. We compare our results with existing lemmatisers, and provide a detailed error analysis revealing why unedited, Byzantine Greek is so challenging for lemmatisation.

Recent work shows large language models can be prompted to generate useful rationales for commonsense question answering (CQA), which can improve the performance of both themselves and other models. However, the cost of deployment and further tuning is relatively expensive for the large models. Some work explores to distill the the rationale-generation ability to convenient small-sized models, yet it typically requires human-authored QA instances during the distillation. In this paper, we propose a novel framework that leverages both knowledge graphs and large language models to synthesize rationale-augmented CQA data. Based on it, we train Leros, a model that can generate helpful rationales to assist generic QA models to accomplish unseen CQA tasks. Empirical results demonstrate Leros can substantially enhance the performance of QA models on five unseen CQA benchmarks, providing better gains than both same-sized counterpart models trained with downstream data and 10x larger language models. Our work reveals a novel way to integrate knowledge from both knowledge graphs and large language models into smaller models. The codes and synthesized resources are publicly available at https://github.com/wchrepo/leros.

Bilingual dictionaries present several challenges, especially for sign languages and oral languages, where multimodality plays a role. We deployed and tested the first bilingual Peruvian Sign Language (LSP) - Spanish Online Dictionary. The first feature allows the user to introduce a text and receive as a result a list of videos whose glosses are related to the input text or Spanish word. The second feature allows the user to sign in front of the camera and shows the five most probable Spanish translations based on the similarity between the input sign and gloss-labeled sign videos used to train a machine learning model. These features are constructed in a design and architecture that differentiates among the coincidence for the Spanish text searched, the sign gloss, and Spanish translation. We explain in depth how these concepts or database columns impact the search. Similarly, we share the challenges of deploying a real-world machine learning model for isolated sign language recognition through Amazon Web Services (AWS).

Aspect-Based Sentiment Analysis (ABSA) stands as a crucial task in predicting the sentiment polarity associated with identified aspects within text. However, a notable challenge in ABSA lies in precisely determining the aspects’ boundaries (start and end indices), especially for long ones, due to users’ colloquial expressions. We propose DiffusionABSA, a novel diffusion model tailored for ABSA, which extracts the aspects progressively step by step. Particularly, DiffusionABSA gradually adds noise to the aspect terms in the training process, subsequently learning a denoising process that progressively restores these terms in a reverse manner. To estimate the boundaries, we design a denoising neural network enhanced by a syntax-aware temporal attention mechanism to chronologically capture the interplay between aspects and surrounding text. Empirical evaluations conducted on eight benchmark datasets underscore the compelling advantages offered by DiffusionABSA when compared against robust baseline models. Our code is publicly available at https://github.com/Qlb6x/DiffusionABSA.

Thanks to the development of pre-trained sequence-to-sequence (seq2seq) models (e.g., BART), recent studies on AMR parsing often regard this task as a seq2seq translation problem by linearizing AMR graphs into AMR token sequences in pre-processing and recovering AMR graphs from sequences in post-processing. Seq2seq AMR parsing is a relatively simple paradigm but it unavoidably loses structural information among AMR tokens. To compensate for the loss of structural information, in this paper we explicitly leverage AMR structure in the decoding phase. Given an AMR graph, we first project the structure in the graph into an AMR token graph, i.e., structure among AMR tokens in the linearized sequence. The structures for an AMR token could be divided into two parts: structure in prediction history and structure in future. Then we propose to model structure in prediction history via a graph attention network (GAT) and learn structure in future via a multi-task scheme, respectively. Experimental results show that our approach significantly outperforms a strong baseline and achieves performance with 85.5 ±0.1 and 84.2 ±0.1 Smatch scores on AMR 2.0 and AMR 3.0, respectively

pdf abs
Leveraging Domain Corpora for Enhanced Terminology: The Case of Estonian-English Remote Sensing Termbase
Liisi Jakobson | Jelena Kallas | Erko Jakobson

This article addresses methodological issues related to developing domain corpora and a terminological database from scratch. We present an ongoing project focused on creating an Estonian-English Remote Sensing Termbase. First, we describe the compilation process of the Estonian Remote Sensing Corpus 2022 , which served as the primary data source for the termbase. The corpus was compiled by crawling the web and adding files using the Corpus Query System Sketch Engine (Kilgarriff et al., 2004). In the next step, we employed the Term Extraction module (Kilgarriff et al., 2014; Fišer et al., 2016; Blahuš et al., 2023) to identify terms, which were subsequently registered in the Estonian Remote Sensing Termbase using the Dictionary Writing System Ekilex (Tavast et al., 2018). For each term, we provided definitions, variants, and usage contexts. In the final stage, remote sensing experts reviewed and edited the terms, their variants, and usage contexts. Finally, we provide insights and outline directions for future work in this area.

pdf abs
Leveraging Information Redundancy of Real-World Data through Distant Supervision
Ariel Cohen | Alexandrine Lanson | Emmanuelle Kempf | Xavier Tannier

We explore the task of event extraction and classification by harnessing the power of distant supervision. We present a novel text labeling method that leverages the redundancy of temporal information in a data lake. This method enables the creation of a large programmatically annotated corpus, allowing the training of transformer models using distant supervision. This aims to reduce expert annotation time, a scarce and expensive resource. Our approach utilizes temporal redundancy between structured sources and text, enabling the design of a replicable framework applicable to diverse real-world databases and use cases. We employ this method to create multiple silver datasets to reconstruct key events in cancer patients’ pathways, using clinical notes from a cohort of 380,000 oncological patients. By employing various noise label management techniques, we validate our end-to-end approach and compare it with a baseline classifier built on expert-annotated data. The implications of our work extend to accelerating downstream applications, such as patient recruitment for clinical trials, treatment effectiveness studies, survival analysis, and epidemiology research. While our study showcases the potential of the method, there remain avenues for further exploration, including advanced noise management techniques, semi-supervised approaches, and a deeper understanding of biases in the generated datasets and models.

pdf abs
Leveraging Linguistically Enhanced Embeddings for Open Information Extraction
Fauzan Nayeem Farooqui | Thanmay Jayakumar | Pulkit Mathur | Mansi A. Radke

Open Information Extraction (OIE) is a structure prediction (SP) task in Natural Language Processing (NLP) that aims to extract structured n-ary tuples - usually subject-relation-object triples - from free text. The word embeddings in the input text can be enhanced with linguistic features, usually Part-of-Speech (PoS) and Syntactic Dependency Parse (SynDP) labels. However, past enhancement techniques cannot leverage the power of pre-trained language models (PLMs), which themselves have been hardly used for OIE. To bridge this gap, we are the first to leverage linguistic features with a Seq2Seq PLM for OIE. We do so by introducing two methods - Weighted Addition and Linearized Concatenation. Our work gives any neural OIE architecture the key performance boost from both PLMs and linguistic features in one go. In our settings, this shows wide improvements of up to 24.9%, 27.3% and 14.9% on Precision, Recall and F1 scores respectively over the baseline. Beyond this, we address other important challenges in the field: to reduce compute overheads with the features, we are the first ones to exploit Semantic Dependency Parse (SemDP) tags; to address flaws in current datasets, we create a clean synthetic dataset; finally, we contribute the first known study of OIE behaviour in SP models.

Counter-narrative generation, i.e., the generation of fact-based responses to hate speech with the aim of correcting discriminatory beliefs, has been demonstrated to be an effective method to combat hate speech. However, its effectiveness is limited by the resource-intensive nature of dataset construction processes and only focuses on the primary language. To alleviate this problem, we propose a Korean Hate Speech Counter Punch (KHSCP), a cost-effective counter-narrative generation method in the Korean language. To this end, we release the first counter-narrative generation dataset in Korean and pose two research questions. Under the questions, we propose an effective augmentation method and investigate the reasonability of a large language model to overcome data scarcity in low-resource environments by leveraging existing resources. In this regard, we conduct several experiments to verify the effectiveness of the proposed method. Our results reveal that applying pre-existing resources can improve the generation performance by a significant margin. Through deep analysis on these experiments, this work proposes the possibility of overcoming the challenges of generating counter-narratives in low-resource environments.

pdf abs
Leveraging Social Context for Humor Recognition and Sense of Humor Evaluation in Social Media with a New Chinese Humor Corpus - HumorWB
Zeyuan Zeng | Zefeng Li | Liang Yang | Hongfei Lin

With the development of the Internet, social media has produced a large amount of user-generated data, which brings new challenges for humor computing. Traditional humor computing research mainly focuses on the content, while neglecting the information of interaction relationships in social media. In addition, both content and users are important in social media, while existing humor computing research mainly focuses on content rather than people. To address these problems, we model the information transfer and entity interactions in social media as a heterogeneous graph, and create the first dataset which introduces the social context information - HumorWB, which is collected from Chinese social media - Weibo. Two humor-related tasks are designed in the dataset. One is a content-oriented humor recognition task, and the other is a novel humor evaluation task. For the above tasks, we purpose a graph-based model called SCOG, which uses heterogeneous graph neural networks to optimize node representation for downstream tasks. Experimental results demonstrate the effectiveness of feature extraction and graph representation learning methods in the model, as well as the necessity of introducing social context information.

African American English (AAE) has received recent attention in the field of natural language processing (NLP). Efforts to address bias against AAE in NLP systems tend to focus on lexical differences. When the unique structures of AAE are considered, the solution is often to remove or neutralize the differences. This work leverages knowledge about the unique linguistic structures to improve automatic disambiguation of habitual and non-habitual meanings of “be” in naturally produced AAE transcribed speech. Both meanings are employed in AAE but examples of Habitual be are rare in already limited AAE data. Generally, representing additional syntactic information improves semantic disambiguation of habituality. Using an ensemble of classical machine learning models with a representation of the unique POS and dependency patterns of Habitual be, we show that integrating syntactic information improves the identification of habitual uses of “be” by about 65 F1 points over a simple baseline model of n-grams, and as much as 74 points. The success of this approach demonstrates the potential impact when we embrace, rather than neutralize, the structural uniqueness of African American English.

pdf abs
Leveraging the Interplay between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation
Yejin Jeon | Yunsu Kim | Gary Geunbae Lee

Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech. Nevertheless, it is important to note that these achievements have predominantly been verified within the context of high-resource languages such as English. Furthermore, the Tacotron and FastSpeech variants show substantial pausing errors when applied to the Korean language, which affects speech perception and naturalness. In order to address the aforementioned issues, we propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns. Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences, despite its training on short audio clips. Architectural design choices are validated through comparisons with baseline models and ablation studies using subjective and objective metrics, thus confirming model performance.

pdf abs
LexAbSumm: Aspect-based Summarization of Legal Decisions
Santosh T.y.s.s. | Mahmoud Aly | Matthias Grabmair

Legal professionals frequently encounter long legal judgments that hold critical insights for their work. While recent advances have led to automated summarization solutions for legal documents, they typically provide generic summaries, which may not meet the diverse information needs of users. To address this gap, we introduce LexAbSumm, a novel dataset designed for aspect-based summarization of legal case decisions, sourced from the European Court of Human Rights jurisdiction. We evaluate several abstractive summarization models tailored for longer documents on LexAbSumm, revealing a challenge in conditioning these models to produce aspect-specific summaries. We release LexAbSum to facilitate research in aspect-based summarization for legal domain.

pdf abs
LexComSpaL2: A Lexical Complexity Corpus for Spanish as a Foreign Language
Jasper Degraeuwe | Patrick Goethals

We present LexComSpaL2, a novel corpus which can be employed to train personalised word-level difficulty classifiers for learners of Spanish as a foreign/second language (L2). The dataset contains 2,240 in-context target words with the corresponding difficulty judgements of 26 Dutch-speaking students who are learning Spanish as an L2, resulting in a total of 58,240 annotations. The target words are divided over 200 sentences from 4 different domains (economics, health, law, and migration) and have been selected based on their suitability to be included in L2 learning materials. As our annotation scheme, we use a customised version of the 5-point lexical complexity prediction scale (Shardlow et al., 2020), tailored to the vocabulary knowledge continuum (which ranges from no knowledge over receptive mastery to productive mastery; Schmitt, 2019). With LexComSpaL2, we aim to address the lack of relevant data for multi-category difficult prediction at word level for L2 learners of other languages than English.

pdf abs
LexDrafter: Terminology Drafting for Legislative Documents Using Retrieval Augmented Generation
Ashish Chouhan | Michael Gertz

With the increase in legislative documents at the EU, the number of new terms and their definitions is increasing as well. As per the Joint Practical Guide of the European Parliament, the Council and the Commission, terms used in legal documents shall be consistent, and identical concepts shall be expressed without departing from their meaning in ordinary, legal, or technical language. Thus, while drafting a new legislative document, having a framework that provides insights about existing definitions and helps define new terms based on a document’s context will support such harmonized legal definitions across different regulations and thus avoid ambiguities. In this paper, we present LexDrafter, a framework that assists in drafting Definitions articles for legislative documents using retrieval augmented generation (RAG) and existing term definitions present in different legislative documents. For this, definition elements are built by extracting definitions from existing documents. Using definition elements and RAG, a Definitions article can be suggested on demand for a legislative document that is being drafted. We demonstrate and evaluate the functionality of LexDrafter using a collection of EU documents from the energy domain. The code for LexDrafter framework is available at https://github.com/achouhan93/LexDrafter.

pdf abs
LexiVault: A Repository for Psycholinguistic Lexicons of Lesser-studied Languages
Hind Saddiki | Samantha Wray | Daisy Li

This paper presents LexiVault, an open-source web tool with annotated lexicons and rich retrieval capabilities primarily developed for, but not restricted to, the support of psycholinguistic research with key measures to design stimuli for low-resource languages. Psycholinguistic research relies on human responses to carefully crafted stimuli for a better understanding of the mechanisms by which we learn, store and process language. Stimuli design captures specific language properties such as frequency, morphological complexity, or stem likelihood in a part of speech, typically derived from a corpus that is representative of the average speaker’s linguistic experience. These measures are more readily available for well-resourced languages, whereas efforts for lesser-studied languages come with substantial overhead for the researcher to build corpora and calculate these measures from scratch. This stumbling block widens the gap, further skewing our modeling of the mental architecture of linguistic processing towards a small, over-represented set of the world’s languages. To lessen this burden, we designed LexiVault to be user friendly and accommodate incremental growth of new and existing low-resource language lexicons in the system through moderated community contributions while abstracting programming complexity to foster more interest from the psycholinguistics community in exploring low-resource languages.

pdf abs
LFED: A Literary Fiction Evaluation Dataset for Large Language Models
Linhao Yu | Qun Liu | Deyi Xiong

The rapid evolution of large language models (LLMs) has ushered in the need for comprehensive assessments of their performance across various dimensions. In this paper, we propose LFED, a Literary Fiction Evaluation Dataset, which aims to evaluate the capability of LLMs on the long fiction comprehension and reasoning. We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. Additionally, we conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations. Through a series of experiments involving various state-of-the-art LLMs, our findings reveal that these models face considerable challenges in effectively addressing questions related to literary fictions, with ChatGPT reaching only 57.08% under the zero-shot setting. The dataset will be publicly available at https://github.com/tjunlp-lab/LFED.git.

pdf abs
LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models
Chuang Liu | Renren Jin | Yuqi Ren | Deyi Xiong

Chinese Large Language Models (LLMs) have recently demonstrated impressive capabilities across various NLP benchmarks and real-world applications. However, the existing benchmarks for comprehensively evaluating these LLMs are still insufficient, particularly in terms of measuring knowledge that LLMs capture. Current datasets collect questions from Chinese examinations across different subjects and educational levels to address this issue. Yet, these benchmarks primarily focus on objective questions such as multiple-choice questions, leading to a lack of diversity in question types. To tackle this problem, we propose LHMKE, a Large-scale, Holistic, and Multi-subject Knowledge Evaluation benchmark in this paper. LHMKE is designed to provide a comprehensive evaluation of the knowledge acquisition capabilities of Chinese LLMs. It encompasses 10,465 questions across 75 tasks covering 30 subjects, ranging from primary school to professional certification exams. Notably, LHMKE includes both objective and subjective questions, offering a more holistic evaluation of the knowledge level of LLMs. We have assessed 11 Chinese LLMs under the zero-shot setting, which aligns with real examinations, and compared their performance across different subjects. We also conduct an in-depth analysis to check whether GPT-4 can automatically score subjective predictions. Our findings suggest that LHMKE is a challenging and advanced testbed for Chinese LLMs.

pdf abs
LI4: Label-Infused Iterative Information Interacting Based Fact Verification in Question-answering Dialogue
Xiaocheng Zhang | Chang Wang | Guoping Zhao | Xiaohong Su

Fact verification constitutes a pivotal application in the effort to combat the dissemination of disinformation, a concern that has recently garnered considerable attention. However, previous studies in the field of fact verification, particularly those focused on question-answering dialogue, have exhibited limitations, such as failing to fully exploit the potential of question structures and ignoring relevant label information during the verification process. In this paper, we introduce Label-Infused Iterative Information Interacting (LI4), a novel approach designed for the task of question-answering dialogue based fact verification. LI4 consists of two meticulously designed components, namely the Iterative Information Refining and Filtering Module (IIRF) and the Fact Label Embedding Module (FLEM). The IIRF uses the Interactive Gating Mechanism to iteratively filter out the noise of question and evidence, concurrently refining the claim information. The FLEM is conceived to strengthen the understanding ability of the model towards labels by injecting label knowledge. We evaluate the performance of the proposed LI4 on HEALTHVER, FAVIQ, and COLLOQUIAL. The experimental results confirm that our LI4 model attains remarkable progress, manifesting as a new state-of-the-art performance.

This paper studies vision-language (V&L) pre-training for deep cross-modal representations. Recently, pre-trained V&L models have shown great success in V&L tasks. However, most existing models apply multi-modal encoders to encode the image and text, at the cost of high training complexity because of the input sequence length. In addition, they suffer from noisy training corpora caused by V&L mismatching. In this work, we propose a lightweight vision-language pre-training (LightVLP) for efficient and effective V&L pre-training. First, we design a new V&L framework with two autoencoders. Each autoencoder involves an encoder, which only takes in unmasked tokens (removes masked ones), as well as a lightweight decoder that reconstructs the masked tokens. Besides, we mask and remove large portions of input tokens to accelerate the training. Moreover, we propose a gated interaction mechanism to cope with noise in aligned image-text pairs. As for a matched image-text pair, the model tends to apply cross-modal representations for reconstructions. By contrast, for an unmatched pair, the model conducts reconstructions mainly using uni-modal representations. Benefiting from the above-mentioned designs, our base model shows competitive results compared to ALBEF while saving 44% FLOPs. Further, we compare our large model with ALBEF under the setting of similar FLOPs on six datasets and show the superiority of LightVLP. In particular, our model achieves 2.2% R@1 gains on COCO Text Retrieval and 1.1% on refCOCO+.

Neural text generation is receiving broad attention with the publication of new tools such as ChatGPT. The main reason for that is that the achieved quality of the generated text may be attributed to a human writer by the naked eye of a human evaluator. In this paper, we propose a new corpus in French and English for the task of recognising automatically generated texts and we conduct a study of how humans perceive the text. Our results show, as previous work before the ChatGPT era, that the generated texts by tools such as ChatGPT share some common characteristics but they are not clearly identifiable which generates different perceptions of these texts.

Event Coreference Resolution (ECR) as a pairwise mention classification task is expensive both for automated systems and manual annotations. The task’s quadratic difficulty is exacerbated when using Large Language Models (LLMs), making prompt engineering for ECR prohibitively costly. In this work, we propose a graphical representation of events, X-AMR, anchored around individual mentions using a cross-document version of Abstract Meaning Representation. We then linearize the ECR with a novel multi-hop coreference algorithm over the event graphs. The event graphs simplify ECR, making it a) LLM cost-effective, b) compositional and interpretable, and c) easily annotated. For a fair assessment, we first enrich an existing ECR benchmark dataset with these event graphs using an annotator-friendly tool we introduce. Then, we employ GPT-4, the newest LLM by OpenAI, for these annotations. Finally, using the ECR algorithm, we assess GPT-4 against humans and analyze its limitations. Through this research, we aim to advance the state-of-the-art for efficient ECR and shed light on the potential shortcomings of current LLMs at this task. Code and annotations: https://github.com/ahmeshaf/gpt_coref

pdf abs
LinguaMeta: Unified Metadata for Thousands of Languages
Sandy Ritchie | Daan van Esch | Uche Okonkwo | Shikhar Vashishth | Emily Drummond

We introduce LinguaMeta, a unified resource for language metadata for thousands of languages, including language codes, names, number of speakers, writing systems, countries, official status, coordinates, and language varieties. The resources are drawn from various existing repositories and supplemented with our own research. Each data point is tagged for its origin, allowing us to easily trace back to and improve existing resources with more up-to-date and complete metadata. The resource is intended for use by researchers and organizations who aim to extend technology to thousands of languages.

pdf abs
Linguistic Knowledge Can Enhance Encoder-Decoder Models (If You Let It)
Alessio Miaschi | Felice Dell’Orletta | Giulia Venturi

In this paper, we explore the impact of augmenting pre-trained Encoder-Decoder models, specifically T5, with linguistic knowledge for the prediction of a target task. In particular, we investigate whether fine-tuning a T5 model on an intermediate task that predicts structural linguistic properties of sentences modifies its performance in the target task of predicting sentence-level complexity. Our study encompasses diverse experiments conducted on Italian and English datasets, employing both monolingual and multilingual T5 models at various sizes. Results obtained for both languages and in cross-lingual configurations show that linguistically motivated intermediate fine-tuning has generally a positive impact on target task performance, especially when applied to smaller models and in scenarios with limited data availability.

pdf abs
Linguistic Nudges and Verbal Interaction with Robots, Smart-Speakers, and Humans
Natalia Kalashnikova | Ioana Vasilescu | Laurence Devillers

This paper describes a data collection methodology and emotion annotation of dyadic interactions between a human, a Pepper robot, a Google Home smart-speaker, or another human. The collected 16 hours of audio recordings were used to analyze the propensity to change someone’s opinions about ecological behavior regarding the type of conversational agent, the kind of nudges, and the speaker’s emotional state. We describe the statistics of data collection and annotation. We also report the first results, which showed that humans change their opinions on more questions with a human than with a device, even against mainstream ideas. We observe a correlation between a certain emotional state and the interlocutor and a human’s propensity to be influenced. We also reported the results of the studies that investigated the effect of human likeness on speech using our data.

pdf abs
Linguistic Rule Induction Improves Adversarial and OOD Robustness in Large Language Models
Shuoran Jiang | Qingcai Chen | Yang Xiang | Youcheng Pan | Yukang Lin

Ensuring robustness is especially important when AI is deployed in responsible or safety-critical environments. ChatGPT can perform brilliantly in both adversarial and out-of-distribution (OOD) robustness, while other popular large language models (LLMs), like LLaMA-2, ERNIE and ChatGLM, do not perform satisfactorily in this regard. Therefore, it is valuable to study what efforts play essential roles in ChatGPT, and how to transfer these efforts to other LLMs. This paper experimentally finds that linguistic rule induction is the foundation for identifying the cause-effect relationships in LLMs. For LLMs, accurately processing the cause-effect relationships improves its adversarial and OOD robustness. Furthermore, we explore a low-cost way for aligning LLMs with linguistic rules. Specifically, we constructed a linguistic rule instruction dataset to fine-tune LLMs. To further energize LLMs for reasoning step-by-step with the linguistic rule, we construct the task-relevant LingR-based chain-of-thoughts. Experiments showed that LingR-induced LLaMA-13B achieves comparable or better results with GPT-3.5 and GPT-4 on various adversarial and OOD robustness evaluations.

pdf abs
Linguistic Survey of India and Polyglotta Africana: Two Retrostandardized Digital Editions of Large Historical Collections of Multilingual Wordlists
Robert Forkel | Johann-Mattis List | Christoph Rzymski | Guillaume Segerer

The Linguistic Survey of India (LSI) and the Polyglotta Africana (PA) are two of the largest historical collections of multilingual wordlists. While the originally printed editions have long since been digitized and shared in various forms, no editions in which the original data is presented in standardized form, comparable with contemporary wordlist collections, have been produced so far. Here we present digital retro-standardized editions of both sources. For maximal interoperability with datasets such as Lexibank the two datasets have been converted to CLDF, the standard proposed by the Cross-Linguistic Data Formats initiative. In this way, an unambiguous identification of the three main constituents of wordlist data – language, concept and segments used for transcription – is ensured through links to the respective reference catalogs, Glottolog, Concepticon and CLTS. At this level of interoperability, legacy material such as LSI and PA may provide a reasonable complementary source for language documentation, filling in gaps where original documentation is not possible anymore.

Recently, it has been discovered that incorporating structure information (e.g., dependency trees) can improve the performance of aspect-based sentiment analysis (ABSA). The structure information is often obtained from off-the-shelf parsers, which are sub-optimal and unwieldy. Therefore, adaptively inducing task-specific structures is helpful in resolving this issue. In this work, we concentrate on adaptive graph structure induction for ABSA and explore the impact of neuron-level manipulation from a spectral perspective on structure induction. Specifically, we consider word representations from PLMs (pre-trained language models) as node features and employ a graph learning module to adaptively generate adjacency matrices, followed by graph neural networks (GNNs) to capture both node features and structural information. Meanwhile, we propose the Neuron Filtering (NeuLT), a method to conduct neuron-level manipulations on word representations in the frequency domain. We conduct extensive experiments on three public datasets to observe the impact of NeuLT on structure induction and ABSA. The results and further analysis demonstrate that performing neuron-level manipulation through NeuLT can shorten Aspects-sentiment Distance of induced structures and be beneficial to improve the performance of ABSA. The effects of our method can achieve or come close to SOTA (state-of-the-art) performance.

pdf abs
Linking Judgement Text to Court Hearing Videos: UK Supreme Court as a Case Study
Hadeel Saadany | Constantin Orasan | Sophie Walker | Catherine Breslin

One the most important archived legal material in the UK is the video recordings of Supreme Court hearings and their corresponding judgements. The impact of Supreme Court published material extends far beyond the parties involved in any given case as it provides landmark rulings on points of law of the greatest public and constitutional importance. Typically, transcripts of legal hearings are lengthy, making it time-consuming for legal professionals to analyse crucial arguments. This study focuses on summarising the second phase of a collaborative research-industrial project aimed at creating an automatic tool designed to connect sections of written judgements with relevant moments in Supreme Court hearing videos, streamlining access to critical information. Acting as a User-Interface (UI) platform, the tool enhances access to justice by pinpointing significant moments in the videos, aiding in comprehension of the final judgement. We make available the initial dataset of judgement-hearing pairs for legal Information Retrieval research, and elucidate our use of AI generative technology to enhance it. Additionally, we demonstrate how fine-tuning GPT text embeddings to our dataset optimises accuracy for an automated linking system tailored to the legal domain.

pdf abs
Linking Named Entities in Diderot’s Encyclopédie to Wikidata
Pierre Nugues

Diderot’s Encyclopédie is a reference work from XVIIIth century in Europe that aimed at collecting the knowledge of its era. Wikipedia has the same ambition with a much greater scope. However, the lack of digital connection between the two encyclopedias may hinder their comparison and the study of how knowledge has evolved. A key element of Wikipedia is Wikidata that backs the articles with a graph of structured data. In this paper, we describe the annotation of more than 9,100 of the Encyclopédie entries with Wikidata identifiers enabling us to connect these entries to the graph. We considered geographic and human entities. The Encyclopédie does not contain biographic entries as they mostly appear as subentries of locations. We extracted all the geographic entries and we completely annotated all the entries containing a description of human entities. This represents more than 2,600 links referring to locations or human entities. In addition, we annotated more than 8,300 entries having a geographic content only. We describe the annotation process as well as application examples. This resource is available at https://github.com/pnugues/encyclopedie_1751.

pdf abs
Little Red Riding Hood Goes around the Globe: Crosslingual Story Planning and Generation with Large Language Models
Evgeniia Razumovskaia | Joshua Maynez | Annie Louis | Mirella Lapata | Shashi Narayan

Previous work has demonstrated the effectiveness of planning for story generation exclusively in a monolingual setting focusing primarily on English. We consider whether planning brings advantages to automatic story generation across languages. We propose a new task of crosslingual story generation with planning and present a new dataset for this task. We conduct a comprehensive study of different plans and generate stories in several languages, by leveraging the creative and reasoning capabilities of large pretrained language models. Our results demonstrate that plans which structure stories into three acts lead to more coherent and interesting narratives, while allowing to explicitly control their content and structure.

pdf abs
LlamaCare: An Instruction Fine-Tuned Large Language Model for Clinical NLP
Rumeng Li | Xun Wang | Hong Yu

Large language models (LLMs) have shown remarkable abilities in generating natural texts for various tasks across different domains. However, applying LLMs to clinical settings still poses significant challenges, as it requires specialized knowledge, vocabulary, as well as reliability. In this work, we propose a novel method of instruction fine-tuning for adapting LLMs to the clinical domain, which leverages the instruction-following capabilities of LLMs and the availability of diverse real-world data sources. We generate instructions, inputs, and outputs covering a wide spectrum of clinical services, from primary cares to nursing, radiology, physician, and social work, and use them to fine-tune LLMs. We evaluated the fine-tuned LLM, LlamaCare, on various clinical tasks, such as generating discharge summaries, predicting mortality and length of stay, and more. Using both automatic and human metrics, we demonstrated that LlamaCare surpasses other LLM baselines in predicting clinical outcomes and producing more accurate and coherent clinical texts. We also discuss the challenges and limitations of LLMs that need to be addressed before they can be widely adopted in clinical settings.

pdf abs
Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness
Xincan Feng | Akifumi Yoshimoto

Recent advancements in Natural Language Processing (NLP) have seen Large-scale Language Models (LLMs) excel at producing high-quality text for various purposes. Notably, in Text-To-Speech (TTS) systems, the integration of BERT for semantic token generation has underscored the importance of semantic content in producing coherent speech outputs. Despite this, the specific utility of LLMs in enhancing TTS synthesis remains considerably limited. This research introduces an innovative approach, Llama-VITS, which enhances TTS synthesis by enriching the semantic content of text using LLM. Llama-VITS integrates semantic embeddings from Llama2 with the VITS model, a leading end-to-end TTS framework. By leveraging Llama2 for the primary speech synthesis process, our experiments demonstrate that Llama-VITS matches the naturalness of the original VITS (ORI-VITS) and those incorporate BERT (BERT-VITS), on the LJSpeech dataset, a substantial collection of neutral, clear speech. Moreover, our method significantly enhances emotive expressiveness on the EmoV_DB_bea_sem dataset, a curated selection of emotionally consistent speech from the EmoV_DB dataset, highlighting its potential to generate emotive speech.

pdf abs
LLMR: Knowledge Distillation with a Large Language Model-Induced Reward
Dongheng Li | Yongchang Hao | Lili Mou

Large language models have become increasingly popular and demonstrated remarkable performance in various natural language processing (NLP) tasks. However, these models are typically computationally expensive and difficult to be deployed in resource-constrained environments. In this paper, we propose LLMR, a novel knowledge distillation (KD) method based on a reward function induced from large language models. We conducted experiments on multiple datasets in the dialogue generation and summarization tasks. Empirical results demonstrate that our LLMR approach consistently outperforms traditional KD methods in different tasks and datasets.

pdf abs
LLMSegm: Surface-level Morphological Segmentation Using Large Language Model
Marko Pranjić | Marko Robnik-Šikonja | Senja Pollak

Morphological word segmentation splits a given word into its morphemes (roots and affixes), the smallest meaning-bearing units of language. We introduce a novel approach, called LLMSegm, to surface-level morphological segmentation leveraging large language models (LLMs). The proposed approach is applicable in low-data settings as well as for low-resourced languages. We show how to transform the surface-level morphological segmentation task to a binary classification problem and train LLMs to solve it efficiently. For input, we leverage the information from the default LLM subword tokenisation, and a custom morphological segmentation using novel encoding. The evaluation of LLMSegm across seven morphologically diverse languages demonstrates substantial gains in minimally-supervised settings as well as for low-resourced languages, compared to several existing competitive approaches. In terms of F1-scores and accuracy, we achieve improved results compared to the competing methods in six out of seven datasets. Keywords: morphological segmentation, surface-level segmentation, large language models, low-resource settings

pdf abs
LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction
Yixuan Wang | Baoxin Wang | Yijun Liu | Dayong Wu | Wanxiang Che

Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task. Recent work using model ensemble methods based on voting can effectively mitigate over-correction and improve the precision of the GEC system. However, these methods still require the output of several GEC systems and inevitably lead to reduced error recall. In this light, we propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble. Specifically, we train the model on an over-correction dataset constructed through the proposed K-fold cross inference method, which allows it to directly generate filtered sentences by combining the original and the over-corrected text. In the inference stage, we directly take the original sentences and the output results of other systems as input and then obtain the filtered sentences through LM-Combiner. Experiments on the FCGEC dataset show that our proposed method effectively alleviates the over-correction of the original system (+18.2 Precision) while ensuring the error recall remains unchanged. Besides, we find that LM-Combiner still has a good rewriting performance even with small parameters and few training data, and thus can cost-effectively mitigate the over-correction of black-box GEC systems (e.g., ChatGPT).

Large pretrained language models (LLMs) have shown surprising In-Context Learning (ICL) ability. An important application in deploying large language models is to augment LLMs with a private database for some specific task.The main problem with this promising commercial use is that LLMs have been shown to memorize their training data and their prompt data are vulnerable to membership inference attacks (MIA) and prompt leaking attacks. In order to deal with this problem, we treat LLMs as untrusted in privacy and propose a locally differentially private framework of in-context learning (LDP-ICL) in the settings where labels are sensitive. Considering the mechanisms of in-context learning in Transformers by gradient descent, we provide an analysis of the trade-off between privacy and utility in such LDP-ICL for classification. Moreover, we apply LDP-ICL to the discrete distribution estimation problem. In the end, we perform several experiments to demonstrate our analysis results

Prior research on Twitter (now X) data has provided positive evidence of its utility in developing supplementary health surveillance systems. In this study, we present a new framework to surveil public health, focusing on mental health (MH) outcomes. We hypothesize that locally posted tweets are indicative of local MH outcomes and collect tweets posted from 765 neighborhoods (census block groups) in the USA. We pair these tweets from each neighborhood with the corresponding MH outcome reported by the Center for Disease Control (CDC) to create a benchmark dataset, LocalTweets. With LocalTweets, we present the first population-level evaluation task for Twitter-based MH surveillance systems. We then develop an efficient and effective method, LocalHealth, for predicting MH outcomes based on LocalTweets. When used with GPT3.5, LocalHealth achieves the highest F1-score and accuracy of 0.7429 and 79.78%, respectively, a 59% improvement in F1-score over the GPT3.5 in zero-shot setting. We also utilize LocalHealth to extrapolate CDC’s estimates to proxy unreported neighborhoods, achieving an F1-score of 0.7291. Our work suggests that Twitter data can be effectively leveraged to simulate neighborhood-level MH outcomes.

pdf abs
Loflòc: A Morphological Lexicon for Occitan using Universal Dependencies
Marianne Vergez-Couret | Myriam Bras | Aleksandra Miletić | Clamença Poujade

This paper presents Loflòc (Lexic obèrt flechit Occitan – Open Inflected Lexicon of Occitan), a morphological lexicon for Occitan. Even though the lexicon no longer occupies the same place in the NLP pipeline since the advent of large language models, it remains a crucial resource for low-resourced languages. Occitan is a Romance language spoken in the south of France and in parts of Italy and Spain. It is not recognized as an official language in France and no standard variety is shared across the area. To the best of our knowledge, Loflòc is the first publicly available lexicon for Occitan. It contains 650 thousand entries for 57 thousand lemmas. Each entry is accompanied by the corresponding Universal Dependencies Part-of-Speech tag. We show that the lexicon has solid coverage on the existing freely available corpora of Occitan in four major dialects. Coverage gaps on multi-dialect corpora are overwhelmingly driven by dialectal variation, which affects both open and closed classes. Based on this analysis we propose directions for future improvements.

Essay writing is a skill commonly taught and practised in schools. The ability to write a fluent and persuasive essay is often a major component of formal assessment. In natural language processing and education technology we may work with essays in their final form, for example to carry out automated assessment or grammatical error correction. In this work we collect and analyse data representing the essay writing process from start to finish, by recording every key stroke from multiple writers participating in our study. We describe our data collection methodology, the characteristics of the resulting dataset, and the assignment of proficiency levels to the texts. We discuss the ways the keystroke data can be used – for instance seeking to identify patterns in the keystrokes which might act as features in automated assessment or may enable further advancements in writing assistance – and the writing support technology which could be built with such information, if we can detect when writers are struggling to compose a section of their essay and offer appropriate intervention. We frame this work in the context of English language learning, but we note that keystroke logging is relevant more broadly to text authoring scenarios as well as cognitive or linguistic analyses of the writing process.

pdf abs
Logic Rules as Explanations for Legal Case Retrieval
ZhongXiang Sun | Kepu Zhang | Weijie Yu | Haoyu Wang | Jun Xu

In this paper, we address the issue of using logic rules to explain the results from legal case retrieval. The task is critical to legal case retrieval because the users (e.g., lawyers or judges) are highly specialized and require the system to provide logic, faithful, and interpretable explanations before making legal decisions. Recently, research efforts have been made to learn explainable legal case retrieval models. However, these methods usually select rationales (key sentences) from the legal cases as explanations, failing to provide faithful and logicly correct explanations. In this paper, we propose Neural-Symbolic enhanced Legal Case Retrieval (NS-LCR), a framework that explicitly conducts reasoning on the matching of legal cases through learning case-level and law-level logic rules. The learned rules are then integrated into the retrieval process in a neuro-symbolic manner. Benefiting from the logic and interpretable nature of the logic rules, NS-LCR is equipped with built-in faithful explainability. We also show that NS-LCR is a model-agnostic framework that can be plug-in for multiple legal retrieval models. To demonstrate the superiority of NS-LCR, we extend the benchmarks of LeCaRD and ELAM with manually annotated logic rules and propose a new explainability measure based on Large Language Models (LLMs). Extensive experiments show that NS-LCR can achieve state-of-the-art ranking performances, and the empirical analysis also showed that NS-LCR is capable of providing faithful explanations for legal case retrieval.

pdf abs
LoNAS: Elastic Low-Rank Adapters for Efficient Large Language Models
Juan Pablo Munoz | Jinjie Yuan | Yi Zheng | Nilesh Jain

Large Language Models (LLMs) continue to grow, reaching hundreds of billions of parameters and making it challenging for Deep Learning practitioners with resource-constrained systems to use them, e.g., fine-tuning these models for a downstream task of their interest. Adapters, such as low-rank adapters (LoRA), have been proposed to reduce the number of trainable parameters in a model, reducing memory requirements and enabling smaller systems to fine-tune these models. Orthogonal to this work, Neural Architecture Search (NAS) has been used to discover compressed and more efficient architectures without sacrificing performance compared to similar base models. This paper introduces a novel approach, LoNAS, to use NAS on language models by exploring a search space of elastic low-rank adapters while reducing memory and compute requirements of full-scale NAS, resulting in high-performing compressed models obtained from weight-sharing super-networks. Compared to models fine-tuned with LoRA, these models contain fewer total parameters, reducing the inference time with only minor decreases in accuracy and, in some cases, even improving accuracy. We discuss the limitations of LoNAS and share observations for the research community regarding its generalization capabilities, which have motivated our follow-up work.

pdf abs
LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation
Jennifer A. Bishop | Sophia Ananiadou | Qianqian Xie

Maintaining factual consistency is a critical issue in abstractive text summarisation, however, it cannot be assessed by traditional automatic metrics used for evaluating text summarisation, such as ROUGE scoring. Recent efforts have been devoted to developing improved metrics for measuring factual consistency using pre-trained language models, but these metrics have restrictive token limits, and are therefore not suitable for evaluating long document text summarisation. Moreover, there is limited research and resources available for evaluating whether existing automatic evaluation metrics are fit for purpose when applied in long document settings. In this work, we evaluate the efficacy of automatic metrics for assessing the factual consistency of long document text summarisation. We create a human-annotated data set for evaluating automatic factuality metrics, LongSciVerify, which contains fine-grained factual consistency annotations for long document summaries from the scientific domain. We also propose a new evaluation framework, LongDocFACTScore, which is suitable for evaluating long document summarisation. This framework allows metrics to be efficiently extended to any length document and outperforms existing state-of-the-art metrics in its ability to correlate with human measures of factuality when used to evaluate long document summarisation data sets. We make our code and LongSciVerify data set publicly available: https://github.com/jbshp/LongDocFACTScore.

pdf abs
Longform Multimodal Lay Summarization of Scientific Papers: Towards Automatically Generating Science Blogs from Research Articles
Sandeep Kumar | Guneet Singh Kohli | Tirthankar Ghosal | Asif Ekbal

Science communication, in layperson’s terms, is essential to reach the general population and also maximize the impact of underlying scientific research. Hence, good science blogs and journalistic reviews of research articles are so well-read and critical to conveying science. Scientific blogging goes beyond traditional research summaries, offering experts a platform to articulate findings in layperson’s terms. It bridges the gap between intricate research and its comprehension by the general public, policymakers, and other researchers. Amid the rapid expansion of scientific data and the accelerating pace of research, credible science blogs serve as vital artifacts for evidence-based information to the general non-expert audience. However, writing a scientific blog or even a short lay summary requires significant time and effort. Here, we are intrigued what if the process of writing a scientific blog based on a given paper could be semi-automated to produce the first draft? In this paper, we introduce a novel task of Artificial Intelligence (AI)-based science blog generation from a research article. We leverage the idea that presentations and science blogs share a symbiotic relationship in their aim to clarify and elucidate complex scientific concepts. Both rely on visuals, such as figures, to aid comprehension. With this motivation, we create a new dataset of science blogs using the presentation transcript and the corresponding slides. We create a dataset containing a paper’s presentation transcript and figures annotated from nearly 3000 papers. We then propose a multimodal attention model to generate a blog text and select the most relevant figures to explain a research article in layperson’s terms, essentially a science blog. Our experimental results with respect to both automatic and human evaluation metrics show the effectiveness of our proposed approach and the usefulness of our proposed dataset.

Knowledge-based Visual Question Generation aims to generate visual questions with outside knowledge other than the image. Existing approaches are answer-aware, which incorporate answers into the question-generation process. However, these methods just focus on leveraging the semantics of inputs to propose questions, ignoring the logical coherence among generated questions (Q), images (V), answers (A), and corresponding acquired outside knowledge (K). It results in generating many non-expected questions with low quality, lacking insight and diversity, and some of them are even without any corresponding answer. To address this issue, we inject logical verification into the processes of knowledge acquisition and question generation, which is defined as LVˆ2-Net. Through checking the logical structure among V, A, K, ground-truth and generated Q twice in the whole KB-VQG procedure, LVˆ2-Net can propose diverse and insightful knowledge-based visual questions. And experimental results on two commonly used datasets demonstrate the superiority of LVˆ2-Net. Our code will be released to the public soon.

pdf abs
LoSST-AD: A Longitudinal Corpus for Tracking Alzheimer’s Disease Related Changes in Spontaneous Speech
Ulla Petti | Anna Korhonen

Language-based biomarkers have shown promising results in differentiating those with Alzheimer’s disease (AD) diagnosis from healthy individuals, but the earliest changes in language are thought to start years or even decades before the diagnosis. Detecting these changes is critical to allow early interventions, but research into the earliest signs is challenging, as it requires large longitudinal datasets that are time-consuming and expensive to collect. There is a need for alternative methods for tracking longitudinal language change, including Natural Language Processing (NLP) and speech recognition technologies. We present a novel corpus that can enable this: a corpus of transcripts of public interviews with 20 famous figures, half of whom will eventually be diagnosed with AD, recorded over several decades. We evaluate the corpus by validating patterns of vocabulary richness changes known from literature, such as decline in noun frequency, word length, and several other features. We show that public data could be used to collect longitudinal datasets without causing extra stress for the participant, and that these data can adequately reflect longitudinal AD-related changes in vocabulary richness. Our corpus can provide a valuable starting point for the development of early detection tools and enhance our understanding of how AD affects language over time.

pdf abs
Low-Rank Prune-And-Factorize for Language Model Compression
Siyu Ren | Kenny Q. Zhu

The components underpinning PLMs—large weight matrices—were shown to bear considerable redundancy. Matrix factorization, a well-established technique from matrix theory, has been utilized to reduce the number of parameters in PLM. However, it fails to retain satisfactory performance under moderate to high compression rates. In this paper, we identify the full-rankness of fine-tuned PLM as the fundamental bottleneck for the failure of matrix factorization and explore the use of network pruning to extract low-rank sparsity pattern desirable to matrix factorization. We find such a low-rank sparsity pattern exclusively exists in models generated by first-order pruning, which motivates us to unite the two approaches and achieve more effective model compression. We further propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning, which improve the initialization and training of the compression procedure, respectively. Experiments on GLUE and question-answering tasks show that the proposed method has a superior compression-performance trade-off compared to existing approaches.

pdf abs
M2SA: Multimodal and Multilingual Model for Sentiment Analysis of Tweets
Gaurish Thakkar | Sherzod Hakimov | Marko Tadić

In recent years, multimodal natural language processing, aimed at learning from diverse data types, has garnered significant attention. However, there needs to be more clarity when it comes to analysing multimodal tasks in multi-lingual contexts. While prior studies on sentiment analysis of tweets have predominantly focused on the English language, this paper addresses this gap by transforming an existing textual Twitter sentiment dataset into a multimodal format through a straightforward curation process. Our work opens up new avenues for sentiment-related research within the research community. Additionally, we conduct baseline experiments utilising this augmented dataset and report the findings. Notably, our evaluations reveal that when comparing unimodal and multimodal configurations, using a sentiment-tuned large language model as a text encoder performs exceptionally well.

pdf abs
M3: A Multi-Task Mixed-Objective Learning Framework for Open-Domain Multi-Hop Dense Sentence Retrieval
Yang Bai | Anthony Colas | Christan Grant | Zhe Wang

In recent research, contrastive learning has proven to be a highly effective method for representation learning and is widely used for dense retrieval. However, we identify that relying solely on contrastive learning can lead to suboptimal retrieval performance. On the other hand, despite many retrieval datasets supporting various learning objectives beyond contrastive learning, combining them efficiently in multi-task learning scenarios can be challenging. In this paper, we introduce M3, an advanced recursive Multi-hop dense sentence retrieval system built upon a novel Multi-task Mixed-objective approach for dense text representation learning, addressing the aforementioned challenges. Our approach yields state-of-the-art performance on a large-scale open-domain fact verification benchmark dataset, FEVER.

Multilingual translation supports multiple translation directions by projecting all languages in a shared space, but the translation quality is undermined by the difference between languages in the text-only modality, especially when the number of languages is large. To bridge this gap, we introduce visual context as the universal language-independent representation to facilitate multilingual translation. In this paper, we propose a framework to leverage the multimodal prompt to guide the Multimodal Multilingual Neural Machine Translation (m3P), which aligns the representations of different languages with the same meaning and generates the conditional vision-language memory for translation. We construct a multilingual multimodal instruction dataset (InstrMulti102) to support 102 languages Our method aims to minimize the representation distance of different languages by regarding the image as a central language. Experimental results show that m3P outperforms previous text-only baselines and multilingual multimodal methods by a large margin. Furthermore, the probing experiments validate the effectiveness of our method in enhancing translation under the low-resource and massively multilingual scenario.

pdf abs
M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews
Sayed Muddashir Hossain | Jan Alexandersson | Philipp Müller

Accurate utterance classification in motivational interviews is crucial to automatically understand the quality and dynamics of client-therapist interaction, and it can serve as a key input for systems mediating such interactions. Motivational interviews exhibit three important characteristics. First, there are two distinct roles, namely client and therapist. Second, they are often highly emotionally charged, which can be expressed both in text and in prosody. Finally, context is of central importance to classify any given utterance. Previous works did not adequately incorporate all of these characteristics into utterance classification approaches for mental health dialogues. In contrast, we present M3TCM, a Multi-modal, Multi-task Context Model for utterance classification. Our approach for the first time employs multi-task learning to effectively model both joint and individual components of therapist and client behaviour. Furthermore, M3TCM integrates information from the text and speech modality as well as the conversation context. With our novel approach, we outperform the state of the art for utterance classification on the recently introduced AnnoMI dataset with a relative improvement of 20% for the client- and by 15% for therapist utterance classification. In extensive ablation studies, we quantify the improvement resulting from each contribution.

pdf abs
MaCmS: Magahi Code-mixed Dataset for Sentiment Analysis
Priya Rani | Theodorus Fransen | John P. McCrae | Gaurav Negi

The present paper introduces new sentiment data, MaCMS, for Magahi-Hindi-English (MHE) code-mixed language, where Magahi is a less-resourced minority language. This dataset is the first Magahi-Hindi-English code-mixed dataset for sentiment analysis tasks. Further, we also provide a linguistics analysis of the dataset to understand the structure of code-mixing and a statistical study to understand the language preferences of speakers with different polarities. With these analyses, we also train baseline models to evaluate the dataset’s quality.

pdf abs
MAGIC: Multi-Argument Generation with Self-Refinement for Domain Generalization in Automatic Fact-Checking
Wei-Yu Kao | An-Zi Yen

Numerous studies have been conducted on automatic fact-checking, driven by its importance in real-world applications. However, two challenges persist: (1) extracting pivotal evidence from extensive documents, and (2) verifying claims across diverse domains. On one hand, current retrieval methods are limited in their ability to concisely retrieve evidence, which results in poor performance. On the other hand, retrieved evidence derived from different sources strains the generalization capabilities of classifiers. This paper explores the task of cross-domain fact-checking and presents the XClaimCheck dataset, which consists of claims from multiple domains. We propose a framework featuring a multi-argument generation technique. We leverage multi-argument generation to reconstruct concise evidence from large amounts of evidence retrieved from different sources. In addition, a self-refinement mechanism is introduced to confirm that the generated arguments are consistent with the content of the evidence. Experimental results show that our proposed framework is effective in identifying the veracity of out-of-domain claims, particularly those that are partially true or false.

Media bias detection poses a complex, multifaceted problem traditionally tackled using single-task models and small in-domain datasets, consequently lacking generalizability. To address this, we introduce MAGPIE, a large-scale multi-task pre-training approach explicitly tailored for media bias detection. To enable large-scale pre-training, we construct Large Bias Mixture (LBM), a compilation of 59 bias-related tasks. MAGPIE outperforms previous approaches in media bias detection on the Bias Annotation By Experts (BABE) dataset, with a relative improvement of 3.3% F1-score. Furthermore, using a RoBERTa encoder, we show that MAGPIE needs only 15% of fine-tuning steps compared to single-task approaches. We provide insight into task learning interference and show that sentiment analysis and emotion detection help learning of all other tasks, and scaling the number of tasks leads to the best results. MAGPIE confirms that MTL is a promising approach for addressing media bias detection, enhancing the accuracy and efficiency of existing models. Furthermore, LBM is the first available resource collection focused on media bias MTL.

pdf abs
MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank
Verena Blaschke | Barbara Kovačić | Siyao Peng | Hinrich Schütze | Barbara Plank

Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in ‘within-language breadth’: most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers’ orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.

pdf abs
MaintIE: A Fine-Grained Annotation Schema and Benchmark for Information Extraction from Maintenance Short Texts
Tyler K. Bikaun | Tim French | Michael Stewart | Wei Liu | Melinda Hodkiewicz

Maintenance short texts (MST), derived from maintenance work order records, encapsulate crucial information in a concise yet information-rich format. These user-generated technical texts provide critical insights into the state and maintenance activities of machines, infrastructure, and other engineered assets–pillars of the modern economy. Despite their importance for asset management decision-making, extracting and leveraging this information at scale remains a significant challenge. This paper presents MaintIE, a multi-level fine-grained annotation scheme for entity recognition and relation extraction, consisting of 5 top-level classes: PhysicalObject, State, Process, Activity and Property and 224 leaf entities, along with 6 relations tailored to MSTs. Using MaintIE, we have curated a multi-annotator, high-quality, fine-grained corpus of 1,076 annotated texts. Additionally, we present a coarse-grained corpus of 7,000 texts and consider its performance for bootstrapping and enhancing fine-grained information extraction. Using these corpora, we provide model performance measures for benchmarking automated entity recognition and relation extraction. The MaintIE scheme, corpus, and model are publicly available at https://github.com/nlp-tlp/maintie under the MIT license, encouraging further community exploration and innovation in extracting valuable insights from MSTs.

pdf abs
Majority Rules Guided Aspect-Category Based Sentiment Analysis via Label Prior Knowledge
Lin Li | Shaopeng Tang | Renwei Wu

As an important fine-grained task of sentiment analysis, Aspect-Category based Sentiment Analysis (ACSA) aims to identify the sentiment polarities of pre-defined categories in text. However, due to subjectivity, the highly semantically similar text has polysemous sentiments to different people, leading to annotation difference. To this end, we propose a MAjority Rules Guided (MARG) for the profound understanding of this difference. Specifically, we firstly design a rule-based prompt generation, and then label word distribution is generated through an autoregression model for token-wise semantic consistency. Last but not least, the impact to the model caused by this commonly prevailing annotation difference can be mitigated by majority rules. 1) Our local majority rule is the ensemble of label word distributions, which alleviates the influence of the difference at the distribution generation stage. And 2) our global majority rule is the refinement based on the label prior knowledge of aspect categories, which further reduces the interference of the difference at the global data level. Conducted on four benchmark datasets, our MARG outperforms the state-of-the-art models by 2.43% to 67.68% in terms of F1-score and by 1.16% to 10.22% in terms of Accuracy.

Large language models (LLMs) have shown increasing power on various natural language processing (NLP) tasks. However, tuning these models for downstream tasks usually needs exorbitant costs or is unavailable due to commercial considerations. Recently, black-box tuning has been proposed to address this problem by optimizing task-specific prompts without accessing the gradients and hidden representations. However, most existing works have yet fully exploited the potential of gradient-free optimization under the scenario of few-shot learning. In this paper, we describe BBT-RGB, a suite of straightforward and complementary techniques for enhancing the efficiency and performance of black-box optimization. Specifically, our method includes three plug-and-play components: (1) Two-stage derivative-free optimization strategy that facilitates fast convergence and mitigates overfitting; (2) Automatic verbalizer construction with its novel usage under few-shot settings; (3) Better prompt initialization policy based on instruction search and auto-selected demonstration. Extensive experiments across various tasks on natural language understanding and inference demonstrate the effectiveness of our method. Our codes are available at https://github.com/QiushiSun/BBT-RGB.

pdf abs
Making Pre-trained Language Models Better Continual Few-Shot Relation Extractors
Shengkun Ma | Jiale Han | Yi Liang | Bo Cheng

Continual Few-shot Relation Extraction (CFRE) is a practical problem that requires the model to continuously learn novel relations while avoiding forgetting old ones with few labeled training data. The primary challenges are catastrophic forgetting and overfitting. This paper harnesses prompt learning to explore the implicit capabilities of pre-trained language models to address the above two challenges, thereby making language models better continual few-shot relation extractors. Specifically, we propose a Contrastive Prompt Learning framework, which designs prompt representation to acquire more generalized knowledge that can be easily adapted to old and new categories, and margin-based contrastive learning to focus more on hard samples, therefore alleviating catastrophic forgetting and overfitting issues. To further remedy overfitting in low-resource scenarios, we introduce an effective memory augmentation strategy that employs well-crafted prompts to guide ChatGPT in generating diverse samples. Extensive experiments demonstrate that our method outperforms state-of-the-art methods by a large margin and significantly mitigates catastrophic forgetting and overfitting in low-resource scenarios.

pdf abs
Making Sentence Embeddings Robust to User-Generated Content
Lydia Nishimwe | Benoît Sagot | Rachel Bawden

NLP models have been known to perform poorly on user-generated content (UGC), mainly because it presents a lot of lexical variations and deviates from the standard texts on which most of these models were trained. In this work, we focus on the robustness of LASER, a sentence embedding model, to UGC data. We evaluate this robustness by LASER’s ability to represent non-standard sentences and their standard counterparts close to each other in the embedding space. Inspired by previous works extending LASER to other languages and modalities, we propose RoLASER, a robust English encoder trained using a teacher-student approach to reduce the distances between the representations of standard and UGC sentences. We show that with training only on standard and synthetic UGC-like data, RoLASER significantly improves LASER’s robustness to both natural and artificial UGC data by achieving up to 2x and 11x better scores. We also perform a fine-grained analysis on artificial UGC data and find that our model greatly outperforms LASER on its most challenging UGC phenomena such as keyboard typos and social media abbreviations. Evaluation on downstream tasks shows that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data.

pdf abs
Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction
MohanRaj Chanthran | Lay-Ki Soon | Huey Fang Ong | Bhawani Selvaretnam

Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions in Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. Unfortunately, most of the existing datasets are mainly based on Standard English, which is not sufficient to enhance NLP tasks in Malaysian English. To the best of our knowledge, there is no annotated dataset that can be used to improve the model. To address this issue, we have constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could significantly improve the performance of NER in Malaysian English. This paper presents our efforts to acquire data, the annotation methodology, and a detailed analysis of the annotated dataset. To ensure the quality of the annotation, we have measured the Inter-Annotator Agreement (IAA), and any disagreements were resolved by a subject matter expert through adjudication. After a rigorous quality check, we have developed a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss spaCy fine-tuning setup and analysis of NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction.

pdf abs
mALBERT: Is a Compact Multilingual BERT Model Still Worth It?
Christophe Servan | Sahar Ghannay | Sophie Rosset

Within the current trend of Pretained Language Models (PLM), emerge more and more criticisms about the ethical and ecological impact of such models. In this article, considering these critical remarks, we propose to focus on smaller models, such as compact models like ALBERT, which are more ecologically virtuous than these PLM. However, PLMs enable huge breakthroughs in Natural Language Processing tasks, such as Spoken and Natural Language Understanding, classification, Question–Answering tasks. PLMs also have the advantage of being multilingual, and, as far as we know, a multilingual version of compact ALBERT models does not exist. Considering these facts, we propose the free release of the first version of a multilingual compact ALBERT model, pre-trained using Wikipedia data, which complies with the ethical aspect of such a language model. We also evaluate the model against classical multilingual PLMs in classical NLP tasks. Finally, this paper proposes a rare study on the subword tokenization impact on language performances.

pdf abs
ManNER & ManPOS: Pioneering NLP for Endangered Manchu Language
Sangah Lee | Sungjoo Byun | Jean Seo | Minha Kang

We present pioneering research in the realm of Natural Language Processing (NLP) for the endangered Manchu language. Recognizing the critical importance of linguistic preservation, we experiment with three language models – BiLSTM-CRF, BERT, and mBERT – for Named Entity Recognition (NER) and Part-of-Speech (POS) tagging tasks. Given the limited digitized Manchu text available, we augment the data using GloVe embeddings for the pre-training of BERT-based models. Remarkably, all models demonstrated outstanding performance, achieving over 90% F1 score in both NER and POS tagging tasks. Our research not only marks the first application of NLP on Manchu and the inaugural use of BERT-based models for the language but also stands as the first endeavor to employ Manchu for NER and POS tagging. To foster further exploration and applications in the field, we make our fine-tuning dataset and models available to the public. Through this research, we aim to underscore the significance of NLP in the protection and revitalization of low-resource languages.

pdf abs
Mapping the Past: Geographically Linking an Early 20th Century Swedish Encyclopedia with Wikidata
Axel Ahlin | Alfred Myrne Blåder | Pierre Nugues

In this paper, we describe the extraction of all the location entries from a prominent Swedish encyclopedia from the early 20th century, the Nordisk Familjebok ‘Nordic Family Book’, focusing on the second edition called Uggleupplagan. This edition comprises 38 volumes and over 182,000 articles, making it one of the most extensive Swedish encyclopedia editions. Using a classifier, we first determined the category of the entities. We found that approximately 22 percent of the encyclopedia entries were locations. We applied a named entity recognition to these entries and we linked them to Wikidata. Wikidata enabled us to extract their precise geographic locations resulting in almost 18,000 valid coordinates. We then analyzed the distribution of these locations and the entry selection process. It showed a concentration within Sweden, Germany, and the United Kingdom. The paper sheds light on the selection and representation of geographic information in the Nordisk Familjebok, providing insights into historical and societal perspectives. It also paves the way for future investigations into entry selection in different time periods and comparative analyses among various encyclopedias.

pdf abs
Mapping Work Task Descriptions from German Job Ads on the O*NET Work Activities Ontology
Ann-Sophie Gnehm | Simon Clematide

This work addresses the challenge of extracting job tasks from German job postings and mapping them to the fine-grained work activities classification in the O*NET labor market ontology. By utilizing ontological data with a Multiple Negatives Ranking loss and integrating a modest volume of labeled job advertisement data into the training process, our top configuration achieved a notable precision of 70% for the best mapping on the test set, representing a substantial improvement compared to the 33% baseline delivered by a general-domain SBERT. In our experiments the following factors proved to be most effective for improving SBERT models: First, the incorporation of subspan markup, both during training and inference, supports accurate classification, by streamlining varied job ad task formats with structured, uniform ontological work activities. Second, the inclusion of additional occupational information from O*NET into training supported learning by contextualizing hierarchical ontological relationships. Third, the most significant performance improvement was achieved by updating SBERT models with labeled job ad data specifically addressing challenging cases encountered during pre-finetuning, effectively bridging the semantic gap between O*NET and job ad data.

This paper introduces a cross-domain and multi-dialectal stance corpus for Arabic that includes four regions in the Arab World and covers the main Arabic dialect groups. Our corpus consists of 4657 sentences manually annotated with each sentence’s stance towards a specific topic. For each region, we collected sentences related to two controversial topics. We annotated each sentence by at least two annotators to indicate if its stance favors the topic, is against it, or is neutral. Our corpus is well-balanced concerning dialect and stance. Approximately half of the sentences are in Modern Standard Arabic (MSA) for each region, and the other half is in the region’s respective dialect. We conducted several machine-learning experiments for stance detection using our new corpus. Our most successful model is the Multi-Layer Perceptron (MLP), using Unigram or TF-IDF extracted features, which yielded an F1-score of 0.66 and an accuracy score of 0.66. Compared with the most similar state-of-the-art dataset, our dataset outperformed in specific stance classes, particularly “neutral” and “against”.

pdf abs
Massively Multilingual Token-Based Typology Using the Parallel Bible Corpus
Amanda Kann

The parallel Bible corpus is a uniquely broad multilingual resource, covering over 1400 languages. While this data is potentially highly useful for extending language coverage in both token-based typology research and various low-resource NLP applications, the restricted register and translational nature of the Bible texts has raised concerns as to whether they are sufficiently representative of language use outside of their specific context. In this paper, we analyze the reliability and generalisability of word order statistics extracted from the Bible corpus from two angles: stability across different translations in the same language, and comparability with Universal Dependencies corpora and typological database classifications from URIEL and Grambank. We find that variation between same-language translations is generally low and that agreement with other data sources and previous work is generally high, suggesting that the impact of issues specific to massively parallel texts is smaller than previously posited.

pdf abs
Mathematical Entities: Corpora and Benchmarks
Jacob Collard | Valeria de Paiva | Eswaran Subrahmanian

Mathematics is a highly specialized domain with its own unique set of challenges. Despite this, there has been relatively little research on natural language processing for mathematical texts, and there are few mathematical language resources aimed at NLP. In this paper, we aim to provide annotated corpora that can be used to study the language of mathematics in different contexts, ranging from fundamental concepts found in textbooks to advanced research mathematics. We preprocess the corpora with a neural parsing model and some manual intervention to provide part-of-speech tags, lemmas, and dependency trees. In total, we provide 182397 sentences across three corpora. We then aim to test and evaluate several noteworthy natural language processing models using these corpora, to show how well they can adapt to the domain of mathematics and provide useful tools for exploring mathematical language. We evaluate several neural and symbolic models against benchmarks that we extract from the corpus metadata to show that terminology extraction and definition extraction do not easily generalize to mathematics, and that additional work is needed to achieve good performance on these metrics. Finally, we provide a learning assistant that grants access to the content of these corpora in a context-sensitive manner, utilizing text search and entity linking. Though our corpora and benchmarks provide useful metrics for evaluating mathematical language processing, further work is necessary to adapt models to mathematics in order to provide more effective learning assistants and apply NLP methods to different mathematical domains.

pdf abs
MccSTN: Multi-Scale Contrast and Fine-Grained Feature Fusion Networks for Subject-driven Style Transfer
Honggang Zhao | Chunling Xiao | Jiayi Yang | Guozhu Jin | Mingyong Li

Stylistic transformation of artistic images is an important part of the current image processing field. In order to access the aesthetic artistic expression of style images, recent research has applied attention mechanisms to the field of style transfer. This approach transforms style images into tokens by calculating attention and then migrating the artistic style of the image through a decoder. Due to the very low semantic similarity between the original image and the style image, this results in many fine-grained style features being discarded. This can lead to discordant artifacts or obvious artifacts. To address this problem, we propose MccSTN, a novel style representation and transfer framework that can be adapted to existing arbitrary image style transfers. Specifically, we first introduce a feature fusion module (Mccformer) to fuse aesthetic features in style images with fine-grained features in content images. Feature maps are obtained through Mccformer. The feature map is then fed into the decoder to get the image we want. In order to lighten the model and train it quickly, we consider the relationship between specific styles and the overall style distribution. We introduce a multi-scale augmented contrast module that learns style representations from a large number of image pairs.

Multimodal information extraction (MIE) is a challenging task which aims to extract the structural information in free text coupled with the image for constructing the multimodal knowledge graph. The entity-based MIE tasks are based on the entity information to complete the specific tasks. However, the existing methods only investigated the entity-based MIE tasks under supervised learning with adequate labeled data. In the real-world scenario, collecting enough data and annotating the entity-based samples are time-consuming, and impractical. Therefore, we propose to investigate the entity-based MIE tasks under the low-resource settings. The conventional models are prone to overfitting on limited labeled data, which can result in poor performance. This is because the models tend to learn the bias existing in the limited samples, which can lead them to model the spurious correlations between multimodal features and task labels. To provide a more comprehensive understanding of the bias inherent in multimodal features of MIE samples, we decompose the features into image, entity, and context factors. Furthermore, we investigate the causal relationships between these factors and model performance, leveraging the structural causal model to delve into the correlations between the input features and output labels. Based on this, we propose the multimodal counterfactual instance learning framework to generate the counterfactual instances by the interventions on the limited observational samples. In the framework, we analyze the causal effect of the counterfactual instances and exploit it as a supervisory signal to maximize the effect for reducing the bias and improving the generalization of the model. Empirically, we evaluate the proposed method on the two public MIE benchmark datasets and the experimental results verify the effectiveness of it.

Text simplification aims to make the text easier to understand by applying rewriting transformations. There has been very little research on Chinese text simplification for a long time. The lack of generic evaluation data is an essential reason for this phenomenon. In this paper, we introduce MCTS, a multi-reference Chinese text simplification dataset. We describe the annotation process of the dataset and provide a detailed analysis. Furthermore, we evaluate the performance of several unsupervised methods and advanced large language models. We additionally provide Chinese text simplification parallel data that can be used for training, acquired by utilizing machine translation and English text simplification. We hope to build a basic understanding of Chinese text simplification through the foundational work and provide references for future research. All of the code and data are released at https://github.com/blcuicall/mcts/.

pdf abs
MDS: A Fine-Grained Dataset for Multi-Modal Dialogue Summarization
Zhipeng Liu | Xiaoming Zhang | Litian Zhang | Zelong Yu

Due to the explosion of various dialogue scenes, summarizing the dialogue into a short message has drawn much attention recently. In the multi-modal dialogue scene, people tend to use tone and body language to illustrate their intentions. While traditional dialogue summarization has predominantly focused on textual content, this approach may overlook vital visual and audio information essential for understanding multi-modal interactions. Recognizing the established field of multi-modal dialogue summarization, we develop a new multi-modal dialogue summarization dataset (MDS), which aims to enhance the variety and scope of data available for this research area. MDS provides a demanding testbed for multi-modal dialogue summarization. Subsequently, we conducted a comparative analysis of various summarization techniques on MDS and found that the existing methods tend to produce redundant and incoherent summaries. All of the models generate unfaithful facts to some degree, suggesting future research directions. MDS is available at https://github.com/R00kkie/MDS.

pdf abs
Measuring Cross-Text Cohesion for Segmentation Similarity Scoring
Gerardo Ocampo Diaz | Jessica Ouyang

Text segmentation is the task of dividing a sequence of text elements (eg. words, sentences, or paragraphs) into meaningful chunks. Although exciting advances are being made in modern segmentation-based tasks, such as automatically generating podcast chapters, current segmentation similarity metrics share a critical weakness: they are content-agnostic. In this paper, we present a word-embedding-based metric of cross-textual cohesion based on the formal linguistic definition of cohesion and incorporate it into a new segmentation similarity metric, SED. Our similarity metric, SED, is capable of providing fine-grained segmentation similarity scoring for the 3 basic segmentation errors: transposition, insertion, and deletion, as well as mixtures of them, avoiding the limitations of traditional metrics. We discuss the benefits of SED and evaluate its alignment with human judgement for each of the 3 basic error types. We show that our metric aligns with human evaluations significantly more than traditional metrics. We briefly discuss future work, such as the integration of anaphora resolution into our cohesion-based metric, and make our code publicly available.

pdf abs
Medical Entity Disambiguation with Medical Mention Relation and Fine-grained Entity Knowledge
Wenpeng Lu | Guobiao Zhang | Xueping Peng | Hongjiao Guan | Shoujin Wang

Medical entity disambiguation (MED) plays a crucial role in natural language processing and biomedical domains, which is the task of mapping ambiguous medical mentions to structured candidate medical entities from knowledge bases (KBs). However, existing methods for MED often fail to fully utilize the knowledge within medical KBs and overlook essential interactions between medical mentions and candidate entities, resulting in knowledge- and interaction-inefficient modeling and suboptimal disambiguation performance. To address these limitations, this paper proposes a novel approach, MED with Medical Mention Relation and Fine-grained Entity Knowledge (MMR-FEK). Specifically, MMR-FEK incorporates a mention relation fusion module and an entity knowledge fusion module, followed by an interaction module. The former employs a relation graph convolutional network to fuse mention relation information between medical mentions to enhance mention representations, while the latter leverages an attention mechanism to fuse synonym and type information of candidate entities to enhance entity representations. Afterwards, an interaction module is designed to employ a bidirectional attention mechanism to capture interactions between mentions and entities to generate the matching representation. Extensive experiments on two publicly available real-world datasets demonstrate MMR-FEK’s superiority over state-of-the-art(SOTA) MED baselines across all metrics. Our source code is publicly available.

pdf abs
Medical Vision-Language Pre-Training for Brain Abnormalities
Masoud Monajatipoor | Zi-Yi Dou | Aichi Chien | Nanyun Peng | Kai-Wei Chang

Vision-language models have become increasingly powerful for tasks that require an understanding of both visual and linguistic elements, bridging the gap between these modalities. In the context of multimodal clinical AI, there is a growing need for models that possess domain-specific knowledge, as existing models often lack the expertise required for medical applications. In this paper, we take brain abnormalities as an example to demonstrate how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset from case reports and published journals and subsequently constructing a high-performance vision-language model tailored to specific medical tasks. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain. We evaluated the resulting model with quantitative and qualitative intrinsic evaluations. The resulting dataset will be released to the community.

Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of large language models (LLMs) have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.

pdf abs
MedQA-SWE - a Clinical Question & Answer Dataset for Swedish
Niclas Hertzberg | Anna Lokrantz

Considering the rapid improvement of large generative language models, it is important to measure their ability to encode clinical domain knowledge in order to help determine their potential utility in a clinical setting. To this end we present MedQA-SWE – a novel multiple choice, clinical question & answering (Q&A) dataset in Swedish consisting of 3,180 questions. The dataset was created from a series of exams aimed at evaluating doctors’ clinical understanding and decision making and is the first open-source clinical Q&A dataset in Swedish. The exams – originally in PDF format – were parsed and each question manually checked and curated in order to limit errors in the dataset. We provide dataset statistics along with benchmark accuracy scores of seven large generative language models on a representative sample of questions in a zero-shot setting, with some models showing impressive performance given the difficulty of the exam the dataset is based on.

pdf abs
MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models
Nathanael Carraz Rakotonirina | Marco Baroni

Transformer-based language models (LMs) track contextual information through large, hard-coded input windows. We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors, akin to soft prompts, without requiring LM finetuning. Tested on a task designed to probe a LM’s ability to keep track of multiple fact updates, a MemoryPrompt-augmented LM outperforms much larger LMs that have access to the full input history. We also test MemoryPrompt on a long-distance dialogue dataset, where its performance is comparable to that of a model conditioned on the entire conversation history. In both experiments we also observe that, unlike full-finetuning approaches, MemoryPrompt does not suffer from catastrophic forgetting when adapted to new tasks, thus not disrupting the generalist capabilities of the underlying LM.

Early detection of mental health disorders is an essential step in treating and preventing mental health conditions. Computational approaches have been applied to users’ social media profiles in an attempt to identify various mental health conditions such as depression, PTSD, schizophrenia, and eating disorders. The interest in this topic has motivated the creation of various depression detection datasets. However, annotating such datasets is expensive and time-consuming, limiting their size and scope. To overcome this limitation, we present MentalHelp, a large-scale semi-supervised mental disorder detection dataset containing 14 million instances. The corpus was collected from Reddit and labeled in a semi-supervised way using an ensemble of three separate models - flan-T5, Disor-BERT, and Mental-BERT.

With mental health issues on the rise on the Web, especially among young people, there is a growing need for effective identification and intervention. In this paper, we introduce a new open-sourced corpus for the early detection of mental disorders in Spanish, focusing on eating disorders, depression, and anxiety. It consists of user messages posted on groups within the Telegram message platform and contains over 1,300 subjects with more than 45,000 messages posted in different public Telegram groups. This corpus has been manually annotated via crowdsourcing and is prepared for its use in several Natural Language Processing tasks including text classification and regression tasks. The samples in the corpus include both text and time data. To provide a benchmark for future research, we conduct experiments on text classification and regression by using state-of-the-art transformer-based models.

pdf abs
Meta-Adapter for Self-Supervised Speech Models: A Solution to Low-Resource Speech Recognition Challenges
Yaqi Chen | Hao Zhang | Xukui Yang | Wenlin Zhang | Dan Qu

Self-supervised models have demonstrated remarkable performance in speech processing by learning latent representations from large amounts of unlabeled data. Although these models yield promising results on low-resource languages, the computational expense of fine-tuning all model parameters is prohibitively high. Adapters offer a solution by incorporating lightweight bottleneck structures into pre-trained models, enabling efficient parameter adaptation for downstream tasks. However, randomly initialized adapters often underperform in low-resource scenarios, limiting their applicability in low-resource languages. To address this issue, we develop the Meta-Adapter for self-supervised models to obtain meta-initialized parameters that facilitate quick adaptation to low-resource languages. Extensive experiments on the Common Voice and FLEURS datasets demonstrate the superior performance of Meta-Adapters on 12 low-resource languages spanning four different language families. Moreover, Meta-adapters show better generalization and extensibility than traditional pretraining methods.

Declarative knowledge and procedural knowledge are two key parts in meta-cognitive theory, and these two hold significant importance in pre-training and inference of LLMs. However, a comprehensive analysis comparing these two types of knowledge is lacking, primarily due to challenges in definition, probing and quantitative assessment. In this paper, we explore from a new perspective by providing ground-truth knowledge for LLMs and evaluating the effective score. Through extensive experiments with widely-used datasets and models, we get conclusions: (1) In most tasks, benefits from declarative knowledge are greater than those from procedural knowledge. (2) Profits of procedural knowledge are larger than declarative knowledge only in reasoning tasks with simple logic. (3) As pre-training progresses and size increases, model ability to utilize both kinds of knowledge significantly improves, but in different speed. We do detailed analysis for the findings and this can provide primary guidance for evaluation and enhancement of large language models.

pdf abs
Meta-Evaluation of Sentence Simplification Metrics
Noof Abdullah Alfear | Dimitar Kazakov | Hend Al-Khalifa

Automatic Text Simplification (ATS) is one of the major Natural Language Processing (NLP) tasks, which aims to help people understand text that is above their reading abilities and comprehension. ATS models reconstruct the text into a simpler format by deletion, substitution, addition or splitting, while preserving the original meaning and maintaining correct grammar. Simplified sentences are usually evaluated by human experts based on three main factors: simplicity, adequacy and fluency or by calculating automatic evaluation metrics. In this paper, we conduct a meta-evaluation of reference-based automatic metrics for English sentence simplification using high-quality, human-annotated dataset, NEWSELA-LIKERT. We study the behavior of several evaluation metrics at sentence level across four different sentence simplification models. All the models were trained on the NEWSELA-AUTO dataset. The correlation between the metrics’ scores and human judgements was analyzed and the results used to recommend the most appropriate metrics for this task.

pdf abs
Metaphors in Online Religious Communication: A Detailed Dataset and Cross-Genre Metaphor Detection
Sebastian Reimann | Tatjana Scheffler

We present the first dataset of fine-grained metaphor annotations for texts from online religious communication, where figurative language plays a particularly important role. In addition to binary labels, metaphors are annotated for deliberateness, that is, whether they are communicated explicitly as metaphors, and we provide indicators for such deliberate use. We further show that cross-genre transfer metaphor detection (from the widely used VUA corpus to our Reddit data) leads to a drop in performance due to the shift in topic and metaphors from source domains that did not occur in the training data. We solve this issue by adding a small amount of in-genre data in fine-tuning, leading to notable performance increases of more than 5 points in F1. Moreover, religious communication has the tendency for extended metaphorical comparisons, which are problematic for current metaphor detection systems. Adding in-genre data had slightly positive effects but we argue that to solve this, architectures that consider larger spans of context are necessary.

pdf abs
MEVTR: A Multilingual Model Enhanced with Visual Text Representations
Xiaohua Wang | Wenlong Fei | Min Hu | Qingyu Zhang | Aoqiang Zhu

The goal of multilingual modelling is to generate multilingual text representations for various downstream tasks in different languages. However, some state-of-the-art pre-trained multilingual models perform poorly on many low-resource languages due to the lack of representation space and model capacity. To alleviate this issue, we propose a Multilingual model Enhanced with Visual Text Representations (MEVTR), which complements textual representations and extends the multilingual representation space with visual text representations. First, the visual encoder focuses on the glyphs and structure of the text to obtain visual text representations, and the textual encoder obtains textual representations. Then, multilingual representations are enhanced by aligning and fusing visual text representations and textual representations. Moreover, we propose similarity constraint, a self-supervised task to prompt the visual encoder to focus on more additional information. Prefix alignment and multi-head bilinear module are designed to acquire an improved integration effect of visual text representations and textual representations. Experimental results indicate that MEVTR benefits from visual text representations and achieves significant performance gains in downstream tasks. In particular, in the zero-shot cross-lingual transfer task, MEVTR achieves results that outperform the state-of-the-art adapter-based framework without the target language adapter.

pdf abs
mForms : Multimodal Form Filling with Question Answering
Larry Heck | Simon Heck | Anirudh S. Sundar

This paper presents a new approach to form-filling by reformulating the task as multimodal natural language Question Answering (QA). The reformulation is achieved by first translating the elements on the GUI form (text fields, buttons, icons, etc.) to natural language questions, where these questions capture the element’s multimodal semantics. After a match is determined between the form element (Question) and the user utterance (Answer), the form element is filled through a pre-trained extractive QA system. By leveraging pre-trained QA models and not requiring form-specific training, this approach to form-filling is zero-shot. The paper also presents an approach to further refine the form-filling by using multi-task training to incorporate a potentially large number of successive tasks. Finally, the paper introduces a multimodal natural language form-filling dataset Multimodal Forms (mForms), as well as a multimodal extension of the popular ATIS dataset to support future research and experimentation. Results show the new approach not only maintains robust accuracy for sparse training conditions but achieves state-of-the-art F1 of 0.97 on ATIS with approximately 1/10th the training data.

Electronic health records (EHRs) serve as a digital repository storing comprehensive medical information about patients. Representation learning for EHRs plays a crucial role in healthcare applications. In this paper, we propose a Multimodal Heterogeneous Graph-enhanced Representation Learning, denoted as MHGRL, aimed at learning effective EHR representations. To address the challenge posed by data insufficiency of EHRs, MHGRL utilizes a multimodal heterogeneous graph to model an EHR. Specifically, we construct a heterogeneous graph for each EHR and enrich it by incorporating multimodal information with medical ontology and textual notes. With the integration of pre-trained model, graph neural network, and attention mechanism, MHGRL effectively incorporates both node attributes and structural information across a multimodal heterogeneous graph. Moreover, we employ contrastive learning to ensure the consistency of representations for similar EHRs and improve the model robustness. The experimental results show that MHGRL outperforms all baselines on two real clinical datasets in downstream tasks, including EHR clustering and disease prediction. The code is available at https://github.com/emmali808/MHGRL.

pdf abs
MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection
Cagri Toraman | Oguzhan Ozcelik | Furkan Sahinuc | Fazli Can

The rapid dissemination of misinformation through online social networks poses a pressing issue with harmful consequences jeopardizing human health, public safety, democracy, and the economy; therefore, urgent action is required to address this problem. In this study, we construct a new human-annotated dataset, called MiDe22, having 5,284 English and 5,064 Turkish tweets with their misinformation labels for several recent events between 2020 and 2022, including the Russia-Ukraine war, COVID-19 pandemic, and Refugees. The dataset includes user engagements with the tweets in terms of likes, replies, retweets, and quotes. We also provide a detailed data analysis with descriptive statistics and the experimental results of a benchmark evaluation for misinformation detection.

pdf abs
Mind Your Neighbours: Leveraging Analogous Instances for Rhetorical Role Labeling for Legal Documents
Santosh T.y.s.s. | Hassan Sarwat | Ahmed Mohamed Abdelaal Abdou | Matthias Grabmair

Rhetorical Role Labeling (RRL) of legal judgments is essential for various tasks, such as case summarization, semantic search and argument mining. However, it presents challenges such as inferring sentence roles from context, interrelated roles, limited annotated data, and label imbalance. This study introduces novel techniques to enhance RRL performance by leveraging knowledge from semantically similar instances (neighbours). We explore inference-based and training-based approaches, achieving remarkable improvements in challenging macro-F1 scores. For inference-based methods, we explore interpolation techniques that bolster label predictions without re-training. While in training-based methods, we integrate prototypical learning with our novel discourse-aware contrastive method that work directly on embedding spaces. Additionally, we assess the cross-domain applicability of our methods, demonstrating their effectiveness in transferring knowledge across diverse legal domains.

Reasoning in mathematical domains remains a significant challenge for relatively small language models (LMs). Many current methods focus on specializing LMs in mathematical reasoning and rely heavily on distilling knowledge from powerful yet inefficient large LMs (LLMs). In this work, we explore a new direction that avoids over-reliance on LLM teachers, introducing a multi-view fine-tuning method that efficiently exploits existing mathematical problem datasets with diverse annotation styles. Our approach uniquely considers the various annotation formats as different “views” that may help each other and leverage them in training the model. By postpending distinct instructions to input questions, models can learn to generate solutions in diverse formats in a flexible manner. Experimental results show that our strategy enables relatively small LMs to outperform prior approaches that heavily rely on knowledge distillation, as well as carefully established baselines. Additionally, the proposed method grants the models promising generalization ability across various views and datasets, and the capability to learn from inaccurate or incomplete noisy data. We hope our multi-view training paradigm could inspire future studies in other machine reasoning domains.

pdf abs
Mitigating Linguistic Artifacts in Emotion Recognition for Conversations from TV Scripts to Daily Conversations
Donovan Ong | Shuo Sun | Jian Su | Bin Chen

Emotion Recognition in Conversations (ERC) is a well-studied task with numerous potential real-world applications. However, existing ERC models trained on the MELD dataset derived from TV series, struggle when applied to daily conversation datasets. A closer examination of the datasets unveils the prevalence of linguistic artifacts such as repetitions and interjections in TV scripts, which ERC models may exploit when making predictions. To address this issue, we explore two techniques aimed at reducing the reliance of ERC models on these artifacts: 1) using contrastive learning to prioritize emotional features over dataset-specific linguistic style and 2) refining emotion predictions with pseudo-emotion intensity score. Our experiment results show that reducing reliance on the linguistic style found in TV transcripts could enhance model’s robustness and accuracy in diverse conversational contexts.

pdf abs
Mitigating Misleading Chain-of-Thought Reasoning with Selective Filtering
Yexin Wu | Zhuosheng Zhang | Hai Zhao

Large language models have manifested remarkable capabilities by leveraging chain-of-thought (CoT) reasoning techniques to solve intricate questions through step-by-step reasoning chains. Despite its success, the efficacy of such reasoning is inherently contingent upon the quality of CoT. However, flawless CoT reasoning cannot be guaranteed due to the presence of indecomposable questions and the potential for erroneous reasoning chains, particularly in the case of small-scale language models. To tackle this challenge, we propose a novel approach called the selective filtering reasoner (SelF-Reasoner) that assesses the entailment relationship between the question and the candidate reasoning chain. We proceed with CoT reasoning when the reasoning chain demonstrates confidence; otherwise, we opt to predict the answer directly. SelF-Reasoner improves the fine-tuned T5 baseline consistently over the ScienceQA, ECQA, and LastLetter tasks. Code is available at Anonymous.

pdf abs
Mitigating Shortcuts in Language Models with Soft Label Encoding
Zirui He | Huiqi Deng | Haiyan Zhao | Ninghao Liu | Mengnan Du

Recent research has shown that large language models rely on spurious correlations in the data for natural language understanding (NLU) tasks. In this work, we aim to answer the following research question: Can we reduce spurious correlations by modifying the ground truth labels of the training data? Specifically, we propose a simple yet effective debiasing framework, named Soft Label Encoding (SoftLE). First, we train a teacher model to quantify each sample’s degree of relying on shortcuts. Then, we encode this shortcut degree into a dummy class and use it to smooth the original ground truth labels, generating soft labels. These soft labels are used to train a more robust student model that reduces spurious correlations between shortcut features and certain classes. Extensive experiments on two NLU benchmark tasks via two language models demonstrate that SoftLE significantly improves out-of-distribution generalization while maintaining satisfactory in-distribution accuracy. Our code is available at https://github.com/ZiruiHE99/sle

Low-resource languages often face challenges in acquiring high-quality language data due to the reliance on translation-based methods, which can introduce the translationese effect. This phenomenon results in translated sentences that lack fluency and naturalness in the target language. In this paper, we propose a novel approach for data collection by leveraging storyboards to elicit more fluent and natural sentences. Our method involves presenting native speakers with visual stimuli in the form of storyboards and collecting their descriptions without direct exposure to the source text. We conducted a comprehensive evaluation comparing our storyboard-based approach with traditional text translation-based methods in terms of accuracy and fluency. Human annotators and quantitative metrics were used to assess translation quality. The results indicate a preference for text translation in terms of accuracy, while our method demonstrates worse accuracy but better fluency in the language focused.

Relation extraction is a critical task in the field of natural language processing with numerous real-world applications. Existing research primarily focuses on monolingual relation extraction or cross-lingual enhancement for relation extraction. Yet, there remains a significant gap in understanding relation extraction in the mix-lingual (or code-switching) scenario, where individuals intermix contents from different languages within sentences, generating mix-lingual content. Due to the lack of a dedicated dataset, the effectiveness of existing relation extraction models in such a scenario is largely unexplored. To address this issue, we introduce a novel task of considering relation extraction in the mix-lingual scenario called MixRE and constructing the human-annotated dataset MixRED to support this task. In addition to constructing the MixRED dataset, we evaluate both state-of-the-art supervised models and large language models (LLMs) on MixRED, revealing their respective advantages and limitations in the mix-lingual scenario. Furthermore, we delve into factors influencing model performance within the MixRE task and uncover promising directions for enhancing the performance of both supervised models and LLMs in this novel task.

pdf abs
Mixture-of-LoRAs: An Efficient Multitask Tuning Method for Large Language Models
Wenfeng Feng | Chuzhan Hao | Yuewei Zhang | Yu Han | Hao Wang

Instruction Tuning has the potential to stimulate or enhance specific capabilities of large language models (LLMs). However, achieving the right balance of data is crucial to prevent catastrophic forgetting and interference between tasks. To address these limitations and enhance training flexibility, we propose the Mixture-of-LoRAs (MoA) architecture which is a novel and parameter-efficient tuning method designed for multi-task learning with LLMs. In this paper, we start by individually training multiple domain-specific LoRA modules using corresponding supervised corpus data. These LoRA modules can be aligned with the expert design principles observed in Mixture-of-Experts (MoE). Subsequently, we combine the multiple LoRAs using an explicit routing strategy and introduce domain labels to facilitate multi-task learning, which help prevent interference between tasks and ultimately enhances the performance of each individual task. Furthermore, each LoRA model can be iteratively adapted to a new domain, allowing for quick domain-specific adaptation. Experiments on diverse tasks demonstrate superior and robust performance, which can further promote the wide application of domain-specific LLMs.

pdf abs
Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding
Zichen Wu | Hsiu-Yuan Huang | Fanyi Qu | Yunfang Wu

Deep multimodal semantic understanding that goes beyond the mere superficial content relation mining has received increasing attention in the realm of artificial intelligence. The challenges of collecting and annotating high-quality multi-modal data have underscored the significance of few-shot learning. In this paper, we focus on two critical tasks under this context: few-shot multi-modal sarcasm detection (MSD) and multi-modal sentiment analysis (MSA). To address them, we propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF), a novel multi-modal soft prompt framework based on the unified vision-language model (VLM). Specifically, we design three experts of soft prompts: a text prompt and an image prompt that extract modality-specific features to enrich the single-modal representation, and a unified prompt to assist multi-modal interaction. Additionally, we reorganize Transformer layers into several blocks and introduce cross-modal prompt attention between adjacent blocks, which smoothens the transition from single-modal representation to multi-modal fusion. On both MSD and MSA datasets in few-shot setting, our proposed model not only surpasses the 8.2B model InstructBLIP with merely 2% parameters (150M), but also significantly outperforms other widely-used prompt methods on VLMs or task-specific methods.

pdf abs
MKeCL: Medical Knowledge-Enhanced Contrastive Learning for Few-shot Disease Diagnosis
Yutian Zhao | Huimin Wang | Xian Wu | Yefeng Zheng

Artificial intelligence (AI)-aided disease prediction has gained extensive research interest due to its capability to support clinical decision-making. Existing works mainly formulate disease prediction as a multi-label classification problem and use historical Electronic Medical Records (EMR) to train supervised models. However, in real-world clinics, such purely data-driven approaches pose two main challenges: 1) long tail problem: there are excessive EMRs for common diseases and insufficient EMRs for rare diseases, thus training over an imbalanced data set could result in a biased model that ignores rare diseases in diagnosis; 2) easily misdiagnosed diseases: some diseases can be easily distinguished while others sharing analogous conditions are much more difficult. General classification models without emphasizing easily misdiagnosed diseases may generate incorrect predictions. To tackle these two problems, we propose a Medical Knowledge-Enhanced Contrastive Learning (MKeCL) approach to disease diagnosis in this paper. MKeCL incorporates medical knowledge graphs and medical licensing exams in modeling in order to compensate for the insufficient information on rare diseases; To handle hard-to-diagnose diseases, MKeCL introduces a contrastive learning strategy to separate diseases that are easily misdiagnosed. Moreover, we establish a new benchmark, named Jarvis-D, which contains clinical EMRs collected from various hospitals. Experiments on real clinical EMRs show that the proposed MKeCL outperforms existing disease prediction approaches, especially in the setting of few-shot and zero-shot scenarios.

pdf abs
MLDSP-MA: Multidimensional Attention for Multi-Round Long Dialogue Sentiment Prediction
Yunfei Yin | Congrui Zou | Zheng Yuan | Xianjian Bao

The intelligent chatbot takes dialogue sentiment prediction as the core, and it has to tackle long dialogue sentiment prediction problems in many real-world applications. Current state-of-the-art methods usually employ attention-based dialogue sentiment prediction models. However, as the conversation progresses, more topics are involved and the changes in sentiments become more frequent, which leads to a sharp decline in the accuracy and efficiency of the current methods. Therefore, we propose a Multi-round Long Dialogue Sentiment Prediction based on Multidimensional Attention (MLDSP-MA), which can focus on different topics. In particular, MLSDP-MA leverages a sliding window to capture different topics and traverses all historical dialogues. In each sliding window, the contextual dependency, sentiment persistence, and sentiment infectivity are characterized, and local attention cross fusion is performed. To learn dialogue sentiment globally, global attention is proposed to iteratively learn comprehensive sentiments from historical dialogues, and finally integrate with local attention. We conducted extensive experimental research on publicly available dialogue datasets. The experimental results show that, compared to the current state-of-the-art methods, our model improves by 3.5% in accuracy and 5.7% in Micro-F1 score.

Audio Description (AD) aims to generate narrations of information that is not accessible through unimodal hearing in movies to aid the visually impaired in following film narratives. Current solutions rely heavily on manual work, resulting in high costs and limited scalability. While automatic methods have been introduced, they often yield descriptions that are sparse and omit key details. ddressing these challenges, we propose a novel automated pipeline, the Multi-modal Movie Audio Description (MMAD). MMAD harnesses the capabilities of three key modules as well as the power of Llama2 to augment the depth and breadth of the generated descriptions. Specifically, first, we propose an Audio-aware Feature Enhancing Module to provide the model with multi-modal perception capabilities, enriching the background descriptions with a more comprehensive understanding of the environmental features. Second, we propose an Actor-tracking-aware Story Linking Module to aid in the generation of contextual and character-centric descriptions, thereby enhancing the richness of character depictions. Third, we incorporate a Subtitled Movie Clip Contextual Alignment Module, supplying semantic information about various time periods throughout the movie, which facilitates the consideration of the full movie narrative context when describing silent segments, thereby enhancing the richness of the descriptions. Experiments on widely used datasets convincingly demonstrates that MMAD significantly surpasses existing strong baselines in performance, establishing a new state-of-the-art in the field. Our code will be released at https://github.com/Daria8976/MMAD.

Given the long textual product information and the product image, Multi-modal Product Summarization (MPS) aims to increase customers’ desire to purchase by highlighting product characteristics with a short textual summary. Existing MPS methods can produce promising results. Nevertheless, they still 1) lack end-to-end product summarization, 2) lack multi-grained multi-modal modeling, and 3) lack multi-modal attribute modeling. To improve MPS, we propose an end-to-end multi-grained multi-modal attribute-aware product summarization method (MMAPS) for generating high-quality product summaries in e-commerce. MMAPS jointly models product attributes and generates product summaries. We design several multi-grained multi-modal tasks to better guide the multi-modal learning of MMAPS. Furthermore, we model product attributes based on both text and image modalities so that multi-modal product characteristics can be manifested in the generated summaries. Extensive experiments on a real large-scale Chinese e-commence dataset demonstrate that our model outperforms state-of-the-art product summarization methods w.r.t. several summarization metrics. Our code is publicly available at: https://github.com/KDEGroup/MMAPS.

pdf abs
MM-IGLU: Multi-Modal Interactive Grounded Language Understanding
Claudiu Daniel Hromei | Daniele Margiotta | Danilo Croce | Roberto Basili

This paper explores Interactive Grounded Language Understanding (IGLU) challenges within Human-Robot Interaction (HRI). In this setting, a robot interprets user commands related to its environment, aiming to discern whether a specific command can be executed. If faced with ambiguities or incomplete data, the robot poses relevant clarification questions. Drawing from the NeurIPS 2022 IGLU competition, we enrich the dataset by introducing our multi-modal data and natural language descriptions in MM-IGLU: Multi-Modal Interactive Grounded Language Understanding. Utilizing a BART-based model that integrates the user’s statement with the environment’s description, and a cutting-edge Multi-Modal Large Language Model that merges both visual and textual data, we offer a valuable resource for ongoing research in the domain. Additionally, we discuss the evaluation methods for such tasks, highlighting potential limitations imposed by traditional string-match-based evaluations on this intricate multi-modal challenge. Moreover, we provide an evaluation benchmark based on human judgment to address the limits and capabilities of such baseline models. This resource is released on a dedicated GitHub repository at https://github.com/crux82/MM-IGLU.

pdf abs
MNER-MI: A Multi-image Dataset for Multimodal Named Entity Recognition in Social Media
Shizhou Huang | Bo Xu | Changqun Li | Jiabo Ye | Xin Lin

Recently, multimodal named entity recognition (MNER) has emerged as a vital research area within named entity recognition. However, current MNER datasets and methods are predominantly based on text and a single accompanying image, leaving a significant research gap in MNER scenarios involving multiple images. To address the critical research gap and enhance the scope of MNER for real-world applications, we propose a novel human-annotated MNER dataset with multiple images called MNER-MI. Additionally, we construct a dataset named MNER-MI-Plus, derived from MNER-MI, to ensure its generality and applicability. Based on these datasets, we establish a comprehensive set of strong and representative baselines and we further propose a simple temporal prompt model with multiple images to address the new challenges in multi-image scenarios. We have conducted extensive experiments to demonstrate that considering multiple images provides a significant improvement over a single image and can offer substantial benefits for MNER. Furthermore, our proposed method achieves state-of-the-art results on both MNER-MI and MNER-MI-Plus, demonstrating its effectiveness. The datasets and source code can be found at https://github.com/JinFish/MNER-MI.

pdf abs
Modalities Should Be Appropriately Leveraged: Uncertainty Guidance for Multimodal Chinese Spelling Correction
Yongliang Lin | Zhen Zhang | Mengting Hu | Yufei Sun | Yuzhi Zhang

Chinese spelling correction (CSC) aims to detect and correct spelling errors in Chinese texts. Most spelling errors are phonetically or graphically similar to the correct ones. Thus, recent works introduce multimodal features to obtain achievements. In this paper, we found that different spelling errors have various biases to each modality, highlighting the importance of appropriately exploiting multimodal features. To achieve this goal, we propose the UGMSC framework, which incorporates uncertainty into both the feature learning and correction stages. Specifically, the UGMSC framework makes predictions with multimodal features and estimates the uncertainty of the corresponding modalities. Then it dynamically fuses the features of all modalities for model learning, and performs spelling correction under the uncertainty-guided strategy. Experimental results on three public datasets demonstrate that the proposed approach provides a significant improvement compared with previous strong multimodal models. The proposed framework is model-agnostic and can be easily applied to other multimodal models.

Chain-of-thought Distillation (CoTD) aims at distilling Chain-of-thought (CoT) reasoning ability of large language models (LLMs) to much smaller student models. The core of CoTD is using a large teacher model to generate rationales and fine-tune smaller student models. However, current Chain-of-thought Distillation works have the following limitations: 1) Student models are separately distilled from specific reasoning tasks and lack a collaboration mechanism, hindering the enhancement of reasoning performance through collaboration among various reasoning tasks. 2) The parameter update of student models severely harms the CoT reasoning ability on other unseen reasoning tasks not included in the distillation process. In this work, we introduce a novel CoT Distillation method, MoDE-CoTD, which decouples the CoT reasoning abilities out of the student model by distilling multiple LoRA-Experts and freezing the parameters of the student model. Sequentially, LoRA-Experts are combined and adapted to handle both seen and unseen reasoning tasks, enabling collaboration among diverse reasoning tasks to further enhance CoT reasoning performance. Experimental results on 14 datasets (including 4 unseen datasets) demonstrate the strength of MoDE-CoTD, with an average accuracy gain of 6.3% on seen datasets and 7.8% on unseen datasets.

pdf abs
Model-Agnostic Cross-Lingual Training for Discourse Representation Structure Parsing
Jiangming Liu

Discourse Representation Structure (DRS) is an innovative semantic representation designed to capture the meaning of texts with arbitrary lengths across languages. The semantic representation parsing is essential for achieving natural language understanding through logical forms. Nevertheless, the performance of DRS parsing models remains constrained when trained exclusively on monolingual data. To tackle this issue, we introduce a cross-lingual training strategy. The proposed method is model-agnostic yet highly effective. It leverages cross-lingual training data and fully exploits the alignments between languages encoded in pre-trained language models. The experiments conducted on the standard benchmarks demonstrate that models trained using the cross-lingual training method exhibit significant improvements in DRS clause and graph parsing in English, German, Italian and Dutch. Comparing our final models to previous works, we achieve state-of-the-art results in the standard benchmarks. Furthermore, the detailed analysis provides deep insights into the performance of the parsers, offering inspiration for future research in DRS parsing.

Health coaching helps patients achieve personalized and lifestyle-related goals, effectively managing chronic conditions and alleviating mental health issues. It is particularly beneficial, however cost-prohibitive, for low-socioeconomic status populations due to its highly personalized and labor-intensive nature. In this paper, we propose a neuro-symbolic goal summarizer to support health coaches in keeping track of the goals and a text-units-text dialogue generation model that converses with patients and helps them create and accomplish specific goals for physical activities. Our models outperform previous state-of-the-art while eliminating the need for predefined schema and corresponding annotation. We also propose a new health coaching dataset extending previous work and a metric to measure the unconventionality of the patient’s response based on data difficulty, facilitating potential coach alerts during deployment.

pdf abs
Modeling Orthographic Variation Improves NLP Performance for Nigerian Pidgin
Pin-Jie Lin | Merel Scholman | Muhammed Saeed | Vera Demberg

Nigerian Pidgin is an English-derived contact language and is traditionally an oral language, spoken by approximately 100 million people. No orthographic standard has yet been adopted, and thus the few available Pidgin datasets that exist are characterised by noise in the form of orthographic variations. This contributes to under-performance of models in critical NLP tasks. The current work is the first to describe various types of orthographic variations commonly found in Nigerian Pidgin texts, and model this orthographic variation. The variations identified in the dataset form the basis of a phonetic-theoretic framework for word editing, which is used to generate orthographic variations to augment training data. We test the effect of this data augmentation on two critical NLP tasks: machine translation and sentiment analysis. The proposed variation generation framework augments the training data with new orthographic variants which are relevant for the test set but did not occur in the training set originally. Our results demonstrate the positive effect of augmenting the training data with a combination of real texts from other corpora as well as synthesized orthographic variation, resulting in performance improvements of 2.1 points in sentiment analysis and 1.4 BLEU points in translation to English.

Explanations are pervasive in our lives. Mostly, they occur in dialogical form where an explainer discusses a concept or phenomenon of interest with an explainee. Leaving the explainee with a clear understanding is not straightforward due to the knowledge gap between the two participants. Previous research looked at the interaction of explanation moves, dialogue acts, and topics in successful dialogues with expert explainers. However, daily-life explanations often fail, raising the question of what makes a dialogue successful. In this work, we study explanation dialogues in terms of the interactions between the explainer and explainee and how they correlate with the quality of explanations in terms of a successful understanding on the explainee’s side. In particular, we first construct a corpus of 399 dialogues from the Reddit forum Explain Like I am Five and annotate it for interaction flows and explanation quality. We then analyze the interaction flows, comparing them to those appearing in expert dialogues. Finally, we encode the interaction flows using two language models that can handle long inputs, and we provide empirical evidence for the effectiveness boost gained through the encoding in predicting the success of explanation dialogues.

pdf abs
Modelling and Linking an Old Latin-Portuguese Dictionary to the LiLa Knowledge Base
Lucas Consolin Dezotti | Marco Passarotti | Francesco Mambrini

This paper describes the steps undertaken to include data from Antonio Velez’s bilingual Latin-Portuguese dictionary (Index Totius Artis, 1744) into the LiLa Knowledge Base of interoperable linguistic resources for Latin. The paper focuses on how the lexical and lexicographic information of the source dictionary was modelled by using respectively the Lexicon Model for Ontologies (OntoLex-lemon) and its lexicog module. The linking process of the dictionary entries with those of the LiLa collection of Latin lemmas is detailed, discussing issues in dealing with ambiguities and typographical errors found in the source. The result is the first Latin-Portuguese lexical resource made interoperable with the (meta)data of the other linguistic resources for Latin interlinked in the LiLa Knowledge Base, providing new ways of assessing the dictionary information or using its content as starting point to explore the connections with other interlinked linguistic resources. A couple of use case scenarios illustrate those possibilities.

pdf abs
Modelling Argumentation for an User Opinion Aggregation Tool
Pablo Weingart | Thiemo Wambsganss | Matthias Soellner

We introduce an argumentation annotation scheme that models basic argumentative structure and additional contextual details across diverse user opinion domains. Drawing from established argumentation modeling approaches and related theory on user opinions, the scheme integrates the concepts of argumentative components, specificity, sentiment and aspects of the user opinion domain. Our freely available dataset includes 1,016 user opinions with 7,266 sentences, spanning products from 19 e-commerce categories, restaurants, hotels, local services, and mobile applications. Utilizing the dataset, we trained three transformer-based models, demonstrating their efficacy in predicting the annotated classes for identifying argumentative statements and contextual details from user opinion documents. Finally, we evaluate a prototypical dashboard that integrates the model inferences to aggregate information and rank exemplary products based on a vast array of user opinions. Early results from an experimental evaluation with eighteen users include positive user perceptions but also highlight challenges when condensing detailed argumentative information to users.

The effective use of monolingual and bilingual knowledge represents a critical challenge within the neural machine translation (NMT) community. In this paper, we propose a modular strategy that facilitates the cooperation of these two types of knowledge in translation tasks, while avoiding the issue of catastrophic forgetting and exhibiting superior model generalization and robustness. Our model is comprised of three functionally independent modules: an encoding module, a decoding module, and a transferring module. The former two acquire large-scale monolingual knowledge via self-supervised learning, while the latter is trained on parallel data and responsible for transferring latent features between the encoding and decoding modules. Extensive experiments in multi-domain translation tasks indicate our model yields remarkable performance, with up to 7 BLEU improvements in out-of-domain tests over the conventional pretrain-and-finetune approach. Our codes are available at https://github.com/NLP2CT/MoNMT.

pdf abs
Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level
Iqra Ali | Hidetaka Kamigaito | Taro Watanabe

Paraphrase detection is a task to identify if two sentences are semantically similar or not. It plays an important role in maintaining the integrity of written work such as plagiarism detection and text reuse detection. Formerly, researchers focused on developing large corpora for English. However, no research has been conducted on sentence-level paraphrase detection in low-resource Pashto language. To bridge this gap, we introduce the first fully manually annotated Pashto sentential paraphrase detection corpus collected from authentic cases in journalism covering 10 different domains, including Sports, Health, Environment, and more. Our proposed corpus contains 6,727 sentences, encompassing 3,687 paraphrased and 3,040 non-paraphrased. Experimental findings reveal that our proposed corpus is sufficient to train XLM-RoBERTa to accurately detect paraphrased sentence pairs in Pashto with an F1 score of 84%. To compare our corpus with those in other languages, we also applied our fine-tuned model to the Indonesian and English paraphrase datasets in a zero-shot manner, achieving F1 scores of 82% and 78%, respectively. This result indicates that the quality of our corpus is not less than commonly used datasets. It‘s a pioneering contribution to the field. We will publicize a subset of 1,800 instances from our corpus, free from any licensing issues.

Zero-shot dialogue state tracking (DST) transfers knowledge to unseen domains, reducing the cost of annotating new datasets. Previous zero-shot DST models mainly suffer from domain transferring and partial prediction problems. To address these challenges, we propose Mixture of Prefix Experts (MoPE) to establish connections between similar slots in different domains, which strengthens the model transfer performance in unseen domains. Empirical results demonstrate that MoPE-DST achieves the joint goal accuracy of 57.13% on MultiWOZ2.1 and 55.4.

Drawing upon the intuition that aligning different modalities to the same semantic embedding space would allow models to understand states and actions more easily, we propose a new perspective to the offline reinforcement learning (RL) challenge. More concretely, we transform it into a supervised learning task by integrating multimodal and pre-trained language models. Our approach incorporates state information derived from images and action-related data obtained from text, thereby bolstering RL training performance and promoting long-term strategic thinking. We emphasize the contextual understanding of language and demonstrate how decision-making in RL can benefit from aligning states’ and actions’ representation with languages’ representation. Our method significantly outperforms current baselines as evidenced by evaluations conducted on Atari and OpenAI Gym environments. This contributes to advancing offline RL performance and efficiency while providing a novel perspective on offline RL.

Morphemes serve as a strong linguistic feature to capture lexical semantics, with higher coverage than words and more natural than sememes. However, due to the lack of morpheme-informed resources and the expense of manual annotation, morpheme-enhanced methods remain largely unexplored in Computational Linguistics. To address this issue, we propose the task of Morpheme Sense Disambiguation (MSD), with two subtasks in-text and in-word, similar to Word Sense Disambiguation (WSD) and Sememe Prediction (SP), to generalize morpheme features on more tasks. We first build the MorDis resource for Chinese, including MorInv as a morpheme inventory, MorTxt and MorWrd as two types of morpheme-annotated datasets. Next, we provide two baselines in each evaluation; the best model yields a promising precision of 77.66% on in-text MSD and 88.19% on in-word MSD, indicating its comparability with WSD and superiority over SP. Finally, we demonstrate that predicted morphemes achieve comparable performance with the ground-truth ones on a downstream application of Definition Generation (DG). This validates the feasibility and applicability of our proposed tasks. The resources and workflow of MSD will provide new insights and solutions for downstream tasks, including DG as well as WSD, training pre-trained models, etc.

Across a number of sign languages, temporal and spatial characteristics of dominant hand articulation are used to express semantic and grammatical features. In this study of Austrian Sign Language (Österreichische Gebärdensprache, or ÖGS), motion capture data of four Deaf signers is used to quantitatively characterize the kinematic parameters of sign production in verbs and adjectives. We investigate (1) the difference in production between verbs involving a natural endpoint (telic verbs; e.g. arrive) and verbs lacking an endpoint (atelic verbs; e.g. analyze), and (2) adjective signs in intensified vs. non-intensified (plain) forms. Motion capture data analysis using linear-mixed effects models (LME) indicates that both the endpoint marking in verbs, as well as marking of intensification in adjectives, are expressed by movement modulation in ÖGS. While the semantic distinction between verb types (telic/atelic) is marked by higher peak velocity and shorter duration for telic signs compared to atelic ones, the grammatical distinction (intensification) in adjectives is expressed by longer duration for intensified compared to non-intensified adjectives. The observed individual differences of signers might be interpreted as personal signing style.

pdf abs
Motion Generation from Fine-grained Textual Descriptions
Kunhang Li | Yansong Feng

The task of **text2motion** is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., _”A man squats.”_, fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at [https://github.com/KunhangL/finemotiondiffuse](https://github.com/KunhangL/finemotiondiffuse).

pdf abs
Motivational Interviewing Transcripts Annotated with Global Scores
Ben Cohen | Moreah Zisquit | Stav Yosef | Doron Friedman | Kfir Bar

Motivational interviewing (MI) is a counseling approach that aims to increase intrinsic motivation and commitment to change. Despite its effectiveness in various disorders such as addiction, weight loss, and smoking cessation, publicly available annotated MI datasets are scarce, limiting the development and evaluation of MI language generation models. We present MI-TAGS, a new annotated dataset of MI therapy sessions written in English collected from video recordings available on public sources. The dataset includes 242 MI demonstration transcripts annotated with the MI Treatment Integrity (MITI) 4.2 therapist behavioral codes and global scores, and Client Language EAsy Rating (CLEAR) 1.0 tags for client speech. In this paper we describe the process of data collection, transcription, and annotation, and provide an analysis of the new dataset. Additionally, we explore the potential use of the dataset for training language models to perform several MITI classification tasks; our results suggest that models may be able to automatically provide utterance-level annotation as well as global scores, with performance comparable to human annotators.

Large language models (LLMs) have demonstrated impressive performance in various natural language processing (NLP) tasks. However, there is limited understanding of how well LLMs perform in specific domains (e.g, the intellectual property (IP) domain). In this paper, we contribute a new benchmark, the first Multilingual-oriented quiZ on Intellectual Property (MoZIP), for the evaluation of LLMs in the IP domain. The MoZIP benchmark includes three challenging tasks: IP multiple-choice quiz (IPQuiz), IP question answering (IPQA), and patent matching (PatentMatch). In addition, we also develop a new IP-oriented multilingual large language model (called MoZi), which is a BLOOMZ-based model that has been supervised fine-tuned with multilingual IP-related text data. We evaluate our proposed MoZi model and four well-known LLMs (i.e., BLOOMZ, BELLE, ChatGLM and ChatGPT) on the MoZIP benchmark. Experimental results demonstrate that MoZi outperforms BLOOMZ, BELLE and ChatGLM by a noticeable margin, while it had lower scores compared with ChatGPT. Notably, the performance of current LLMs on the MoZIP benchmark has much room for improvement, and even the most powerful ChatGPT does not reach the passing level. Our source code, data, and models are available at https://github.com/AI-for-Science/MoZi.

pdf abs
MRC-based Nested Medical NER with Co-prediction and Adaptive Pre-training
Xiaojing Du | Hanjie Zhao | Danyan Xing | Yuxiang Jia | Hongying Zan

In medical information extraction, medical Named Entity Recognition (NER) is indispensable, playing a crucial role in developing medical knowledge graphs, enhancing medical question-answering systems, and analyzing electronic medical records. The challenge in medical NER arises from the complex nested structures and sophisticated medical terminologies, distinguishing it from its counterparts in traditional domains. In response to these complexities, we propose a medical NER model based on Machine Reading Comprehension (MRC), which uses a task-adaptive pre-training strategy to improve the model’s capability in the medical field. Meanwhile, our model introduces multiple word-pair embeddings and multi-granularity dilated convolution to enhance the model’s representation ability and uses a combined predictor of Biaffine and MLP to improve the model’s recognition performance. Experimental evaluations conducted on the CMeEE, a benchmark for Chinese nested medical NER, demonstrate that our proposed model outperforms the compared state-of-the-art (SOTA) models.

As a fresh way to improve the user viewing experience, videos of time-sync comments have attracted a lot of interest. Many efforts have been made to explore the effectiveness of time-sync comments for various applications. However, due to the complexity of interactions among users, videos, and comments, it still remains challenging to understand users’ behavior on time-sync comments. Along this line, we study the problem of time-sync comment behavior prediction with considerations of both historical behaviors and multi-modal information of visual frames and textual comments. Specifically, we propose a novel Multi-modal short- and long-Range Temporal Convolutional Network model, namely MRT. Firstly, we design two amplified Temporal Convolutional Networks with different sizes of receptive fields, to capture both short- and long-range surrounding contexts for each frame and time-sync comments. Then, we design a bottle-neck fusion module to obtain the multi-modal enhanced representation. Furthermore, we take the user preferences into consideration to generate the personalized multi-model semantic representation at each timestamp. Finally, we utilize the binary cross-entropy loss to optimize MRT on the basis of users’ historical records. Through comparing with representative baselines, we demonstrate the effectiveness of MRT and qualitatively verify the necessity and utility of short- and long-range contextual and multi-modal information through extensive experiments.

Conversational humor is the key to capturing dialogue semantics and dialogue comprehension, which is usually generated in multiple modalities, such as linguistic rhetoric (textual modality), exaggerated facial expressions or movements (visual modality), and quirky intonation (acoustic modality). However, existing multimodal corpora for conversation humor are coarse-grained, and the modality is insufficient to support the conversational humor recognition task. This paper designed an annotation scheme for multimodal humor datasets, and constructed a corpus based on a Chinese sitcom for conversational humor recognition, named MUCH. The MUCH corpus consists of 34,804 utterances in total, and 7,079 of them are humorous. We employed both unimodal and multimodal methods to test our MUCH corpus. Experimental results showed that the multimodal approach could achieve 75.94% in terms of F1-score and surpassed the performance of most unimodal methods, which demonstrated that the MUCH corpus was effective for multimodal humor recognition tasks.

pdf abs
Multi-Channel Spatio-Temporal Transformer for Sign Language Production
Xiaohan Ma | Rize Jin | Tae-Sun Chung

The task of Sign Language Production (SLP) in machine learning involves converting text-based spoken language into corresponding sign language expressions. Sign language conveys meaning through the continuous movement of multiple articulators, including manual and non-manual channels. However, most current Transformer-based SLP models convert these multi-channel sign poses into a unified feature representation, ignoring the inherent structural correlations between channels. This paper introduces a novel approach called MCST-Transformer for skeletal sign language production. It employs multi-channel spatial attention to capture correlations across various channels within each frame, and temporal attention to learn sequential dependencies for each channel over time. Additionally, the paper explores and experiments with multiple fusion techniques to combine the spatial and temporal representations into naturalistic sign sequences. To validate the effectiveness of the proposed MCST-Transformer model and its constituent components, extensive experiments were conducted on two benchmark sign language datasets from diverse cultures. The results demonstrate that this new approach outperforms state-of-the-art models on both datasets.

pdf abs
MULTICOLLAB: A Multimodal Corpus of Dialogues for Analyzing Collaboration and Frustration in Language
Michael Peechatt | Cecilia Ovesdotter Alm | Reynold Bailey

This paper addresses an existing resource gap for studying complex emotional states when a speaker collaborates with a partner to solve a task. We present a novel dialogue resource — the MULTICOLLAB corpus — where two interlocutors, an instructor and builder, communicated through a Zoom call while sensors recorded eye gaze, facial action units, and galvanic skin response, with transcribed speech signals, resulting in a unique, heavily multimodal corpus. The builder received instructions from the instructor. Half of the builders were privately told to disobey the instructor’s directions. After the task, participants watched the Zoom recording and annotated their instances of frustration. In this study, we introduce this new corpus and perform computational experiments with time series transformers, using early fusion through time for sensor data and late fusion for speech transcripts. We then average predictions from both methods to recognize instructor frustration. Using sensor and speech data in a 4.5 second time window, we find that the fusion of both models yields 21% improvement in classification accuracy (with a precision of 79% and F1 of 63%) over a comparison baseline, demonstrating that complex emotions can be recognized when rich multimodal data from transcribed spoken dialogue and biophysical sensor data are fused.

pdf abs
Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean
Dojun Park | Sebastian Padó

Almost all frameworks for the manual or automatic evaluation of machine translation characterize the quality of an MT output with a single number. An exception is the Multidimensional Quality Metrics (MQM) framework which offers a fine-grained ontology of quality dimensions for scoring (such as style, fluency, accuracy, and terminology). Previous studies have demonstrated the feasibility of MQM annotation but there are, to our knowledge, no computational models that predict MQM scores for novel texts, due to a lack of resources. In this paper, we address these shortcomings by (a) providing a 1200-sentence MQM evaluation benchmark for the language pair English-Korean and (b) reframing MT evaluation as the multi-task problem of simultaneously predicting several MQM scores using SOTA language models, both in a reference-based MT evaluation setup and a reference-free quality estimation (QE) setup. We find that reference-free setup outperforms its counterpart in the style dimension while reference-based models retain an edge regarding accuracy. Overall, RemBERT emerges as the most promising model. Through our evaluation, we offer an insight into the translation quality in a more fine-grained, interpretable manner.

pdf abs
Multi-domain Hate Speech Detection Using Dual Contrastive Learning and Paralinguistic Features
Somaiyeh Dehghan | Berrin Yanıkoğlu

Social networks have become venues where people can share and spread hate speech, especially when the platforms allow users to remain anonymous. Hate speech can have significant social and cultural effects, especially when it targets specific groups of people in terms of religion, race, ethnicity, culture or a specific social situation such as immigrants and refugees. In this study, we propose a hate speech detection model, BERTurk-DualCL, using a mixed objective with contrastive learning loss that is combined with the traditional cross-entropy loss used for classification. In addition, we study the effects of paralinguistic features, namely emojis and hashtags, on the performance of our model. We trained and evaluated our model on tweets in four different topics with heated discussions from two separate datasets, ranging from discussions about migrants to the Israel-Palestine conflict. Our multi-domain model outperforms comparable results in literature and the average results of four domain-specific models, achieving a macro-F1 score of 81.04% and 58.89% on two- and five-class tasks respectively.

pdf abs
Multi-Grained Conversational Graph Network for Retrieval-based Dialogue Systems
Quan Tu | Chongyang Tao | Rui Yan

Retrieval-based dialogue agents aim at selecting a proper response according to multi-turn conversational history. Existing methods have achieved great progress in terms of retrieval accuracy on benchmarks with pre-trained language models. However, these methods simply concatenate all turns in the dialogue history as the input, ignoring the dialogue dependency and structural information between the utterances. Besides, they usually reason the relationship of the context-response pair at a single level of abstraction (e.g., utterance level), which can not comprehensively capture the fine-grained relation between the context and response. In this paper, we present the multi-grained conversational graph network (MCGN) that considers multiple levels of abstraction from dialogue histories and semantic dependencies within multi-turn dialogues for addressing. Evaluation results on two benchmarks indicate that the proposed multi-grained conversational graph network is helpful for dialogue context understanding and can bring consistent and significant improvement over the state-of-the-art methods.

pdf abs
Multi-Granularity Fusion Text Semantic Matching Based on WoBERT
Hongchun Yu | Wei Pan | Xing Fan | Hanqi Li

Text semantic matching is crucial in natural language processing, applied in information retrieval, question answering, and recommendation systems. Traditional text-matching methods struggle with semantic nuances in short text. Recent advancements in multi-granularity representation learning have led to increased interest in improving text semantic matching models. We propose a novel multi-granularity fusion model that harnesses WoBERT, a pre-trained language model, to enhance the accuracy of text semantic information capture. Initially, we process text using WoBERT to acquire semantic representations, effectively capturing individual text semantic nuances. Next, we employ a soft attention alignment mechanism, enabling multi-granularity fusions among characters, words, and sentences, thus further improving matching performance. Our approach was evaluated through experiments on common Chinese short text matching datasets, BQ and LCQMC. Results reveal a significant improvement in performance compared to traditional methods, particularly in terms of accuracy.

pdf abs
MultiLeg: Dataset for Text Sanitisation in Less-resourced Languages
Rinalds Vīksna | Inguna Skadiņa

Text sanitization is the task of detecting and removing personal information from the text. While it has been well-studied in monolingual settings, today, there is also a need for multilingual text sanitization. In this paper, we introduce MultiLeg: a parallel, multilingual named entity (NE) dataset consisting of documents from the Court of Justice of the European Union annotated with semantic categories suitable for text sanitization. The dataset is available in 8 languages, and it contains 3082 parallel text segments for each language. We also show that the pseudonymized dataset remains useful for downstream tasks.

Understanding the relation between the meanings of words is an important part of comprehending natural language. Prior work has either focused on analysing lexical semantic relations in word embeddings or probing pretrained language models (PLMs), with some exceptions. Given the rarity of highly multilingual benchmarks, it is unclear to what extent PLMs capture relational knowledge and are able to transfer it across languages. To start addressing this question, we propose MultiLexBATS, a multilingual parallel dataset of lexical semantic relations adapted from BATS in 15 languages including low-resource languages, such as Bambara, Lithuanian, and Albanian. As experiment on cross-lingual transfer of relational knowledge, we test the PLMs’ ability to (1) capture analogies across languages, and (2) predict translation targets. We find considerable differences across relation types and languages with a clear preference for hypernymy and antonymy as well as romance languages.

pdf abs
Multilingual Brain Surgeon: Large Language Models Can Be Compressed Leaving No Language behind
Hongchuan Zeng | Hongshen Xu | Lu Chen | Kai Yu

Large Language Models (LLMs) have ushered in a new era in Natural Language Processing, but their massive size demands effective compression techniques for practicality. Although numerous model compression techniques have been investigated, they typically rely on a calibration set that overlooks the multilingual context and results in significant accuracy degradation for low-resource languages. This paper introduces Multilingual Brain Surgeon (MBS), a novel calibration data sampling method for multilingual LLMs compression. MBS overcomes the English-centric limitations of existing methods by sampling calibration data from various languages proportionally to the language distribution of the model training datasets. Our experiments, conducted on the BLOOM multilingual LLM, demonstrate that MBS improves the performance of existing English-centric compression methods, especially for low-resource languages. We also uncover the dynamics of language interaction during compression, revealing that the larger the proportion of a language in the training set and the more similar the language is to the calibration language, the better performance the language retains after compression. In conclusion, MBS presents an innovative approach to compressing multilingual LLMs, addressing the performance disparities and improving the language inclusivity of existing compression techniques. Keywords: Large Language Model, Multilingual Model Compression

pdf abs
Multilingual Coreference Resolution in Low-resource South Asian Languages
Ritwik Mishra | Pooja Desur | Rajiv Ratn Shah | Ponnurangam Kumaraguru

Coreference resolution involves the task of identifying text spans within a discourse that pertain to the same real-world entity. While this task has been extensively explored in the English language, there has been a notable scarcity of publicly accessible resources and models for coreference resolution in South Asian languages. We introduce a Translated dataset for Multilingual Coreference Resolution (TransMuCoRes) in 31 South Asian languages using off-the-shelf tools for translation and word-alignment. Nearly all of the predicted translations successfully pass a sanity check, and 75% of English references align with their predicted translations. Using multilingual encoders, two off-the-shelf coreference resolution models were trained on a concatenation of TransMuCoRes and a Hindi coreference resolution dataset with manual annotations. The best performing model achieved a score of 64 and 68 for LEA F1 and CoNLL F1, respectively, on our test-split of Hindi golden set. This study is the first to evaluate an end-to-end coreference resolution model on a Hindi golden set. Furthermore, this work underscores the limitations of current coreference evaluation metrics when applied to datasets with split antecedents, advocating for the development of more suitable evaluation metrics.

pdf abs
Multilingual Generation in Abstractive Summarization: A Comparative Study
Jinpeng Li | Jiaze Chen | Huadong Chen | Dongyan Zhao | Rui Yan

The emergence of pre-trained models marks a significant juncture for the multilingual generation, offering unprecedented capabilities to comprehend and produce text across multiple languages. These models display commendable efficiency in high-resource languages. However, their performance notably falters in low-resource languages due to the extensive linguistic diversity encountered. Moreover, the existing works lack thorough analysis impairs the discovery of effective multilingual strategies, further complicating the advancement of current multilingual generation systems. This paper aims to appraise the efficacy of multilingual generation tasks, with a focus on summarization, through three resource availability scenarios: high-resource, low-resource, and zero-shot. We classify multilingual generation methodologies into three foundational categories based on their underlying modeling principles: Fine-tuning, Parameter-isolation, and Constraint-based approaches. Following this classification, we conduct a comprehensive comparative study of these methodologies across different resource contexts using two datasets that span six languages. This analysis provides insights into the unique advantages and limitations of each method. In addition, we introduce an innovative yet simple automatic metric LANGM designed to mitigate the prevalent problem of spurious correlations associated with language mixing. LANGM accurately measures the degree of code-mixing at the language level. Finally, we highlight several challenges and suggest potential avenues for future inquiry, aiming to spur further advancements within the field of multilingual text generation.

pdf abs
Multilinguality or Back-translation? A Case Study with Estonian
Elizaveta Korotkova | Taido Purason | Agnes Luhtaru | Mark Fishel

Machine translation quality is highly reliant on large amounts of training data, and, when a limited amount of parallel data is available, synthetic back-translated or multilingual data can be used in addition. In this work, we introduce SynEst, a synthetic corpus of translations from 11 languages into Estonian which totals over 1 billion sentence pairs. Using this corpus, we investigate whether adding synthetic or English-centric additional data yields better translation quality for translation directions that do not include English. Our results show that while both strategies are effective, synthetic data gives better results. Our final models improve the performance of the baseline No Language Left Behind model while retaining its source-side multilinguality.

pdf abs
Multilingual Sentence-T5: Scalable Sentence Encoders for Multilingual Applications
Chihiro Yano | Akihiko Fukuchi | Shoko Fukasawa | Hideyuki Tachibana | Yotaro Watanabe

Prior work on multilingual sentence embedding has demonstrated that the efficient use of natural language inference (NLI) data to build high-performance models can outperform conventional methods. However, the potential benefits from the recent “exponential” growth of language models with billions of parameters have not yet been fully explored. In this paper, we introduce Multilingual Sentence T5 (m-ST5), as a larger model of NLI-based multilingual sentence embedding, by extending Sentence T5, an existing monolingual model. By employing the low-rank adaptation (LoRA) technique, we have achieved a successful scaling of the model’s size to 5.7 billion parameters. We conducted experiments to evaluate the performance of sentence embedding and verified that the method outperforms the NLI-based prior approach. Furthermore, we also have confirmed a positive correlation between the size of the model and its performance. It was particularly noteworthy that languages with fewer resources or those with less linguistic similarity to English benefited more from the parameter increase. Our model is available at https://huggingface.co/pkshatech/m-ST5.

pdf abs
Multilingual Substitution-based Word Sense Induction
Denis Kokosinskii | Nikolay Arefyev

Word Sense Induction (WSI) is the task of discovering senses of an ambiguous word by grouping usages of this word into clusters corresponding to these senses. Many approaches were proposed to solve WSI in English and a few other languages, but these approaches are not easily adaptable to new languages. We present multilingual substitution-based WSI methods that support any of 100 languages covered by the underlying multilingual language model with minimal to no adaptation required. Despite the multilingual capabilities, our methods perform on par with the existing monolingual approaches on popular English WSI datasets. At the same time, they will be most useful for lower-resourced languages which miss lexical resources available for English, thus, have higher demand for unsupervised methods like WSI.

pdf abs
Multilingual Turn-taking Prediction Using Voice Activity Projection
Koji Inoue | Bing’er Jiang | Erik Ekstedt | Tatsuya Kawahara | Gabriel Skantze

This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The results show that a monolingual VAP model trained on one language does not make good predictions when applied to other languages. However, a multilingual model, trained on all three languages, demonstrates predictive performance on par with monolingual models across all languages. Further analyses show that the multilingual model has learned to discern the language of the input signal. We also analyze the sensitivity to pitch, a prosodic cue that is thought to be important for turn-taking. Finally, we compare two different audio encoders, contrastive predictive coding (CPC) pre-trained on English, with a recent model based on multilingual wav2vec 2.0 (MMS).

pdf abs
Multimodal and Multilingual Laughter Detection in Stand-Up Comedy Videos
Anna Kuznetsova | Carlo Strapparava

This paper presents the development of a novel multimodal multilingual dataset in Russian and English, with a particular emphasis on the exploration of laughter detection techniques. Data was collected from YouTube stand-up comedy videos with manually annotated subtitles, and our research covers data preparation and laughter labeling. We explore two laughter detection approaches presented in the literature: peak detection using preprocessed voiceless audio with an energy-based algorithm and machine learning approach with pretrained models to identify laughter presence and duration. While the machine learning approach currently outperforms peak detection in accuracy and generalization, the latter shows promise and warrants further study. Additionally, we explore unimodal and multimodal humor detection on the new dataset, showing the effectiveness of neural models in capturing humor in both languages, even with textual data. Multimodal experiments indicate that even basic models benefit from visual data, improving detection results. However, further research is needed to enhance laughter detection labeling quality and fully understand the impact of different modalities in a multimodal and multilingual context.

pdf abs
Multimodal Behaviour in an Online Environment: The GEHM Zoom Corpus Collection
Patrizia Paggio | Manex Agirrezabal | Costanza Navarretta | Leo Vitasovic

This paper introduces a novel multimodal corpus consisting of 12 video recordings of Zoom meetings held in English by an international group of researchers from September 2021 to March 2023. The meetings have an average duration of about 40 minutes each, for a total of 8 hours. The number of participants varies from 5 to 9 per meeting. The participants’ speech was transcribed automatically using WhisperX, while visual coordinates of several keypoints of the participants’ head, their shoulders and wrists, were extracted using OpenPose. The audio-visual recordings will be distributed together with the orthographic transcription as well as the visual coordinates. In the paper we describe the way the corpus was collected, transcribed and enriched with the visual coordinates, we give descriptive statistics concerning both the speech transcription and the visual keypoint values and we present and discuss visualisations of these values. Finally, we carry out a short preliminary analysis of the role of feedback in the meetings, and show how visualising the coordinates extracted via OpenPose can be used to see how gestural behaviour supports the use of feedback words during the interaction.

Event coreference resolution (ECR) is the task of determining whether distinct mentions of events within a multi-document corpus are actually linked to the same underlying occurrence. Images of the events can help facilitate resolution when language is ambiguous. Here, we propose a multimodal cross-document event coreference resolution method that integrates visual and textual cues with a simple linear map between vision and language models. As existing ECR benchmark datasets rarely provide images for all event mentions, we augment the popular ECB+ dataset with event-centric images scraped from the internet and generated using image diffusion models. We establish three methods that incorporate images and text for coreference: 1) a standard fused model with finetuning, 2) a novel linear mapping method without finetuning and 3) an ensembling approach based on splitting mention pairs by semantic and discourse-level difficulty. We evaluate on 2 datasets: the augmented ECB+, and AIDA Phase 1. Our ensemble systems using cross-modal linear mapping establish an upper limit (91.9 CoNLL F1) on ECB+ ECR performance given the preprocessing assumptions used, and establish a novel baseline on AIDA Phase 1. Our results demonstrate the utility of multimodal information in ECR for certain challenging coreference problems, and highlight a need for more multimodal resources in the coreference resolution space.

pdf abs
Multimodal Cross-lingual Phrase Retrieval
Chuanqi Dong | Wenjie Zhou | Xiangyu Duan | Yuqi Zhang | Min Zhang

Cross-lingual phrase retrieval aims to retrieve parallel phrases among languages. Current approaches only deals with textual modality. There lacks multimodal data resources and explorations for multimodal cross-lingual phrase retrieval (MXPR). In this paper, we create the first MXPR data resource and propose a novel approach for MXPR to explore the effectiveness of multi-modality. The MXPR data resource is built by marrying the benchmark dataset for textual cross-lingual phrase retrieval with Wikimedia Commons, which is a media store containing tremendous texts and related images. In the built resource, the phrase pairs of the textual benchmark dataset are equipped with their related images. Based on this novel data resource, we introduce a strategy to bridge the gap between different modalities by multimodal relation generation with a large multimodal pre-trained model and consistency training. Experiments on benchmarked dataset covering eight language pairs show that our MXPR approach, which deals with multimodal phrases, performs significantly better than pure textual cross-lingual phrase retrieval.

pdf abs
Multimodal Language Models Show Evidence of Embodied Simulation
Cameron R. Jones | Sean Trott

Multimodal large language models (MLLMs) are gaining popularity as partial solutions to the “symbol grounding problem” faced by language models trained on text alone. However, little is known about whether and how these multiple modalities are integrated. We draw inspiration from analogous work in human psycholinguistics on embodied simulation, i.e., the hypothesis that language comprehension is grounded in sensorimotor representations. We show that MLLMs are sensitive to implicit visual features like object shape (e.g., “The egg was in the skillet” implies a frying egg rather than one in a shell). This suggests that MLLMs activate implicit information about object shape when it is implied by a verbal description of an event. We find mixed results for color and orientation, and rule out the possibility that this is due to models’ insensitivity to those features in our dataset overall. We suggest that both human psycholinguistics and computational models of language could benefit from cross-pollination, e.g., with the potential to establish whether grounded representations play a functional role in language processing.

pdf abs
Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment
Ming Zhang | Ke Chang | Yunfang Wu

Multi-modal semantic understanding requires integrating information from different modalities to extract users’ real intention behind words. Most previous work applies a dual-encoder structure to separately encode image and text, but fails to learn cross-modal feature alignment, making it hard to achieve cross-modal deep information interaction. This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment, which projects the features derived from different modalities into a unified deep space. On multi-modal sarcasm detection (MMSD) and multi-modal sentiment analysis (MMSA) tasks, the experimental results show that our proposed model significantly outperforms several baselines, and our feature alignment strategy brings obvious performance gain over models with different aggregating methods and models even enriched with knowledge. More importantly, our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks. Our source codes are available at https://github.com/ChangKe123/CLFA.

Product review summarization aims to generate a concise summary based on product reviews to facilitate purchasing decisions. This intricate task gives rise to three challenges in existing work: factual accuracy, aspect comprehensiveness, and content relevance. In this paper, we first propose an FB-Thinker framework to improve the summarization ability of LLMs with multi-objective forward reasoning and multi-reward backward refinement. To enable LLM with these dual capabilities, we present two Chinese product review summarization datasets, Product-CSum and Product-CSum-Cross, for both instruction-tuning and cross-domain evaluation. Specifically, these datasets are collected via GPT-assisted manual annotations from an online forum and public datasets. We further design an evaluation mechanism Product-Eval, integrating both automatic and human evaluation across multiple dimensions for product summarization. Experimental results show the competitiveness and generalizability of our proposed framework in the product review summarization tasks.

Knowledge graph completion (KGC) is a widely used method to tackle incompleteness in knowledge graphs (KGs) by making predictions for missing links. Description-based KGC leverages pre-trained language models to learn entity and relation representations with their names or descriptions, which shows promising results. However, the performance of description-based KGC is still limited by the quality of text and the incomplete structure, as it lacks sufficient entity descriptions and relies solely on relation names, leading to sub-optimal results. To address this issue, we propose MPIKGC, a general framework to compensate for the deficiency of contextualized knowledge and improve KGC by querying large language models (LLMs) from various perspectives, which involves leveraging the reasoning, explanation, and summarization capabilities of LLMs to expand entity descriptions, understand relations, and extract structures, respectively. We conducted extensive evaluation of the effectiveness and improvement of our framework based on four description-based KGC models, for both link prediction and triplet classification tasks. All codes and generated data will be publicly available after review.

Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets.

Emotional support conversation (ESC) task aims to relieve the emotional distress of users who have high-intensity of negative emotions. However, due to the ignorance of emotion intensity modelling which is essential for ESC, previous methods fail to capture the transition of emotion intensity effectively. To this end, we propose a Multi-stream information Fusion Framework (MFF-ESC) to thoroughly fuse three streams (text semantics stream, emotion intensity stream, and feedback stream) for the modelling of emotion intensity, based on a designed multi-stream fusion unit. As the difficulty of modelling subtle transitions of emotion intensity and the strong emotion intensity-feedback correlations, we use the KL divergence between feedback distribution and emotion intensity distribution to further guide the learning of emotion intensities. Experimental results on automatic and human evaluations indicate the effectiveness of our method.

pdf abs
Multi-Tiered Cantonese Word Segmentation
Charles Lam | Chaak-ming Lau | Jackson L. Lee

Word segmentation for Chinese text data is essential for compiling corpora and any other tasks where the notion of “word” is assumed, since Chinese orthography does not have conventional word boundaries as languages such as English do. A perennial issue, however, is that there is no consensus about the definition of “word” in Chinese, which makes word segmentation challenging. Recent work in Chinese word segmentation has begun to embrace the idea of multiple word segmentation possibilities. In a similar spirit, this paper focuses on Cantonese, another major Chinese variety. We propose a linguistically motivated, multi-tiered word segmentation system for Cantonese, and release a Cantonese corpus of 150,000 characters word-segmented by this proposal. Our work will be of interest to researchers whose work involves Cantonese corpus data.

pdf abs
Murre24: Dialect Identification of Finnish Internet Forum Messages
Olli Kuparinen

This paper presents Murre24, a collection of dialectal messages posted on the largest Finnish internet forum, Suomi24. The messages posted in Finnish on the forum between 2001 and 2020 are classified to present either the standard language, one of the seven traditional dialects, a colloquial style or the Helsinki slang. We present a manually annotated dataset used to train dialect identification models as well as the automatic annotation of almost 94 million messages in total. We experiment with five different dialect identification methods and evaluate them on dialectally balanced and random test samples. The best performing method for differentiating standard Finnish from non-standard Finnish is a character n-gram based support vector machine (SVM), while fine-tuning a BERT-based model achieves best scores in the final dialect identification task. According to the automatic classification, most of the messages written on the forum are in standard Finnish, and most of the non-standard messages are in a colloquial variety used typically by young speakers in Finland. We moreover show that the proportion of non-standard messages declines over time, but the proportion of the traditional dialects stays relatively steady.

pdf abs
MVP: Minimal Viable Phrase for Long Text Understanding
Louis Clouatre | Amal Zouaq | Sarath Chandar

A recent renewal in interest in long text understanding has sparked the emergence of high-quality long text benchmarks, as well as new models demonstrating significant performance improvements on these benchmarks. However, gauging the implication of these advancements based solely on the length of the input text offers limited insight. Such benchmarks may require models to parse long-range dependencies or merely to locate and comprehend the relevant paragraph within a longer text. This work introduces the Minimal Viable Phrase (MVP), a novel metric that determines, through perturbations to the input text, the shortest average text length that needs to be preserved to execute the task with limited performance degradation. Our evaluation of the popular SCROLLS benchmark reveals that only one of its seven tasks necessitates an MVP of over 512 tokens–the maximum text length manageable by the previous generation of pre-trained models. We highlight the limited need for understanding long-range dependencies in resolving these tasks, discuss the specific design decisions that seem to have led to the QuALITY task requiring reliance on long-range dependencies to be solved, and point out specific modeling choices that seem to outperform on the QuALITY task.

pdf abs
MWE-Finder: A Demonstration
Jan Odijk | Martin Kroon | Tijmen Baarda | Ben Bonfil | Sheean Spoel

This paper introduces and demonstrates MWE Finder, an application to search for flexible multiword expressions (MWEs) in Dutch text corpora, starting from an example. If the example is in canonical form, the application automatically generates three queries to search for sentences that contain an occurrence of the MWE and thus enables efficient analysis of its properties. Searching is done in treebanks, so the grammatical structure of the sentences is taken into account.

pdf abs
myMediCon: End-to-End Burmese Automatic Speech Recognition for Medical Conversations
Hay Man Htun | Ye Kyaw Thu | Hutchatai Chanlekha | Kotaro Funakoshi | Thepchai Supnithi

End-to-End Automatic Speech Recognition (ASR) models have significantly advanced the field of speech processing by streamlining traditionally complex ASR system pipelines, promising enhanced accuracy and efficiency. Despite these advancements, there is a notable absence of freely available medical conversation speech corpora for Burmese, which is one of the low-resource languages. Addressing this gap, we present a manually curated Burmese Medical Speech Conversations (myMediCon) corpus, encapsulating conversations among medical doctors, nurses, and patients. Utilizing the ESPnet speech processing toolkit, we explore End-to-End ASR models for the Burmese language, focus on Transformer and Recurrent Neural Network (RNN) architectures. Our corpus comprises 12 speakers, including three males and nine females, with a total speech duration of nearly 11 hours within the medical domain. To assess the ASR performance, we applied word and syllable segmentation to the text corpus. ASR models were evaluated using Character Error Rate (CER), Word Error Rate (WER), and Translation Error Rate (TER). The experimental results indicate that the RNN-based Burmese speech recognition with syllable-level segmentation achieved the best performance, yielding a CER of 9.7%. Moreover, the RNN approach significantly outperformed the Transformer model.

pdf abs
My Science Tutor (MyST)–a Large Corpus of Children’s Conversational Speech
Sameer Pradhan | Ronald A. Cole | Wayne H. Ward

This article describes the [corpus-name] corpus developed as part of the [project-name] project. To the best of our knowledge, this is one of the largest collections of children’s conversational speech that is freely available for non-commercial use under the creative commons license (CC BY-NC-SA 4.0). It comprises approximately 400 hours of speech, spanning some 230K utterances spread across about 10,500 virtual tutor sessions. Roughly 1,300 third, fourth and fifth grade students contributed to this corpus. The current release contains roughly 100K transcribed utterances. It is our hope that the corpus can be used to improve automatic speech recognition models and algorithms. We report the word error rate achieved on the test set using a model trained on the training and development portion of the corpus. The git repository of the corpus contains the complete training and evaluation setup in order to facilitate a fair and consistent evaluation. It is our hope that this corpus will contribute to the creation and evaluation of conversational AI agents having a better understanding of children’s speech, potentially opening doors to novel, effective, learning and therapeutic interventions.

It remains a question that how simultaneous interpretation (SI) data affects simultaneous machine translation (SiMT). Research has been limited due to the lack of a large-scale training corpus. In this work, we aim to fill in the gap by introducing NAIST-SIC-Aligned, which is an automatically-aligned parallel English-Japanese SI dataset. Starting with a non-aligned corpus NAIST-SIC, we propose a two-stage alignment approach to make the corpus parallel and thus suitable for model training. The first stage is coarse alignment where we perform a many-to-many mapping between source and target sentences, and the second stage is fine-grained alignment where we perform intra- and inter-sentence filtering to improve the quality of aligned pairs. To ensure the quality of the corpus, each step has been validated either quantitatively or qualitatively. This is the first open-sourced large-scale parallel SI dataset in the literature. We also manually curated a small test set for evaluation purposes. Our results show that models trained with SI data lead to significant improvement in translation quality and latency over baselines. We hope our work advances research on SI corpora construction and SiMT. Our data will be released upon the paper’s acceptance.

For the past decade, temporal annotation has been sparse: only a small portion of event pairs in a text was annotated. We present NarrativeTime, the first timeline-based annotation framework that achieves full coverage of all possible TLINKs. To compare with the previous SOTA in dense temporal annotation, we perform full re-annotation of the classic TimeBankDense corpus (American English), which shows comparable agreement with a signigicant increase in density. We contribute TimeBankNT corpus (with each text fully annotated by two expert annotators), extensive annotation guidelines, open-source tools for annotation and conversion to TimeML format, and baseline results.

Instruction-tuned Large Language Models (LLMs) have exhibited impressive language understanding and the capacity to generate responses that follow specific prompts. However, due to the computational demands associated with training these models, their applications often adopt a zero-shot setting. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and OpenAssistant, in the context of six Computational Social Science classification tasks, while also investigating the effects of various prompting strategies. Our experiments investigate the impact of prompt complexity, including the effect of incorporating label definitions into the prompt; use of synonyms for label names; and the influence of integrating past memories during foundation model training. The findings indicate that in a zero-shot setting, current LLMs are unable to match the performance of smaller, fine-tuned baseline transformer models (such as BERT-large). Additionally, we find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10%.

pdf abs
NB Uttale: A Norwegian Pronunciation Lexicon with Dialect Variation
Marie Iversdatter Røsok | Ingerid Løyning Dale

We present a Norwegian pronunciation lexicon with Bokmål orthographic word forms and up to eight alternate phonological transcriptions per word form. The lexicon covers dialectal variations for five geographical areas, as well as pronunciation variations for spontaneous and manuscript-read speech. It is based on the NST Bokmål lexicon for East Norwegian, whose original phonological transcriptions have been corrected, before they were converted with dialect specific regular expression rules. To evaluate the quality and consistency of the new, rule-generated transcriptions, we trained grapheme-to phoneme (G2P) models and report our results with word- (WER) and phoneme-error-rate (PER) metrics. We found that the G2P models trained on lexica for Southwest and West Norwegian close-to written transcriptions have the lowest WER scores, and that all error-corrected, close-to-written lexica yield better WER scores than the original NST lexicon. The lexicon is available under an open license, and can be used for various language technology applications and in linguistic research.

pdf abs
Negation Scope Conversion: Towards a Unified Negation-Annotated Dataset
Asahi Yoshida | Yoshihide Kato | Shigeki Matsubara

Negation scope resolution is the task that identifies the part of a sentence affected by the negation cue. The three major corpora used for this task, the BioScope corpus, the SFU review corpus and the Sherlock dataset, have different annotation schemes for negation scope. Due to the different annotations, the negation scope resolution models based on pre-trained language models (PLMs) perform worse when fine-tuned on the simply combined dataset consisting of the three corpora. To address this issue, we propose a method for automatically converting the scopes of BioScope and SFU to those of Sherlock and merge them into a unified dataset. To verify the effectiveness of the proposed method, we conducted experiments using the unified dataset for fine-tuning PLM-based models. The experimental results demonstrate that the performances of the models increase when fine-tuned on the unified dataset unlike the simply combined one. In the token-level metric, the model fine-tuned on the unified dataset archived the state-of-the-art performance on the Sherlock dataset.

Previous works of negation understanding mainly focus on negation cue detection and scope resolution, without identifying negation subject which is also significant to the downstream tasks. In this paper, we propose a new negation triplet extraction (NTE) task which aims to extract negation subject along with negation cue and scope. To achieve NTE, we devise a novel Syntax&Semantic-Enhanced Negation Extraction model, namely SSENE, which is built based on a generative pretrained language model (PLM) of Encoder-Decoder architecture with a multi-task learning framework. Specifically, the given sentence’s syntactic dependency tree is incorporated into the PLM’s encoder to discover the correlations between the negation subject, cue and scope. Moreover, the semantic consistency between the sentence and the extracted triplet is ensured by an auxiliary task learning. Furthermore, we have constructed a high-quality Chinese dataset NegComment based on the users’ reviews from the real-world platform of Meituan, upon which our evaluations show that SSENE achieves the best NTE performance compared to the baselines. Our ablation and case studies also demonstrate that incorporating the syntactic information helps the PLM’s recognize the distant dependency between the subject and cue, and the auxiliary task learning is helpful to extract the negation triplets with more semantic consistency. We further demonstrate that SSENE is also competitive on the traditional CDSR task.

pdf abs
nEMO: Dataset of Emotional Speech in Polish
Iwona Christop

Speech emotion recognition has become increasingly important in recent years due to its potential applications in healthcare, customer service, and personalization of dialogue systems. However, a major issue in this field is the lack of datasets that adequately represent basic emotional states across various language families. As datasets covering Slavic languages are rare, there is a need to address this research gap. This paper presents the development of nEMO, a novel corpus of emotional speech in Polish. The dataset comprises over 3 hours of samples recorded with the participation of nine actors portraying six emotional states: anger, fear, happiness, sadness, surprise, and a neutral state. The text material used was carefully selected to represent the phonetics of the Polish language adequately. The corpus is freely available under the terms of a Creative Commons license (CC BY-NC-SA 4.0).

Hierarchical text classification (HTC) is a significant but challenging task in natural language processing (NLP) due to its complex taxonomic label hierarchy. Recently, there have been a number of approaches that applied prompt learning to HTC problems, demonstrating impressive efficacy. The majority of prompt-based studies emphasize global hierarchical features by employing graph networks to represent the hierarchical structure as a whole, with limited research on maintaining path consistency within the internal hierarchy of the structure. In this paper, we formulate prompt-based HTC as a named entity recognition (NER) task and introduce conditional random fields (CRF) and Global Pointer to establish hierarchical dependencies. Specifically, we approach single- and multi-path HTC as flat and nested entity recognition tasks and model them using span- and token-based methods. By narrowing the gap between HTC and NER, we maintain the consistency of internal paths within the hierarchical structure through a simple and effective way. Extensive experiments on three public datasets show that our method achieves state-of-the-art (SoTA) performance.

Nested Event Extraction (NEE) aims to extract complex event structures where an event contains other events as its arguments recursively. Nested events involve a kind of Pivot Elements (PEs) that simultaneously act as arguments of outer-nest events and as triggers of inner-nest events, and thus connect them into nested structures. This special characteristic of PEs brings challenges to existing NEE methods, as they cannot well cope with the dual identities of PEs. Therefore, this paper proposes a new model, called PerNee, which extracts nested events mainly based on recognizing PEs. Specifically, PerNee first recognizes the triggers of both inner-nest and outer-nest events and further recognizes the PEs via classifying the relation type between trigger pairs. The model uses prompt learning to incorporate information from both event types and argument roles for better trigger and argument representations to improve NEE performance. Since existing NEE datasets (e.g., Genia11) are limited to specific domains and contain a narrow range of event types with nested structures, we systematically categorize nested events in the generic domain and construct a new NEE dataset, called ACE2005-Nest. Experimental results demonstrate that PerNee consistently achieves state-of-the-art performance on ACE2005-Nest, Genia11, and Genia13. The ACE2005-Nest dataset and the code of the PerNee model are available at https://github.com/waysonren/PerNee.

pdf abs
Nested Noun Phrase Identification Using BERT
Shweta Misra | Johan Boye

For several NLP tasks, an important substep is the identification of noun phrases in running text. This has typically been done by “chunking” – a way of finding minimal noun phrases by token classification. However, chunking-like methods do not represent the fact that noun phrases can be nested. This paper presents a novel method of finding all noun phrases in a sentence, nested to an arbitrary depth, using the BERT model for token classification. We show that our proposed method achieves very good results for both Swedish and English.

pdf abs
Neural Machine Translation between Low-Resource Languages with Synthetic Pivoting
Khalid Ahmed | Jan Buys

Training neural models for translating between low-resource languages is challenging due to the scarcity of direct parallel data between such languages. Pivot-based neural machine translation (NMT) systems overcome data scarcity by including a high-resource pivot language in the process of translating between low-resource languages. We propose synthetic pivoting, a novel approach to pivot-based translation in which the pivot sentences are generated synthetically from both the source and target languages. Synthetic pivot sentences are generated through sequence-level knowledge distillation, with the aim of changing the structure of pivot sentences to be closer to that of the source or target languages, thereby reducing pivot translation complexity. We incorporate synthetic pivoting into two paradigms for pivoting: cascading and direct translation using synthetic source and target sentences. We find that the performance of pivot-based systems highly depends on the quality of the NMT model used for sentence regeneration. Furthermore, training back-translation models on these sentences can make the models more robust to input-side noise. The results show that synthetic data generation improves pivot-based systems translating between low-resource Southern African languages by up to 5.6 BLEU points after fine-tuning.

pdf abs
Neural Multimodal Topic Modeling: A Comprehensive Evaluation
Felipe Gonzalez-Pizarro | Giuseppe Carenini

Neural topic models can successfully find coherent and diverse topics in textual data. However, they are limited in dealing with multimodal datasets (e.g., images and text). This paper presents the first systematic and comprehensive evaluation of multimodal topic modeling of documents containing both text and images. In the process, we propose two novel topic modeling solutions and two novel evaluation metrics. Overall, our evaluation on an unprecedented rich and diverse collection of datasets indicates that both of our models generate coherent and diverse topics. Nevertheless, the extent to which one method outperforms the other depends on the metrics and dataset combinations, which suggests further exploration of hybrid solutions in the future. Notably, our succinct human evaluation aligns with the outcomes determined by our proposed metrics. This alignment not only reinforces the credibility of our metrics but also highlights the potential for their application in guiding future multimodal topic modeling endeavors.

pdf abs
New Datasets for Automatic Detection of Textual Entailment and of Contradictions between Sentences in French
Maximos Skandalis | Richard Moot | Christian Retoré | Simon Robillard

This paper introduces DACCORD, an original dataset in French for automatic detection of contradictions between sentences. It also presents new, manually translated versions of two datasets, namely the well known dataset RTE3 and the recent dataset GQNLI, from English to French, for the task of natural language inference / recognising textual entailment, which is a sentence-pair classification task. These datasets help increase the admittedly limited number of datasets in French available for these tasks. DACCORD consists of 1034 pairs of sentences and is the first dataset exclusively dedicated to this task and covering among others the topic of the Russian invasion in Ukraine. RTE3-FR contains 800 examples for each of its validation and test subsets, while GQNLI-FR is composed of 300 pairs of sentences and focuses specifically on the use of generalised quantifiers. Our experiments on these datasets show that they are more challenging than the two already existing datasets for the mainstream NLI task in French (XNLI, FraCaS). For languages other than English, most deep learning models for NLI tasks currently have only XNLI available as a training set. Additional datasets, such as ours for French, could permit different training and evaluation strategies, producing more robust results and reducing the inevitable biases present in any single dataset.

pdf abs
New Evaluation Methodology for Qualitatively Comparing Classification Models
Ahmad Aljanaideh

Text Classification is one of the most common tasks in Natural Language Processing. When proposing new classification models, practitioners select a sample of items the proposed model classified correctly while the baseline did not, and then try to observe patterns across those items to understand the proposed model’s strengths. However, this approach is not comprehensive and requires the effort of observing patterns across text items. In this work, we propose a new evaluation methodology for performing qualitative assessment over multiple classification models. The proposed methodology is driven to discover clusters of text items where each cluster’s items 1) exhibit a linguistic pattern and 2) the proposed model significantly outperforms the baseline when classifying such items. This helps practitioners in learning what their proposed model is powerful at capturing in comparison with the baseline model without having to perform this process manually. We use a fine-tuned BERT and Logistic Regression as the two models to compare with Sentiment Analysis as the downstream task. We show how our proposed evaluation methodology discovers various clusters of text items which BERT classifies significantly more accurately than the Logistic Regression baseline, thus providing insight into what BERT is powerful at capturing.

New Intent Discovery (NID) aims to recognize known and infer new intent categories with the help of limited labeled and large-scale unlabeled data. The task is addressed as a feature-clustering problem and recent studies augment instance representation. However, existing methods fail to capture cluster-friendly representations, since they show less capability to effectively control and coordinate within-cluster and between-cluster distances. Tailored to the NID problem, we propose a Robust and Adaptive Prototypical learning (RAP) framework for globally distinct decision boundaries for both known and new intent categories. Specifically, a robust prototypical attracting learning (RPAL) method is designed to compel instances to gravitate toward their corresponding prototype, achieving greater within-cluster compactness. To attain larger between-cluster separation, another adaptive prototypical dispersing learning (APDL) method is devised to maximize the between-cluster distance from the prototype-to-prototype perspective. Experimental results evaluated on three challenging benchmarks (CLINC, BANKING, and StackOverflow) of our method with better cluster-friendly representation demonstrate that RAP brings in substantial improvements over the current state-of-the-art methods (even large language model) by a large margin (average 5.5% improvement).

This paper presents a new phonetic resource for Nigerian Pidgin, a low-resource language of West Africa. Aiming to provide a new tool for research on intonosyntax, we have augmented an existing syntactic treebank of Nigerian Pidgin, associating each orthographically transcribed token with a series of syllable-level alignments and phonetizations. Syllables are further described using a set of continuous and discrete prosodic features. This new approach provides a simple tool for researchers to explore the prosodic characteristics of various syntactic phenomena. In this paper, we present the format of the corpus, the various features added, and several explorations that can be performed using an online interface. We also present a prosodically specified lexicon extracted using this resource. In it, each orthographic form is accompanied by the frequency of its phoneme-level variants, as well as the suprasegmental features that most frequently accompany each syllable. Finally, we present several additional case studies on how this corpus can used in the study of the language’s prosody.

pdf abs
New Proposal of Greenberg’s Universal 14 from Typometrics
Antoni Brosa-Rodríguez | Sylvain Kahane

In his Universal 14, Greenberg stated that the normal and dominant order in all world languages was to place the condition before the conclusion in conditional sentences. We take this claim to review it quantitatively and based on occurrences in real texts in more than 50 languages. We can see that Greenberg’s proposal is correct but that it needs a reformulation to be true at all. We propose a quantitatively based and updated Universal 14, which gives a better account of the representation of the different languages analyzed and which is fulfilled in 100% of the cases (as opposed to Greenberg’s 60% in our sample). In addition, we also analyze adverbial sentences. Once we obtain the occurrence data in their direction (before or after the main verb), we plot a new Universal in a typometrical way: 100% of the languages show a higher proportion of preceding conditional clauses than of adverbial clauses, regardless of their type or the direction preference for adverbial clauses. The relationship between the SOV type and a stricter initial conditional location is also proposed.

pdf abs
New Semantic Task for the French Spoken Language Understanding MEDIA Benchmark
Nadège Alavoine | Gaëlle Laperrière | Christophe Servan | Sahar Ghannay | Sophie Rosset

Intent classification and slot-filling are essential tasks of Spoken Language Understanding (SLU). In most SLU systems, those tasks are realized by independent modules, but for about fifteen years, models achieving both of them jointly and exploiting their mutual enhancement have been proposed. A multilingual module using a joint model was envisioned to create a touristic dialogue system for a European project, HumanE-AI-Net. A combination of multiple datasets, including the MEDIA dataset, was suggested for training this joint model. The MEDIA SLU dataset is a French dataset distributed since 2005 by ELRA, mainly used by the French research community and free for academic research since 2020. Unfortunately, it is annotated only in slots but not intents. An enhanced version of MEDIA annotated with intents has been built to extend its use to more tasks and use cases. This paper presents the semi-automatic methodology used to obtain this enhanced version. In addition, we present the first results of SLU experiments on this enhanced dataset using joint models for intent classification and slot-filling.

The Nguni languages have over 20 million home language speakers in South Africa. There has been considerable growth in the datasets for Nguni languages, but so far no analysis of the performance of NLP models for these languages has been reported across languages and tasks. In this paper we study pretrained language models for the 4 Nguni languages - isiXhosa, isiZulu, isiNdebele, and Siswati. We compile publicly available datasets for natural language understanding and generation, spanning 6 tasks and 11 datasets. This benchmark, which we call NGLUEni, is the first centralised evaluation suite for the Nguni languages, allowing us to systematically evaluate the Nguni-language capabilities of pretrained language models (PLMs). Besides evaluating existing PLMs, we develop new PLMs for the Nguni languages through multilingual adaptive finetuning. Our models, Nguni-XLMR and Nguni-ByT5, outperform their base models and large-scale adapted models, showing that performance gains are obtainable through limited language group-based adaptation. We also perform experiments on cross-lingual transfer and machine translation. Our models achieve notable cross-lingual transfer improvements in the lower resourced Nguni languages (isiNdebele and Siswati). To facilitate future use of NGLUEni as a standardised evaluation suite for the Nguni languages, we create a web portal to access the collection of datasets and publicly release our models.

pdf abs
NLoPT: N-gram Enhanced Low-Rank Task Adaptive Pre-training for Efficient Language Model Adaption
Hao Gu | Jiangyan Yi | Zheng Lian | Jianhua Tao | Xinrui Yan

Pre-trained Language Models (PLMs) like BERT have achieved superior performance on different downstream tasks, even when such a model is trained on a general domain. Moreover, recent studies have shown that continued pre-training on task-specific data, known as task adaptive pre-training (TAPT), can further improve downstream task performance. However, conventional TAPT adjusts all the parameters of the PLMs, which distorts the learned generic knowledge embedded in the original PLMs weights, and it is expensive to store a whole model copy for each downstream task. In this paper, we propose NLoPT, a two-step n-gram enhanced low-rank task adaptive pre-training method, to effectively and efficiently customize a PLM to the downstream task. Specifically, we first apply low-rank adaption (LoRA), a prevalent parameter-efficient technique, for efficient TAPT. We further explicitly incorporate the task-specific multi-granularity n-gram information via the cross-attention mechanism. Experimental results on six datasets from four domains illustrate the effectiveness of NLoPT, demonstrating the superiority of LoRA based TAPT and the necessity of incorporating task-specific n-gram information.

pdf abs
NLPre: A Revised Approach towards Language-centric Benchmarking of Natural Language Preprocessing Systems
Martyna Wiącek | Piotr Rybak | Łukasz Pszenny | Alina Wróblewska

With the advancements of transformer-based architectures, we observe the rise of natural language preprocessing (NLPre) tools capable of solving preliminary NLP tasks (e.g. tokenisation, part-of-speech tagging, dependency parsing, or morphological analysis) without any external linguistic guidance. It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-based morphological analysers or dictionaries. Aware of the shortcomings of existing NLPre evaluation approaches, we investigate a novel method of reliable and fair evaluation and performance reporting. Inspired by the GLUE benchmark, the proposed language-centric benchmarking system enables comprehensive ongoing evaluation of multiple NLPre tools, while credibly tracking their performance. The prototype application is configured for Polish and integrated with the thoroughly assembled NLPre-PL benchmark. Based on this benchmark, we conduct an extensive evaluation of a variety of Polish NLPre systems. To facilitate the construction of benchmarking environments for other languages, e.g. NLPre-GA for Irish or NLPre-ZH for Chinese, we ensure full customization of the publicly released source code of the benchmarking system. The links to all the resources (deployed platforms, source code, trained models, datasets etc.) can be found on the project website: https://sites.google.com/view/nlpre-benchmark.

pdf abs
No Need for Large-Scale Search: Exploring Large Language Models in Complex Knowledge Base Question Answering
Shouhui Wang | Biao Qin

Knowledge Base Question Answering (KBQA) systems play a pivotal role in the domain of natural language processing and information retrieval. Its primary objective is to bridge the gap between natural language questions and structured knowledge representations, especially for complex KBQA. Despite the significant progress in developing effective and interconnected KBQA technologies, the recent emergence of large language models (LLMs) offers an opportunity to address the challenges faced by KBQA systems more efficiently. This study adopts the LLMs, such as Large Language Model Meta AI (LLaMA), as a channel to connect natural language questions with structured knowledge representations and proposes a Three-step Fine-tune Strategy based on large language model to implement the KBQA system (TFS-KBQA). This method achieves direct conversion from natural language questions to structured knowledge representations, thereby overcoming the limitations of existing KBQA methods, such as addressing large search and reasoning spaces and ranking massive candidates. To evaluate the effectiveness of the proposed method, we conduct experiments using three popular complex KBQA datasets. The results achieve state-of-the-art performance across all three datasets, with particularly notable results for the WebQuestionSP dataset, which achieves an F1 value of 79.9%.

pdf abs
Non-Essential Is NEcessary: Order-agnostic Multi-hop Question Generation
Kyungho Kim | Seongmin Park | Junseo Lee | Jihwa Lee

Existing multi-hop question generation (QG) methods treat answer-irrelevant documents as non-essential and remove them as impurities. However, this approach can create a training-inference discrepancy when impurities cannot be completely removed, which can lead to a decrease in model performance. To overcome this problem, we propose an auxiliary task, called order-agnostic, which leverages non-essential data in the training phase to create a robust model and extract the consistent embeddings in real-world inference environments. Additionally, we use a single LM to perform both ranker and generator through a prompt-based approach without applying additional external modules. Furthermore, we discover that appropriate utilization of the non-essential components can achieve a significant performance increase. Finally, experiments conducted on HotpotQA dataset achieve state-of-the-art.

pdf abs
NSina: A News Corpus for Sinhala
Hansi Hettiarachchi | Damith Premasiri | Lasitha Randunu Chandrakantha Uyangodage | Tharindu Ranasinghe

The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSina, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSina aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSina is the largest news corpus for Sinhala, available up to date.

pdf abs
Null Subjects in Spanish as a Machine Translation Problem
Jose Diego Suarez | Luis Chiruzzo

In this study we approach the detection of null subjects and impersonal constructions in Spanish using a machine translation methodology. We repurpose the Spanish AnCora corpus, converting it to a parallel set that transforms Spanish sentences into a format that allows us to detect and classify verbs, and train LSTM-based neural machine translation systems to perform this task. Various models differing on output format and hyperparameters were evaluated. Experimental results proved this approach to be highly resource-effective, obtaining results comparable to or surpassing the state of the art found in existing literature, while employing modest computational resources. Additionally, an improved dataset for training and evaluating Spanish null-subject detection tools was elaborated for this project, that could aid in the creation and serve as a benchmark for further developments in the area.

pdf abs
NumHG: A Dataset for Number-Focused Headline Generation
Jian-Tao Huang | Chung-Chi Chen | Hen-Hsen Huang | Hsin-Hsi Chen

Headline generation, a key task in abstractive summarization, strives to condense a full-length article into a succinct, single line of text. Notably, while contemporary encoder-decoder models excel based on the ROUGE metric, they often falter when it comes to the precise generation of numerals in headlines. We identify the lack of datasets providing fine-grained annotations for accurate numeral generation as a major roadblock. To address this, we introduce a new dataset, the NumHG, and provide over 27,000 annotated numeral-rich news articles for detailed investigation. Further, we evaluate five well-performing models from previous headline-generation tasks using human evaluation in terms of numerical accuracy, reasonableness, and readability. Our study reveals a need for improvement in numerical accuracy, demonstrating the potential of the NumHG dataset to drive progress in number-focused headline generation and stimulate further discussions in numeral-focused text generation.

pdf abs
NutFrame: Frame-based Conceptual Structure Induction with LLMs
Shaoru Guo | Yubo Chen | Kang Liu | Ru Li | Jun Zhao

Conceptual structure is fundamental to human cognition and natural language understanding. It is significant to explore whether Large Language Models (LLMs) understand such knowledge. Since FrameNet serves as a well-defined conceptual structure knowledge resource, with meaningful frames, fine-grained frame elements, and rich frame relations, we construct a benchmark for coNceptual structure induction based on FrameNet, called NutFrame. It contains three sub-tasks: Frame Induction, Frame Element Induction, and Frame Relation Induction. In addition, we utilize prompts to induce conceptual structure of Framenet with LLMs. Furthermore, we conduct extensive experiments on NutFrame to evaluate various widely-used LLMs. Experimental results demonstrate that FrameNet induction remains a challenge for LLMs.

pdf abs
OATS: A Challenge Dataset for Opinion Aspect Target Sentiment Joint Detection for Aspect-Based Sentiment Analysis
Siva Uday Sampreeth Chebolu | Franck Dernoncourt | Nedim Lipka | Thamar Solorio

Aspect-based sentiment analysis (ABSA) delves into understanding sentiments specific to distinct elements within a user-generated review. It aims to analyze user-generated reviews to determine a) the target entity being reviewed, b) the high-level aspect to which it belongs, c) the sentiment words used to express the opinion, and d) the sentiment expressed toward the targets and the aspects. While various benchmark datasets have fostered advancements in ABSA, they often come with domain limitations and data granularity challenges. Addressing these, we introduce the OATS dataset, which encompasses three fresh domains and consists of 27,470 sentence-level quadruples and 17,092 review-level tuples. Our initiative seeks to bridge specific observed gaps in existing datasets: the recurrent focus on familiar domains like restaurants and laptops, limited data for intricate quadruple extraction tasks, and an occasional oversight of the synergy between sentence and review-level sentiments. Moreover, to elucidate OATS’s potential and shed light on various ABSA subtasks that OATS can solve, we conducted experiments, establishing initial baselines. We hope the OATS dataset augments current resources, paving the way for an encompassing exploration of ABSA (https://github.com/RiTUAL-UH/OATS-ABSA).

pdf abs
OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog
Adnen Abdessaied | Manuel Hochmeister | Andreas Bulling

We present the Object Language Video Transformer (OLViT) – a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns. In stark contrast to previous works, our approach is generic by nature and is therefore capable of learning continuous multi-modal dialog state representations of the most relevant objects and rounds. As a result, they can be seamlessly integrated into Large Language Models (LLMs) and offer high flexibility in dealing with different datasets and tasks. Evaluations on the challenging DVD (response classification) and SIMMC 2.1 (response generation) datasets show that OLViT achieves new state-of-the-art performance across both datasets.

pdf abs
On an Intermediate Task for Classifying URL Citations on Scholarly Papers
Kazuhiro Wada | Masaya Tsunokake | Shigeki Matsubara

Citations using URL (URL citations) that appear in scholarly papers can be used as an information source for the research resource search engines. In particular, the information about the types of cited resources and reasons for their citation is crucial to describe the resources and their relations in the search services. To obtain this information, previous studies proposed some methods for classifying URL citations. However, their methods trained the model using a simple fine-tuning strategy and exhibited insufficient performance. We propose a classification method using a novel intermediate task. Our method trains the model on our intermediate task of identifying whether sample pairs belong to the same class before being fine-tuned on the target task. In the experiment, our method outperformed previous methods using the simple fine-tuning strategy with higher macro F-scores for different model sizes and architectures. Our analysis results indicate that the model learns the class boundaries of the target task by training our intermediate task. Our intermediate task also demonstrated higher performance and computational efficiency than an alternative intermediate task using triplet loss. Finally, we applied our method to other text classification tasks and confirmed the effectiveness when a simple fine-tuning strategy does not stably work.

pdf abs
On Leveraging Encoder-only Pre-trained Language Models for Effective Keyphrase Generation
Di Wu | Wasi Ahmad | Kai-Wei Chang

This study addresses the application of encoder-only Pre-trained Language Models (PLMs) in keyphrase generation (KPG) amidst the broader availability of domain-tailored encoder-only models compared to encoder-decoder models. We investigate three core inquiries: (1) the efficacy of encoder-only PLMs in KPG, (2) optimal architectural decisions for employing encoder-only PLMs in KPG, and (3) a performance comparison between in-domain encoder-only and encoder-decoder PLMs across varied resource settings. Our findings, derived from extensive experimentation in two domains reveal that with encoder-only PLMs, although keyphrase extraction with Conditional Random Fields slightly excels in identifying present keyphrases, the KPG formulation renders a broader spectrum of keyphrase predictions. Additionally, prefix-LM fine-tuning of encoder-only PLMs emerges as a strong and data-efficient strategy for KPG, outperforming general-domain seq2seq PLMs. We also identify a favorable parameter allocation towards model depth rather than width when employing encoder-decoder architectures initialized with encoder-only PLMs. The study sheds light on the potential of utilizing encoder-only PLMs for advancing KPG systems and provides a groundwork for future KPG methods. Our code and pre-trained checkpoints are released at https://github.com/uclanlp/DeepKPG.

In this article we look at how two different standards for lexical resources, TEI and OntoLex, deal with corpus citations in lexicons. We will focus on how corpus citations in retrodigitised dictionaries can be modelled using each of the two standards since this provides us with a suitably challenging use case. After looking at the structure of an example entry from a legacy dictionary, we examine the two approaches offered by the two different standards by outlining an encoding for the example entry using both of them (note that this article features the first extended discussion of how the Frequency Attestation and Corpus (FrAC) module of OntoLex deals with citations). After comparing the two approaches and looking at the advantages and disadvantages of both, we argue for a combination of both. In the last part of the article we discuss different ways of doing this, giving our preference for a strategy which makes use of RDFa.

One of the prominent issues stifling the current generation of large language models is their limited context length. Recent proprietary models such as GPT-4 and Claude 2 have introduced longer context lengths, 8k/32k and 100k, respectively; however, despite the efforts in the community, most common models, such as LLama-2, have a context length of 4k or less. Unlimiformer (Bertsch et al., 2023) is a recently popular vector-retrieval augmentation method that offloads cross-attention computations to a kNN index. However, its main limitation is incompatibility with decoder-only transformers out of the box. In this work, we explore practical considerations of adapting Unlimiformer to decoder-only transformers and introduce a series of modifications to overcome this limitation. Moreover, we expand the original experimental setup on summarization to include a new task (i.e., free-form Q&A) and an instruction-tuned model (i.e., a custom 6.7B GPT model). Our results showcase the effectiveness of these modifications on summarization, performing on par with a model with 2x the context length. Moreover, we discuss limitations and future directions for free-form Q&A and instruction-tuned models.

pdf abs
On the Relationship between Skill Neurons and Robustness in Prompt Tuning
Leon Ackermann | Xenia Isabel Ohmer

Prompt Tuning is a popular parameter-efficient finetuning method for pre-trained large language models (PLMs). Based on experiments with RoBERTa, it has been suggested that Prompt Tuning activates specific neurons in the transformer’s feed-forward networks, that are highly predictive and selective for the given task. In this paper, we study the robustness of Prompt Tuning in relation to these “skill neurons”, using RoBERTa and T5. We show that prompts tuned for a specific task are transferable to tasks of the same type but are not very robust to adversarial data. While prompts tuned for RoBERTa yield below-chance performance on adversarial data, prompts tuned for T5 are slightly more robust and retain above-chance performance in two out of three cases. At the same time, we replicate the finding that skill neurons exist in RoBERTa and further show that skill neurons also exist in T5. Interestingly, the skill neurons of T5 determined on non-adversarial data are also among the most predictive neurons on the adversarial data, which is not the case for RoBERTa. We conclude that higher adversarial robustness may be related to a model’s ability to consistently activate the relevant skill neurons on adversarial data.

pdf abs
On the Scaling Laws of Geographical Representation in Language Models
Nathan Godey | Éric de la Clergerie | Benoît Sagot

Language models have long been shown to embed geographical information in their hidden representations. This line of work has recently been revisited by extending this result to Large Language Models (LLMs). In this paper, we propose to fill the gap between well-established and recent literature by observing how geographical knowledge evolves when scaling language models. We show that geographical knowledge is observable even for tiny models, and that it scales consistently as we increase the model size. Notably, we observe that larger language models cannot mitigate the geographical bias that is inherent to the training data.

pdf abs
On the Use of Silver Standard Data for Zero-shot Classification Tasks in Information Extraction
Jianwei Wang | Tianyin Wang | Ziqian Zeng

The superior performance of supervised classification methods in the information extraction (IE) area heavily relies on a large amount of gold standard data. Recent zero-shot classification methods converted the task to other NLP tasks (e.g., textual entailment) and used off-the-shelf models of these NLP tasks to directly perform inference on the test data without using a large amount of IE annotation data. A potentially valuable by-product of these methods is the large-scale silver standard data, i.e., pseudo-labeled data by the off-the-shelf models of other NLP tasks. However, there is no further investigation into the use of these data. In this paper, we propose a new framework, Clean-LaVe, which aims to utilize silver standard data to enhance the zero-shot performance. Clean-LaVe includes four phases: (1) Obtaining silver data; (2) Identifying relatively clean data from silver data; (3) Finetuning the off-the-shelf model using clean data; (4) Inference on the test data. The experimental results show that Clean-LaVe can outperform the baseline by 5% and 6% on TACRED and Wiki80 dataset in the zero-shot relation classification task, and by 3% ~7 % on Smile (Korean and Polish) in the zero-shot cross-lingual relation classification task, and by 8% on ACE05-E+ in the zero-shot event argument classification task.

Modern Transformers achieved impressive results on various Natural Language Processing tasks over the last few years. The one downside of this success is the size of these models. Huge capacity, which sometimes surpasses billions of parameters, improves generalization abilities, but makes it difficult to employ. Developing field of model compression seeks to reduce the model size and inference latency. This research focuses on one of the compression techniques — Post-Training Quantization. We present a methodology to effectively quantize at least 95% of Transformer weights and corresponding activations to INT8 without any access to task-specific data so the drop in performance does not exceed 0.02%. Furthermore, we provide intriguing observations that reflect cross-domain nature of some of the quantization properties.

pdf abs
On Zero-Shot Counterspeech Generation by LLMs
Punyajoy Saha | Aalok Agrawal | Abhik Jana | Chris Biemann | Animesh Mukherjee

With the emergence of numerous Large Language Models (LLM), the usage of such models in various Natural Language Processing (NLP) applications is increasing extensively. Counterspeech generation is one such key task where efforts are made to develop generative models by fine-tuning LLMs with hatespeech - counterspeech pairs, but none of these attempts explores the intrinsic properties of large language models in zero-shot settings. In this work, we present a comprehensive analysis of the performances of four LLMs namely GPT-2, DialoGPT, ChatGPT and FlanT5 in zero-shot settings for counterspeech generation, which is the first of its kind. For GPT-2 and DialoGPT, we further investigate the deviation in performance with respect to the sizes (small, medium, large) of the models. On the other hand, we propose three different prompting strategies for generating different types of counterspeech and analyse the impact of such strategies on the performance of the models. Our analysis shows that there is an improvement in generation quality for two datasets (17%), however the toxicity increase (25%) with increase in model size. Considering type of model, GPT-2 and FlanT5 models are significantly better in terms of counterspeech quality but also have high toxicity as compared to DialoGPT. ChatGPT are much better at generating counter speech than other models across all metrics. In terms of prompting, we find that our proposed strategies help in improving counter speech generation across all the models.

pdf abs
OOVs in the Spotlight: How to Inflect Them?
Tomáš Sourada | Jana Straková | Rudolf Rosa

We focus on morphological inflection in out-of-vocabulary (OOV) conditions, an under-researched subtask in which state-of-the-art systems usually are less effective. We developed three systems: a retrograde model and two sequence-to-sequence (seq2seq) models based on LSTM and Transformer. For testing in OOV conditions, we automatically extracted a large dataset of nouns in the morphologically rich Czech language, with lemma-disjoint data splits, and we further manually annotated a real-world OOV dataset of neologisms. In the standard OOV conditions, Transformer achieves the best results, with increasing performance in ensemble with LSTM, the retrograde model and SIGMORPHON baselines. On the real-world OOV dataset of neologisms, the retrograde model outperforms all neural models. Finally, our seq2seq models achieve state-of-the-art results in 9 out of 16 languages from SIGMORPHON 2022 shared task data in the OOV evaluation (feature overlap) in the large data condition. We release the Czech OOV Inflection Dataset for rigorous evaluation in OOV conditions. Further, we release the inflection system with the seq2seq models as a ready-to-use Python library.

We develop and evaluate multilingual scientific documents similarity measurement models in this work. Such models can be used to find related papers in different languages, which can help multilingual researchers find and explore papers more efficiently. We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778M citation pairs. With OpenMSD, we develop multilingual SDSM models by adjusting and extending the state-of-the-art methods designed for English SDSM tasks. We find that: (i)Some highly successful methods in English SDSM yield significantly worse performance in multilingual SDSM. (ii)Our best model, which enriches the non-English papers with English summaries, outperforms strong baselines by 7% (in mean average precision) on multilingual SDSM tasks, without compromising the performance on English SDSM tasks.

pdf abs
Opinion Mining Using Pre-Trained Large Language Models: Identifying the Type, Polarity, Intensity, Expression, and Source of Private States
Saeed Ahmadnia | Arash Yousefi Jordehi | Mahsa Hosseini Khasheh Heyran | SeyedAbolghasem Mirroshandel | Owen Rambow

Opinion mining is an important task in natural language processing. The MPQA Opinion Corpus is a fine-grained and comprehensive dataset of private states (i.e., the condition of a source who has an attitude which may be directed toward a target) based on context. Although this dataset was released years ago, because of its complex definition of annotations and hard-to-read data format, almost all existing research works have only focused on a small subset of the dataset. In this paper, we present a comprehensive study of the entire MPQA 2.0 dataset. In order to achieve this goal, we first provide a clean version of MPQA 2.0 in a more interpretable format. Then, we propose two novel approaches for opinion mining, establishing new high baselines for future work. We use two pre-trained large language models, BERT and T5, to automatically identify the type, polarity, and intensity of private states expressed in phrases, and we use T5 to detect opinion expressions and their agents (i.e., sources).

As in the existing opinion summary data set, more than 70% are positive texts, the current opinion summarization approaches are reluctant to generate the negative opinion summary given the input of negative opinions. To address such sentiment bias, two approaches are proposed through two perspectives: model-specific and model-agnostic. For the model-specific approach, a variational autoencoder is proposed to disentangle the input representation into sentiment-relevant and sentiment-irrelevant components through adversarial loss. Therefore, the sentiment information in the input is kept and employed for the following decoding which avoids interference of content information with emotional signals. To further avoid relying on some specific opinion summarization frameworks, a model-agnostic approach based on counterfactual data augmentation is proposed. A dataset with a more balanced emotional polarity distribution is constructed using a large pre-trained language model based on some pairwise and mini-edited principles. Experimental results show that the sentiment consistency of the generated summaries is significantly improved using the proposed approaches, while their semantics quality is unaffected.

Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.

Pretrained language models can be applied for various downstream tasks but are susceptible to subtle perturbations. Most adversarial defense methods often introduce adversarial training during the fine-tuning phase to enhance empirical robustness. However, the repeated execution of adversarial training hinders training efficiency when transitioning to different tasks. In this paper, we explore the transferability of robustness within subnetworks and leverage this insight to introduce a novel adversarial defense method ORTicket, eliminating the need for separate adversarial training across diverse downstream tasks. Specifically, (i) pruning the full model using the MLM task (the same task employed for BERT pretraining) yields a task-agnostic robust subnetwork(i.e., winning ticket in Lottery Ticket Hypothesis); and (ii) fine-tuning this subnetwork for downstream tasks. Extensive experiments demonstrate that our approach achieves comparable robustness to other defense methods while retaining the efficiency of traditional fine-tuning.This also confirms the significance of selecting MLM task for identifying the transferable robust subnetwork. Furthermore, our method is orthogonal to other adversarial training approaches, indicating the potential for further enhancement of model robustness.

pdf abs
Out-of-Domain Intent Detection Considering Multi-Turn Dialogue Contexts
Hao Lang | Yinhe Zheng | Binyuan Hui | Fei Huang | Yongbin Li

Out-of-Domain (OOD) intent detection is vital for practical dialogue systems, and it usually requires considering multi-turn dialogue contexts. However, most previous OOD intent detection approaches are limited to single dialogue turns. In this paper, we introduce a context-aware OOD intent detection (Caro) framework to model multi-turn contexts in OOD intent detection tasks. Specifically, we follow the information bottleneck principle to extract robust representations from multi-turn dialogue contexts. Two different views are constructed for each input sample and the superfluous information not related to intent detection is removed using a multi-view information bottleneck loss. Moreover, we also explore utilizing unlabeled data in Caro. A two-stage training process is introduced to mine OOD samples from these unlabeled data, and these OOD samples are used to train the resulting model with a bootstrapping approach. Comprehensive experiments demonstrate that Caro establishes state-of-the-art performances on multi-turn OOD detection tasks by improving the F1-OOD score of over 29% compared to the previous best method.

pdf abs
Out of the Mouths of MPs: Speaker Attribution in Parliamentary Debates
Ines Rehbein | Josef Ruppenhofer | Annelen Brunner | Simone Paolo Ponzetto

This paper presents GePaDe_SpkAtt , a new corpus for speaker attribution in German parliamentary debates, with more than 7,700 manually annotated events of speech, thought and writing. Our role inventory includes the sources, addressees, messages and topics of the speech event and also two additional roles, medium and evidence. We report baseline results for the automatic prediction of speech events and their roles, with high scores for both, event triggers and roles. Then we apply our model to predict speech events in 20 years of parliamentary debates and investigate the use of factives in the rhetoric of MPs.

In an era characterized by the rapid proliferation of information, the pervasive issues of misinformation and disinformation have significantly impacted numerous individuals. Consequently, the evaluation of information’s truthfulness and accuracy has garnered substantial attention among researchers. In this work, we present a novel fact-checking framework called PACAR, fact-checking based on planning and customized action reasoning using LLMs. It comprises four modules: a claim decomposer with self-reflection, an LLM-centric planner module, an executor for carrying out planned actions, and a verifier module that assesses veracity and generates explanations based on the overall reasoning process. Unlike previous work that employs single-path decision-making and single-step verdict prediction, PACAR focuses on the use of LLMs in dynamic planning and execution of actions. Furthermore, in contrast to previous work that relied primarily on general reasoning, we introduce tailored actions such as numerical reasoning and entity disambiguation to effectively address potential challenges in fact-checking. Our PACAR framework, incorporating LLM-centric planning along with customized action reasoning, significantly outperforms baseline methods across three datasets from different domains and with varying complexity levels. Additional experiments, including multidimensional and sliced observations, demonstrate the effectiveness of PACAR and offer valuable insights for the advancement of automated fact-checking.

pdf abs
PAD: A Robustness Enhancement Ensemble Method via Promoting Attention Diversity
Yuting Yang | Pei Huang | Feifei Ma | Juan Cao | Jintao Li

Deep neural networks can be vulnerable to adversarial attacks, even for the mainstream Transformer-based models. Although several robustness enhancement approaches have been proposed, they usually focus on some certain type of perturbation. As the types of attack can be various and unpredictable in practical scenarios, a general and strong defense method is urgently in require. We notice that most well-trained models can be weakly robust in the perturbation space, i.e., only a small ratio of adversarial examples exist. Inspired by the weak robust property, this paper presents a novel ensemble method for enhancing robustness. We propose a lightweight framework PAD to save computational resources in realizing an ensemble. Instead of training multiple models, a plugin module is designed to perturb the parameters of a base model which can achieve the effect of multiple models. Then, to diversify adversarial example distributions among different models, we promote each model to have different attention patterns via optimizing a diversity measure we defined. Experiments on various widely-used datasets and target models show that PAD can consistently improve the defense ability against many types of adversarial attacks while maintaining accuracy on clean data. Besides, PAD also presents good interpretability via visualizing diverse attention patterns.

pdf abs
Palmyra 3.0: A User-Friendly Cloud-Based Platform for Morphology and Dependency Syntax Annotation
Muhammed AbuOdeh | Long Phan | Ahmed Farouk Zakaria Elshabrawy | Nizar Habash

We present Palmyra 3.0, a cloud-based, configurable, and user-friendly platform for morphology and syntax annotation through dependency-tree visualization. Palmyra 3.0 implements a robust system that stores data on the cloud. By default, Palmyra 3.0 comes with an Arabic dependency parser that generates highly accurate trees, but it is easily configurable to support dependency parsers in other languages. Palmyra 3.0 provides default configuration files for a number of predefined formalisms, such as UD and CATiB, and a number of user-friendly features to support annotators.

pdf abs
Parameter-Efficient Transfer Learning for End-to-end Speech Translation
Yunlong Zhao | Kexin Wang | Qianqian Dong | Tom Ko

Recently, end-to-end speech translation (ST) has gained significant attention in research, but its progress is hindered by the limited availability of labeled data. To overcome this challenge, leveraging pre-trained models for knowledge transfer in ST has emerged as a promising direction. In this paper, we propose PETL-ST, which investigates parameter-efficient transfer learning for end-to-end speech translation. Our method utilizes two lightweight adaptation techniques, namely prefix and adapter, to modulate Attention and the Feed-Forward Network, respectively, while preserving the capabilities of pre-trained models. We conduct experiments on MuST-C En-De, Es, Fr, Ru datasets to evaluate the performance of our approach. The results demonstrate that PETL-ST outperforms strong baselines, achieving superior translation quality with high parameter efficiency. Moreover, our method exhibits remarkable data efficiency and significantly improves performance in low-resource settings.

pdf abs
ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages Using Wikidata
Jonne Sälevä | Constantine Lignos

We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.

pdf abs
PaReNT (Parent Retrieval Neural Tool): A Deep Dive into Word Formation across Languages
Emil Svoboda | Magda Sevcikova

We present PaReNT (Parent Retrieval Neural Tool), a deep-learning-based multilingual tool performing retrieval and word formation classification in English, German, Dutch, Spanish, French, Russian, and Czech. Parent retrieval refers to determining the lexeme or lexemes the input lexeme was based on (e.g. “darkness” is traced back to “dark”; “waterfall” decomposes into “water” and “fall”). Additionally, PaReNT performs word formation classification, which determines the input lexeme as a compound e.g. “proofread”, a derivative (e.g. “deescalate”) or as an unmotivated word (e.g. “dog”). These seven languages are selected from three major branches of the Indo-European language family (Germanic, Romance, Slavic). Data is aggregated from a range of word-formation resources, as well as Wiktionary, to train and test the tool. The tool is based on a custom-architecture hybrid transformer block-enriched sequence-to-sequence neural network utilizing both a character-based and semantic representation of the input lexemes, with two output modules - one decoder-based dedicated to parent retrieval, and one classifier-based for word formation classification. PaReNT achieves a mean accuracy of 0.62 in parent retrieval and a mean balanced accuracy of 0.74 in word formation classification.

pdf abs
Parsing for Mauritian Creole Using Universal Dependencies
Neha Ramsurrun | Rolando Coto-Solano | Michael Gonzalez

This paper presents a first attempt to apply Universal Dependencies (De Marneffe et al., 2021) to train a parser for Mauritian Creole (MC), a French-based Creole language spoken on the island of Mauritius. This paper demonstrates the construction of a 161-sentence (1007-token) treebank for MC and evaluates the performance of a part-of-speech tagger and Universal Dependencies parser trained on this data. The sentences were collected from publicly available grammar books (Syea, 2013) and online resources (Baker and Kriegel, 2013), as well as from government-produced school textbooks (Antonio-Françoise et al., 2021; Natchoo et al., 2017). The parser, trained with UDPipe 2 (Straka, 2018), reached F1 scores of UPOS=86.2, UAS=80.8 and LAS=69.8. This fares favorably when compared to models of similar size for other under-resourced Indigenous and Creole languages. We then address some of the challenges faced when applying UD to Creole languages in general and to Mauritian Creole in particular. The main challenge was the handling of spelling variation in the input. Other issues include the tagging of modal verbs, middle voice sentences, and parts of the tense-aspect-mood system (such as the particle fek).

pdf abs
Parsing Headed Constituencies
Katarzyna Krasnowska-Kieraś | Marcin Woliński

In the paper, we present a parsing technique that generates headed constituency trees, which combine information typically contained in constituency and dependency trees. We advocate for using such structures for syntactic representation. The parsing method combines prediction of dependency links with prediction of constituency spines in a ‘parsing as tagging’ approach and outputs a hybrid structure. An interesting feature is that the method can generate constituency trees with discontinuities. The parser is built on top of a BERT model for the given language and uses a specially crafted classifier for predicting dependency links. With suitable training data the method can be applied to arbitrary language; we report evaluation results for Polish and German.

Modeling social media users is the core of social governance in the digital society. Existing works have incorporated different digital traces to better learn the representations of social media users, including text information encoded by pre-trained language models and social network information encoded by graph models. However, limited by overloaded text information and hard-to-collect social network information, they cannot utilize global text information and cannot be generalized without social relationships. In this paper, we propose a Pre-training Architecture for Social Media User Modeling based on Text Graph(PASUM). We aggregate all microblogs to represent social media users based on the text graph model and learn the mapping from microblogs to user representation. We further design inter-user and intra-user contrastive learning tasks to inject general structural information into the mapping. In different scenarios, we can represent users based on text, even without social network information. Experimental results on various downstream tasks demonstrate the effectiveness and superiority of our framework.

Identifying the type of relationship between words (cognates, borrowings, inherited) provides a deeper insight into the history of a language and allows for a better characterization of language relatedness. In this paper, we propose a computational approach for discriminating between cognates and borrowings, one of the most difficult tasks in historical linguistics. We compare the discriminative power of graphic and phonetic features and we analyze the underlying linguistic factors that prove relevant in the classification task. We perform experiments for pairs of languages in the Romance language family (French, Italian, Spanish, Portuguese, and Romanian), based on a comprehensive database of Romance cognates and borrowings. To our knowledge, this is one of the first attempts of this kind and the most comprehensive in terms of covered languages.

Recently, we have witnessed the breakthroughs of meta-learning for few-shot learning scenario. Data augmentation is essential for meta-learning, particularly in situations where data is extremely scarce. However, existing text data augmentation methods can not ensure the diversity and quality of the generated data, which leads to sub-optimal performance. Inspired by the recent success of large language models (LLMs) which demonstrate improved language comprehension abilities, we propose a Meta-learning framework with Progressive Data Augmentation (PDAMeta) for few-shot text classification, which contains a two-stage data augmentation strategy. First, the prompt-based data augmentation enriches the diversity of the training instances from a global perspective. Second, the attention-based data augmentation further improves the data quality from a local perspective. Last, we propose a dual-stream contrastive meta-learning strategy to learn discriminative text representations from both original and augmented instances. Extensive experiments conducted on four public few-shot text classification datasets show that PDAMeta significantly outperforms several state-of-the-art models and shows better robustness.

pdf abs
PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents
Nan Zhang | Connor Heaton | Sean Timothy Okonsky | Prasenjit Mitra | Hilal Ezgi Toraman

Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.

pdf abs
PECC: Problem Extraction and Coding Challenges
Patrick Haller | Jonas Golde | Alan Akbik

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs’ capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.

Misogyny is often expressed through figurative language. Some neutral words can assume a negative connotation when functioning as pejorative epithets. Disambiguating the meaning of such terms might help the detection of misogyny. In order to address such task, we present PejorativITy, a novel corpus of 1,200 manually annotated Italian tweets for pejorative language at the word level and misogyny at the sentence level. We evaluate the impact of injecting information about disambiguated words into a model targeting misogyny detection. In particular, we explore two different approaches for injection: concatenation of pejorative information and substitution of ambiguous words with univocal terms. Our experimental results, both on our corpus and on two popular benchmarks on Italian tweets, show that both approaches lead to a major classification improvement, indicating that word sense disambiguation is a promising preliminary step for misogyny detection. Furthermore, we investigate LLMs’ understanding of pejorative epithets by means of contextual word embeddings analysis and prompting.

pdf abs
Persona-aware Multi-party Conversation Response Generation
Khyati Mahajan | Samira Shaikh

Modeling interlocutor information is essential towards modeling multi-party conversations to account for the presence of multiple participants. We investigate the role of including the persona attributes of both the speaker and addressee relevant to each utterance, collected via 3 distinct mock social media experiments. The participants were recruited via MTurk, and were unaware of the persona attributes of the other users they interacted with on the platform. Our main contributions include 1) a multi-party conversation dataset with rich associated metadata (including persona), and 2) a persona-aware heterogeneous graph transformer response generation model. We find that PersonaHeterMPC provides a good baseline towards persona-aware generation for multi-party conversation modeling, generating responses which are relevant and consistent with the interlocutor personas relevant to the conversation.

pdf abs
Phonetic Segmentation of the UCLA Phonetics Lab Archive
Eleanor Chodroff | Blaž Pažon | Annie Baker | Steven Moran

Research in speech technologies and comparative linguistics depends on access to diverse and accessible speech data. The UCLA Phonetics Lab Archive is one of the earliest multilingual speech corpora, with long-form audio recordings and phonetic transcriptions for 314 languages (Ladefoged et al., 2009). Recently, 95 of these languages were time-aligned with word-level phonetic transcriptions (Li et al., 2021). Here we present VoxAngeles, a corpus of audited phonetic transcriptions and phone-level alignments of the UCLA Phonetics Lab Archive, which uses the 95-language CMU re-release as our starting point. VoxAngeles also includes word- and phone-level segmentations from the original UCLA corpus, as well as phonetic measurements of word and phone durations, vowel formants, and vowel f0. This corpus enhances the usability of the original data, particularly for quantitative phonetic typology, as demonstrated through a case study of vowel intrinsic f0. We also discuss the utility of the VoxAngeles corpus for general research and pedagogy in crosslinguistic phonetics, as well as for low-resource and multilingual speech technologies. VoxAngeles is free to download and use under a CC-BY-NC 4.0 license.

pdf abs
Phonotactic Complexity across Dialects
Ryan Soh-Eun Shim | Kalvin Chang | David R. Mortensen

Received wisdom in linguistic typology holds that if the structure of a language becomes more complex in one dimension, it will simplify in another, building on the assumption that all languages are equally complex (Joseph and Newmeyer, 2012). We study this claim on a micro-level, using a tightly-controlled sample of Dutch dialects (across 366 collection sites) and Min dialects (across 60 sites), which enables a more fair comparison across varieties. Even at the dialect level, we find empirical evidence for a tradeoff between word length and a computational measure of phonotactic complexity from a LSTM-based phone-level language model—a result previously documented only at the language level. A generalized additive model (GAM) shows that dialects with low phonotactic complexity concentrate around the capital regions, which we hypothesize to correspond to prior hypotheses that language varieties of greater or more diverse populations show reduced phonotactic complexity. We also experiment with incorporating the auxiliary task of predicting syllable constituency, but do not find an increase in the strength of the negative correlation observed.

pdf abs
PILA: A Historical-Linguistic Dataset of Proto-Italic and Latin
Stephen Bothwell | Brian DuSell | David Chiang | Brian Krostenko

Computational historical linguistics seeks to systematically understand processes of sound change, including during periods at which little to no formal recording of language is attested. At the same time, few computational resources exist which deeply explore phonological and morphological connections between proto-languages and their descendants. This is particularly true for the family of Italic languages. To assist historical linguists in the study of Italic sound change, we introduce the Proto-Italic to Latin (PILA) dataset, which consists of roughly 3,000 pairs of forms from Proto-Italic and Latin. We provide a detailed description of how our dataset was created and organized. Then, we exhibit PILA’s value in two ways. First, we present baseline results for PILA on a pair of traditional computational historical linguistics tasks. Second, we demonstrate PILA’s capability for enhancing other historical-linguistic datasets through a dataset compatibility study.

pdf abs
PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods
Slawomir Dadas | Michał Perełkiewicz | Rafał Poświata

We present Polish Information Retrieval Benchmark (PIRB), a comprehensive evaluation framework encompassing 41 text information retrieval tasks for Polish. The benchmark incorporates existing datasets as well as 10 new, previously unpublished datasets covering diverse topics such as medicine, law, business, physics, and linguistics. We conduct an extensive evaluation of over 20 dense and sparse retrieval models, including the baseline models trained by us as well as other available Polish and multilingual methods. Finally, we introduce a three-step process for training highly effective language-specific retrievers, consisting of knowledge distillation, supervised fine-tuning, and building sparse-dense hybrid retrievers using a lightweight rescoring model. In order to validate our approach, we train new text encoders for Polish and compare their results with previously evaluated methods. Our dense models outperform the best solutions available to date, and the use of hybrid methods further improves their performance.

pdf abs
PLAES: Prompt-generalized and Level-aware Learning Framework for Cross-prompt Automated Essay Scoring
Yuan Chen | Xia Li

Current cross-prompt automatic essay scoring (AES) systems are primarily concerned with obtaining shared knowledge specific to the target prompt by using the source and target prompt essays. However, it may not be feasible in practical situations because the target prompt essays may not be available as training data. When constructing a model solely from source prompt essays, its capacity to generalize to the target prompt may be hindered by significant discrepancies among different prompt essays. In this study, a novel learning framework for cross-prompt AES is proposed in order to capture more general knowledge across prompts and improve the model’s capacity to distinguish between writing levels. To acquire generic knowledge across different prompts, a primary model is trained via meta learning with all source prompt essays. To improve the model’s ability to differentiate writing levels, we present a level-aware learning strategy consisting of a general scorer and three level scorers for low-, middle-, and high-level essays. Then, we introduce a contrastive learning strategy to bring the essay representation of the general scorer closer to its corresponding level representation and far away from the other two levels, thereby improving the system’s ability to differentiate writing levels as well as boosting scoring performance. Experimental results on public datasets illustrate the efficacy of our method.

pdf abs
Plots Made Quickly: An Efficient Approach for Generating Visualizations from Natural Language Queries
Henrik Voigt | Kai Lawonn | Sina Zarrieß

Generating visualizations from natural language queries is a useful extension to visualization libraries such as Vega-Lite. The goal of the NL2VIS task is to generate a valid Vega-Lite specification from a data frame and a natural language query as input, which can then be rendered as a visualization. To enable real-time interaction with the data, small model sizes and fast inferences are required. Previous work has introduced custom neural network solutions with custom visualization specifications and has not systematically tested pre-trained LMs to solve this problem. In this work, we opt for a more generic approach that (i) evaluates pre-trained LMs of different sizes and (ii) uses string encodings of data frames and visualization specifications instead of custom specifications. In our experiments, we show that these representations, in combination with pre-trained LMs, scale better than current state-of-the-art models. In addition, the small and base versions of the T5 architecture achieve real-time interaction, while LLMs far exceed latency thresholds suitable for visual exploration tasks. In summary, our models generate visualization specifications in real-time on a CPU and establish a new state of the art on the NL2VIS benchmark nvBench.

Although neural machine translation (NMT) models perform well in the general domain, it remains rather challenging to control their generation behavior to satisfy the requirement of different users. Given the expensive training cost and the data scarcity challenge of learning a new model from scratch for each user requirement, we propose a memory-augmented adapter to steer pretrained NMT models in a pluggable manner. Specifically, we construct a multi-granular memory based on the user-provided text samples and propose a new adapter architecture to combine the model representations and the retrieved results. We also propose a training strategy using memory dropout to reduce spurious dependencies between the NMT model and the memory. We validate our approach on both style- and domain-specific experiments and the results indicate that our method can outperform several representative pluggable baselines.

pdf abs
Pointing Out the Shortcomings of Relation Extraction Models with Semantically Motivated Adversarials
Gennaro Nolano | Moritz Blum | Basil Ell | Philipp Cimiano

In recent years, large language models have achieved state-of-the-art performance across various NLP tasks. However, investigations have shown that these models tend to rely on shortcut features, leading to inaccurate predictions and causing the models to be unreliable at generalization to out-of-distribution (OOD) samples. For instance, in the context of relation extraction (RE), we would expect a model to identify the same relation independently of the entities involved in it. For example, consider the sentence “Leonardo da Vinci painted the Mona Lisa” expressing the created(Leonardo_da_Vinci, Mona_Lisa) relation. If we substiute “Leonardo da Vinci” with “Barack Obama”, then the sentence still expresses the created relation. A robust model is supposed to detect the same relation in both cases. In this work, we describe several semantically-motivated strategies to generate adversarial examples by replacing entity mentions and investigate how state-of-the-art RE models perform under pressure. Our analyses show that the performance of these models significantly deteriorates on the modified datasets (avg. of -48.5% in F1), which indicates that these models rely to a great extent on shortcuts, such as surface forms (or patterns therein) of entities, without making full use of the information present in the sentences.

pdf abs
Polish-ASTE: Aspect-Sentiment Triplet Extraction Datasets for Polish
Marta Lango | Borys Naglik | Mateusz Lango | Iwo Naglik

Aspect-Sentiment Triplet Extraction (ASTE) is one of the most challenging and complex tasks in sentiment analysis. It concerns the construction of triplets that contain an aspect, its associated sentiment polarity, and an opinion phrase that serves as a rationale for the assigned polarity. Despite the growing popularity of the task and the many machine learning methods being proposed to address it, the number of datasets for ASTE is very limited. In particular, no dataset is available for any of the Slavic languages. In this paper, we present two new datasets for ASTE containing customer opinions about hotels and purchased products expressed in Polish. We also perform experiments with two ASTE techniques combined with two large language models for Polish to investigate their performance and the difficulty of the assembled datasets. The new datasets are available under a permissive licence and have the same file format as the English datasets, facilitating their use in future research.

This paper presents the Polish Discourse Corpus, a pioneering resource of this kind for Polish and the first corpus in Poland to employ the ISO standard for discourse relation annotation. The Polish Discourse Corpus adopts ISO 24617-8, a segment of the Language Resource Management – Semantic Annotation Framework (SemAF), which outlines a set of core discourse relations adaptable for diverse languages and genres. The paper overviews the corpus architecture, annotation procedures, the challenges that the annotators have encountered, as well as key statistical data concerning discourse relations and connectives in the corpus. It further discusses the initial phases of the discourse parser tailored for the ISO 24617-8 framework. Evaluations on the efficacy and potential refinement areas of the corpus annotation and parsing strategies are also presented. The final part of the paper touches upon anticipated research plans to improve discourse analysis techniques in the project and to conduct discourse studies involving multiple languages.

pdf abs
PolitiCause: An Annotation Scheme and Corpus for Causality in Political Texts
Paulina Garcia Corral | Hanna Bechara | Ran Zhang | Slava Jankin

In this paper, we present PolitiCAUSE, a new corpus of political texts annotated for causality. We provide a detailed and robust annotation scheme for annotating two types of information: (1) whether a sentence contains a causal relation or not, and (2) the spans of text that correspond to the cause and effect components of the causal relation. We also provide statistics and analysis of the corpus, and outline the difficulties and limitations of the task. Finally, we test out two transformer-based classification models on our dataset as a form of evaluation. The models achieve a moderate performance on the dataset, with a MCC score of 0.62. Our results show that PolitiCAUSE is a valuable resource for studying causality in texts, especially in the domain of political discourse, and that there is still room for improvement in developing more accurate and robust methods for this problem.

pdf abs
PolQA: Polish Question Answering Dataset
Piotr Rybak | Piotr Przybyła | Maciej Ogrodniczuk

Recently proposed systems for open-domain question answering (OpenQA) require large amounts of training data to achieve state-of-the-art performance. However, data annotation is known to be time-consuming and therefore expensive to acquire. As a result, the appropriate datasets are available only for a handful of languages (mainly English and Chinese). In this work, we introduce and publicly release PolQA, the first Polish dataset for OpenQA. It consists of 7,000 questions, 87,525 manually labeled evidence passages, and a corpus of over 7,097,322 candidate passages. Each question is classified according to its formulation, type, as well as entity type of the answer. This resource allows us to evaluate the impact of different annotation choices on the performance of the QA system and propose an efficient annotation strategy that increases the passage retrieval accuracy@10 by 10.55 p.p. while reducing the annotation cost by 82%.

pdf abs
PolyNERE: A Novel Ontology and Corpus for Named Entity Recognition and Relation Extraction in Polymer Science Domain
Van-Thuy Phi | Hiroki Teranishi | Yuji Matsumoto | Hiroyuki Oka | Masashi Ishii

Polymers are widely used in diverse fields, and the demand for efficient methods to extract and organize information about them is increasing. An automated approach that utilizes machine learning can accurately extract relevant information from scientific papers, providing a promising solution for automating information extraction using annotated training data. In this paper, we introduce a polymer-relevant ontology featuring crucial entities and relations to enhance information extraction in the polymer science field. Our ontology is customizable to adapt to specific research needs. We present PolyNERE, a high-quality named entity recognition (NER) and relation extraction (RE) corpus comprising 750 polymer abstracts annotated using our ontology. Distinctive features of PolyNERE include multiple entity types, relation categories, support for various NER settings, and the ability to assert entities and relations at different levels. PolyNERE also facilitates reasoning in the RE task through supporting evidence. While our experiments with recent advanced methods achieved promising results, challenges persist in adapting NER and RE from abstracts to full-text paragraphs. This emphasizes the need for robust information extraction systems in the polymer domain, making our corpus a valuable benchmark for future developments.

pdf abs
PopALM: Popularity-Aligned Language Models for Social Media Trendy Response Prediction
Erxin Yu | Jing Li | Chunpu Xu

Social media platforms are daily exhibiting millions of events. To preliminarily predict the mainstream public reaction to these events, we study trendy response prediction to automatically generate top-liked user replies to social media events. While previous works focus on generating responses without factoring in popularity, we propose Popularity-Aligned Language Models (PopALM) to distinguish responses liked by a larger audience through reinforcement learning. Recognizing the noisy labels from user “likes”, we tailor-make curriculum learning in proximal policy optimization (PPO) to help models capture the essential samples for easy-to-hard training. In experiments, we build a large-scale Weibo dataset for trendy response prediction, and its results show that PopALM can help boost the performance of advanced language models.

pdf abs
PopAut: An Annotated Corpus for Populism Detection in Austrian News Comments
Ahmadou Wagne | Julia Neidhardt | Thomas Elmar Kolb

Populism is a phenomenon that is noticeably present in the political landscape of various countries over the past decades. While populism expressed by politicians has been thoroughly examined in the literature, populism expressed by citizens is still underresearched, especially when it comes to its automated detection in text. This work presents the PopAut corpus, which is the first annotated corpus of news comments for populism in the German language. It features 1,200 comments collected between 2019-2021 that are annotated for populist motives anti-elitism, people-centrism and people-sovereignty. Following the definition of Cas Mudde, populism is seen as a thin ideology. This work shows that annotators reach a high agreement when labeling news comments for these motives. The data set is collected to serve as the basis for automated populism detection using machine-learning methods. By using transformer-based models, we can outperform existing dictionaries tailored for automated populism detection in German social media content. Therefore our work provides a rich resource for future work on the classification of populist user comments in the German language.

pdf abs
Positive and Risky Message Assessment for Music Products
Yigeng Zhang | Mahsa Shafaei | Fabio Gonzalez | Thamar Solorio

In this work, we introduce a pioneering research challenge: evaluating positive and potentially harmful messages within music products. We initiate by setting a multi-faceted, multi-task benchmark for music content assessment. Subsequently, we introduce an efficient multi-task predictive model fortified with ordinality-enforcement to address this challenge. Our findings reveal that the proposed method not only significantly outperforms robust task-specific alternatives but also possesses the capability to assess multiple aspects simultaneously. Furthermore, through detailed case studies, where we employed Large Language Models (LLMs) as surrogates for content assessment, we provide valuable insights to inform and guide future research on this topic. The code for dataset creation and model implementation is publicly available at https://github.com/RiTUAL-UH/music-message-assessment.

pdf abs
POS Tagging for the Endangered Dagur Language
Joanna Dolińska | Delphine Bernhard

The application of natural language processing tools opens new ways for the documentation and revitalization of under-resourced languages. In this article we aim to investigate the feasibility of automatic part-of-speech (POS) tagging for Dagur, which is an endangered Mongolic language spoken mainly in northeast China, with no official written standard for all Dagur dialects. We present a new manually annotated corpus for Dagur, which includes about 1,200 tokens, and detail the decisions made during the annotation process. This corpus is used to test transfer of models from other languages, especially from Buryat, which is currently the only Mongolic language included in the Universal Dependencies corpora. We applied the models trained by de Vries et al. (2022) to the Dagur corpus and continued training these models on Buryat. We analyse the results with respect to language families, script and POS distribution, in three different zero-shot settings: (1) unrelated, (2) related and (3) unrelated+related language.

pdf abs
Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview
Heyang Liu | Yanfeng Wang | Yu Wang

End-to-end (E2E) approach is gradually replacing hybrid models for automatic speech recognition (ASR) tasks. However, the optimization of E2E models lacks an intuitive method for handling decoding shifts, especially in scenarios with a large number of domain-specific rare words that hold specific important meanings. Furthermore, the absence of knowledge-intensive speech datasets in academia has been a significant limiting factor, and the commonly used speech corpora exhibit significant disparities with realistic conversation. To address these challenges, we present Medical Interview (MED-IT), a multi-turn consultation speech dataset that contains a substantial number of knowledge-intensive named entities. We also explore methods to enhance the recognition performance of rare words for E2E models. We propose a novel approach, post-decoder biasing, which constructs a transform probability matrix based on the distribution of training transcriptions. This guides the model to prioritize recognizing words in the biasing list. In our experiments, for subsets of rare words appearing in the training speech between 10 and 20 times, and between 1 and 5 times, the proposed method achieves a relative improvement of 9.3% and 5.1%, respectively.

pdf abs
PPORTAL_ner: An Annotated Corpus of Portuguese Literary Entities
Mariana O. Silva | Mirella M. Moro

The intersection of natural language processing (NLP) and literary analysis has yielded valuable insights and applications across various languages. However, the scarcity of labeled data tailored for Portuguese literary texts poses a notable challenge. To address this gap, we present the PPORTAL_ner corpus, an annotated dataset that simplifies the development of Named Entity Recognition (NER) models specifically adapted for Portuguese literary works. Our corpus includes annotations of PER, LOC, GPE, ORG, and DATE entities within a diverse set of 25 literary texts. Annotation of the corpus involved a two-step process: initial pre-annotation using a pre-trained spaCy model followed by correction and refinement using the Prodigy annotation tool. With a total of 125,059 tokens and 5,266 annotated entities, PPORTAL_ner corpus significantly enriches the landscape of resources available for computational literary analysis in Portuguese. This paper details the annotation methodology, guidelines, and dataset statistics while also evaluating four NER models over the PPORTAL_ner corpus. Our evaluation analysis reveals that fine-tuning on domain-specific data significantly improves NER model performance, demonstrating the value of the PPORTAL_ner corpus for developing domain-specific language models.

In this study, we analyze spontaneous speech transcripts from Hungarian patients with schizophrenia, schizoaffective, and bipolar disorders. Our goal is to identify distinctive linguistic features in these patient groups and controls. To our knowledge, no prior study has systematically examined the linguistic features of these disorders or explored their use in distinguishing between these patient groups. We collected recordings from 77 participants during three directed spontaneous speech tasks in a clinical setting, resulting in 458 texts. Our research group manually transcribed the recordings. We processed the written corpus texts using Natural Language Processing methods and tools. The final corpus consists of 179,515 tokens, excluding punctuation. Using this data, we analyze different linguistic features’ predictive power by computing and comparing their frequency distributions. We then attempt to automatically differentiate between patient groups and controls using our extensive set of linguistic features, employing the random forest algorithm in these experiments. Our results indicate that applying machine learning techniques based on distinctive features can effectively distinguish SZ, SAD, BD, and controls, surpassing baseline results.

While impressive performance has been achieved in image captioning, the limited diversity of the generated captions and the large parameter scale remain major barriers to the real-word application of these systems. In this work, we propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion. To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model. In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network. Prefix-diffusion is able to generate diverse captions with relatively less parameters, while maintaining the fluency and relevance of the captions benefiting from the generative capabilities of the diffusion model. Our work paves the way for scaling up diffusion models for image captioning, and achieves promising performance compared with recent approaches.

pdf abs
Pre-Trained Language Models Represent Some Geographic Populations Better than Others
Jonathan Dunn | Benjamin Adams | Harish Tayyar Madabushi

This paper measures the skew in how well two families of LLMs represent diverse geographic populations. A spatial probing task is used with geo-referenced corpora to measure the degree to which pre-trained language models from the OPT and BLOOM series represent diverse populations around the world. Results show that these models perform much better for some populations than others. In particular, populations across the US and the UK are represented quite well while those in South and Southeast Asia are poorly represented. Analysis shows that both families of models largely share the same skew across populations. At the same time, this skew cannot be fully explained by sociolinguistic factors, economic factors, or geographic factors. The basic conclusion from this analysis is that pre-trained models do not equally represent the world’s population: there is a strong skew towards specific geographic populations. This finding challenges the idea that a single model can be used for all populations.

Recent large-scale vision-language pre-training depends on image-text global alignment by contrastive learning and is further boosted by fine-grained alignment in a weakly contrastive manner for cross-modal retrieval. Nonetheless, besides semantic matching learned by contrastive learning, cross-modal retrieval also largely relies on object matching between modalities. This necessitates fine-grained categorical discriminative learning, which however suffers from scarce data in full-supervised scenarios and information asymmetry in weakly-supervised scenarios when applied to cross-modal retrieval. To address these issues, we propose expansive lexicon-patch alignment (ELA) to align image patches with a vocabulary rather than only the words explicitly in the text for annotation-free alignment and information augmentation, thus enabling more effective fine-grained categorical discriminative learning for cross-modal retrieval. Experimental results show that ELA could effectively learn representative fine-grained information and outperform state-of-the-art methods on cross-modal retrieval.

pdf abs
PRIMO: Progressive Induction for Multi-hop Open Rule Generation
Jianyu Liu | Sheng Bi | Guilin Qi

Open rules refer to the implication from premise atoms to hypothesis atoms, which captures various relationships between instances in the real world. Injecting open rule knowledge into the machine helps to improve the performance of downstream tasks such as dialogue and relation extraction. Existing approaches focus on single-hop open rule generation, ignoring scenarios involving multiple hops, leading to logical inconsistencies between premise and hypothesis atoms, as well as semantic duplication of generated rule atoms. To address these issues, we propose a progressive multi-stage open rule generation method called PRIMO. We introduce ontology information during the rule generation stage to reduce ambiguity and improve rule accuracy. PRIMO constructs a multi-stage structure consisting of generation, extraction, and rank modules to fully leverage the latent knowledge within the language model across multiple dimensions. Furthermore, we employ reinforcement learning from human feedback to further optimize model, enhancing the model’s understanding of commonsense knowledge. Experimental results demonstrate that compared to baseline models, PRIMO significantly enhances rule quality and diversity while reducing the repetition rate of rule atoms.

pdf abs
Principal Component Analysis as a Sanity Check for Bayesian Phylolinguistic Reconstruction
Yugo Murawaki

Bayesian approaches to reconstructing the evolutionary history of languages rely on the tree model, which assumes that these languages descended from a common ancestor and underwent modifications over time. However, this assumption can be violated to different extents due to contact and other factors. Understanding the degree to which this assumption is violated is crucial for validating the accuracy of phylolinguistic inference. In this paper, we propose a simple sanity check: projecting a reconstructed tree onto a space generated by principal component analysis. By using both synthetic and real data, we demonstrate that our method effectively visualizes anomalies, particularly in the form of jogging.

pdf abs
Prior Relational Schema Assists Effective Contrastive Learning for Inductive Knowledge Graph Completion
Ruilin Luo | Jiayi Li | Jianghangfan Zhang | Jing Xiao | Yujiu Yang

Knowledge Graph Completion (KGC) is a task aimed at uncovering the inherent relationships among known knowledge triplets in a Knowledge Graph (KG) and subsequently predicting missing links. Presently, there is a rising interest in inductive knowledge graph completion, where missing links may pertain to previously unobserved entities. Previous inductive KGC methods mainly rely on descriptive information of entities to improve the representation of unseen entities, neglecting to provide effective prior knowledge for relation modeling. To tackle this challenge, we capture prior schema-level interactions related to relations by leveraging entity type information, thereby furnishing effective prior constraints when reasoning with newly introduced entities. Moreover, We employ normal in-batch negatives and introduce schema-guided negatives to bolster the efficiency of normal contrastive representation learning. Experimental results demonstrate that our approach consistently achieves state-of-the-art performance on various established metrics across multiple benchmark datasets for link prediction. Notably, our method achieves a 20.5% relative increase in Hits@1 on the HumanWiki-Ind dataset.

pdf abs
Probe Then Retrieve and Reason: Distilling Probing and Reasoning Capabilities into Smaller Language Models
Yichun Zhao | Shuheng Zhou | Huijia Zhu

Step-by-step reasoning methods, such as the Chain-of-Thought (CoT), have been demonstrated to be highly effective in harnessing the reasoning capabilities of Large Language Models (LLMs). Recent research efforts have sought to distill LLMs into Small Language Models (SLMs), with a significant focus on transferring the reasoning capabilities of LLMs to SLMs via CoT. However, the outcomes of CoT distillation are inadequate for knowledge-intensive reasoning tasks. This is because generating accurate rationales requires crucial factual knowledge, which SLMs struggle to retain due to their parameter constraints. We propose a retrieval-based CoT distillation framework, named Probe then Retrieve and Reason (PRR), which distills the question probing and reasoning capabilities from LLMs into SLMs. We train two complementary distilled SLMs, a probing model and a reasoning model, in tandem. When presented with a new question, the probing model first identifies the necessary knowledge to answer it, generating queries for retrieval. Subsequently, the reasoning model uses the retrieved knowledge to construct a step-by-step rationale for the answer. In knowledge-intensive reasoning tasks, such as StrategyQA and OpenbookQA, our distillation framework yields superior performance for SLMs compared to conventional methods, including simple CoT distillation and knowledge-augmented distillation using raw questions.

pdf abs
Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics
Fangru Lin | Daniel Altshuler | Janet B. Pierrehumbert

Scalar adjectives pertain to various domain scales and vary in intensity within each scale (e.g. certain is more intense than likely on the likelihood scale). Scalar implicatures arise from the consideration of alternative statements which could have been made. They can be triggered by scalar adjectives and require listeners to reason pragmatically about them. Some scalar adjectives are more likely to trigger scalar implicatures than others. This phenomenon is referred to as scalar diversity. In this study, we probe different families of Large Language Models such as GPT-4 for their knowledge of the lexical semantics of scalar adjectives and one specific aspect of their pragmatics, namely scalar diversity. We find that they encode rich lexical-semantic information about scalar adjectives. However, the rich lexical-semantic knowledge does not entail a good understanding of scalar diversity. We also compare current models of different sizes and complexities and find that larger models are not always better. Finally, we explain our probing results by leveraging linguistic intuitions and model training objectives.

The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving state-of-the-art performance on image-to-text tasks. However, there are few studies exploring which layers of MLLMs make the most effort to the global image information, which plays vital roles in multimodal comprehension and generation. In this study, we find that the intermediate layers of models can encode more global semantic information, whose representation vectors perform better on visual-language entailment tasks, rather than the topmost layers. We further probe models regarding local semantic representations through object recognition tasks. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information. Our code and data are released via https://github.com/kobayashikanna01/probing_MLLM_rep.

pdf abs
ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search
Zehan Li | Jianfei Zhang | Chuantao Yin | Yuanxin Ouyang | Wenge Rong

Retrieval-based code question answering seeks to match user queries in natural language to relevant code snippets. Previous approaches typically rely on pretraining models using crafted bi-modal and uni-modal datasets to align text and code representations. In this paper, we introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community, offering naturally structured mixed-modal QA pairs. To validate its effectiveness, we propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models. Compared to previous models that primarily employ bimodal and unimodal pairs extracted from CodeSearchNet for pre-training, our model exhibits significant performance improvements across a wide range of code retrieval benchmarks.

pdf abs
PRODIS - a Speech Database and a Phoneme-based Language Model for the Study of Predictability Effects in Polish
Zofia Malisz | Jan Foremski | Małgorzata Kul

We present a speech database and a phoneme-level language model of Polish. The database and model are designed for the analysis of prosodic and discourse factors interacting with predictability effects on acoustic parameters. The database is also the first large, publicly available Polish speech corpus of excellent acoustic quality that can be used for phonetic analysis and training of multi-speaker speech technology systems. The speech in the database is processed in a pipeline that achieves a 90% degree of automation. It incorporates state-of-the-art, freely available tools enabling database expansion or adaptation to additional languages.

pdf abs
Producing a Parallel Universal Dependencies Treebank of Ancient Hebrew and Ancient Greek via Cross-Lingual Projection
Daniel G. Swanson | Bryce D. Bussert | Francis Tyers

In this paper we present the initial construction of a treebank of Ancient Greek containing portions of the Septuagint, a translation of the Hebrew Scriptures (1576 sentences, 39K tokens, roughly 7% of the total corpus). We construct the treebank by word-aligning and projecting from the parallel text in Ancient Hebrew before automatically correcting systematic syntactic mismatches and manually correcting other errors.

pdf abs
Projective Methods for Mitigating Gender Bias in Pre-trained Language Models
Hillary Dawkins | Isar Nejadgholi | Daniel Gillis | Judi McCuaig

Mitigation of gender bias in NLP has a long history tied to debiasing static word embeddings. More recently, attention has shifted to debiasing pre-trained language models. We study to what extent the simplest projective debiasing methods, developed for word embeddings, can help when applied to BERT’s internal representations. Projective methods are fast to implement, use a small number of saved parameters, and make no updates to the existing model parameters. We evaluate the efficacy of the methods in reducing both intrinsic bias, as measured by BERT’s next sentence prediction task, and in mitigating observed bias in a downstream setting when fine-tuned. To this end, we also provide a critical analysis of a popular gender-bias assessment test for quantifying intrinsic bias, resulting in an enhanced test set and new bias measures. We find that projective methods can be effective at both intrinsic bias and downstream bias mitigation, but that the two outcomes are not necessarily correlated. This finding serves as a warning that intrinsic bias test sets, based either on language modeling tasks or next sentence prediction, should not be the only benchmark in developing a debiased language model.

pdf abs
Project MOSLA: Recording Every Moment of Second Language Acquisition
Masato Hagiwara | Joshua B. Tanner

Second language acquisition (SLA) is a complex and dynamic process. Many SLA studies that have attempted to record and analyze this process have typically focused on a single modality (e.g., textual output of learners), covered only a short period of time, and/or lacked control (e.g., failed to capture every aspect of the learning process). In Project MOSLA (Moments of Second Language Acquisition), we have created a longitudinal, multimodal, multilingual, and controlled dataset by inviting participants to learn one of three target languages (Arabic, Spanish, and Chinese) from scratch over a span of two years, exclusively through online instruction, and recording every lesson using Zoom. The dataset is semi-automatically annotated with speaker/language IDs and transcripts by both human annotators and fine-tuned state-of-the-art speech models. Our experiments reveal linguistic insights into learners’ proficiency development over time, as well as the potential for automatically detecting the areas of focus on the screen purely from the unannotated multimodal data. Our dataset is freely available for research purposes and can serve as a valuable resource for a wide range of applications, including but not limited to SLA, proficiency assessment, language and speech processing, pedagogy, and multimodal learning analytics.

pdf abs
PROM: A Phrase-level Copying Mechanism with Pre-training for Abstractive Summarization
Xinbei Ma | Yeyun Gong | Pengcheng He | Hai Zhao | Nan Duan

Based on the remarkable achievements of pre-trained language models in abstractive summarization, the copying mechanism has proved helpful by improving the factuality, stability, and overall performance. This work proposes PROM, a new PhRase-level cOpying Mechanism that enhances attention on n-grams, which can be applied to zero-shot summarization with pre-training. PROM adds an indicator layer to explicitly pick up tokens in n-gram that can be copied from the source, and calculates an auxiliary loss for the copying prediction. Empirical studies show that PROM makes significant improvements in fine-tuning on benchmarks. In the zero-shot setting, PROM is utilized in the self-supervised pre-training on raw corpora and provides new general baselines on a wide range of summarization datasets. Further analysis shows that PROM performs more reasonable copying and contributes to faithfulness. Our code is publicly available at https://github.com/xbmxb/PROM.

pdf abs
PromISe: Releasing the Capabilities of LLMs with Prompt Introspective Search
Minzheng Wang | Nan Xu | Jiahao Zhao | Yin Luo | Wenji Mao

The development of large language models (LLMs) raises the importance of assessing the fairness and completeness of various evaluation benchmarks. Regrettably, these benchmarks predominantly utilize uniform manual prompts, which may not fully capture the expansive capabilities of LLMs—potentially leading to an underestimation of their performance. To unlock the potential of LLMs, researchers pay attention to automated prompt search methods, which employ LLMs as optimizers to discover optimal prompts. However, previous methods generate the solutions implicitly, which overlook the underlying thought process and lack explicit feedback. In this paper, we propose a novel prompt introspective search framework, namely PromISe, to better release the capabilities of LLMs. It converts the process of optimizing prompts into an explicit chain of thought, through a step-by-step procedure that integrates self-introspect and self-refine. Extensive experiments, conducted over 73 tasks on two major benchmarks, demonstrate that our proposed PromISe significantly boosts the performance of 12 well-known LLMs compared to the baseline approach. Moreover, our study offers enhanced insights into the interaction between humans and LLMs, potentially serving as a foundation for future designs and implementations. Keywords: large language models, prompt search, self-introspect, self-refine

pdf abs
Prompt-based Generation of Natural Language Explanations of Synthetic Lethality for Cancer Drug Discovery
Ke Zhang | Yimiao Feng | Jie Zheng

Synthetic lethality (SL) offers a promising approach for targeted anti-cancer therapy. Deeply understanding SL gene pair mechanisms is vital for anti-cancer drug discovery. However, current wet-lab and machine learning-based SL prediction methods lack user-friendly and quantitatively evaluable explanations. To address these problems, we propose a prompt-based pipeline for generating natural language explanations. We first construct a natural language dataset named NexLeth. This dataset is derived from New Bing through prompt-based queries and expert annotations and contains 707 instances. NexLeth enhances the understanding of SL mechanisms and it is a benchmark for evaluating SL explanation methods. For the task of natural language generation for SL explanations, we combine subgraph explanations from an SL knowledge graph (KG) with instructions to construct novel personalized prompts, so as to inject the domain knowledge into the generation process. We then leverage the prompts to fine-tune pre-trained biomedical language models on our dataset. Experimental results show that the fine-tuned model equipped with designed prompts performs better than existing biomedical language models in terms of text quality and explainability, suggesting the potential of our dataset and the fine-tuned model for generating understandable and reliable explanations of SL mechanisms.

pdf abs
Prompt-based Zero-shot Relation Extraction with Semantic Knowledge Augmentation
Jiaying Gong | Hoda Eldardiry

In relation triplet extraction (RTE), recognizing unseen relations for which there are no training instances is a challenging task. Efforts have been made to recognize unseen relations based on question-answering models or relation descriptions. However, these approaches miss the semantic information about connections between seen and unseen relations. In this paper, We propose a prompt-based model with semantic knowledge augmentation (ZS-SKA) to recognize unseen relations under the zero-shot setting. We present a new word-level analogy-based sentence translation rule and generate augmented instances with unseen relations from instances with seen relations using that new rule. We design prompts with weighted virtual label construction based on an external knowledge graph to integrate semantic knowledge information learned from seen relations. Instead of using the actual label sets in the prompt template, we construct weighted virtual label words. We learn the representations of both seen and unseen relations with augmented instances and prompts. We then calculate the distance between the generated representations using prototypical networks to predict unseen relations. Extensive experiments conducted on three public datasets FewRel, Wiki-ZSL, and NYT, show that ZS-SKA outperforms other methods under zero-shot setting. Results also demonstrate the effectiveness and robustness of ZS-SKA.

pdf abs
Prompt-fused Framework for Inductive Logical Query Answering
Zezhong Xu | Wen Zhang | Peng Ye | Lei Liang | Huajun Chen

Answering logical queries on knowledge graphs (KG) poses a significant challenge for machine reasoning. The primary obstacle in this task stems from the inherent incompleteness of KGs. Existing research has predominantly focused on addressing the issue of missing edges in KGs, thereby neglecting another aspect of incompleteness: the emergence of new entities. Furthermore, most of the existing methods tend to reason over each logical operator separately, rather than comprehensively analyzing the query as a whole during the reasoning process. In this paper, we propose a query-aware prompt-fused framework named Pro-QE, which could incorporate existing query embedding methods and address the embedding of emerging entities through contextual information aggregation. Additionally, a query prompt, which is generated by encoding the symbolic query, is introduced to gather information relevant to the query from a holistic perspective. To evaluate the efficacy of our model in the inductive setting, we introduce two new challenging benchmarks. Experimental results demonstrate that our model successfully handles the issue of unseen entities in logical queries. Furthermore, the ablation study confirms the efficacy of the aggregator and prompt components.

pdf abs
Prompting-based Synthetic Data Generation for Few-Shot Question Answering
Maximilian Schmidt | Andrea Bartezzaghi | Ngoc Thang Vu

Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.

pdf abs
Prompting Explicit and Implicit Knowledge for Multi-hop Question Answering Based on Human Reading Process
Guangming Huang | Yunfei Long | Cunjin Luo | Jiaxing Shen | Xia Sun

Pre-trained language models (PLMs) leverage chains-of-thought (CoT) to simulate human reasoning and inference processes, achieving proficient performance in multi-hop QA. However, a gap persists between PLMs’ reasoning abilities and those of humans when tackling complex problems. Psychological studies suggest a vital connection between explicit information in passages and human prior knowledge during reading. Nevertheless, current research has given insufficient attention to linking input passages and PLMs’ pre-training-based knowledge from the perspective of human cognition studies. In this study, we introduce a Prompting Explicit and Implicit knowledge (PEI) framework, which uses prompts to connect explicit and implicit knowledge, aligning with human reading process for multi-hop QA. We consider the input passages as explicit knowledge, employing them to elicit implicit knowledge through unified prompt reasoning. Furthermore, our model incorporates type-specific reasoning via prompts, a form of implicit knowledge. Experimental results show that PEI performs comparably to the state-of-the-art on HotpotQA. Ablation studies confirm the efficacy of our model in bridging and integrating explicit and implicit knowledge.

pdf abs
Prompting for Numerical Sequences: A Case Study on Market Comment Generation
Masayuki Kawarada | Tatsuya Ishigaki | Hiroya Takamura

Large language models (LLMs) have been applied to a wide range of data-to-text generation tasks, including tables, graphs, and time-series numerical data-to-text settings. While research on generating prompts for structured data such as tables and graphs is gaining momentum, in-depth investigations into prompting for time-series numerical data are lacking. Therefore, this study explores various input representations, including sequences of tokens and structured formats such as HTML, LaTeX, and Python-style codes. In our experiments, we focus on the task of Market Comment Generation, which involves taking a numerical sequence of stock prices as input and generating a corresponding market comment. Contrary to our expectations, the results show that prompts resembling programming languages yield better outcomes, whereas those similar to natural languages and longer formats, such as HTML and LaTeX, are less effective. Our findings offer insights into creating effective prompts for tasks that generate text from numerical sequences.

pdf abs
Prompting Large Language Models for Counterfactual Generation: An Empirical Study
Yongqi Li | Mayi Xu | Xin Miao | Shen Zhou | Tieyun Qian

Large language models (LLMs) have made remarkable progress in a wide range of natural language understanding and generation tasks. However, their ability to generate counterfactuals has not been examined systematically. To bridge this gap, we present a comprehensive evaluation framework on various types of NLU tasks, which covers all key factors in determining LLMs’ capability of generating counterfactuals. Based on this framework, we 1) investigate the strengths and weaknesses of LLMs as the counterfactual generator, and 2) disclose the factors that affect LLMs when generating counterfactuals, including both the intrinsic properties of LLMs and prompt designing. The results show that, though LLMs are promising in most cases, they face challenges in complex tasks like RE since they are bounded by task-specific performance, entity constraints, and inherent selection bias. We also find that alignment techniques, e.g., instruction-tuning and reinforcement learning from human feedback, may potentially enhance the counterfactual generation ability of LLMs. On the contrary, simply increasing the parameter size does not yield the desired improvements. Besides, from the perspective of prompt designing, task guidelines unsurprisingly play an important role. However, the chain-of-thought approach does not always help due to inconsistency issues.

pdf abs
PromptStream: Self-Supervised News Story Discovery Using Topic-Aware Article Representations
Arezoo Hatefi | Anton Eklund | Mona Forsman

Given the importance of identifying and monitoring news stories within the continuous flow of news articles, this paper presents PromptStream, a novel method for unsupervised news story discovery. In order to identify coherent and comprehensive stories across the stream, it is crucial to create article representations that incorporate as much topic-related information from the articles as possible. PromptStream constructs these article embeddings using cloze-style prompting. These representations continually adjust to the evolving context of the news stream through self-supervised learning, employing a contrastive loss and a memory of the most confident article-story assignments from the most recent days. Extensive experiments with real news datasets highlight the notable performance of our model, establishing a new state of the art. Additionally, we delve into selected news stories to reveal how the model’s structuring of the article stream aligns with story progression.

pdf abs
Prompt Tuning for Few-shot Relation Extraction via Modeling Global and Local Graphs
Zirui Zhang | Yiyu Yang | Benhui Chen

Recently, prompt-tuning has achieved very significant results for few-shot tasks. The core idea of prompt-tuning is to insert prompt templates into the input, thus converting the classification task into a masked language modeling problem. However, for few-shot relation extraction tasks, how to mine more information from limited resources becomes particularly important. In this paper, we first construct a global relation graph based on label consistency to optimize the feature representation of samples between different relations. Then the global relation graph is further divided to form a local relation subgraph for each relation type to optimize the feature representation of samples within the same relation. This fully uses the limited supervised information and improves the tuning efficiency. In addition, the existence of rich semantic knowledge in relation labels cannot be ignored. For this reason, this paper incorporates the knowledge in relation labels into prompt-tuning. Specifically, the potential knowledge implicit in relation labels is injected into constructing learnable prompt templates. In this paper, we conduct extensive experiments on four datasets under low-resource settings, showing that this method achieves significant results.

pdf abs
PrOnto: Language Model Evaluations for 859 Languages
Luke Gessler

Evaluation datasets are critical resources for measuring the quality of pretrained language models. However, due to the high cost of dataset annotation, these resources are scarce for most languages other than English, making it difficult to assess the quality of language models. In this work, we present a new method for evaluation dataset construction which enables any language with a New Testament translation to receive a suite of evaluation datasets suitable for pretrained language model evaluation. The method critically involves aligning verses with those in the New Testament portion of English OntoNotes, and then projecting annotations from English to the target language, with no manual annotation required. We apply this method to 1051 New Testament translations in 859 languages and make them publicly available. Additionally, we conduct experiments which demonstrate the efficacy of our method for creating evaluation tasks which can assess language model quality.

pdf abs
Prophecy Distillation for Boosting Abstractive Summarization
Jiaxin Duan | Fengyu Lu | Junfei Liu

Abstractive summarization models learned with maximum likelihood estimation (MLE) have long been guilty of generating unfaithful facts alongside ambiguous focus. Improved paradigm under the guidance of reference-identified words, i.e., guided summarization, has exhibited remarkable advantages in overcoming this problem. However, it suffers limited real applications since the prophetic guidance is practically agnostic at inference. In this paper, we introduce a novel teacher-student framework, which learns a regular summarization model to mimic the behavior of being guided by prophecy for boosting abstractive summaries. Specifically, by training in probability spaces to follow and distinguish a guided teacher model, a student model learns the key to generating teacher-like quality summaries without any guidance. We refer to this process as prophecy distillation, and it breaks the limitations of both standard and guided summarization. Through extensive experiments, we show that our method achieves new or matched state-of-the-art on four well-known datasets, including ROUGE scores, faithfulness, and saliency awareness. Human evaluations are also carried out to evidence these merits. Furthermore, we conduct empirical studies to analyze how the hyperparameters setting and the guidance choice affect TPG performance.

Few-shot Event Detection (FSED) is a meaningful task due to the limited labeled data and expensive manual labeling. Some prompt-based methods are used in FSED. However, these methods require large GPU memory due to the increased length of input tokens caused by concatenating prompts, as well as additional human effort for designing verbalizers. Moreover, they ignore instance and prompt biases arising from the confounding effects between prompts and texts. In this paper, we propose a prototype-based prompt-instance Interaction with causal Intervention (2xInter) model to conveniently utilize both prompts and verbalizers and effectively eliminate all biases. Specifically, 2xInter first presents a Prototype-based Prompt-Instance Interaction (PPII) module that applies an interactive approach for texts and prompts to reduce memory and regards class prototypes as verbalizers to avoid design costs. Next, 2xInter constructs a Structural Causal Model (SCM) to explain instance and prompt biases and designs a Double-View Causal Intervention (DVCI) module to eliminate these biases. Due to limited supervised information, DVCI devises a generation-based prompt adjustment for instance intervention and a Siamese network-based instance contrasting for prompt intervention. Finally, the experimental results show that 2xInter achieves state-of-the-art performance on RAMS and ACE datasets.

pdf abs
Pruning before Fine-tuning: A Retraining-free Compression Framework for Pre-trained Language Models
Pingjie Wang | Hongcheng Liu | Yanfeng Wang | Yu Wang

Structured pruning is an effective technique for compressing pre-trained language models (PLMs), reducing model size and improving inference speed for efficient deployment. However, most of existing pruning algorithms require retraining, leading to additional computational overhead. While some retraining-free approaches have been proposed for classification tasks, they still require a fully fine-tuned model for the task, and may cause catastrophic performance degradation on generative tasks. To address these challenges, we propose P-pruning (pre-pruning), an innovative task-specific compression framework. P-pruning prunes redundant modules of PLMs before fine-tuning, reducing the costs associated with fine-tuning. We also introduce a pruning algorithm for this framework, which includes two techniques: (1) module clustering, which clusters the outputs of all heads and neurons based on the task input; and (2) centroid selection, which identifies the most salient element in each cluster and prunes the others. We apply our method to BERT and GPT-2 and evaluate its effectiveness on GLUE, SQuAD, WikiText-2, WikiText-103, and PTB datasets. Experimental results demonstrate that our approach achieves higher performance in both classification and generative tasks, while also reducing the time required for fine-tuning.

pdf abs
PSentScore: Evaluating Sentiment Polarity in Dialogue Summarization
Yongxin Zhou | Fabien Ringeval | François Portet

Automatic dialogue summarization is a well-established task with the goal of distilling the most crucial information from human conversations into concise textual summaries. However, most existing research has predominantly focused on summarizing factual information, neglecting the affective content, which can hold valuable insights for analyzing, monitoring, or facilitating human interactions. In this paper, we introduce and assess a set of measures PSentScore, aimed at quantifying the preservation of affective content in dialogue summaries. Our findings indicate that state-of-the-art summarization models do not preserve well the affective content within their summaries. Moreover, we demonstrate that a careful selection of the training set for dialogue samples can lead to improved preservation of affective content in the generated summaries, albeit with a minor reduction in content-related metrics.

Linguistic data, a component critical not only for research in a variety of fields but also for the development of various Natural Language Processing (NLP) applications, can contain personal information. As a result, its accessibility is limited, both from a legal and an ethical standpoint. One of the solutions is the pseudonymization of the data. Key stages of this process include the identification of sensitive elements and the generation of suitable surrogates in a way that the data is still useful for the intended task. Within this paper, we conduct an analysis of tagsets that have previously been utilized in anonymization and pseudonymization. We also investigate what kinds of Personally Identifiable Information (PII) appear in various domains. These reveal that none of the analyzed tagsets account for all of the PII types present cross-domain at the level of detailedness seemingly required for pseudonymization. We advocate for a universal system of tags for categorizing PIIs leading up to their replacement. Such categorization could facilitate the generation of grammatically, semantically, and sociolinguistically appropriate surrogates for the kinds of information that are considered sensitive in a given domain, resulting in a system that would enable dynamic pseudonymization while keeping the texts readable and useful for future research in various fields.

Face-to-face interactions between representatives of the state and citizens are a key intercept in public service delivery, for instance when providing social benefits to vulnerable groups. Despite the relevance of these encounters for the individual, but also for society at large, there is a significant research gap in the systematic empirical study of the communication taking place. This is mainly due to the high institutional and data protection barriers for collecting data in a very sensitive and private setting in which citizens request support from the state. In this paper, we describe the procedure of compiling the first open access dataset of transcribed recordings of so-called Public Service Encounters in Germany, i.e., meetings between state officials and citizens in which there is direct communication in order to allocate state services. This dataset sets a new research directive in the social sciences, because it allows the community to open up the black box of direct state-citizen interaction. With data of this kind it becomes possible to directly and systematically investigate bias, bureaucratic discrimination and other power-driven dynamics in the actual communication and ideally propose guidelines as to alleviate these issues.

pdf abs
PSYDIAL: Personality-based Synthetic Dialogue Generation Using Large Language Models
Ji-Eun Han | Jun-Seok Koh | Hyeon-Tae Seo | Du-Seong Chang | Kyung-Ah Sohn

We present a novel end-to-end personality-based synthetic dialogue data generation pipeline, specifically designed to elicit responses from large language models via prompting. We design the prompts to generate more human-like dialogues considering real-world scenarios when users engage with chatbots. We introduce PSYDIAL, the first Korean dialogue dataset focused on personality-based dialogues, curated using our proposed pipeline. Notably, we focus on the Extraversion dimension of the Big Five personality model in our research. Experimental results indicate that while pre-trained models and those fine-tuned with a chit-chat dataset struggle to generate responses reflecting personality, models trained with PSYDIAL show significant improvements. The versatility of our pipeline extends beyond dialogue tasks, offering potential for other non-dialogue related applications. This research opens doors for more nuanced, personality-driven conversational AI in Korean and potentially other languages.

Humor is an intricate part of verbal communication and dealing with this kind of phenomenon is essential to building systems that can process language at large with all of its complexities. In this paper, we introduce Puntuguese, a new corpus of punning humor in Portuguese, motivated by previous works showing that currently available corpora for this language are still unfit for Machine Learning due to data leakage. Puntuguese comprises 4,903 manually-gathered punning one-liners in Brazilian and European Portuguese. To create negative examples that differ exclusively in terms of funniness, we carried out a micro-editing process, in which all jokes were edited by fluent Portuguese speakers to make the texts unfunny. Finally, we did some experiments on Humor Recognition, showing that Puntuguese is considerably more difficult than the previous corpus, achieving an F1-Score of 68.9%. With this new dataset, we hope to enable research not only in NLP but also in other fields that are interested in studying humor; thus, the data is publicly available.

Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embedding methods, we also contribute a task suite to fairly evaluate past, current, and future methods. We evaluate both (1) intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and (2) extrinsic performance on tasks such as rhyme and cognate detection and sound analogies. We hope our task suite will promote reproducibility and inspire future phonetic embedding research.

pdf abs
PyRater: A Python Toolkit for Annotation Analysis
Angelo Basile | Marc Franco-Salvador | Paolo Rosso

We introduce PyRater, an open-source Python toolkit designed for analysing corpora annotations. When creating new annotated language resources, probabilistic models of annotation are the state-of-the-art solution for identifying the best annotators, retrieving the gold standard, and more generally separating annotation signal from noise. PyRater offers a unified interface for several such models and includes an API for the addition of new ones. Additionally, the toolkit has built-in functions to read datasets with multiple annotations and plot the analysis outcomes. In this work, we also demonstrate a novel application of PyRater to zero-shot classifiers, where it effectively selects the best-performing prompt. We make PyRater available to the research community.

pdf abs
Qabas: An Open-Source Arabic Lexicographic Database
Mustafa Jarrar | Tymaa Hasanain Hammouda

We present Qabas, a novel open-source Arabic lexicon designed for NLP applications. The novelty of Qabas lies in its synthesis of 110 lexicons. Specifically, Qabas lexical entries (lemmas) are assembled by linking lemmas from 110 lexicons. Furthermore, Qabas lemmas are also linked to 12 morphologically annotated corpora (about 2M tokens), making it the first Arabic lexicon to be linked to lexicons and corpora. Qabas was developed semi-automatically, utilizing a mapping framework and a web-based tool. Compared with other lexicons, Qabas stands as the most extensive Arabic lexicon, encompassing about 58K lemmas (45K nominal lemmas, 12.5K verbal lemmas, and 473 functional-word lemmas). Qabas is open-source and accessible online at https://sina.birzeit.edu/qabas

pdf abs
QA-based Event Start-Points Ordering for Clinical Temporal Relation Annotation
Seiji Shimizu | Lis Pereira | Shuntaro Yada | Eiji Aramaki

Temporal relation annotation in the clinical domain is crucial yet challenging due to its workload and the medical expertise required. In this paper, we propose a novel annotation method that integrates event start-points ordering and question-answering (QA) as the annotation format. By focusing only on two points on a timeline, start-points ordering reduces ambiguity and simplifies the relation set to be considered during annotation. QA as annotation recasts temporal relation annotation into a reading comprehension task, allowing annotators to use natural language instead of the formalisms commonly adopted in temporal relation annotation. Based on our method, most of the relations in a document are inferable from a significantly smaller number of explicitly annotated relations, showing the efficiency of our proposed method. Using these inferred relations, we develop a temporal relation classification model that achieves a 0.72 F1 score. Also, by decomposing the annotation process into QA generation and QA validation, our method enables collaboration among medical experts and non-experts. We obtained high inter-annotator agreement (IAA) scores, which indicate the positive prospect of such collaboration in the annotation process. Our annotated corpus, annotation tool, and trained model are publicly available: https://github.com/seiji-shimizu/qa-start-ordering.

pdf abs
QCAW 1.0: Building a Qatari Corpus of Student Argumentative Writing
Wajdi Zaghouani | Abdelhamid Ahmed | Xiao Zhang | Lameya Rezk

This paper presents the creation of the Qatari Corpus of Argumentative Writing (QCAW) as an annotated L1 Arabic and L2 English bilingual writer corpus. It comprises 200,000 tokens of argumentative writing by Qatari university students in L1 Arabic and L2 English. The corpus includes 195 essays written by 195 students, 159 females and 36 males. The students were native Arabic speakers proficient in English as a second language. The corpus is divided into Arabic and English sections, accompanied by part-of-speech annotated files. The Metadata contains information about the students (gender, major, first and second languages) and the essays (text serial numbers, word limits, genre, writing date, time spent, and location). The paper outlines the steps for collecting and analysing the corpus, including details on essay writers, topic selection, pre-analysis text modifications, proficiency level, gender, and major ratings. Statistical analyses were applied to examine the corpus. The QCAW offers a valuable bilingual data source authored by the same students in Arabic and English, with implications for further research

Chain-of-Thought prompting has improved reasoning capability of large language models (LLM). However, it still is challenging to guarantee the effectiveness and stability for questions requiring complicated reasoning. Recently, Plan-and-Solve prompting enhances the reasoning capability for complex questions by planning the solution steps firstly and then solving them step by step, but it suffers the difficulty to represent and execute the problem-solving logic of complex questions. To deal with these challenges, in this work, we propose a novel Plan-and-Solve prompting method based on Question Decomposition Meaning Representation (QDMR). Specifically, this method first allows the LLM to generate a QDMR graph to represent the problem-solving logic, which is a directed acyclic graph composed of sub-questions. Then, the LLM generates a specific solving process based on the QDMR graph. When solving each sub-question, it can locate the preceding sub-questions and their answers according to the QDMR graph, and then utilize this information for solution. Compared with existing Plan-and-Solve prompting techniques, our method can not only represent the problem-solving logic of complicated questions more accurately with the aid of QDMR graph, but also deliver the dependence information accurately for different solution steps according to the QDMR graph. In addition, with the supervised fine-tuning on the Allen Institute dataset, the decomposing capability of LLM for complicated questions can be considerably enhanced. Extensive experiments show that our method has achieve a great significance in arithmetic reasoning and commonsense reasoning task by comparing the classical Chain-of-Thought prompting and Plan-and-Solve prompting techniques, and the improvements achieved are even greater for problems with more reasoning steps.

pdf abs
Qsnail: A Questionnaire Dataset for Sequential Question Generation
Yan Lei | Liang Pang | Yuanzhuo Wang | Huawei Shen | Xueqi Cheng

The questionnaire is a professional research methodology used for both qualitative and quantitative analysis of human opinions, preferences, attitudes, and behaviors. However, designing and evaluating questionnaires demands significant effort due to their intricate and complex structure. Questionnaires entail a series of questions that must conform to intricate constraints involving the questions, options, and overall structure. Specifically, the questions should be relevant and specific to the given research topic and intent. The options should be tailored to the questions, ensuring they are mutually exclusive, completed, and ordered sensibly. Moreover, the sequence of questions should follow a logical order, grouping similar topics together. As a result, automatically generating questionnaires presents a significant challenge and this area has received limited attention primarily due to the scarcity of high-quality datasets. To address these issues, we present Qsnail, the first dataset specifically constructed for the questionnaire generation task, which comprises 13,168 human-written questionnaires gathered from online platforms. We further conduct experiments on Qsnail, and the results reveal that retrieval models and traditional generative models do not fully align with the given research topic and intents. Large language models, while more closely related to the research topic and intents, exhibit significant limitations in terms of diversity and specificity. Despite enhancements through the chain-of-thought prompt and finetuning, questionnaires generated by language models still fall short of human-written questionnaires. Therefore, questionnaire generation is challenging and needs to be further explored. The dataset will be published in the future.

pdf abs
Quantifying the Impact of Disfluency on Spoken Content Summarization
Maria Teleki | Xiangjue Dong | James Caverlee

Spoken content is abundant – including podcasts, meeting transcripts, and TikTok-like short videos. And yet, many important tasks like summarization are often designed for written content rather than the looser, noiser, and more disfluent style of spoken content. Hence, we aim in this paper to quantify the impact of disfluency on spoken content summarization. Do disfluencies negatively impact the quality of summaries generated by existing approaches? And if so, to what degree? Coupled with these goals, we also investigate two methods towards improving summarization in the presence of such disfluencies. We find that summarization quality does degrade with an increase in these disfluencies and that a combination of multiple disfluency types leads to even greater degradation. Further, our experimental results show that naively removing disfluencies and augmenting with special tags can worsen the summarization when used for testing, but that removing disfluencies for fine-tuning yields the best results. We make the code available at https://github.com/mariateleki/Quantifying-Impact-Disfluency.

The paper describes a dataset composed of two sub-corpora from two different sources in Italian. The QUEEREOTYPES corpus includes social media texts regarding LGBTQIA+ individuals, behaviors, ideology and events. The texts were collected from Facebook and Twitter in 2018 and were annotated for the presence of stereotypes, and orthogonal dimensions (such as hate speech, aggressiveness, offensiveness, and irony in one sub-corpus, and stance in the other). The resource was developed by Natural Language Processing researchers together with activists from an Italian LGBTQIA+ not-for-profit organization. The creation of the dataset allows the NLP community to study stereotypes against marginalized groups, individuals and, ultimately, to develop proper tools and measures to reduce the online spread of such stereotypes. A test for the robustness of the language resource has been performed by means of 5-fold cross-validation experiments. Finally, text classification experiments have been carried out with a fine-tuned version of AlBERTo (a BERT-based model pre-trained on Italian tweets) and mBERT, obtaining good results on the task of stereotype detection, suggesting that stereotypes towards different targets might share common traits.

pdf abs
Query-driven Relevant Paragraph Extraction from Legal Judgments
Santosh T.y.s.s. | Elvin A. Quero Hernandez | Matthias Grabmair

Legal professionals often grapple with navigating lengthy legal judgements to pinpoint information that directly address their queries. This paper focus on this task of extracting relevant paragraphs from legal judgements based on the query. We construct a specialized dataset for this task from the European Court of Human Rights (ECtHR) using the case law guides. We assess the performance of current retrieval models in a zero-shot way and also establish fine-tuning benchmarks using various models. The results highlight the significant gap between fine-tuned and zero-shot performance, emphasizing the challenge of handling distribution shift in the legal domain. We notice that the legal pre-training handles distribution shift on the corpus side but still struggles on query side distribution shift, with unseen legal queries. We also explore various Parameter Efficient Fine-Tuning (PEFT) methods to evaluate their practicality within the context of information retrieval, shedding light on the effectiveness of different PEFT methods across diverse configurations with pre-training and model architectures influencing the choice of PEFT method.

pdf abs
QueryNER: Segmentation of E-commerce Queries
Chester Palen-Michel | Lizzie Liang | Zhe Wu | Constantine Lignos

We present QueryNER, a manually-annotated dataset and accompanying model for e-commerce query segmentation. Prior work in sequence labeling for e-commerce has largely addressed aspect-value extraction which focuses on extracting portions of a product title or query for narrowly defined aspects. Our work instead focuses on the goal of dividing a query into meaningful chunks with broadly applicable types. We report baseline tagging results and conduct experiments comparing token and entity dropping for null and low recall query recovery. Challenging test sets are created using automatic transformations and show how simple data augmentation techniques can make the models more robust to noise. We make the QueryNER dataset publicly available.

pdf abs
Question Answering over Tabular Data with DataBench: A Large-Scale Empirical Evaluation of LLMs
Jorge Osés Grijalba | L. Alfonso Ureña-López | Eugenio Martínez Cámara | Jose Camacho-Collados

Large Language Models (LLMs) are showing emerging abilities, and one of the latest recognized ones deals with their ability to reason and answer questions from tabular data. Although there are some available datasets to assess question answering systems on tabular data, they are not large and diverse enough to properly assess the capabilities of LLMs. To this end, we propose DataBench, a benchmark composed of 65 real-world datasets over several domains, including 20 human-generated questions per dataset, totaling 1300 questions and answers overall. Using this benchmark, we perform a large-scale empirical comparison of several open and closed source models, including both code-generating and in-context learning models. The results highlight the current gap between open-source and closed-source models, with all types of model having room for improvement even in simple boolean questions or involving a single column.

pdf abs
Quite Good, but Not Enough: Nationality Bias in Large Language Models - a Case Study of ChatGPT
Shucheng Zhu | Weikang Wang | Ying Liu

While nationality is a pivotal demographic element that enhances the performance of language models, it has received far less scrutiny regarding inherent biases. This study investigates nationality bias in ChatGPT (GPT-3.5), a large language model (LLM) designed for text generation. The research covers 195 countries, 4 temperature settings, and 3 distinct prompt types, generating 4,680 discourses about nationality descriptions in Chinese and English. Automated metrics were used to analyze the nationality bias, and expert annotators alongside ChatGPT itself evaluated the perceived bias. The results show that ChatGPT’s generated discourses are predominantly positive, especially compared to its predecessor, GPT-2. However, when prompted with negative inclinations, it occasionally produces negative content. Despite ChatGPT considering its generated text as neutral, it shows consistent self-awareness about nationality bias when subjected to the same pair-wise comparison annotation framework used by human annotators. In conclusion, while ChatGPT’s generated texts seem friendly and positive, they reflect the inherent nationality biases in the real world. This bias may vary across different language versions of ChatGPT, indicating diverse cultural perspectives. The study highlights the subtle and pervasive nature of biases within LLMs, emphasizing the need for further scrutiny.

Move structures have been studied in English for Specific Purposes (ESP) and English for Academic Purposes (EAP) for decades. However, there are few move annotation corpora for Research Article (RA) abstracts. In this paper, we introduce RAAMove, a comprehensive multi-domain corpus dedicated to the annotation of move structures in RA abstracts. The primary objective of RAAMove is to facilitate move analysis and automatic move identification. This paper provides a thorough discussion of the corpus construction process, including the scheme, data collection, annotation guidelines, and annotation procedures. The corpus is constructed through two stages: initially, expert annotators manually annotate high-quality data; subsequently, based on the human-annotated data, a BERT-based model is employed for automatic annotation with the help of experts’ modification. The result is a large-scale and high-quality corpus comprising 33,988 annotated instances. We also conduct preliminary move identification experiments using the BERT-based model to verify the effectiveness of the proposed corpus and model. The annotated corpus is available for academic research purposes and can serve as essential resources for move analysis, English language teaching and writing, as well as move/discourse-related tasks in Natural Language Processing (NLP).

pdf abs
RADCoT: Retrieval-Augmented Distillation to Specialization Models for Generating Chain-of-Thoughts in Query Expansion
Sung-Min Lee | Eunhwan Park | DongHyeon Jeon | Inho Kang | Seung-Hoon Na

Large language models (LLMs) have demonstrated superior performance to that of small language models (SLM) in information retrieval for various subtasks including dense retrieval, reranking, query expansion, and pseudo-document generation. However, the parameter sizes of LLMs are extremely large, making it expensive to operate LLMs stably for providing LLM-based retrieval services. Recently, retrieval-augmented language models have been widely employed to significantly reduce the parameter size by retrieving relevant knowledge from large-scale corpora and exploiting the resulting “in-context” knowledge as additional model input, thereby substantially reducing the burden of internalizing and retaining world knowledge in model parameters. Armed by the retrieval-augmented language models, we present a retrieval-augmented model specialization that distills the capability of LLMs to generate the chain-of-thoughts (CoT) for query expansion – that is, injects the LLM’s capability to generate CoT into a retrieval-augmented SLM – referred to as RADCoT. Experimental results on the MS-MARCO, TREC DL 19, 20 datasets show that RADCoT yields consistent improvements over distillation without retrieval, achieving comparable performance to that of the query expansion method using LLM-based CoTs. Our code is publicly available at https://github.com/ZIZUN/RADCoT.

Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Existing solutions, such as deploying task-specific verifiers or voting over multiple reasoning paths, either require extensive human annotations or fail in scenarios with inconsistent responses. To address these challenges, we introduce RankPrompt, a new prompting method that enables LLMs to self-rank their responses without additional resources. RankPrompt breaks down the ranking problem into a series of comparisons among diverse responses, leveraging the inherent capabilities of LLMs to generate chains of comparison as contextual exemplars. Our experiments across 11 arithmetic and commonsense reasoning tasks show that RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4, with improvements of up to 13%. Moreover, RankPrompt excels in LLM-based automatic evaluations for open-ended tasks, aligning with human judgments 74% of the time in the AlpacaEval dataset. It also exhibits robustness to variations in response order and consistency. Collectively, our results validate RankPrompt as an effective method for eliciting high-quality feedback from language models.

The creation of instruction data and evaluation benchmarks for serving Large language models often involves enormous human annotation. This issue becomes particularly pronounced when rapidly developing such resources for a non-English language like Japanese. Instead of following the popular practice of directly translating existing English resources into Japanese (e.g., Japanese-Alpaca), we propose an efficient self-instruct method based on GPT-4. We first translate a small amount of English instructions into Japanese and post-edit them to obtain native-level quality. GPT-4 then utilizes them as demonstrations to automatically generate Japanese instruction data. We also construct an evaluation benchmark containing 80 questions across 8 categories, using GPT-4 to automatically assess the response quality of LLMs without human references. The empirical results suggest that the models fine-tuned on our GPT-4 self-instruct data significantly outperformed the Japanese-Alpaca across all three base pre-trained models. Our GPT-4 self-instruct data allowed the LLaMA 13B model to defeat GPT-3.5 (Davinci-003) with a 54.37% win-rate. The human evaluation exhibits the consistency between GPT-4’s assessments and human preference. Our high-quality instruction data and evaluation benchmark are released here.

Simultaneous interpretation is a cognitively taxing task, and even seasoned professionals benefit from real-time assistance. However, both recruiting professional interpreters and evaluating new assistance techniques are difficult. We present a novel, realistic simultaneous interpretation task that mimics the cognitive load of interpretation with crowdworker surrogates. Our task tests different real-time assistance methods in a Wizard-of-Oz experiment with a large pool of proxy users and compares against professional interpreters. Both professional and proxy participants respond similarly to changes in interpreting conditions, including improvement with two assistance interventions—translation of specific terms and of numbers—compared to a no-assistance control.

pdf abs
Rationale-based Learning Using Self-Supervised Narrative Events for Text Summarisation of Interactive Digital Narratives
Ashwathy T Revi | Stuart E. Middleton | David E. Millard

This paper explores using rationale-based learning with supervised attention to focus the training of text summarisation models on words and sentences surrounding choice points for Interactive Digital Narratives (IDNs). IDNs allow players to interact with the story via choice points, making choices central to these narratives. Exploiting such knowledge about narrative structure during model training can help ensure key narrative information appears in generated summaries of narrative-based text and thus improve the quality of these summaries. We experiment with using word-level and sentence-level rationales indicating the proximity of words and sentences to self-supervised choice points. Our results indicate that rationale-based learning can improve the ability of attention-based text summarisation models to create higher quality summaries that encode key narrative information better for different playthroughs of the same interactive narrative. These results suggest a promising new direction for narrative-based text summarisation models.

pdf abs
Reading Does Not Equal Reading: Comparing, Simulating and Exploiting Reading Behavior across Populations
David R. Reich | Shuwen Deng | Marina Björnsdóttir | Lena Jäger | Nora Hollenstein

Eye-tracking-while-reading corpora play a crucial role in the study of human language processing, and, more recently, have been leveraged for cognitively enhancing neural language models. A critical limitation of existing corpora is that they often lack diversity, comprising primarily native speakers. In this study, we expand the eye-tracking-while-reading dataset CopCo, which initially included only Danish L1 readers with and without dyslexia, by incorporating a new dataset of L2 readers with diverse L1 backgrounds. Thus, the extended CopCo corpus constitutes the first eye-tracking-while-reading dataset encompassing neurotypical L1 and L1 readers with dyslexia as well as L2 readers, all reading the same materials. We first provide extensive descriptive statistics of the extended CopCo corpus. Second, we investigate how different degrees of diversity of the training data affect a state-of-the-art generative model of eye movements in reading. Finally, we use this scanpath generation model for gaze-augmented language modeling and investigate the impact of diversity in the training data on the model’s performance on a range of NLP downstream tasks. The code can be found here: https://github.com/norahollenstein/copco-processing.

The paper presents the design and construction of a time-stamped multimodal dataset for reading research, including multiple time-aligned temporal signals elicited with four experimental trials of connected text reading by both child and adult readers. We present the experimental protocols, as well as the data acquisition process and the post-processing phase of data annotation/augmentation. To evaluate the potential and usefulness of a time-aligned multimodal dataset for reading research, we present a few statistical analyses showing the correlation and complementarity of multimodal time-series of reading data, as well as some results of modelling adults’ reading data by integrating different modalities. The total dataset size amounts to about 2.5 GByte in compressed format.

pdf abs
Reassessing Semantic Knowledge Encoded in Large Language Models through the Word-in-Context Task
Yoshihiko Hayashi

Despite the remarkable recent advancements in large language models (LLMs), a comprehensive understanding of their inner workings and the depth of their knowledge remains elusive. This study aims to reassess the semantic knowledge encoded in LLMs by utilizing the Word-in-Context (WiC) task, which involves predicting the semantic equivalence of a target word across different contexts, as a probing task. To address this challenge, we start by prompting LLMs, specifically GPT-3 and GPT-4, to generate natural language descriptions that contrast the meanings of the target word in two contextual sentences given in the WiC dataset. Subsequently, we conduct a manual analysis to examine their linguistic attributes. In parallel, we train a text classification model that utilizes the generated descriptions as supervision and assesses their practical effectiveness in the WiC task. The linguistic and empirical findings reveal a consistent provision of valid and valuable descriptions by LLMs, with LLM-generated descriptions significantly improving classification accuracy. Notably, the highest classification result achieved with GPT-3-generated descriptions largely surpassed GPT-3’s zero-shot baseline. However, the GPT-4-generated descriptions performed slightly below GPT-4’s zero-shot baseline, suggesting that the full potential of the most advanced large language models, such as GPT-4, is yet to be fully revealed.

pdf abs
Rebalancing Label Distribution While Eliminating Inherent Waiting Time in Multi Label Active Learning Applied to Transformers
Maxime Arens | Lucile Callebert | Mohand Boughanem | Jose G. Moreno

Data annotation is crucial for machine learning, notably in technical domains, where the quality and quantity of annotated data, significantly affect effectiveness of trained models. Employing humans is costly, especially when annotating for multi-label classification, as instances may bear multiple labels. Active Learning (AL) aims to alleviate annotation costs by intelligently selecting instances for annotation, rather than randomly annotating. Recent attention on transformers has spotlighted the potential of AL in this context. However, in practical settings, implementing AL faces challenges beyond theory. Notably, the gap between AL cycles presents idle time for annotators. To address this issue, we investigate alternative instance selection methods, aiming to maximize annotation efficiency by seamlessly integrating with the AL process. We begin by evaluating two existing methods in our transformer setting, employing respectively random sampling and outdated information. Following this we propose our novel method based on annotating instances to rebalance label distribution. Our approach mitigates biases, enhances model performance (up to 23% improvement on f1score), reduces strategy-dependent disparities (decrease of nearly 50% on standard deviation) and reduces label imbalance (decrease of 30% on Mean Imbalance Ratio).

pdf abs
ReCAP: Semantic Role Enhanced Caption Generation
Abhidip Bhattacharyya | Martha Palmer | Christoffer Heckman

Even though current vision language (V+L) models have achieved success in generating image captions, they often lack specificity and overlook various aspects of the image. Additionally, the attention learned through weak supervision operates opaquely and is difficult to control. To address these limitations, we propose the use of semantic roles as control signals in caption generation. Our hypothesis is that, by incorporating semantic roles as signals, the generated captions can be guided to follow specific predicate argument structures. To validate the effectiveness of our approach, we conducted experiments using data and compared the results with a baseline model VL-BART(CITATION). The experiments showed a significant improvement, with a gain of 45% in Smatch score (Standard NLP evaluation metric for semantic representations), demonstrating the efficacy of our approach. By focusing on specific objects and their associated semantic roles instead of providing a general description, our framework produces captions that exhibit enhanced quality, diversity, and controllability.

pdf abs
Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations
Yi-Pei Chen | Noriki Nishida | Hideki Nakayama | Yuji Matsumoto

Enhancing user engagement through personalization in conversational agents has gained significance, especially with the advent of large language models that generate fluent responses. Personalized dialogue generation, however, is multifaceted and varies in its definition – ranging from instilling a persona in the agent to capturing users’ explicit and implicit cues. This paper seeks to systemically survey the recent landscape of personalized dialogue generation, including the datasets employed, methodologies developed, and evaluation metrics applied. Covering 22 datasets, we highlight benchmark datasets and newer ones enriched with additional features. We further analyze 17 seminal works from top conferences between 2021-2023 and identify five distinct types of problems. We also shed light on recent progress by LLMs in personalized dialogue generation. Our evaluation section offers a comprehensive summary of assessment facets and metrics utilized in these works. In conclusion, we discuss prevailing challenges and envision prospect directions for future research in personalized dialogue generation.

The integration of generative AI in education is expanding, yet empirical analyses of large-scale and real-world interactions between students and AI systems still remain limited. Addressing this gap, we present RECIPE4U (RECIPE for University), a dataset sourced from a semester-long experiment with 212 college students in English as Foreign Language (EFL) writing courses. During the study, students engaged in dialogues with ChatGPT to revise their essays. RECIPE4U includes comprehensive records of these interactions, including conversation logs, students’ intent, students’ self-rated satisfaction, and students’ essay edit histories. In particular, we annotate the students’ utterances in RECIPE4U with 13 intention labels based on our coding schemes. We establish baseline results for two subtasks in task-oriented dialogue systems within educational contexts: intent detection and satisfaction estimation. As a foundational step, we explore student-ChatGPT interaction patterns through RECIPE4U and analyze them by focusing on students’ dialogue, essay data statistics, and students’ essay edits. We further illustrate potential applications of RECIPE4U dataset for enhancing the incorporation of LLMs in educational frameworks. RECIPE4U is publicly available at https://zeunie.github.io/RECIPE4U/.

pdf abs
Recognizing Social Cues in Crisis Situations
Di Wang | Yuan Zhuang | Ellen Riloff | Marina Kogan

During crisis situations, observations of other people’s behaviors often play an essential role in a person’s decision-making. For example, a person might evacuate before a hurricane only if everyone else in the neighborhood does so. Conversely, a person might stay if no one else is leaving. Such observations are called social cues. Social cues are important for understanding people’s response to crises, so recognizing them can help inform the decisions of government officials and emergency responders. In this paper, we propose the first NLP task to categorize social cues in social media posts during crisis situations. We introduce a manually annotated dataset of 6,000 tweets, labeled with respect to eight social cue categories. We also present experimental results of several classification models, which show that some types of social cues can be recognized reasonably well, but overall this task is challenging for NLP systems. We further present error analyses to identify specific types of mistakes and promising directions for future research on this task.

Understanding the implicit values and beliefs of diverse groups and cultures using qualitative texts – such as long-form narratives – and domain-expert interviews is a fundamental goal of social anthropology. This paper builds upon a 2022 study that introduced the NLP task of Recognizing Value Resonance (RVR) for gauging perspective – positive, negative, or neutral – on implicit values and beliefs in textual pairs. This study included a novel hand-annotated dataset, the World Values Corpus (WVC), designed to simulate the task of RVR, and a transformer-based model, Resonance-Tuned RoBERTa, designed to model the task. We extend existing work by refining the task definition and releasing the World Values Corpus (WVC) dataset. We further conduct several validation experiments designed to robustly evaluate the need for task specific modeling, even in the world of LLMs. Finally, we present two additional Resonance-Tuned models trained over extended RVR datasets, designed to improve RVR model versatility and robustness. Our results demonstrate that the Resonance-Tuned models outperform top-performing Recognizing Textual Entailment (RTE) models in recognizing value resonance as well as zero-shot GPT-3.5 under several different prompt structures, emphasizing its practical applicability. Our findings highlight the potential of RVR in capturing cultural values within texts and the importance of task-specific modeling.

Citing comprehensively and appropriately has become a challenging task with the explosive growth of scientific publications. Current citation recommendation systems aim to recommend a list of scientific papers for a given text context or a draft paper. However, none of the existing work focuses on already included citations of full papers, which are imperfect and still have much room for improvement. In the scenario of peer reviewing, it is a common phenomenon that submissions are identified as missing vital citations by reviewers. This may lead to a negative impact on the credibility and validity of the research presented. To help improve citations of full papers, we first define a novel task of Recommending Missed Citations Identified by Reviewers (RMC) and construct a corresponding expert-labeled dataset called CitationR. We conduct an extensive evaluation of several state-of-the-art methods on CitationR. Furthermore, we propose a new framework RMCNet with an Attentive Reference Encoder module mining the relevance between papers, already-made citations, and missed citations. Empirical results prove that RMC is challenging, with the proposed architecture outperforming previous methods in all metrics. We release our dataset and benchmark models to motivate future research on this challenging new task.

pdf abs
Reconstruction of Cuneiform Literary Texts as Text Matching
Fabian Simonjetz | Jussi Laasonen | Yunus Cobanoglu | Alexander Fraser | Enrique Jiménez

Ancient Mesopotamian literature is riddled with gaps, caused by the decay and fragmentation of its writing material, clay tablets. The discovery of overlaps between fragments allows reconstruction to advance, but it is a slow and unsystematic process. Since new pieces are found and digitized constantly, NLP techniques can help to identify fragments and match them with existing text collections to restore complete literary works. We compare a number of approaches and determine that a character-level n-gram-based similarity matching approach works well for this problem, leading to a large speed-up for researchers in Assyriology.

pdf abs
Reduce Redundancy Then Rerank: Enhancing Code Summarization with a Novel Pipeline Framework
Xiaoyu Hu | Xu Zhang | Zexu Lin | Deyu Zhou

Code summarization is the task of automatically generating natural language descriptions from source code. Recently, pre-trained language models have gained significant popularity in code summarization due to their capacity to capture richer semantic representations of both code and natural language. Nonetheless, contemporary code summarization models grapple with two fundamental limitations. (1) Some tokens in the code are irrelevant to the natural language description and damage the alignment of the representation spaces for code and language. (2) Most approaches are based on the encoder-decoder framework, which is often plagued by the exposure bias problem, hampering the effectiveness of their decoding sampling strategies. To address the two challenges, we propose a novel pipeline framework named Reduce Redundancy then Rerank (Reˆ3). Specifically, a redundancy reduction component is introduced to eliminate redundant information in code representation space. Moreover, a re-ranking model is incorporated to select more suitable summary candidates, alleviating the exposure bias problem. The experimental results show the effectiveness of Reˆ3 over some state-of-the-art approaches across six different datasets from the CodeSearchNet benchmark.

pdf abs
Re-evaluating the Tomes for the Times
Ryan Brate | Marieke van Erp | Antal van den Bosch

Literature is to some degree a snapshot of the time it was written in and the societal attitudes of the time. Not all depictions are pleasant or in-line with modern-day sensibilities; this becomes problematic when the prevalent depictions over a large body of work are negatively biased, leading to their normalisation. Many much-loved and much-read classics are set in periods of heightened social inequality: slavery, pre-womens’ rights movements, colonialism, etc. In this paper, we exploit known text co-occurrence metrics with respect to token-level level contexts to identify prevailing themes associated with known problematic descriptors. We see that prevalent, negative depictions are perpetuated by classic literature. We propose that such a methodology could form the basis of a system for making explicit such problematic associations, for interested parties: such as, sensitivity coordinators of publishing houses, library curators, or organisations concerned with social justice

pdf abs
REFeREE: A REference-FREE Model-Based Metric for Text Simplification
Yichen Huang | Ekaterina Kochmar

Text simplification lacks a universal standard of quality, and annotated reference simplifications are scarce and costly. We propose to alleviate such limitations by introducing REFeREE, a reference-free model-based metric with a 3-stage curriculum. REFeREE leverages an arbitrarily scalable pretraining stage and can be applied to any quality standard as long as a small number of human annotations are available. Our experiments show that our metric outperforms existing reference-based metrics in predicting overall ratings and reaches competitive and consistent performance in predicting specific ratings while requiring no reference simplifications at inference time.

In this paper, we introduce the task of style-consistent content transfer, which concerns modifying a text’s content based on a provided reference statement while preserving its original style. We approach the task by employing multi-task learning to ensure that the modified text meets three important conditions: reference faithfulness, style adherence, and coherence. In particular, we train three independent classifiers for each condition. During inference, these classifiers are used to determine the best modified text variant. Our evaluation, conducted on hotel reviews and news articles, compares our approach with sequence-to-sequence and error correction baselines. The results demonstrate that our approach reasonably generates text satisfying all three conditions. In subsequent analyses, we highlight the strengths and limitations of our approach, providing valuable insights for future research directions.

Sensitising language models (LMs) to external context helps them to more effectively capture the speaking patterns of individuals with specific characteristics or in particular environments. This work investigates to what extent detailed character and film annotations can be leveraged to personalise LMs in a scalable manner. We then explore the use of such models in evaluating context specificity in machine translation. We build LMs which leverage rich contextual information to reduce perplexity by up to 6.5% compared to a non-contextual model, and generalise well to a scenario with no speaker-specific data, relying on combinations of demographic characteristics expressed via metadata. Our findings are consistent across two corpora, one of which (Cornell-rich) is also a contribution of this paper. We then use our personalised LMs to measure the co-occurrence of extra-textual context and translation hypotheses in a machine translation setting. Our results suggest that the degree to which professional translations in our domain are context-specific can be preserved to a better extent by a contextual machine translation model than a non-contextual model, which is also reflected in the contextual model’s superior reference-based scores.

pdf abs
Refining Idioms Semantics Comprehension via Contrastive Learning and Cross-Attention
Mingmin Wu | Guixin Su | Yongcheng Zhang | Zhongqiang Huang | Ying Sha

Chinese idioms on social media demand a nuanced understanding for correct usage. The Chinese idiom cloze test poses a unique challenge for machine reading comprehension due to the figurative meanings of idioms deviating from their literal interpretations, resulting in a semantic bias in models’ comprehension of idioms. Furthermore, given that the figurative meanings of many idioms are similar, their use as suboptimal options can interfere with optimal selection. Despite achieving some success in the Chinese idiom cloze test, existing methods based on deep learning still struggle to comprehensively grasp idiom semantics due to the aforementioned issues. To tackle these challenges, we introduce a Refining Idioms Semantics Comprehension Framework (RISCF) to capture the comprehensive idioms semantics. Specifically, we propose a semantic sense contrastive learning module to enhance the representation of idiom semantics, diminishing the semantic bias between figurative and literal meanings of idioms. Meanwhile, we propose an interference-resistant cross-attention module to attenuate the interference of suboptimal options, which considers the interaction between the candidate idioms and the blank space in the context. Experimental results on the benchmark datasets demonstrate the effectiveness of our RISCF model, which outperforms state-of-the-art methods significantly.

pdf abs
Refining rtMRI Landmark-Based Vocal Tract Contour Labels with FCN-Based Smoothing and Point-to-Curve Projection
Mushaffa Rasyid Ridha | Sakriani Sakti

Advanced real-time Magnetic Resonance Imaging (rtMRI) enables researchers to study dynamic articulatory movements during speech production with high temporal resolution. However, accurately outlining articulator contours in high-frame-rate rtMRI presents challenges due to data scalability and image quality issues, making manual and automatic labeling difficult. The widely used publicly available USC-TIMIT dataset offers rtMRI data with landmark-based contour labels derived from unsupervised region segmentation using spatial frequency domain representation and gradient descent optimization. Unfortunately, occasional labeling errors exist, and many contour detection methods were trained and tested based on this ground truth, which is not purely a gold label, with the resulting contour data largely remaining undisclosed to the public. This paper offers a refinement of landmark-based vocal-tract contour labels by employing outlier removal, full convolutional network (FCN)-based smoothing, and a landmark point-to-edge curve projection technique. Since there is no established ground truth label, we evaluate the quality of the new labels through subjective assessments of several contour areas, comparing them to the existing data labels.

pdf abs
Reflecting the Male Gaze: Quantifying Female Objectification in 19th and 20th Century Novels
Kexin Luo | Yue Mao | Bei Zhang | Sophie Hao

Inspired by the concept of the male gaze (Mulvey, 1975) in literature and media studies, this paper proposes a framework for analyzing gender bias in terms of female objectification—the extent to which a text portrays female individuals as objects of visual pleasure. Our framework measures female objectification along two axes. First, we compute an agency bias score that indicates whether male entities are more likely to appear in the text as grammatical agents than female entities. Next, by analyzing the word embedding space induced by a text (Caliskan et al., 2017), we compute an appearance bias score that indicates whether female entities are more closely associated with appearance-related words than male entities. Applying our framework to 19th and 20th century novels reveals evidence of female objectification in literature: we find that novels written from a male perspective systematically objectify female characters, while novels written from a female perspective do not exhibit statistically significant objectification of any gender.

pdf abs
Reflections & Resonance: Two-Agent Partnership for Advancing LLM-based Story Annotation
Yuetian Chen | Mei Si

We introduce a novel multi-agent system for automating story annotation through the generation of tailored prompts for a large language model (LLM). This system utilizes two agents: Agent A is responsible for generating prompts that identify the key information necessary for reconstructing the story, while Agent B reconstructs the story from these annotations and provides feedback to refine the initial prompts. Human evaluations and perplexity scores revealed that optimized prompts significantly enhance the model’s narrative reconstruction accuracy and confidence, demonstrating that dynamic interaction between agents substantially boosts the annotation process’s precision and efficiency. Utilizing this innovative approach, we created the “StorySense” corpus, containing 615 stories, meticulously annotated to facilitate comprehensive story analysis. The paper also demonstrates the practical application of our annotated dataset by drawing the story arcs of two distinct stories, showcasing the utility of the annotated information in story structure analysis and understanding.

pdf abs
ReflectSumm: A Benchmark for Course Reflection Summarization
Yang Zhong | Mohamed Elaraby | Diane Litman | Ahmed Ashraf Butt | Muhsin Menekse

This paper introduces ReflectSumm, a novel summarization dataset specifically designed for summarizing students’ reflective writing. The goal of ReflectSumm is to facilitate developing and evaluating novel summarization techniques tailored to real-world scenarios with little training data, with potential implications in the opinion summarization domain in general and the educational domain in particular. The dataset encompasses a diverse range of summarization tasks and includes comprehensive metadata, enabling the exploration of various research questions and supporting different applications. To showcase its utility, we conducted extensive evaluations using multiple state-of-the-art baselines. The results provide benchmarks for facilitating further research in this area.

pdf abs
Reimagining Intent Prediction: Insights from Graph-Based Dialogue Modeling and Sentence Encoders
Daria Romanovna Ledneva | Denis Pavlovich Kuznetsov

This paper presents a innovative approach tailored to the specific characteristics of closed-domain dialogue systems. Leveraging scenario dialog graphs, our method effectively addresses the challenges posed by highly specialized fields, where context comprehension is of paramount importance. By modeling dialogues as sequences of transitions between intents, representing distinct goals or requests, our approach focuses on accurate intent prediction for generating contextually relevant responses. The study conducts a thorough evaluation, comparing the performance of state-of-the-art sentence encoders in conjunction with graph-based models across diverse datasets encompassing both open and closed domains. The results highlight the superiority of our methodology, offering fresh perspectives on the integration of advanced sentence encoders and graph models for precise and contextually-driven intent prediction in dialogue systems. Additionally, the use of this approach enhances the transparency of generated output, enabling a deeper understanding of the reasoning behind system responses. This study significantly advances the field of dialogue systems, providing valuable insights into the effectiveness and potential limitations of the proposed approaches.

pdf abs
Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM
Xuan Zhang | Wei Gao

Retrieval-augmented language models have exhibited promising performance across various areas of natural language processing (NLP), including fact-critical tasks. However, due to the black-box nature of advanced large language models (LLMs) and the non-retrieval-oriented supervision signal of specific tasks, the training of retrieval model faces significant challenges under the setting of black-box LLM. We propose an approach leveraging Fine-grained Feedback with Reinforcement Retrieval (FFRR) to enhance fact-checking on news claims by using black-box LLM. FFRR adopts a two-level strategy to gather fine-grained feedback from the LLM, which serves as a reward for optimizing the retrieval policy, by rating the retrieved documents based on the non-retrieval ground truth of the task. We evaluate our model on two public datasets for real-world news claim verification, and the results demonstrate that FFRR achieves significant improvements over strong LLM-enabled and non-LLM baselines.

In modern times, generational artificial intelligence is used in several industries and by many people. One use case that can be considered important but somewhat redundant is the act of searching for related work and other references to cite. As an avenue to better ascertain the value of citations and their corresponding locations, we focus on the common “related work” section as a focus of experimentation with the overall objective to generate the section. In this article, we present a corpus with 400k annotations of that distinguish related work from the rest of the references. Additionally, we show that for the papers in our experiments, the related work section represents the paper just as good, and in many cases, better than the rest of the references. We show that this is the case for more than 74% of the articles when using cosine similarity to measure the distance between two common graph neural network algorithms: Prone and Specter.

pdf abs
Relation between Cross-Genre and Cross-Topic Transfer in Dependency Parsing
Vera Danilova | Sara Stymne

Matching genre in training and test data has been shown to improve dependency parsing. However, it is not clear whether the used methods capture only the genre feature. We hypothesize that successful transfer may also depend on topic similarity. Using topic modelling, we assess whether cross-genre transfer in dependency parsing is stable with respect to topic distribution. We show that LAS scores in cross-genre transfer within and across treebanks typically align with topic distances. This indicates that topic is an important explanatory factor for genre transfer.

pdf abs
Relation Classification via Bidirectional Prompt Learning with Data Augmentation by Large Language Model
Yizhi Jiang | Jinlong Li | Huanhuan Chen

The Relation Extraction (RE) task aims to extract the relation between two entities in a sentence. As the performance of methods on RE task depends on datasets’ quantity and quality, in this paper, we propose to use the Large Language Model (LLM) to do data augmentation. Moreover, compared to traditional fine-tuning methods, more research focuses on prompt learning. However, all of their prompt templates ignore the relative order of entities, which we believe will affect the prediction error. Due to that, we propose novel bidirectional prompt templates for prompt learning and design a training strategy for utilizing the templates. Then we try to fit the probability distributions of both prompt learning and fine-tuning methods into our model. To this end, we propose Relation Classification via Bidirectional Prompt learning with data augmentation by LLM (RCBP) and conduct experiments on four datasets: TACRED, RETACRED, TACREV and Semeval. The results show that RCBP performs well on these datasets and outperforms the state-of-the-art in the TACREV, RETACRED datasets.

AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models specialize in the English language, and thus, AI democratization in non-English-speaking communities is lagging significantly. To reduce this gap in AI access, we released Generative Pre-trained Transformer (GPT), Contrastive Language and Image Pre-training (CLIP), Stable Diffusion, and Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) pre-trained in Japanese. By providing these models, users can freely interface with AI that aligns with Japanese cultural values and ensures the identity of Japanese culture, thus enhancing the democratization of AI. Additionally, experiments showed that pre-trained models specialized for Japanese can efficiently achieve high performance in Japanese tasks.

pdf abs
Releasing the Capacity of GANs in Non-Autoregressive Image Captioning
Da Ren | Qing Li

Building Non-autoregressive (NAR) models in image captioning can fundamentally tackle the high inference latency of autoregressive models. However, existing NAR image captioning models are trained on maximum likelihood estimation, and suffer from their inherent multi-modality problem. Although constructing NAR models based on GANs can theoretically tackle this problem, existing GAN-based NAR models obtain poor performance when transferred to image captioning due to their incapacity of modeling complicated relations between images and text. To tackle this problem, we propose an Adversarial Non-autoregressive Transformer for Image Captioning (CaptionANT) by improving performance from two aspects: 1) modifying the model structure so as to be compatible with contrastive learning to effectively make use of unpaired samples; 2) integrating a reconstruction process to better utilize paired samples. By further combining with other effective techniques and our proposed lightweight structure, CaptionANT can better align input images and output text, and thus achieves new state-of-the-art performance for fully NAR models on the challenging MSCOCO dataset. More importantly, CaptionANT achieves a 26.72 times speedup compared to the autoregressive baseline with only 36.3% the number of parameters of the existing best fully NAR model for image captioning.

Temporal knowledge graph completion is a critical task within the knowledge graph domain. Existing approaches encompass deep neural network-based methods for temporal knowledge graph embedding and rule-based logical symbolic reasoning. However, the former may not adequately account for structural dependencies between relations.Conversely, the latter methods relies heavily on strict logical rule reasoning and lacks robustness in the face of fuzzy or noisy data. In response to these challenges, we present RENN, a groundbreaking framework that enhances temporal knowledge graph completion through rule embedding. RENN employs a three-step approach. First, it utilizes temporary random walk to extract temporal logic rules. Then, it pre-trains by learning embeddings for each logical rule and its associated relations, thereby enhancing the likelihood of existing quadruples and logical rules. Finally, it incorporates the embeddings of logical rules into the deep neural network. Our methodology has been validated through experiments conducted on various temporal knowledge graph models and datasets, consistently demonstrating its effectiveness and potential in improving temporal knowledge graph completion.

Patients can not always completely understand medical documents given the myriad of technical terms they contain. Automatic text simplification techniques can help, but they must guarantee that the content is transmitted rigorously and not creating wrong information. In this work, we tested: 1) lexicon-based simplification approaches, using a Spanish lexicon of technical and laymen terms collected for this task (SimpMedLexSp); 2) deep-learning (DL) based methods, with BART-based and prompt-learning-based models; and 3) a combination of both techniques. As a test set, we used 5000 parallel (technical and laymen) sentence pairs: 3800 manually aligned sentences from the CLARA-MeD corpus; and 1200 sentences from clinical trials simplified by linguists. We conducted a quantitative evaluation with standard measures (BLEU, ROUGE and SARI) and a human evaluation, in which eleven subjects scored the simplification output of several methods. In our experiments, the lexicon improved the quantitative results when combined with the DL models. The simplified sentences using only the lexicon were assessed with the highest scores regarding semantic adequacy; however, their fluency needs to be improved. The prompt-method had similar ratings in this aspect and in simplification. We make available the models and the data to reproduce our results.

pdf abs
Representation Degeneration Problem in Prompt-based Models for Natural Language Understanding
Qingyan Zhao | Ruifang He | Jinpeng Zhang | Chang Liu | Bo Wang

Prompt-based fine-tuning (PF), by aligning with the training objective of pre-trained language models (PLMs), has shown improved performance on many few-shot natural language understanding (NLU) benchmarks. However, the word embedding space of PLMs exhibits anisotropy, which is called the representation degeneration problem. In this paper, we explore the self-similarity within the same context and identify the anisotropy of the feature embedding space in PF model. Given that the performance of PF models is dependent on feature embeddings, we inevitably pose the hypothesis: this anisotropy limits the performance of the PF models. Based on our experimental findings, we propose CLMA, a Contrastive Learning framework based on the [MASK] token and Answers, to alleviate the anisotropy in the embedding space. By combining our proposed counter-intuitive SSD, a Supervised Signal based on embedding Distance, our approach outperforms mainstream methods on the many NLU benchmarks in the few-shot experimental settings. In subsequent experiments, we analyze the capability of our method to capture deep semantic cues and the impact of the anisotropy in the feature embedding space on the performance of the PF model.

pdf abs
Representing Compounding with OntoLex. An Evaluation of Vocabularies for Word Formation Resources
Elena Benzoni | Matteo Pellegrini | Francesco Dedè | Marco Passarotti

This paper explores how compounds are represented in resources documenting word formation, and proposes ways to convert them into Linked Open Data using the OntoLex model. The ultimate purpose is to offer a broad empirical evaluation of which of the two OntoLex modules allowing for the representation of compounds – Decomp and Morph – fits best the different formats and theoretical approaches of the resources we examine. We show that the vocabulary of Decomp alone is rarely sufficient to account for all relevant facts; in almost all cases, it is necessary to resort to the vocabulary of Morph, either to reify the relation between compounds and their constituents or to represent specifically morphological information or other aspects. Special attention is devoted to the format of the Universal Derivations project: the modelling strategy that we propose can be applied to all resources harmonized in that format, potentially allowing for the conversion into Linked Open Data of a large amount of structured data.

pdf abs
Reranking Overgenerated Responses for End-to-End Task-Oriented Dialogue Systems
Songbo Hu | Ivan Vulić | Fangyu Liu | Anna Korhonen

End-to-end task-oriented dialogue systems are prone to fall into the so-called ‘likelihood trap’, resulting in generated responses which are dull, repetitive, and often inconsistent with dialogue history. Comparing ranked lists of multiple generated responses against the ‘gold response’ reveals a wide diversity in quality, with many good responses placed lower in the ranked list. The main challenge addressed in this work is how to reach beyond greedily generated system responses, that is, how to obtain and select high-quality responses from the list of overgenerated responses at inference without the availability of the gold response. To this end, we propose a simple yet effective reranking method to select high-quality items from the lists of initially overgenerated responses. The idea is to use any sequence-level scoring function to divide the semantic space of responses into high-scoring versus low-scoring partitions. At training, the high-scoring partition comprises all generated responses whose similarity to the gold response is higher than the similarity of the greedy response to the gold response. At inference, the aim is to estimate the probability that each overgenerated response belongs to the high-scoring partition. We evaluate our proposed method on the standard MultiWOZ dataset, the BiTOD dataset, and with human evaluation.

pdf abs
Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents
Ramona Christen | Anastassia Shaitarova | Matthias Stürmer | Joel Niklaus

Resolving the scope of a negation within a sentence is a challenging NLP task. The complexity of legal texts and the lack of annotated in-domain negation corpora pose challenges for state-of-the-art (SotA) models when performing negation scope resolution on multilingual legal data. Our experiments demonstrate that models pre-trained without legal data underperform in the task of negation scope resolution. We release a new set of annotated court decisions in German, French, and Italian and use it to improve negation scope resolution in both zero-shot and multilingual settings. We achieve token-level F1-scores of up to 86.7% in our zero-shot cross-lingual experiments, where the models are trained on two languages of our legal datasets and evaluated on the third. Our multilingual experiments, where the models were trained on all available negation data and evaluated on our legal datasets, resulted in F1-scores of up to 91.1%.

pdf abs
Restoring Ancient Ideograph: A Multimodal Multitask Neural Network Approach
Siyu Duan | Jun Wang | Qi Su

Cultural heritage serves as the enduring record of human thought and history. Despite significant efforts dedicated to the preservation of cultural relics, many ancient artefacts have been ravaged irreversibly by natural deterioration and human actions. Deep learning technology has emerged as a valuable tool for restoring various kinds of cultural heritages, including ancient text restoration. Previous research has approached ancient text restoration from either visual or textual perspectives, often overlooking the potential of synergizing multimodal information. This paper proposes a novel Multimodal Multitask Restoring Model (MMRM) to restore ancient texts, particularly emphasising the ideograph. This model combines context understanding with residual visual information from damaged ancient artefacts, enabling it to predict damaged characters and generate restored images simultaneously. We tested the MMRM model through experiments conducted on both simulated datasets and authentic ancient inscriptions. The results show that the proposed method gives insightful restoration suggestions in both simulation experiments and real-world scenarios. To the best of our knowledge, this work represents the pioneering application of multimodal deep learning in ancient text restoration, which will contribute to the understanding of ancient society and culture in digital humanities fields.

Memory is one of the most essential cognitive functions serving as a repository of world knowledge and episodes of activities. In recent years, large-scale pre-trained language models have shown remarkable memorizing ability. On the contrary, vanilla neural networks without pre-training have been long observed suffering from the catastrophic forgetting problem. To investigate such a retentive-forgetful contradiction and understand the memorizing dynamic mechanism of language models, we conduct thorough experiments by controlling the target knowledge types, the learning strategies and the learning schedules. We find that: 1) Vanilla language models without pre-training are forgetful; 2) Pre-training leads to retentive language models; 3) Knowledge relevance and diversification significantly influence the memory formation. These conclusions are useful for understanding the abilities of pre-trained language models and shed light on designing and evaluating new learning and inference algorithms of language models.

pdf abs
Rethinking Word-level Adversarial Attack: The Trade-off between Efficiency, Effectiveness, and Imperceptibility
Pengwei Zhan | Jing Yang | He Wang | Chao Zheng | Liming Wang

Neural language models have demonstrated impressive performance in various tasks but remain vulnerable to word-level adversarial attacks. Word-level adversarial attacks can be formulated as a combinatorial optimization problem, and thus, an attack method can be decomposed into search space and search method. Despite the significance of these two components, previous works inadequately distinguish them, which may lead to unfair comparisons and insufficient evaluations. In this paper, to address the inappropriate practices in previous works, we perform thorough ablation studies on the search space, illustrating the substantial influence of search space on attack efficiency, effectiveness, and imperceptibility. Based on the ablation study, we propose two standardized search spaces: the Search Space for ImPerceptibility (SSIP) and Search Space for EffecTiveness (SSET). The reevaluation of eight previous attack methods demonstrates the success of SSIP and SSET in achieving better trade-offs between efficiency, effectiveness, and imperceptibility in different scenarios, offering fair and comprehensive evaluations of previous attack methods and providing potential guidance for future works.

pdf abs
Retrieval-Augmented Modular Prompt Tuning for Low-Resource Data-to-Text Generation
Ruitao Feng | Xudong Hong | Mayank Jobanputra | Mattes Warning | Vera Demberg

Data-to-text (D2T) generation describes the task of verbalizing data, often given as attribute-value pairs. While this task is relevant for many different data domains beyond the traditionally well-explored tasks of weather forecasting, restaurant recommendations, and sports reporting, a major challenge to the applicability of data-to-text generation methods is typically data sparsity. For many applications, there is extremely little training data in terms of attribute-value inputs and target language outputs available for training a model. Given the sparse data setting, recently developed prompting methods seem most suitable for addressing D2T tasks since they do not require substantial amounts of training data, unlike finetuning approaches. However, prompt-based approaches are also challenging, as a) the design and search of prompts are non-trivial; and b) hallucination problems may occur because of the strong inductive bias of these models. In this paper, we propose a retrieval-augmented modular prompt tuning () method, which constructs prompts that fit the input data closely, thereby bridging the domain gap between the large-scale language model and the structured input data. Experiments show that our method generates texts with few hallucinations and achieves state-of-the-art performance on a dataset for drone handover message generation.

pdf abs
Retrieval-based Question Answering with Passage Expansion Using a Knowledge Graph
Benno Kruit | Yiming Xu | Jan-Christoph Kalo

Recent advancements in dense neural retrievers and language models have led to large improvements in state-of-the-art approaches to open-domain Question Answering (QA) based on retriever-reader architectures. However, issues stemming from data quality and imbalances in the use of dense embeddings have hindered performance, particularly for less common entities and facts. To tackle these problems, this study explores a multi-modal passage retrieval model’s potential to bolster QA system performance. This study poses three key questions: (1) Can a distantly supervised question-relation extraction model enhance retrieval using a knowledge graph (KG), compensating for dense neural retrievers’ shortcomings with rare entities? (2) How does this multi-modal approach compare to existing QA systems based on textual features? (3) Can this QA system alleviate poor performance on less common entities on common benchmarks? We devise a multi-modal retriever combining entity features and textual data, leading to improved retrieval precision in some situations, particularly for less common entities. Experiments across different datasets confirm enhanced performance for entity-centric questions, but challenges remain in handling complex generalized questions.

pdf abs
Revisiting Context Choices for Context-aware Machine Translation
Matiss Rikters | Toshiaki Nakazawa

One of the most popular methods for context-aware machine translation (MT) is to use separate encoders for the source sentence and context as multiple sources for one target sentence. Recent work has cast doubt on whether these models actually learn useful signals from the context or are improvements in automatic evaluation metrics just a side-effect. We show that multi-source transformer models improve MT over standard transformer-base models even with empty lines provided as context, but the translation quality improves significantly (1.51 - 2.65 BLEU) when a sufficient amount of correct context is provided. We also show that even though randomly shuffling in-domain context can also improve over baselines, the correct context further improves translation quality and random out-of-domain context further degrades it.

With the growing privacy concerns surrounding natural language understanding (NLU) applications, the need to train high-quality models while safeguarding data privacy has reached unprecedented importance. Federated learning (FL) offers a promising approach to collaborative model training by exchanging model gradients. However, many studies show that eavesdroppers in FL could develop sophisticated data reconstruction attack (DRA) to accurately reconstruct clients’ data from the shared gradients. Regrettably, current DRA methods in federated NLU have been mostly conducted on public datasets, lacking a comprehensive evaluation of real-world privacy datasets. To address this limitation, this paper presents a pioneering study that reexamines the performance of these DRA methods as well as corresponding defense methods. Specifically, we introduce a novel real-world privacy dataset called FedAttack which leads to a significant discovery: existing DRA methods usually fail to accurately recover the original text of real-world privacy data. In detail, the tokens within a recovery sentence are disordered and intertwined with tokens from other sentences in the same training batch. Moreover, our experiments demonstrate that the performance of DRA is also influenced by different languages and domains. By discovering these findings, our work lays a solid foundation for further research into the development of more practical DRA methods and corresponding defenses.

pdf abs
Revisiting the Classics: A Study on Identifying and Rectifying Gender Stereotypes in Rhymes and Poems
Aditya Narayan Sankaran | Vigneshwaran Shankaran | Sampath Lonka | Rajesh Sharma

Rhymes and poems are a powerful medium for transmitting cultural norms and societal roles. However, the pervasive existence of gender stereotypes in these works perpetuates biased perceptions and limits the scope of individuals’ identities. Past works have shown that stereotyping and prejudice emerge in early childhood, and developmental research on causal mechanisms is critical for understanding and controlling stereotyping and prejudice. This work contributes by gathering a dataset of rhymes and poems to identify gender stereotypes and propose a model with 97% accuracy to identify gender bias. Gender stereotypes were rectified using a Large Language Model (LLM) and its effectiveness was evaluated in a comparative survey against human educator rectifications. To summarize, this work highlights the pervasive nature of gender stereotypes in literary works and reveal the potential of LLMs to rectify gender stereotypes. This study raises awareness and promotes inclusivity within artistic expressions, making a significant contribution to the discourse on gender equality.

pdf abs
Revisiting the Self-Consistency Challenges in Multi-Choice Question Formats for Large Language Model Evaluation
Wenjie Zhou | Qiang Wang | Mingzhou Xu | Ming Chen | Xiangyu Duan

Multi-choice questions (MCQ) are a common method for assessing the world knowledge of large language models (LLMs), demonstrated by benchmarks such as MMLU and C-Eval. However, recent findings indicate that even top-tier LLMs, such as ChatGPT and GPT4, might display inconsistencies when faced with slightly varied inputs. This raises concerns about the credibility of MCQ-based evaluations. To address this issue, we introduced three knowledge-equivalent question variants: option position shuffle, option label replacement, and conversion to a True/False format. We rigorously tested a range of LLMs, varying in model size (from 6B to 70B) and types—pretrained language model (PLM), supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF). Our findings from MMLU and C-Eval revealed that accuracy for individual questions lacks robustness, particularly in smaller models (<30B) and PLMs. Consequently, we advocate that consistent accuracy may serve as a more reliable metric for evaluating and ranking LLMs.

pdf abs
Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System
Christina Tånnander | Jens Edlund | Joakim Gustafson

In order to investigate the strengths and weaknesses of Audience Response System (ARS) in text-to-speech synthesis (TTS) evaluations, we revisit three previously published TTS studies and perform an ARS-based evaluation on the stimuli used in each study. The experiments are performed with a participant pool of 39 respondents, using a web-based tool that emulates an ARS experiment. The results of the first experiment confirms that ARS is highly useful for evaluating long and continuous stimuli, particularly if we wish for a diagnostic result rather than a single overall metric, while the second and third experiments highlight weaknesses in ARS with unsuitable materials as well as the importance of framing and instruction when conducting ARS-based evaluation.

pdf abs
Rewiring the Transformer with Depth-Wise LSTMs
Hongfei Xu | Yang Song | Qiuhui Liu | Josef van Genabith | Deyi Xiong

Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance. However, residual connections may make the model “forget” distant layers and fail to fuse information from previous layers effectively. Selectively managing the representation aggregation of Transformer layers may lead to better performance. In this paper, we present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. We show that layer normalization and feed-forward computation within a Transformer layer can be absorbed into depth-wise LSTMs connecting pure Transformer attention layers. Our experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task, and our deep Transformer experiments demonstrate the effectiveness of depth-wise LSTM on the convergence and performance of deep Transformers.

pdf abs
RISE: Robust Early-exiting Internal Classifiers for Suicide Risk Evaluation
Ritesh Singh Soun | Atula Tejaswi Neerkaje | Ramit Sawhney | Nikolaos Aletras | Preslav Nakov

Suicide is a serious public health issue, but it is preventable with timely intervention. Emerging studies have suggested there is a noticeable increase in the number of individuals sharing suicidal thoughts online. As a result, utilising advance Natural Language Processing techniques to build automated systems for risk assessment is a viable alternative. However, existing systems are prone to incorrectly predicting risk severity and have no early detection mechanisms. Therefore, we propose RISE, a novel robust mechanism for accurate early detection of suicide risk by ensembling Hyperbolic Internal Classifiers equipped with an abstention mechanism and early-exit inference capabilities. Through quantitative, qualitative and ablative experiments, we demonstrate RISE as an efficient and robust human-in-the-loop approach for risk assessment over the Columbia Suicide Severity Risk Scale (C-SSRS) and CLPsych 2022 datasets. It is able to successfully abstain from 84% incorrect predictions on Reddit data while out-predicting state of the art models upto 3.5x earlier.

pdf abs
RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian
Krenare Pireva Nuci | Paul Landes | Barbara Di Eugenio

The education domain has been a popular area of collaboration with NLP researchers for decades. However, many recent breakthroughs, such as large transformer based language models, have provided new opportunities for solving interesting, but difficult problems. One such problem is assigning sentiment to reviews of educators’ performance. We present EduSenti: a corpus of 1,163 Albanian and 624 English reviews of educational instructor’s performance reviews annotated for sentiment, emotion and educational topic. In this work, we experiment with fine-tuning several language models on the EduSenti corpus and then compare with an Albanian masked language trained model from the last XLM-RoBERTa checkpoint. We show promising results baseline results, which include an F1 of 71.9 in Albanian and 73.8 in English. Our contributions are: (i) a sentiment analysis corpus in Albanian and English, (ii) a large Albanian corpus of crawled data useful for unsupervised training of language models, and (iii) the source code for our experiments.

In this paper, we introduce a new far-field speaker recognition benchmark called RoboVox. RoboVox is a French corpus recorded by a mobile robot. The files are recorded from different distances under severe acoustical conditions with the presence of several types of noise and reverberation. In addition to noise and reverberation, the robot’s internal noise acts as an extra additive noise. RoboVox can be used for both single-channel and multi-channel speaker recognition. In the evaluation protocols, we are considering both cases. The obtained results demonstrate a significant decline in performance in far-filed speaker recognition and urge the community to further research in this domain

Large language models (LLMs) can make predictions using *parametric knowledge* – knowledge encoded in the model weights – or *contextual knowledge* – knowledge presented in the context. In many scenarios, a desirable behavior is that LLMs give precedence to contextual knowledge when it conflicts with the parametric knowledge, and fall back to using their parametric knowledge when the context is irrelevant. This enables updating and correcting the model’s knowledge by in-context editing instead of retraining. Previous works have shown that LLMs are inclined to ignore contextual knowledge and fail to reliably fall back to parametric knowledge when presented with irrelevant context. In this work, we discover that, with proper prompting methods, instruction-finetuned LLMs can be highly controllable by contextual knowledge and robust to irrelevant context. Utilizing this feature, we propose EREN (Edit models by REading Notes) to improve the scalability and robustness of LLM editing. To better evaluate the robustness of model editors, we collect a new dataset, that contains irrelevant questions that are more challenging than the ones in existing datasets. Empirical results show that our method outperforms current state-of-the-art methods by a large margin. Unlike existing techniques, it can integrate knowledge from multiple edits, and correctly respond to syntactically similar but semantically unrelated inputs (and vice versa). The source code can be found at https://github.com/thunlp/EREN.

pdf abs
RoCode: A Dataset for Measuring Code Intelligence from Problem Definitions in Romanian
Adrian Cosma | Ioan-Bogdan Iordache | Paolo Rosso

Recently, large language models (LLMs) have become increasingly powerful and have become capable of solving a plethora of tasks through proper instructions in natural language. However, the vast majority of testing suites assume that the instructions are written in English, the de facto prompting language. Code intelligence and problem solving still remain a difficult task, even for the most advanced LLMs. Currently, there are no datasets to measure the generalization power for code-generation models in a language other than English. In this work, we present RoCode, a competitive programming dataset, consisting of 2,642 problems written in Romanian, 11k solutions in C, C++ and Python and comprehensive testing suites for each problem. The purpose of RoCode is to provide a benchmark for evaluating the code intelligence of language models trained on Romanian / multilingual text as well as a fine-tuning set for pretrained Romanian models. Through our results and review of related works, we argue for the need to develop code models for languages other than English.

Large Language Models (LLMs) have showcased remarkable capabilities in following human instructions. However, recent studies have raised concerns about the robustness of LLMs for natural language understanding (NLU) tasks when prompted with instructions combining textual adversarial samples. In this paper, drawing inspiration from recent works that LLMs are sensitive to the design of the instructions, we utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions. Through this conversion, we provide LLMs with more precise instructions and strengthen the robustness of LLMs. Moreover, under few-shot scenarios, we propose a novel method to compose in-context demonstrations using both clean and adversarial samples (adversarial context method) to further boost the robustness of the LLMs. Experiments on eight robustness datasets show that our method consistently outperforms prompting LLMs with natural language, for example, with gpt-3.5-turbo on average, our method achieves an improvement of 5.68% in test set accuracy and a reduction of 5.66 points in Attack Success Rate (ASR).

pdf abs
RT-VQ2A2: Real Time Vector Quantized Question Answering with ASR
Kyungho Kim | Seongmin Park | Jihwa Lee

In Spoken Question Answering (SQA), automatic speech recognition (ASR) outputs are often relayed to language models for QA. However, constructing such a cascaded framework with large language models (LLMs) in a real-time SQA setting involves realistic challenges, such as noise in the ASR output, the limited context length of LLMs, and latency in processing large models. This paper proposes a novel model-agnostic framework, RT-VQ2A2, to address these challenges. RT-VQ2A2 consists of three steps: codebook preparation, quantized semantic vector extractor, and dual segment selector. We construct a codebook from clustering, removing outliers on a text corpus derived from ASR to mitigate the influence of ASR error. Extracting quantized semantic vectors through a pre-built codebook shows significant speed and performance improvements in relevant context retrieval. Dual segment selector considers both semantic and lexical aspects to deal with ASR error. The efficacy of RT-VQ2A2 is validated on the widely used Spoken-SQuAD dataset.

Fact-checking is the task of verifying the factuality of a given claim by examining the available evidence. High-quality evidence plays a vital role in enhancing fact-checking systems and facilitating the generation of explanations that are understandable to humans. However, the provision of both sufficient and relevant evidence for explainable fact-checking systems poses a challenge. To tackle this challenge, we propose a method based on a Large Language Model to automatically retrieve and summarize evidence from the Web. Furthermore, we construct RU22Fact, a novel multilingual explainable fact-checking dataset on the Russia-Ukraine conflict in 2022 of 16K samples, each containing real-world claims, optimized evidence, and referenced explanation. To establish a baseline for our dataset, we also develop an end-to-end explainable fact-checking system to verify claims and generate explanations. Experimental results demonstrate the prospect of optimized evidence in increasing fact-checking performance and also indicate the possibility of further progress in the end-to-end claim verification and explanation generation tasks.

pdf abs
RuBia: A Russian Language Bias Detection Dataset
Veronika Grigoreva | Anastasiia Ivanova | Ilseyar Alimova | Ekaterina Artemova

Warning: this work contains upsetting or disturbing content. Large language models (LLMs) tend to learn the social and cultural biases present in the raw pre-training data. To test if an LLM’s behavior is fair, functional datasets are employed, and due to their purpose, these datasets are highly language and culture-specific. In this paper, we address a gap in the scope of multilingual bias evaluation by presenting a bias detection dataset specifically designed for the Russian language, dubbed as RuBia. The RuBia dataset is divided into 4 domains: gender, nationality, socio-economic status, and diverse, each of the domains is further divided into multiple fine-grained subdomains. Every example in the dataset consists of two sentences with the first reinforcing a potentially harmful stereotype or trope and the second contradicting it. These sentence pairs were first written by volunteers and then validated by native-speaking crowdsourcing workers. Overall, there are nearly 2,000 unique sentence pairs spread over 19 subdomains in RuBia. To illustrate the dataset’s purpose, we conduct a diagnostic evaluation of state-of-the-art or near-state-of-the-art LLMs and discuss the LLMs’ predisposition to social biases.

Russian Learner Corpus (RLC) is a large collection of learner texts in Russian written by native speakers of over forty languages. Learner errors in part of the corpus are manually corrected and annotated. Diverging from conventional error classifications, which typically focus on isolated lexical and grammatical features, the RLC error classification intends to highlight learners’ strategies employed in the process of text production, such as derivational patterns and syntactic relations (including agreement and government). In this paper, we present two open datasets derived from RLC: a manually annotated full-text dataset and a dataset with crowdsourced corrections for individual sentences. In addition, we introduce an automatic error annotation tool that, given an original sentence and its correction, locates and labels errors according to a simplified version of the RLC error-type system. We evaluate the performance of the tool on manually annotated data from RLC.

pdf abs
S3Prompt: Instructing the Model with Self-calibration, Self-recall and Self-aggregation to Improve In-context Learning
Junda Chen | Jianting Liu

Large language models achieve impressive results by inferring conditional probability distributions in the context of user input to generate responses. However, they still have the following limitations in practical applications: 1) User queries are often colloquial and do not conform to the conditional probability distribution of LLM. 2) Unsupervised generation and recall of in-context examples(compared to random sampling) remains an open problem. To alleviate the above problems, we propose a novel Self-calibration, Self-recall and Self-aggregation prompt pipeline (S 3Prompt). Specifically, we first design a question calibration prompt to align colloquial queries with LLM context. Secondly, we construct a candidate recall prompt that allows LLM to generate potential background information, which is different from traditional retrieval-based QA. Finally, we design an information aggregation prompt to generate the final answer by aggregating the recalled information. Notably, we find that the self-generated information by LLM has a smaller gap when fused with LLM. We conducted comprehensive experiments on various datasets, including numerical reasoning, common sense reasoning, logical reasoning, and reading comprehension. The results showed that the performance of LLM can be significantly improved by using question calibration, candidate recall, and information aggregation, without requiring annotated datasets and model parameter updates.

pdf abs
SaGE: Evaluating Moral Consistency in Large Language Models
Vamshi Krishna Bonagiri | Sreeram Vennam | Priyanshul Govil | Ponnurangam Kumaraguru | Manas Gaur

Despite recent advancements showcasing the impressive capabilities of Large Language Models (LLMs) in conversational systems, we show that even state-of-the-art LLMs are morally inconsistent in their generations, questioning their reliability (and trustworthiness in general). Prior works in LLM evaluation focus on developing ground-truth data to measure accuracy on specific tasks. However, for moral scenarios that often lack universally agreed-upon answers, consistency in model responses becomes crucial for their reliability. To address this issue, we propose an information-theoretic measure called Semantic Graph Entropy (SaGE), grounded in the concept of “Rules of Thumb” (RoTs) to measure a model’s moral consistency. RoTs are abstract principles learned by a model and can help explain their decision-making strategies effectively. To this extent, we construct the Moral Consistency Corpus (MCC), containing 50K moral questions, responses to them by LLMs, and the RoTs that these models followed. Furthermore, to illustrate the generalizability of SaGE, we use it to investigate LLM consistency on two popular datasets – TruthfulQA and HellaSwag. Our results reveal that task accuracy and consistency are independent problems, and there is a dire need to investigate these issues further.

Predicting price variations of financial instruments for risk modeling and stock trading is challenging due to the stochastic nature of the stock market. While recent advancements in the Financial AI realm have expanded the scope of data and methods they use, such as textual and audio cues from financial earnings calls, limitations exist. Most datasets are small, and show domain distribution shifts due to the nature of their source, suggesting the exploration for data augmentation for robust augmentation strategies such as Mixup. To tackle such challenges in the financial domain, we propose SH-Mix: Saliency-guided Hierarchical Mixup augmentation technique for multimodal financial prediction tasks. SH-Mix combines multi-level embedding mixup strategies based on the contribution of each modality and context subsequences. Through extensive quantitative and qualitative experiments on financial earnings and conference call datasets consisting of text and speech, we show that SH-Mix outperforms state-of-the-art methods by 3-7%. Additionally, we show that SH-Mix is generalizable across different modalities and models.

We release Saamayik, a dataset of around 53,000 parallel English-Sanskrit sentences, written in contemporary prose. Sanskrit is a classical language still in sustenance and has a rich documented heritage. However, due to the limited availability of digitized content, it still remains a low-resource language. Existing Sanskrit corpora, whether monolingual or bilingual, have predominantly focused on poetry and offer limited coverage of contemporary written materials. Saamayik is curated from a diverse range of domains, including language instruction material, textual teaching pedagogy, and online tutorials, among others. It stands out as a unique resource that specifically caters to the contemporary usage of Sanskrit, with a primary emphasis on prose writing. Translation models trained on our dataset demonstrate statistically significant improvements when translating out-of-domain contemporary corpora, outperforming models trained on older classical-era poetry datasets. Finally, we also release benchmark models by adapting four multilingual pre-trained models, three of them have not been previously exposed to Sanskrit for translating between English and Sanskrit while one of them is multi-lingual pre-trained translation model including English and Sanskrit. The dataset and source code can be found at https://github.com/ayushbits/saamayik.

pdf abs
SamróMur MilljóN: An ASR Corpus of One Million Verified Read Prompts in Icelandic
Carlos Daniel Hernandez Mena | Þorsteinn Daði Gunnarsson | Jon Gudnason

The platform samromur.is, or “Samrómur” for short, is a crowdsourcing web application built on Mozilla’s Common Voice, designed to accumulate speech data for the advancement of language technologies in Icelandic. Over the years, Samrómur has proven to be remarkably successful in amassing a significant number of high-quality audio clips from thousands of users. However, the challenge of manually verifying the entirety of the collected data has hindered its effective exploitation, especially in the realm of Automatic Speech Recognition (ASR), its original purpose. In this paper, we introduce the “Samrómur Milljón” corpus, an ASR dataset comprising one million audio clips from Samrómur. These clips have been automatically verified using state-of-the-art speech recognition systems such as NeMo, Wav2Vec2, and Whisper. Additionally, we present the ASR results obtained from creating acoustic models based on Samrómur Milljón. These results demonstrate significant promise when compared to other acoustic models trained with a similar volume of Icelandic data from different sources.

pdf abs
Sarcasm Detection in a Disaster Context
Tiberiu Sosea | Junyi Jessy Li | Cornelia Caragea

During natural disasters, people often use social media platforms such as Twitter to ask for help, to provide information about the disaster situation, or to express contempt about the unfolding event or public policies and guidelines. This contempt is in some cases expressed as sarcasm or irony. Understanding this form of speech in a disaster-centric context is essential to improving natural language understanding of disaster-related tweets. In this paper, we introduce HurricaneSARC, a dataset of 15,000 tweets annotated for intended sarcasm, and provide a comprehensive investigation of sarcasm detection using pre-trained language models. Our best model is able to obtain as much as 0.70 F1 on our dataset. We also demonstrate that the performance on HurricaneSARC can be improved by leveraging intermediate task transfer learning

pdf abs
SarcNet: A Multilingual Multimodal Sarcasm Detection Dataset
Tan Yue | Xuzhao Shi | Rui Mao | Zonghai Hu | Erik Cambria

Sarcasm poses a challenge in linguistic analysis due to its implicit nature, involving an intended meaning that contradicts the literal expression. The advent of social networks has propelled the utilization of multimodal data to enhance sarcasm detection performance. In prior multimodal sarcasm detection datasets, a single label is assigned to a multimodal instance. Subsequent experiments often highlight the superiority of multimodal models by demonstrating their improvements compared to unimodal models based on these unified labels across multiple modalities. However, our investigation revealed that numerous instances of sarcasm cannot be identified using a single modality. Humans employ the conflict between a statement and factual information as a cue to detect sarcasm, and these cues can stem from different modalities. Then, a unified label for a multimodal instance may be not suitable for the associated text or image. In this work, we introduce SarcNet, a multilingual and multimodal sarcasm detection dataset in English and Chinese, consisting of 3,335 image-text pair samples. We provide annotations for sarcasm in visual, textual, and multimodal data, respectively, resulting in over 10,000 labeled instances. The separated annotation schema for unimodal and multimodal data facilitates a more accurate and reasonable assessment of unimodal and multimodal models.

Automated patent classification typically involves assigning labels to a patent from a taxonomy, using multi-class multi-label classification models. However, classification-based models face challenges in scaling to large numbers of labels, struggle with generalizing to new labels, and fail to effectively utilize the rich information and multiple views of patents and labels. In this work, we propose a multi-view ranking-based method to address these limitations. Our method consists of four ranking-based models that incorporate different views of patents and a meta-model that aggregates and re-ranks the candidate labels given by the four ranking models. We compared our approach against the state-of-the-art baselines on two publicly available patent classification datasets, USPTO-2M and CLEF-IP-2011. We demonstrate that our approach can alleviate the aforementioned limitations and achieve a new state-of-the-art performance by a significant margin.

pdf abs
Scale-VAE: Preventing Posterior Collapse in Variational Autoencoder
Tianbao Song | Jingbo Sun | Xin Liu | Weiming Peng

Variational autoencoder (VAE) is a widely used generative model that gains great popularity for its capability in density estimation and representation learning. However, when employing a strong autoregressive generation network, VAE tends to converge to a degenerate local optimum known as posterior collapse. In this paper, we propose a model named Scale-VAE to solve this problem. Scale-VAE does not force the KL term to be larger than a positive constant, but aims to make the latent variables easier to be exploited by the generation network. Specifically, each dimension of the mean for the approximate posterior distribution is multiplied by a factor to keep that dimension discriminative across data instances. The same factors are used for all data instances so as not to change the relative relationship between the posterior distributions. Latent variables from the scaled-up posteriors are fed into the generation network, but the original posteriors are still used to calculate the KL term. In this way, Scale-VAE can solve the posterior collapse problem with a training cost similar to or even lower than the basic VAE. Experimental results show that Scale-VAE outperforms state-of-the-art models in density estimation, representation learning, and consistency of the latent space, and is competitive with other models in generation.

Alignment with human preference prevents large language models (LLMs) from generating misleading or toxic content while requiring high-cost human feedback. Assuming resources of human annotation are limited, there are two different ways of allocating considered: more diverse PROMPTS or more diverse RESPONSES to be labeled. Nonetheless, a straightforward comparison between their impact is absent. In this work, we first control the diversity of both sides according to the number of samples for fine-tuning, which can directly reflect their influence. We find that instead of numerous prompts, more responses but fewer prompts better trigger LLMs for human alignment. Additionally, the concept of diversity for prompts can be more complex than responses that are typically quantified by single digits. Consequently, a new formulation of prompt diversity is proposed, further implying a linear correlation with the final performance of LLMs after fine-tuning. We also leverage it on data augmentation and conduct experiments to show its effect on different algorithms.

pdf abs
Scansion-based Lyrics Generation
Yiwen Chen | Simone Teufel

We aim to generate lyrics for Mandarin songs with a good match between the melody and the tonal contour of the lyrics. Our solution relies on mBart, treating lyrics generation as a translation problem, but rather than translating directly from the melody as is common, our novelty in this paper is that we generate from scansion as an intermediate contour representation that can fit a given melody. One of the advantages of our solution is that it does not require a parallel melody-lyrics dataset. We also present a thorough automatic evaluation of our system against competitors, using several new evaluation metrics. These measure intelligibility, fit to melody, and use proxies for quantifying creativity (variation to other songs created by the same system in different settings, semantic similarity to keywords given to the system, perplexity). When comparing different implementations of scansion to competitor systems, a varied picture emerges. Our best system outperforms all others in lyric-melody fit and is in the top group of systems for two of the creativity metrics (variation and perplexity), overshadowing two large language models (LLM) specialised to this task.

pdf abs
Schema-based Data Augmentation for Event Extraction
Xiaomeng Jin | Heng Ji

Event extraction is a crucial task for semantic understanding and structured knowledge construction. However, the expense of collecting and labeling data for training event extraction models is usually high. To address this issue, we propose a novel schema-based data augmentation method that utilizes event schemas to guide the data generation process. The event schemas depict the typical patterns of complex events and can be used to create new synthetic data for event extraction. Specifically, we sub-sample from the schema graph to obtain a subgraph, instantiate the schema subgraph, and then convert the instantiated subgraph to natural language texts. We conduct extensive experiments on event trigger detection, event trigger extraction, and event argument extraction tasks using two datasets (including five scenarios). The experimental results demonstrate that our proposed data-augmentation method produces high-quality generated data and significantly enhances the model performance, with up to 12% increase in F1 score compared to baseline methods.

pdf abs
Schema Learning Corpus: Data and Annotation Focused on Complex Events
Song Chen | Jennifer Tracey | Ann Bies | Stephanie Strassel

The Schema Learning Corpus (SLC) is a new linguistic resource designed to support research into the structure of complex events in multilingual, multimedia data. The SLC incorporates large volumes of background data in English, Spanish and Russian, and defines 100 complex events (CEs) across 12 domains, with CE profiles containing information about the typical steps and substeps and expected event categories for the CE. Multiple documents are labeled for each CE, with pointers to evidence in the document for each CE step, plus labeled events and relations along with their arguments across a large tag set. The SLC was designed to support development and evaluation of technology capable of understanding and reasoning about complex real-world events in multimedia, multilingual data streams in order to provide users with a deeper understanding of the potential relationships among seemingly disparate events and actors, and to allow users to make better predictions about how future events are likely to unfold. The Schema Learning Corpus will be made available to the research community through publication in Linguistic Data Consortium catalog.

pdf abs
Schroedinger’s Threshold: When the AUC Doesn’t Predict Accuracy
Juri Opitz

The Area Under Curve measure (AUC) seems apt to evaluate and compare diverse models, possibly without calibration. An important example of AUC application is the evaluation and benchmarking of models that predict faithfulness of generated text. But we show that the AUC yields an academic and optimistic notion of accuracy that can misalign with the actual accuracy observed in application, yielding significant changes in benchmark rankings. To paint a more realistic picture of downstream model performance (and prepare it for actual application), we explore different calibration modes, testing calibration data and method.

pdf abs
SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions
Huitong Pan | Qi Zhang | Cornelia Caragea | Eduard Dragut | Longin Jan Latecki

We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes. To the best of our knowledge, SciDMT is the largest corpus for scientific entity mention detection. The corpus’s scale and diversity are instrumental in developing and refining models for tasks such as indexing scientific papers, enhancing information retrieval, and improving the accessibility of scientific knowledge. We demonstrate the corpus’s utility through experiments with advanced deep learning architectures like SciBERT and GPT-3.5. Our findings establish performance baselines and highlight unresolved challenges in scientific mention detection. SciDMT serves as a robust benchmark for the research community, encouraging the development of innovative models to further the field of scientific information extraction.

pdf abs
SciMRC: Multi-perspective Scientific Machine Reading Comprehension
Xiao Zhang | Heqi Zheng | Yuxiang Nie | Heyan Huang | Xian-Ling Mao

Scientific Machine Reading Comprehension (SMRC) aims to facilitate the understanding of scientific texts through human-machine interactions. While existing dataset has significantly contributed to this field, it predominantly focus on single-perspective question-answer pairs, thereby overlooking the inherent variation in comprehension levels among different readers. To address this limitation, we introduce a novel multi-perspective scientific machine reading comprehension dataset, SciMRC, which incorporates perspectives from beginners, students, and experts. Our dataset comprises 741 scientific papers and 6,057 question-answer pairs, with 3,306, 1,800, and 951 pairs corresponding to beginners, students, and experts respectively. Extensive experiments conducted on SciMRC using pre-trained models underscore the importance of considering diverse perspectives in SMRC and highlight the challenging nature of our scientific machine comprehension tasks.

pdf abs
SciNews: From Scholarly Complexities to Public Narratives – a Dataset for Scientific News Report Generation
Dongqi Pu | Yifan Wang | Jia Loy | Vera Demberg

Scientific news reports serve as a bridge, adeptly translating complex research articles into reports that resonate with the broader public. The automated generation of such narratives enhances the accessibility of scholarly insights. In this paper, we present a new corpus to facilitate this paradigm development. Our corpus comprises a parallel compilation of academic publications and their corresponding scientific news reports across nine disciplines. To demonstrate the utility and reliability of our dataset, we conduct an extensive analysis, highlighting the divergences in readability and brevity between scientific news narratives and academic manuscripts. We benchmark our dataset employing state-of-the-art text generation models. The evaluation process involves both automatic and human evaluation, which lays the groundwork for future explorations into the automated generation of scientific news reports. The dataset and code related to this work are available at https://dongqi.me/projects/SciNews.

We introduce the Situated Corpus Of Understanding Transactions (SCOUT), a multi-modal collection of human-robot dialogue in the task domain of collaborative exploration. The corpus was constructed from multiple Wizard-of-Oz experiments where human participants gave verbal instructions to a remotely-located robot to move and gather information about its surroundings. SCOUT contains 89,056 utterances and 310,095 words from 278 dialogues averaging 320 utterances per dialogue. The dialogues are aligned with the multi-modal data streams available during the experiments: 5,785 images and 30 maps. The corpus has been annotated with Abstract Meaning Representation and Dialogue-AMR to identify the speaker’s intent and meaning within an utterance, and with Transactional Units and Relations to track relationships between utterances to reveal patterns of the Dialogue Structure. We describe how the corpus and its annotations have been used to develop autonomous human-robot systems and enable research in open questions of how humans speak to robots. We release this corpus to accelerate progress in autonomous, situated, human-robot dialogue, especially in the context of navigation tasks where details about the environment need to be discovered.

pdf abs
SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning
Dongsheng Zhu | Zhenyu Mao | Jinghui Lu | Rui Zhao | Fei Tan

Contrastive learning has recently achieved compelling performance in unsupervised sentence representation. As an essential element, data augmentation protocols, however, have not been well explored. The pioneering work SimCSE resorting to a simple dropout mechanism (viewed as continuous augmentation) surprisingly dominates discrete augmentations such as cropping, word deletion, and synonym replacement as reported. To understand the underlying rationales, we revisit existing approaches and attempt to hypothesize the desiderata of reasonable data augmentation methods: balance of semantic consistency and expression diversity. We then develop three simple yet effective discrete sentence augmentation schemes: punctuation insertion, modal verbs, and double negation. They act as minimal noises at lexical level to produce diverse forms of sentences. Furthermore, standard negation is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning. We experimented extensively with semantic textual similarity on diverse datasets. The results support the superiority of the proposed methods consistently. Our key code is available at https://github.com/Zhudongsheng75/SDA

pdf abs
Searching by Code: A New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets
Ivan Sedykh | Nikita Sorokin | Dmitry Abulkhanov | Sergey I. Nikolenko | Valentin Malykh

Code search is an important and well-studied task, but it usually means searching for code by a text query. We argue that using a code snippet (and possibly an error traceback) as a query while looking for bugfixing instructions and code samples is a natural use case not covered by prior art. Moreover, existing datasets use code comments rather than full-text descriptions as text, making them unsuitable for this use case. We present a new SearchBySnippet dataset implementing the search-by-code use case based on StackOverflow data; we show that on SearchBySnippet, existing architectures fall short of a simple BM25 baseline even after fine-tuning. We present a new single encoder model SnippeR that outperforms several strong baselines on SearchBySnippet with a result of 0.451 Recall@10; we propose the SearchBySnippet dataset and SnippeR as a new important benchmark for code search evaluation.

Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.

pdf abs
Seeing Eye-to-Eye: Cross-Modal Coherence Relations Inform Eye-gaze Patterns During Comprehension & Production
Mert Inan | Malihe Alikhani

Context influences how we engage with multimodal documents. Describing and processing the content of images is highly correlated with the goals of the discourse. It is known that these underlying cognitive processes can be tapped into by looking at eye movements, but the connection between discourse goals and eye movements is a missing link. In this study, we carry out both augmented reality and webcam-based eye-tracking experiments during comprehension and production tasks. We build on computational frameworks of coherence in text and images that study causal, logical, elaborative, and temporal inferences to understand how eye gaze patterns and coherence relations influence each other. No state-of-the-art techniques exist to analyze eye movements in multimodal language settings. So, we introduce a new eye gaze pattern ranking algorithm and a semantic gaze visualization technique to study this phenomenon better. Our results demonstrate that eye gaze durations are person-dependent, and during comprehension and production, ranked gaze patterns are significantly different for different types of coherence relations. We also present a case study of how Multimodal Large Language Models represent this connection of eye gaze patterns and coherence relations. We make all of our code and novel analysis tools available through https://github.com/Merterm/eye-gaze-coherence.

Over the last few years, artificial intelligence-based clinical assistance has gained immense popularity and demand in telemedicine, including automatic disease diagnosis. Patients often describe their signs and symptoms to doctors using visual aids, which provide vital evidence for identifying a medical condition. In addition to learning from our experiences, we learn from well-established theories/ knowledge. With the motivation of leveraging visual cues and medical knowledge, we propose a transformer-based, knowledge-infused multi-modal medical dialogue generation (KI-MMDG) framework. In addition, we present a discourse-aware image identifier (DII) that recognizes signs and their severity by leveraging the current conversation context in addition to the image of the signs. We first curate an empathy and severity-aware multi-modal medical dialogue (ES-MMD) corpus in English, which is annotated with intent, symptoms, and visual signs with severity information. Experimental results show the superior performance of the proposed KI-MMDG model over uni-modal and non-knowledge infused generative models, demonstrating the importance of visual signs and knowledge infusion in symptom investigation and diagnosis. We also observed that the DII model surpasses the existing state-of-the-art model by 7.84%, indicating the crucial significance of dialogue context for identifying a sign image surfaced during conversations. The code and dataset are available at https://github.com/NLP-RL/KI-MMDG.

pdf abs
Segmentation of Complex Question Turns for Argument Mining: A Corpus-based Study in the Financial Domain
Giulia D’Agostino | Chris A. Reed | Daniele Puccinelli

Within the financial communication domain, Earnings Conference Calls (ECCs) play a pivotal role in tracing (a) the presentational strategies and trust-building devices used by company representatives and (b) the relevant hot-topics for stakeholders, from which they form an (e)valuation of the company. Due to their formally regulated nature, ECCs are a favoured domain for the study of argumentation in context and the extraction of Argumentative Discourse Units (ADUs). However, the idiosyncratic structure of dialogical exchanges in Q&A sessions of ECCs, particularly at the level of question formulation, challenges existing models of argument mining, which assume adjacency of related question and answer turns in the dialogue. Maximal Interrogative Units (MIUs) are a novel approach to grouping together topically contiguous argumentative components within a question turn. MIU identification allows application of existing argument mining techniques to a less noisy unit of text, following removal of discourse regulators and splitting into sub-units of thematically related text. Evaluation of an automated method for MIU recognition is also presented with respect to gold-standard manual annotation.

pdf abs
Select and Reorder: A Novel Approach for Neural Sign Language Production
Harry Walsh | Ben Saunders | Richard Bowden

Sign languages, often categorised as low-resource languages, face significant challenges in achieving accurate translation due to the scarcity of parallel annotated datasets. This paper introduces Select and Reorder (S&R), a novel approach that addresses data scarcity by breaking down the translation process into two distinct steps: Gloss Selection (GS) and Gloss Reordering (GR). Our method leverages large spoken language models and the substantial lexical overlap between source spoken languages and target sign languages to establish an initial alignment. Both steps make use of Non-AutoRegressive (NAR) decoding for reduced computation and faster inference speeds. Through this disentanglement of tasks, we achieve state-of-the-art BLEU and Rouge scores on the Meine DGS Annotated (mDGS) dataset, demonstrating a substantial BLUE-1 improvement of 37.88% in Text to Gloss (T2G) Translation. This innovative approach paves the way for more effective translation models for sign languages, even in resource-constrained settings.

pdf abs
Select High-quality Synthetic QA Pairs to Augment Training Data in MRC under the Reward Guidance of Generative Language Models
Jing Jin | Houfeng Wang

Synthesizing QA pairs via question generator (QG) for data augmentation is widely used in Machine Reading Comprehension (MRC), especially in data-scarce scenarios like limited labeled data or domain adaptation. However, the quality of generated QA pairs varies, and it is necessary to select the ones with high quality from them. Existing approaches focus on downstream metrics to choose QA pairs, which lacks generalization across different metrics and datasets. In this paper, we propose a general selection method that employs a generative large pre-trained language model as a reward model in a Reinforcement Learning (RL) framework for the training of the selection agent. Our experiments on both generative and extractive datasets demonstrate that our selection method leads to better downstream performance. We also find that using the large language model (LLM) as a reward model is more beneficial than using it as a direct selector or QA model. Furthermore, we assess the selected QA pairs from multiple angles, not just downstream metrics, highlighting their superior quality compared to other methods. Our work has better flexibility across metrics, provides interpretability for the selected data, and expands the potential of leveraging generative large language models in the field of MRC and RL training. Our code is available at https://github.com/JulieJin-km/LLM_RL_Selection.

Temporal Knowledge Graph (TKG), which characterizes temporally evolving facts in the form of (subject, relation, object, timestamp), has attracted much attention recently. TKG reasoning aims to predict future facts based on given historical ones. However, existing TKG reasoning models are unable to abstain from predictions they are uncertain, which will inevitably bring risks in real-world applications. Thus, in this paper, we propose an abstention mechanism for TKG reasoning, which helps the existing models make selective, instead of indiscriminate, predictions. Specifically, we develop a confidence estimator, called Confidence Estimator with History (CEHis), to enable the existing TKG reasoning models to first estimate their confidence in making predictions, and then abstain from those with low confidence. To do so, CEHis takes two kinds of information into consideration, namely, the certainty of the current prediction and the accuracy of historical predictions. Experiments with representative TKG reasoning models on two benchmark datasets demonstrate the effectiveness of the proposed CEHis.

Task-oriented dialogue (TOD) systems facilitate users in executing various activities via multi-turn dialogues, but Large Language Models (LLMs) often struggle to comprehend these intricate contexts. In this study, we propose a novel “Self-Explanation” prompting strategy to enhance the comprehension abilities of LLMs in multi-turn dialogues. This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks. Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts, demonstrating its potential as a powerful tool in enhancing LLMs’ comprehension in complex dialogue tasks.

Temporal Knowledge Graph Question Answering (TKGQA) aims to answer questions with temporal intent over Temporal Knowledge Graphs (TKGs). The core challenge of this task lies in understanding the complex semantic information regarding multiple types of time constraints (e.g., before, first) in questions. Existing end-to-end methods implicitly model the time constraints by learning time-aware embeddings of questions and candidate answers, which is far from understanding the question comprehensively. Motivated by semantic-parsing-based approaches that explicitly model constraints in questions by generating logical forms with symbolic operators, we design fundamental temporal operators for time constraints and introduce a novel self-improvement Programming method for TKGQA (Prog-TQA). Specifically, Prog-TQA leverages the in-context learning ability of Large Language Models (LLMs) to understand the combinatory time constraints in the questions and generate corresponding program drafts with a few examples given. Then, it aligns these drafts to TKGs with the linking module and subsequently executes them to generate the answers. To enhance the ability to understand questions, Prog-TQA is further equipped with a self-improvement strategy to effectively bootstrap LLMs using high-quality self-generated drafts. Extensive experiments demonstrate the superiority of the proposed Prog-TQA on MultiTQ and CronQuestions datasets, especially in the Hits@1 metric.

pdf abs
Self-Knowledge Distillation for Knowledge Graph Embedding
Haotian Xu | Yuhua Wang | Jiahui Fan

Knowledge graph embedding (KGE) is an important task and it can benefit lots of downstream applications. General KGE can increase the embedding dimension to improve model performance. High-dimensional KGE will significantly increase the number of model parameters and training time. Therefore, knowledge distillation is proposed for learning a low-dimensional model from a pre-trained high-dimensional model. To avoid introducing a complex teacher model, we use self-knowledge distillation. However, there are still some issues with the self-knowledge distillation model we mentioned later. One of them is misdirection from incorrect predictions during model training. Another is the loss of discrimination information caused by excessive distillation temperature. To address these issues, we apply self-knowledge distillation, knowledge adjustment and dynamic temperature distillation to KGE. Self-knowledge distillation uses the information from the latest iteration to guide the training in the current iteration. Knowledge adjustment fixes the predictions of misjudged training samples. Dynamic temperature distillation designs dynamic sample-wise temperatures to compute soft targets. Our model can not only improve model performance but also achieve a lightweight model. Experimental results demonstrate the effectiveness and generalization ability of our model in link prediction. The lightweight model can maintain good model performance while reducing the number of model parameters and training time.

pdf abs
Self-reported Demographics and Discourse Dynamics in a Persuasive Online Forum
Agnieszka Falenska | Eva Maria Vecchi | Gabriella Lapesa

Research on language as interactive discourse underscores the deliberate use of demographic parameters such as gender, ethnicity, and class to shape social identities. For example, by explicitly disclosing one’s information and enforcing one’s social identity to an online community, the reception by and interaction with the said community is impacted, e.g., strengthening one’s opinions by depicting the speaker as credible through their experience in the subject. Here, we present a first thorough study of the role and effects of self-disclosures on online discourse dynamics, focusing on a pervasive type of self-disclosure: author gender. Concretely, we investigate the contexts and properties of gender self-disclosures and their impact on interaction dynamics in an online persuasive forum, ChangeMyView. Our contribution is twofold. At the level of the target phenomenon, we fill a research gap in the understanding of the impact of these self-disclosures on the discourse by bringing together features related to forum activity (votes, number of comments), linguistic/stylistic features from the literature, and discourse topics. At the level of the contributed resource, we enrich and release a comprehensive dataset that will provide a further impulse for research on the interplay between gender disclosures, community interaction, and persuasion in online discourse.

pdf abs
Semantic Frame Extraction in Multilingual Olfactory Events
Stefano Menini

In this work we present a system for multilingual olfactory information extraction covering six European languages, introducing new models to extract olfactory information from large amounts of text in a structured and scalable way. For the task we rely on a supervised multi-task approach to detect olfactory related text adopting a FrameNet-like structure, identifying the lexical units triggering the smell event and a related set of frame elements.

pdf abs
Semantic Map-based Generation of Navigation Instructions
Chengzu Li | Chao Zhang | Simone Teufel | Rama Sanand Doddipatla | Svetlana Stoyanchev

We are interested in the generation of navigation instructions, either in their own right or as training material for robotic navigation task. In this paper, we propose a new approach to navigation instruction generation by framing the problem as an image captioning task using semantic maps as visual input. Conventional approaches employ a sequence of panorama images to generate navigation instructions. Semantic maps abstract away from visual details and fuse the information in multiple panorama images into a single top-down representation, thereby reducing computational complexity to process the input. We present a benchmark dataset for instruction generation using semantic maps, propose an initial model and ask human subjects to manually assess the quality of generated instructions. Our initial investigations show promise in using semantic maps for instruction generation instead of a sequence of panorama images, but there is vast scope for improvement. We release the code for data preparation and model training at https://github.com/chengzu-li/VLGen.

Identifying unexpected domain-shifted instances in natural language processing is crucial in real-world applications. Previous works identify the out-of-distribution (OOD) instance by leveraging a single global feature embedding to represent the sentence, which cannot characterize subtle OOD patterns well. Another major challenge current OOD methods face is learning effective low-dimensional sentence representations to identify the hard OOD instances that are semantically similar to the in-distribution (ID) data. In this paper, we propose a new unsupervised OOD detection method, namely Semantic Role Labeling Guided Out-of-distribution Detection (SRLOOD), that separates, extracts, and learns the semantic role labeling (SRL) guided fine-grained local feature representations from different arguments of a sentence and the global feature representations of the full sentence using a margin-based contrastive loss. A novel self-supervised approach is also introduced to enhance such global-local feature learning by predicting the SRL extracted role. The resulting model achieves SOTA performance on four OOD benchmarks, indicating the effectiveness of our approach. The code is publicly accessible via https://github.com/cytai/SRLOOD.

pdf abs
Semantics-Aware Dual Graph Convolutional Networks for Argument Pair Extraction
Minzhao Guan | Zhixun Qiu | Fenghuan Li | Yun Xue

Argument pair extraction (APE) is a task that aims to extract interactive argument pairs from two argument passages. Generally, existing works focus on either simple argument interaction or task form conversion, instead of thorough deep-level feature exploitation of argument pairs. To address this issue, a Semantics-Aware Dual Graph Convolutional Networks (SADGCN) is proposed for APE. Specifically, the co-occurring word graph is designed to tackle the lexical and semantic relevance of arguments with a pre-trained Rouge-guided Transformer (ROT). Considering the topic relevance in argument pairs, a topic graph is constructed by the neural topic model to leverage the topic information of argument passages. The two graphs are fused via a gating mechanism, which contributes to the extraction of argument pairs. Experimental results indicate that our approach achieves the state-of-the-art performance. The performance on F1 score is significantly improved by 6.56% against the existing best alternative.

In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment. However, in most existing methods, the reconstruction targets for MIM lack high-level semantics, and text is not sufficiently involved in masked modeling. These two drawbacks limit the effect of MIM in facilitating cross-modal semantic alignment. In this work, we propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning. Specifically, to provide more semantically meaningful supervision for MIM, we propose a local semantics enhancing approach, which harvest high-level semantics from global image features via self-supervised agreement learning and transfer them to local patch encodings by sharing the encoding space. Moreover, to achieve deep involvement of text during the entire MIM process, we propose a text-guided masking strategy and devise an efficient way of injecting textual information in both masked modeling and reconstruction target acquisition. Experimental results validate that our method improves the effectiveness of the MIM task in facilitating cross-modal semantic alignment. Compared to previous VLP models with similar model size and data scale, our SemMIM model achieves state-of-the-art or competitive performance on multiple downstream vision-language tasks.

pdf abs
Sense of the Day: Short Timeframe Temporal-Aware Word Sense Disambiguation
Yuchen Wei | Milton King

The predominant sense of a lemma can vary based on the timeframe (years, decades, centuries) that the text was written. In our work, we explore the predominant sense of shorter timeframes (days, months, seasons, etc.) and find that different short timeframes can have different predominant senses from each other and from the predominant sense of a corpus. Leveraging the predominant sense and sense distribution of a short timeframe, we design short timeframe temporal-aware word sense disambiguation (WSD) models that outperform a temporal agnostic model. Likewise, author-aware WSD models tend to outperform author agnostic models, therefore we augment our temporal-aware models to leverage knowledge of author-level predominant senses and sense distributions to create temporal and author-aware WSD models. In addition to this, we found that considering recent usages of a lemma by the same author can assist a WSD model. Our approach requires the use of only a small amount of text from authors and timeframes.

pdf abs
SENTA: Sentence Simplification System for Slovene
Aleš Žagar | Matej Klemen | Marko Robnik-Šikonja | Iztok Kosem

Ensuring universal access to written content, regardless of users’ language proficiency and cognitive abilities, is of paramount importance. Sentence simplification, which involves converting complex sentences into more accessible forms while preserving their meaning, plays a crucial role in enhancing text accessibility. This paper introduces SENTA, a system for sentence simplification in Slovene. The system consists of two components. First, a neural classifier identifies sentences that require simplification, and second, a large Slovene language model based on T5 architecture is fine-tuned to transform complex texts into a simpler form, achieving an excellent SARI score of 41. Both automatic and qualitative evaluations provide important insights into the problem, highlighting areas for future research in multilingual applications, and fluency maintenance. Finally, SENTA is integrated into a freely accessible, user-friendly user interface, offering a valuable service to less-fluent Slovene users.

pdf abs
SentiCSE: A Sentiment-aware Contrastive Sentence Embedding Framework with Sentiment-guided Textual Similarity
Jaemin Kim | Yohan Na | Kangmin Kim | Sang-Rak Lee | Dong-Kyu Chae

Recently, sentiment-aware pre-trained language models (PLMs) demonstrate impressive results in downstream sentiment analysis tasks. However, they neglect to evaluate the quality of their constructed sentiment representations; they just focus on improving the fine-tuning performance, which overshadows the representation quality. We argue that without guaranteeing the representation quality, their downstream performance can be highly dependent on the supervision of the fine-tuning data rather than representation quality. This problem would make them difficult to foray into other sentiment-related domains, especially where labeled data is scarce. We first propose Sentiment-guided Textual Similarity (SgTS), a novel metric for evaluating the quality of sentiment representations, which is designed based on the degree of equivalence in sentiment polarity between two sentences. We then propose SentiCSE, a novel Sentiment-aware Contrastive Sentence Embedding framework for constructing sentiment representations via combined word-level and sentence-level objectives, whose quality is guaranteed by SgTS. Qualitative and quantitative comparison with the previous sentiment-aware PLMs shows the superiority of our work. Our code is available at: https://github.com/nayohan/SentiCSE

pdf abs
Sequence Reducible Holdout Loss for Language Model Pretraining
Raghuveer Thirukovalluru | Nicholas Monath | Bhuwan Dhingra | Sam Wiseman

Data selection techniques, which adaptively select datapoints inside the training loop, have demonstrated empirical benefits in reducing the number of gradient steps to train neural models. However, these techniques have so far largely been applied to classification. In this work, we study their applicability to language model pretraining, a highly time-intensive task. We propose a simple modification to an existing data selection technique (reducible hold-out loss training) in order to adapt it to the sequence losses typical in language modeling. We experiment on both autoregressive and masked language modelling, and show that applying data selection to pretraining offers notable benefits including a 4.3% reduction in total number of steps, a 21.5% steps reduction in average, to an intermediate target perplexity, over the course of pretraining an autoregressive language model. Further, data selection trained language models demonstrate significantly better generalization ability on out of domain datasets - 7.9% reduction in total number of steps and 23.2% average steps reduction to an intermediate target perplexity.

pdf abs
Sequence-to-Sequence Language Models for Character and Emotion Detection in Dream Narratives
Gustave Cortal

The study of dreams has been central to understanding human (un)consciousness, cognition, and culture for centuries. Analyzing dreams quantitatively depends on labor-intensive, manual annotation of dream narratives. We automate this process through a natural language sequence-to-sequence generation framework. This paper presents the first study on character and emotion detection in the English portion of the open DreamBank corpus of dream narratives. Our results show that language models can effectively address this complex task. To get insight into prediction performance, we evaluate the impact of model size, prediction order of characters, and the consideration of proper names and character traits. We compare our approach with a large language model using in-context learning. Our supervised models perform better while having 28 times fewer parameters. Our model and its generated annotations are made publicly available.

pdf abs
Sequence-to-Sequence Spanish Pre-trained Language Models
Vladimir Araujo | Maria Mihaela Trusca | Rodrigo Tufiño | Marie-Francine Moens

In recent years, significant advancements in pre-trained language models have driven the creation of numerous non-English language variants, with a particular emphasis on encoder-only and decoder-only architectures. While Spanish language models based on BERT and GPT have demonstrated proficiency in natural language understanding and generation, there remains a noticeable scarcity of encoder-decoder models explicitly designed for sequence-to-sequence tasks, which aim to map input sequences to generate output sequences conditionally. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across various sequence-to-sequence tasks, including summarization, question answering, split-and-rephrase, dialogue, and translation. Our findings underscore the competitive performance of all models, with the BART- and T5-based models emerging as top performers across all tasks. We have made all models publicly available to the research community to foster future explorations and advancements in Spanish NLP: https://github.com/vgaraujov/Seq2Seq-Spanish-PLMs.

Temporal Knowledge Graph (TKG) reasoning has received a growing interest recently, especially in forecasting the future facts based on the historical KG sequences. Existing studies typically utilize a recurrent neural network to learn the evolutional representations of entities for temporal reasoning. However, these methods are hard to capture the complex temporal evolutional patterns such as sequential and repetitive patterns accurately. To tackle this challenge, we propose a novel Sequential and Repetitive Pattern Learning (SRPL) method, which comprehensively captures both the sequential and repetitive patterns. Specifically, a Dependency-aware Sequential Pattern Learning (DSPL) component expresses the temporal dependencies of each historical timestamp as embeddings for accurately capturing the sequential patterns across temporally adjacent facts. A Time-interval guided Repetitive Pattern Learning (TRPL) component models the irregular time intervals between historical repetitive facts for capturing the repetitive patterns. Extensive experiments on four representative benchmarks demonstrate that our proposed method outperforms state-of-the-art methods in all metrics by an obvious margin, especially on GDELT dataset, where performance improvement of MRR reaches up to 18.84%.

pdf abs
SGCM: Salience-Guided Context Modeling for Question Generation
Chuyao Ding | Yu Hong | Jianmin Yao

We tackle Paragraph-level Question Generation (abbr., PQG) in this paper. PQG is a task of automatically generating questions given paragraphs and answers. Identifying the relevant sentences to answers is crucial for reasoning the possible questions before generation. Accordingly, we propose a salience-guided approach to enhance PQG. Specifically, we construct an auxiliary task of identifying salient sentences that manifest relevance. Grounded on this auxiliary task and the main task of PQG, we strengthen the BART encoder during training within a multitask learning framework. In particular, we utilize the identified salient sentences as an explicit guidance to enable the salience-aware attention computation in the BART decoder. We experiment on the benchmark dataset FairytaleQA. The test results show that our approach yields substantial improvements compared to the BART baseline, achieving the Rouge-L, BLEU4, BERTScore, Q-BLUE-3 and F1-scores of about 56.56%, 19.78%, 61.19%, 54.33% and 43.55%, respectively. Both the source codes and models will be publicly available.

pdf abs
ShadowSense: A Multi-annotated Dataset for Evaluating Word Sense Induction
Ondřej Herman | Miloš Jakubíček

In this paper we present a novel bilingual (Czech, English) dataset called ShadowSense developed for the purposes of word sense induction (WSI) evaluation. Unlike existing WSI datasets, ShadowSense is annotated by multiple annotators whose inter-annotator agreement represents key reliability score to be used for evaluation of systems automatically inducing word senses. In this paper we clarify the motivation for such an approach, describe the dataset in detail and provide evaluation of three neural WSI systems showing substantial differences compared to traditional evaluation paradigms.

pdf abs
Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies
Philipp Sadler | Sherzod Hakimov | David Schlangen

In collaborative goal-oriented settings, the participants are not only interested in achieving a successful outcome, but do also implicitly negotiate the effort they put into the interaction (by adapting to each other). In this work, we propose a challenging interactive reference game that requires two players to coordinate on vision and language observations. The learning signal in this game is a score (given after playing) that takes into account the achieved goal and the players’ assumed efforts during the interaction. We show that a standard Proximal Policy Optimization (PPO) setup achieves a high success rate when bootstrapped with heuristic partner behaviors that implement insights from the analysis of human-human interactions. And we find that a pairing of neural partners indeed reduces the measured joint effort when playing together repeatedly. However, we observe that in comparison to a reasonable heuristic pairing there is still room for improvement—which invites further research in the direction of cost-sharing in collaborative interactions.

pdf abs
SIGA: A Naturalistic NLI Dataset of English Scalar Implicatures with Gradable Adjectives
Rashid Nizamani | Sebastian Schuster | Vera Demberg

Many utterances convey meanings that go beyond the literal meaning of a sentence. One class of such meanings is scalar implicatures, a phenomenon by which a speaker conveys the negation of a more informative utterance by producing a less informative utterance. This paper introduces a Natural Language Inference (NLI) dataset designed to investigate the ability of language models to interpret utterances with scalar implicatures. Our dataset is comprised of text extracted from the C4 English text corpus and annotated with both crowd-sourced and expert annotations. We evaluate NLI models based on DeBERTa to investigate 1) whether NLI models can learn to predict pragmatic inferences involving gradable adjectives and 2) whether models generalize to utterances involving unseen adjectives. We find that fine-tuning NLI models on our dataset significantly improves their performance to derive scalar implicatures, both for in-domain and for out-of domain examples. At the same time, we find that the investigated models still perform considerably worse on examples with scalar implicatures than on other types of NLI examples, highlighting that pragmatic inferences still pose challenges for current models.

pdf abs
SignBLEU: Automatic Evaluation of Multi-channel Sign Language Translation
Jung-Ho Kim | Mathew Huerta-Enochian | Changyong Ko | Du Hui Lee

Sign languages are multi-channel languages that communicate information through not just the hands (manual signals) but also facial expressions and upper body movements (non-manual signals). However, since automatic sign language translation is usually performed by generating a single sequence of glosses, researchers eschew non-manual and co-occurring manual signals in favor of a simplified list of manual glosses. This can lead to significant information loss and ambiguity. In this paper, we introduce a new task named multi-channel sign language translation (MCSLT) and present a novel metric, SignBLEU, designed to capture multiple signal channels. We validated SignBLEU on a system-level task using three sign language corpora with varied linguistic structures and transcription methodologies and examined its correlation with human judgment through two segment-level tasks. We found that SignBLEU consistently correlates better with human judgment than competing metrics. To facilitate further MCSLT research, we report benchmark scores for the three sign language corpora and release the source code for SignBLEU at https://github.com/eq4all-projects/SignBLEU.

pdf abs
SilverAlign: MT-Based Silver Data Algorithm for Evaluating Word Alignment
Abdullatif Koksal | Silvia Severini | Hinrich Schütze

Word alignments are essential for a variety of NLP tasks. Therefore, choosing the best approaches for their creation is crucial. However, the scarce availability of gold evaluation data makes the choice difficult. We propose SilverAlign, a new method to automatically create silver data for the evaluation of word aligners by exploiting machine translation and minimal pairs. We show that performance on our silver data correlates well with gold benchmarks for 9 language pairs, making our approach a valid resource for evaluation of different languages and domains when gold data is not available. This addresses the important scenario of missing gold data alignments for low-resource languages.

pdf abs
Silver Retriever: Advancing Neural Passage Retrieval for Polish Question Answering
Piotr Rybak | Maciej Ogrodniczuk

Modern open-domain question answering systems often rely on accurate and efficient retrieval components to find passages containing the facts necessary to answer the question. Recently, neural retrievers have gained popularity over lexical alternatives due to their superior performance. However, most of the work concerns popular languages such as English or Chinese. For others, such as Polish, few models are available. In this work, we present Silver Retriever, a neural retriever for Polish trained on a diverse collection of manually or weakly labeled datasets. Silver Retriever achieves much better results than other Polish models and is competitive with larger multilingual models. Together with the model, we open-source five new passage retrieval datasets.

pdf abs
SimLex-999 for Dutch
Lizzy Brans | Jelke Bloem

Word embeddings revolutionised natural language processing by effectively representing words as dense vectors. Although many datasets exist to evaluate English embeddings, few cater to Dutch. We developed a Dutch variant of the SimLex-999 word similarity dataset by gathering similarity judgements from 235 native Dutch speakers. Subsequently, we evaluated two popular Dutch language models, Bertje and RobBERT, finding that Bertje showed superior alignment with human semantic similarity judgments compared to RobBERT. This study provides the first intrinsic Dutch word embedding evaluation dataset, which enables accurate assessment of these embeddings and fosters the development of effective Dutch language models.

Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.

pdf abs
SI-NLI: A Slovene Natural Language Inference Dataset and Its Evaluation
Matej Klemen | Aleš Žagar | Jaka Čibej | Marko Robnik-Šikonja

Natural language inference (NLI) is an important language understanding benchmark. Two deficiencies of this benchmark are: i) most existing NLI datasets exist for English and a few other well-resourced languages, and ii) most NLI datasets are formed with a narrow set of annotators’ instructions, allowing the prediction models to capture linguistic clues instead of measuring true reasoning capability. We address both issues and introduce SI-NLI, the first dataset for Slovene natural language inference. The dataset is constructed from scratch using knowledgeable annotators with carefully crafted guidelines aiming to avoid commonly encountered problems in existing NLI datasets. We also manually translate the SI-NLI to English to enable cross-lingual model training and evaluation. Using the newly created dataset and its translation, we train and evaluate a variety of large transformer language models in a monolingual and cross-lingual setting. The results indicate that larger models, in general, achieve better performance. The qualitative analysis shows that the SI-NLI dataset is diverse and that there remains plenty of room for improvement even for the largest models.

pdf abs
SkOTaPA: A Dataset for Skepticism Detection in Online Text after Persuasion Attempt
Smitha Muthya Sudheendra | Maral Abdollahi | Dongyeop Kang | Jisu Huh | Jaideep Srivastava

Individuals often encounter persuasion attempts, during which a persuasion agent aims to persuade a target to change the target’s emotions, beliefs, and behaviors. These persuasion attempts can be observed in various social settings, such as advertising, public health, political campaigns, and personal relationships. During these persuasion attempts, targets generally like to preserve their autonomy, so their responses often manifest in some form of resistance, like a skeptical reaction. In order to detect such skepticism in response to persuasion attempts on social media, we developed a corpus based on consumer psychology. In this paper, we consider one of the most prominent areas in which persuasion attempts unfold: social media influencer marketing. In this paper, we introduce the skepticism detection corpus, SkOTaPA, which was developed using multiple independent human annotations, and inter-coder reliability was evaluated with Krippendorff’s alpha (0.709). We performed validity tests to show skepticism cannot be detected using other potential proxy variables like sentiment and sarcasm.

Identifying early markers of Alzheimer’s disease (AD) trajectory enables intervention in early disease stages when our currently-available interventions are most likely to be beneficial. Research has shown that alterations in speech, as well as linguistic and semantic deviations in spontaneous conversation detected using natural language processing, manifest early in AD prior to some other observed cognitive deficits. Recent studies show that cerebrospinal fluid (CSF) levels serve as useful early biomarkers for identifying early AD, but CSF biomarkers are challenging to collect. A simpler alternative that has seen very rapid development is based on the use of plasma biomarkers as a blood draw is minimally invasive. Associating verbal and nonverbal characteristics from speech data with CSF and plasma biomarkers may open the door to less invasive, more efficient methods for early AD detection. We present SLaCAD, a new dataset to facilitate this process. We describe our data collection procedures, analyze the resulting corpus, and present preliminary findings that relate measures extracted from the audio and transcribed text to clinical diagnoses, CSF levels, and plasma biomarkers. Our findings demonstrate the feasibility of this and indicate that the collected data can be used to improve assessments of early AD.

pdf abs
Slot and Intent Detection Resources for Bavarian and Lithuanian: Assessing Translations vs Natural Queries to Digital Assistants
Miriam Winkler | Virginija Juozapaityte | Rob van der Goot | Barbara Plank

Digital assistants perform well in high-resource languages like English, where tasks like slot and intent detection (SID) are well-supported. Many recent SID datasets start including multiple language varieties. However, it is unclear how realistic these translated datasets are. Therefore, we extend one such dataset, namely xSID-0.4, to include two underrepresented languages: Bavarian, a German dialect, and Lithuanian, a Baltic language. Both language variants have limited speaker populations and are often not included in multilingual projects. In addition to translations we provide “natural” queries to digital assistants generated by native speakers. We further include utterances from another dataset for Bavarian to build the richest SID dataset available today for a low-resource dialect without standard orthography. We then set out to evaluate models trained on English in a zero-shot scenario on our target language variants. Our evaluation reveals that translated data can produce overly optimistic scores. However, the error patterns in translated and natural datasets are highly similar. Cross-dataset experiments demonstrate that data collection methods influence performance, with scores lower than those achieved with single-dataset translations. This work contributes to enhancing SID datasets for underrepresented languages, yielding NaLiBaSID, a new evaluation dataset for Bavarian and Lithuanian.

pdf abs
SlovakSum: A Large Scale Slovak Summarization Dataset
Viktoria Ondrejova | Marek Suppa

The ability to automatically summarize news articles has become increasingly important due to the vast amount of information available online. Together with the rise of chatbots , Natural Language Processing (NLP) has recently experienced a tremendous amount of development. Despite these advancements, the majority of research is focused on established well-resourced languages, such as English. To contribute to development of the low resource Slovak language, we introduce SlovakSum, a Slovak news summarization dataset consisting of over 200 thousand news articles with titles and short abstracts obtained from multiple Slovak newspapers. The abstractive approach, including MBART and mT5 models, was used to evaluate various baselines. The code for the reproduction of our dataset and experiments can be found at https://github.com/NaiveNeuron/slovaksum

pdf abs
Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification
Pierre Lepagnol | Thomas Gerald | Sahar Ghannay | Christophe Servan | Sophie Rosset

This study is part of the debate on the efficiency of large versus small language models for text classification by prompting. We assess the performance of small language models in zero-shot text classification, challenging the prevailing dominance of large models. Across 15 datasets, our investigation benchmarks language models from 77M to 40B parameters using different architectures and scoring functions. Our findings reveal that small models can effectively classify texts, getting on par with or surpassing their larger counterparts. We developed and shared a comprehensive open-source repository that encapsulates our methodologies. This research underscores the notion that bigger isn’t always better, suggesting that resource-efficient small models may offer viable solutions for specific data classification challenges.

Despite achieving remarkable performance on various vision-language tasks, Transformer-based Vision-Language Models (VLMs) suffer from redundancy in inputs and parameters, significantly hampering their efficiency in real-world applications. Moreover, the degree of redundancy in token representations and model parameters, such as attention heads, varies significantly for different inputs. In light of the challenges, we propose SmartTrim, an adaptive acceleration framework for VLMs, which adjusts the computational overhead per instance. Specifically, we integrate lightweight modules into the original backbone to identify and prune redundant token representations and attention heads within each layer. Furthermore, we devise a self-distillation strategy to enhance the consistency between the predictions of the pruned model and its fully-capacity counterpart. Experimental results across various vision-language tasks consistently demonstrate that SmartTrim accelerates the original model by 2-3 times with minimal performance degradation, highlighting the effectiveness and efficiency compared to previous approaches. Code will be available at https://github.com/kugwzk/SmartTrim.

This article introduces SM-FEEL-BG – the first Bulgarian-language package, containing 6 datasets with Social Media (SM) texts with emotion, feeling, and sentiment labels and 4 classifiers trained on them. All but one dataset from these are freely accessible for research purposes. The largest dataset contains 6000 Twitter, Telegram, and Facebook texts, manually annotated with 21 fine-grained emotion/feeling categories. The fine-grained labels are automatically merged into three coarse-grained sentiment categories, producing a dataset with two parallel sets of labels. Several classification experiments are run on different subsets of the fine-grained categories and their respective sentiment labels with a Bulgarian fine-tuned BERT. The highest Acc. reached was 0.61 for 16 emotions and 0.70 for 11 emotions (incl. 310 ChatGPT 4-generated texts). The sentiments Acc. of the 11 emotions dataset was also the highest (0.79). As Facebook posts cannot be shared, we ran experiments on the Twitter and Telegram subset of the 11 emotions dataset, obtaining 0.73 Acc. for emotions and 0.80 for sentiments. The article describes the annotation procedures, guidelines, experiments, and results. We believe that this package will be of significant benefit to researchers working on emotion detection and sentiment analysis in Bulgarian.

pdf abs
SOBR: A Corpus for Stylometry, Obfuscation, and Bias on Reddit
Chris Emmery | Marilù Miotto | Sergey Kramp | Bennett Kleinberg

Sharing textual content in the form of public posts on online platforms remains a significant part of the social web. Research on stylometric profiling suggests that despite users’ discreetness, and even under the guise of anonymity, the content and style of such posts may still reveal detailed author information. Studying how this might be inferred and obscured is relevant not only to the domain of cybersecurity, but also to those studying bias of classifiers drawing features from web corpora. While the collection of gold standard data is expensive, prior work shows that distant labels (i.e., those gathered via heuristics) offer an effective alternative. Currently, however, pre-existing corpora are limited in scope (e.g., variety of attributes and size). We present the SOBR corpus: 235M Reddit posts for which we used subreddits, flairs, and self-reports as distant labels for author attributes (age, gender, nationality, personality, and political leaning). In addition to detailing the data collection pipeline and sampling strategy, we report corpus statistics and provide a discussion on the various tasks and research avenues to be pursued using this resource. Along with the raw corpus, we provide sampled splits of the data, and suggest baselines for stylometric profiling. We close our work with a detailed set of ethical considerations relevant to the proposed lines of research.

pdf abs
Social Convos: Capturing Agendas and Emotions on Social Media
Ankita Bhaumik | Ning Sa | Gregorios Katsios | Tomek Strzalkowski

Social media platforms are popular tools for disseminating targeted information during major public events like elections or pandemics. Systematic analysis of the message traffic can provide valuable insights into prevailing opinions and social dynamics among different segments of the population. We are specifically interested in influence spread, and in particular whether more deliberate influence operations can be detected. However, filtering out the essential messages with telltale influence indicators from the extensive and often chaotic social media traffic is a major challenge.In this paper we present a novel approach to extract influence indicators from messages circulating among groups of users discussing particular topics. We build upon the the concept of a convo to identify influential authors who are actively promoting some particular agenda around that topic within the group. We focus on two influence indicators: the (control of) agenda and the use of emotional language.

There are many settings where it is useful to predict and explain the success or failure of a dialogue. Circumplex theory from psychology models the social orientations (e.g., Warm-Agreeable, Arrogant-Calculating) of conversation participants and can be used to predict and explain the outcome of social interactions. Our work is novel in its systematic application of social orientation tags to modeling conversation outcomes. In this paper, we introduce a new data set of dialogue utterances machine-labeled with social orientation tags. We show that social orientation tags improve task performance, especially in low-resource settings, on both English and Chinese language benchmarks. We also demonstrate how social orientation tags help explain the outcomes of social interactions when used in neural models. Based on these results showing the utility of social orientation tags for dialogue outcome prediction tasks, we release our data sets, code, and models that are fine-tuned to predict social orientation tags on dialogue utterances.

pdf abs
SoftMCL: Soft Momentum Contrastive Learning for Fine-grained Sentiment-aware Pre-training
Jin Wang | Liang-Chih Yu | Xuejie Zhang

The pre-training for language models captures general language understanding but fails to distinguish the affective impact of a particular context to a specific word. Recent works have sought to introduce contrastive learning (CL) for sentiment-aware pre-training in acquiring affective information. Nevertheless, these methods present two significant limitations. First, the compatibility of the GPU memory often limits the number of negative samples, hindering the opportunities to learn good representations. In addition, using only a few sentiment polarities as hard labels, e.g., positive, neutral, and negative, to supervise CL will force all representations to converge to a few points, leading to the issue of latent space collapse. This study proposes a soft momentum contrastive learning (SoftMCL) for fine-grained sentiment-aware pre-training. Instead of hard labels, we introduce valence ratings as soft-label supervision for CL to fine-grained measure the sentiment similarities between samples. The proposed SoftMCL conducts CL on both the word- and sentence-level to enhance the model’s ability to learn affective information. A momentum queue was introduced to expand the contrastive samples, allowing storing and involving more negatives to overcome the limitations of hardware platforms. Extensive experiments were conducted on four different sentiment-related tasks, which demonstrates the effectiveness of the proposed SoftMCL method. The code and data of the proposed SoftMCL is available at: https://www.github.com/wangjin0818/SoftMCL/.

The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.

pdf abs
Soft Well-Formed Semantic Parsing with Score-Based Selection
Jiangming Liu

Semantic parsing is the task of translating natural language into a structured, formal semantic representation that can be interpreted by machines. These semantic representations are organized with complex structures. While various models have been developed for semantic parsing, there has been limited focus on generating semantic representations with well-formed structures. In this study, we introduce a score-based method to select well-formed outputs from candidates generated by beam search algorithms. Our experiments focus on parsing texts into discourse representation structures, which are innovative semantic representations designed to capture the meaning of texts with arbitrary lengths across languages. Our experimental results demonstrate that models utilizing the proposed method can reduce the number of ill-formed outputs and improve F1 scores in English. Furthermore, our final model achieves significant improvements in German, Italian and Dutch zero-shot DRS parsing by effectively preventing ill-formed outputs.

pdf abs
So Hateful! Building a Multi-Label Hate Speech Annotated Arabic Dataset
Wajdi Zaghouani | Hamdy Mubarak | Md. Rafiul Biswas

Social media enables widespread propagation of hate speech targeting groups based on ethnicity, religion, or other characteristics. With manual content moderation being infeasible given the volume, automatic hate speech detection is essential. This paper analyzes 70,000 Arabic tweets, from which 15,965 tweets were selected and annotated, to identify hate speech patterns and train classification models. Annotators labeled the Arabic tweets for offensive content, hate speech, emotion intensity and type, effect on readers, humor, factuality, and spam. Key findings reveal 15% of tweets contain offensive language while 6% have hate speech, mostly targeted towards groups with common ideological or political affiliations. Annotations capture diverse emotions, and sarcasm is more prevalent than humor. Additionally, 10% of tweets provide verifiable factual claims, and 7% are deemed important. For hate speech detection, deep learning models like AraBERT outperform classical machine learning approaches. By providing insights into hate speech characteristics, this work enables improved content moderation and reduced exposure to online hate. The annotated dataset advances Arabic natural language processing research and resources.

Recent works demonstrate that voice assistants do not perform equally well for everyone, but research on demographic robustness of speech technologies is still scarce. This is mainly due to the rarity of large datasets with controlled demographic tags. This paper introduces the Sonos Voice Control Bias Assessment Dataset, an open dataset composed of voice assistant requests for North American English in the music domain (1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts) with a controlled demographic diversity (gender, age, dialectal region and ethnicity). We also release a statistical demographic bias assessment methodology, at the univariate and multivariate levels, tailored to this specific use case and leveraging spoken language understanding metrics rather than transcription accuracy, which we believe is a better proxy for user experience. To demonstrate the capabilities of this dataset and statistical method to detect demographic bias, we consider a pair of state-of-the-art Automatic Speech Recognition and Spoken Language Understanding models. Results show statistically significant differences in performance across age, dialectal region and ethnicity. Multivariate tests are crucial to shed light on mixed effects between dialectal region, gender and age.

Unsupervised Domain Adaptation (UDA) of the Aspect-based Sentiment Analysis (ABSA) task aims to transfer knowledge learned from labeled source domain datasets to unlabeled target domains on the assumption that samples from the source domain are freely accessible during the training period. However, this assumption can easily lead to privacy invasion issues in real-world applications, especially when the source data involves privacy-preserving domains such as healthcare and finance. In this paper, we introduce the Source-Free Domain Adaptation Framework for ABSA (SF-ABSA), which only allows model parameter transfer, not data transfer, between different domains. Specifically, the proposed SF-ABSA framework consists of two parts, i.e., feature-based adaptation and pseudo-label-based adaptation. Experiment results on four benchmarks show that the proposed framework performs competitively with traditional unsupervised domain adaptation methods under the premise of insufficient information, which demonstrates the superiority of our method under privacy conditions.

pdf abs
SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation
Andres Garcia-Silva | Cristian Berrio | Jose Manuel Gomez-Perez

Detecting salient parts in text using natural language processing has been widely used to mitigate the effects of information overflow. Nevertheless, most of the datasets available for this task are derived mainly from academic publications. We introduce SPACE-IDEAS, a dataset for salient information detection from innovation ideas related to the Space domain. The text in SPACE-IDEAS varies greatly and includes informal, technical, academic and business-oriented writing styles. In addition to a manually annotated dataset we release an extended version that is annotated using a large generative language model. We train different sentence and sequential sentence classifiers, and show that the automatically annotated dataset can be leveraged using multitask learning to train better classifiers.

pdf abs
Spanish Resource Grammar Version 2023
Olga Zamaraeva | Lorena S. Allegue | Carlos Gómez-Rodríguez

We present the latest version of the Spanish Resource Grammar (SRG), a grammar of Spanish implemented in the HPSG formalism. Such grammars encode a complex set of hypotheses about syntax making them a resource for empirical testing of linguistic theory. They also encode a strict notion of grammaticality which makes them a resource for natural language processing applications in computer-assisted language learning. This version of the SRG uses the recent version of the Freeling morphological analyzer and is released along with an automatically created, manually verified treebank of 2,291 sentences. We explain the treebanking process, emphasizing how it is different from treebanking with manual annotation and how it contributes to empirically-driven development of syntactic theory. The treebanks’ high level of consistency and detail makes them a resource for training high-quality semantic parsers and generally systems that benefit from precise and detailed semantics. Finally, we present the grammar’s coverage and overgeneration on 100 sentences from a learner corpus, a new research line related to developing methodologies for robust empirical evaluation of hypotheses in second language acquisition.

pdf abs
Spanless Event Annotation for Corpus-Wide Complex Event Understanding
Ann Bies | Jennifer Tracey | Ann O’Brien | Song Chen | Stephanie Strassel

We present a new approach to event annotation designed to promote whole-corpus understanding of complex events in multilingual, multimedia data as part of the DARPA Knowledge-directed Artificial Intelligence Reasoning Over Schemas (KAIROS) Program. KAIROS aims to build technology capable of reasoning about complex real-world events like a specific terrorist attack in order to provide actionable insights to end users. KAIROS systems extract events from a corpus, aggregate information into a coherent semantic representation, and instantiate observed events or predict unseen but expected events using a relevant event schema selected from a generalized schema library. To support development and testing for KAIROS Phase 2B we created a complex event annotation corpus that, instead of individual event mentions anchored in document spans with pre-defined event type labels, comprises a series of temporally ordered event frames populated with information aggregated from the whole corpus and labeled with an unconstrained tag set based on Wikidata Qnodes. The corpus makes a unique contribution to the resource landscape for information extraction, addressing gaps in the availability of multilingual, multimedia corpora for schema-based event representation. The corpus will be made available through publication in the Linguistic Data Consortium (LDC) catalog.

pdf abs
Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks
Santiago Herrera | Caio Corro | Sylvain Kahane

Descriptive grammars are highly valuable, but writing them is time-consuming and difficult. Furthermore, while linguists typically use corpora to create them, grammar descriptions often lack quantitative data. As for formal grammars, they can be challenging to interpret. In this paper, we propose a new method to extract and explore significant fine-grained grammar patterns and potential syntactic grammar rules from treebanks, in order to create an easy-to-understand corpus-based grammar. More specifically, we extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order, using a large search space and paying special attention to the ranking order of the extracted rules. For that, we use a linear classifier to extract the most salient features that predict the linguistic phenomena under study. We associate statistical information to each rule, and we compare the ranking of the model’s results to those of other quantitative and statistical measures. Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.

pdf abs
Specifying Genericity through Inclusiveness and Abstractness Continuous Scales
Claudia Collacciani | Andrea Amelio Ravelli | Marianna Bolognesi

This paper introduces a novel annotation framework for the fine-grained modeling of Noun Phrases’ (NPs) genericity in natural language. The framework is designed to be simple and intuitive, making it accessible to non-expert annotators and suitable for crowd-sourced tasks. Drawing from theoretical and cognitive literature on genericity, this framework is grounded in established linguistic theory. Through a pilot study, we created a small but crucial annotated dataset of 324 sentences, serving as a foundation for future research. To validate our approach, we conducted an evaluation comparing our continuous annotations with existing binary annotations on the same dataset, demonstrating the framework’s effectiveness in capturing nuanced aspects of genericity. Our work offers a practical resource for linguists, providing a first annotated dataset and an annotation scheme designed to build real-language datasets that can be used in studies on the semantics of genericity, and NLP practitioners, contributing to the development of commonsense knowledge repositories valuable in enhancing various NLP applications.

pdf abs
SpeechAlign: A Framework for Speech Translation Alignment Evaluation
Belen Alastruey | Aleix Sant | Gerard I. Gállego | David Dale | Marta R. Costa-jussà

Speech-to-Speech and Speech-to-Text translation are currently dynamic areas of research. In our commitment to advance these fields, we present SpeechAlign, a framework designed to evaluate the underexplored field of source-target alignment in speech models. The SpeechAlign framework has two core components. First, to tackle the absence of suitable evaluation datasets, we introduce the Speech Gold Alignment dataset, built upon a English-German text translation gold alignment dataset. Secondly, we introduce two novel metrics, Speech Alignment Error Rate (SAER) and Time-weighted Speech Alignment Error Rate (TW-SAER), which enable the evaluation of alignment quality within speech models. While the former gives equal importance to each word, the latter assigns weights based on the length of the words in the speech signal. By publishing SpeechAlign we provide an accessible evaluation framework for model assessment, and we employ it to benchmark open-source Speech Translation models. In doing so, we contribute to the ongoing research progress within the fields of Speech-to-Speech and Speech-to-Text translation.

pdf abs
Speech Analysis of Language Varieties in Italy
Moreno La Quatra | Alkis Koudounas | Elena Baralis | Sabato Marco Siniscalchi

Italy exhibits rich linguistic diversity across its territory due to the distinct regional languages spoken in different areas. Recent advances in self-supervised learning provide new opportunities to analyze Italy’s linguistic varieties using speech data alone. This includes the potential to leverage representations learned from large amounts of data to better examine nuances between closely related linguistic varieties. In this study, we focus on automatically identifying the geographic region of origin of speech samples drawn from Italy’s diverse language varieties. We leverage self-supervised learning models to tackle this task and analyze differences and similarities between Italy’s regional languages. In doing so, we also seek to uncover new insights into the relationships among these diverse yet closely related varieties, which may help linguists understand their interconnected evolution and regional development over time and space. To improve the discriminative ability of learned representations, we evaluate several supervised contrastive learning objectives, both as pre-training steps and additional fine-tuning objectives. Experimental evidence shows that pre-trained self-supervised models can effectively identify regions from speech recording. Additionally, incorporating contrastive objectives during fine-tuning improves classification accuracy and yields embeddings that distinctly separate regional varieties, demonstrating the value of combining self-supervised pre-training and contrastive learning for this task.

pdf abs
Speech Corpus for Korean Children with Autism Spectrum Disorder: Towards Automatic Assessment Systems
Seonwoo Lee | Jihyun Mun | Sunhee Kim | Minhwa Chung

Despite the growing demand for digital therapeutics for children with Autism Spectrum Disorder (ASD), there is currently no speech corpus available for Korean children with ASD. This paper introduces a speech corpus specifically designed for Korean children with ASD, aiming to advance speech technologies such as pronunciation and severity evaluation. Speech recordings from speech and language evaluation sessions were transcribed, and annotated for articulatory and linguistic characteristics. Three speech and language pathologists rated these recordings for social communication severity (SCS) and pronunciation proficiency (PP) using a 3-point Likert scale. The total number of participants will be 300 for children with ASD and 50 for typically developing (TD) children. The paper also analyzes acoustic and linguistic features extracted from speech data collected and completed for annotation from 73 children with ASD and 9 TD children to investigate the characteristics of children with ASD and identify significant features that correlate with the clinical scores. The results reveal some speech and linguistic characteristics in children with ASD that differ from those in TD children or another subgroup of ASD categorized by clinical scores, demonstrating the potential for developing automatic assessment systems for SCS and PP.

pdf abs
Speech Recognition Corpus of the Khinalug Language for Documenting Endangered Languages
Zhaolin Li | Monika Rind-Pawlowski | Jan Niehues

Automatic Speech Recognition (ASR) can be a valuable tool to document endangered languages. However, building ASR tools for these languages poses several difficult research challenges, notably data scarcity. In this paper, we show the whole process of creating a useful ASR tool for language documentation scenarios. We publish the first speech corpus for Khinalug, an endangered language spoken in Northern Azerbaijan. The corpus consists of 2.67 hours of labeled data from recordings of spontaneous speech about various topics. As Khinalug is an extremely low-resource language, we investigate the benefits of multilingual models for self-supervised learning and supervised learning and achieve the performance of 6.65 Character Error Rate (CER) points and 25.53 Word Error Rate (WER) points. The benefits of multilingual models are further validated through experimentation with three additional under-resourced languages. Lastly, this work conducts quality assessments with linguists on new recordings to investigate the model’s usefulness in language documentation. We observe an evident degradation for new recordings, indicating the importance of enhancing model robustness. In addition, we find the inaudible content is the main cause of wrong ASR predictions, suggesting relating work on incorporating contextual information.

pdf abs
SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels
Elena Shushkevich | Long Thanh Mai | Manuel V. Loureiro | Steven Derby | Tri Kurniawan Wijaya

The proliferation of news media outlets has increased the demand for intelligent systems capable of detecting redundant information in news articles in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a novel dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four different levels of complexity, specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.

pdf abs
SPLICE: A Singleton-Enhanced PipeLIne for Coreference REsolution
Yilun Zhu | Siyao Peng | Sameer Pradhan | Amir Zeldes

Singleton mentions, i.e. entities mentioned only once in a text, are important to how humans understand discourse from a theoretical perspective. However previous attempts to incorporate their detection in end-to-end neural coreference resolution for English have been hampered by the lack of singleton mention spans in the OntoNotes benchmark. This paper addresses this limitation by combining predicted mentions from existing nested NER systems and features derived from OntoNotes syntax trees. With this approach, we create a near approximation of the OntoNotes dataset with all singleton mentions, achieving ~94% recall on a sample of gold singletons. We then propose a two-step neural mention and coreference resolution system, named SPLICE, and compare its performance to the end-to-end approach in two scenarios: the OntoNotes test set and the out-of-domain (OOD) OntoGUM corpus. Results indicate that reconstructed singleton training yields results comparable to end-to-end systems for OntoNotes, while improving OOD stability (+1.1 avg. F1). We conduct error analysis for mention detection and delve into its impact on coreference clustering, revealing that precision improvements deliver more substantial benefits than increases in recall for resolving coreference chains.

Linguistic conventions that arise in dialogue reflect common ground and can increase communicative efficiency. Social robots that can understand these conventions and the process by which they arise have the potential to become efficient communication partners. Nevertheless, it is unclear how robots can engage in convention formation when presented with both familiar and new information. We introduce an adaptable game platform, SPOTTER, to study the dynamics of convention formation for visually grounded referring expressions in both human-human and human-robot interaction. Specifically, we seek to elicit convention forming for members of an inner circle of well-known individuals in the common ground, as opposed to individuals from an outer circle, who are unfamiliar. We release an initial corpus of 5000 utterances from two exploratory pilot experiments in Dutch. Different from previous work focussing on human-human interaction, we find that referring expressions for both familiar and unfamiliar individuals maintain their length throughout human-robot interaction. Stable conventions are formed, although these conventions can be impacted by distracting outer circle individuals. With our distinction between familiar and unfamiliar, we create a contrastive operationalization of common ground, which aids research into convention formation.

pdf abs
SpreadNaLa: A Naturalistic Code Generation Evaluation Dataset of Spreadsheet Formulas
Sebastian Schuster | Ayesha Ansar | Om Agarwal | Vera Demberg

Automatic generation of code from natural language descriptions has emerged as one of the main use cases of large language models (LLMs). This has also led to a proliferation of datasets to track progress in the reliability of code generation models, including domains such as programming challenges and common data science tasks. However, existing datasets primarily target the use of code generation models to aid expert programmers in writing code. In this work, we consider a domain of code generation which is more frequently used by users without sophisticated programming skills: translating English descriptions to spreadsheet formulas that can be used to do everyday data processing tasks. We extract naturalistic instructions from StackOverflow posts and manually verify and standardize the corresponding spreadsheet formulas. We use this dataset to evaluate an off-the-shelf code generation model (GPT 3.5 text-davinci-003) as well as recently proposed pragmatic code generation procedures and find that Code Reviewer reranking (Zhang et al., 2022) performs best among the evaluated methods but still frequently generates formulas that differ from human-generated ones.

pdf abs
STAF: Pushing the Boundaries of Test-Time Adaptation towards Practical Noise Scenarios
Haoyu Xiong | Xinchun Zhang | Leixin Yang | Yu Xiang | Gang Fang

Test-time adaptation (TTA) aims to adapt the neural network to the distribution of the target domain using only unlabeled test data. Most previous TTA methods have achieved success under mild conditions, such as considering only a single or multiple independent static domains. However, in real-world settings, the test data is sampled in a correlated manner and the test environments undergo continual changes over time, which may cause previous TTA methods to fail in practical noise scenarios, i.e., independent noise distribution shifts, continual noise distribution shifts, and continual mixed distribution shifts. To address these issues, we elaborate a Stable Test-time Adaptation Framework, called STAF, to stabilize the adaptation process. Specifically, to boost model robustness to noise distribution shifts, we present a multi-stream perturbation consistency method, enabling weak-to-strong views to be consistent, guided by the weak view from the original sample. Meanwhile, we develop a reliable memory-based corrector which utilizes reliable snapshots between the anchor model and the adapt model to correct prediction bias. Furthermore, we propose a dynamic parameter restoration strategy to alleviate error accumulation and catastrophic forgetting that takes into account both the distribution shift and sample adaptation degree. Extensive experiments demonstrate the robustness and effectiveness of STAF, which pushes the boundaries of test-time adaptation to more realistic scenarios and paves the way for stable deployment of real-world applications.

pdf abs
STAGE: Simple Text Data Augmentation by Graph Exploration
Ho-Seung Kim | YongHoon Kang | Jee-Hyong Lee

Pre-trained language models (PLMs) are widely used for various tasks, but fine-tuning them requires sufficient data. Data augmentation approaches have been proposed as alternatives, but they vary in complexity, cost, and performance. To address these challenges, we propose STAGE (Simple Text Data Augmentation by Graph Exploration), a highly effective method for data augmentation. STAGE utilizes simple modification operations such as insertion, deletion, replacement, and swap. However, what distinguishes STAGE lies in the selection of optimal words for each modification. This is achieved by leveraging a word-relation graph called the co-graph. The co-graph takes into account both word frequency and co-occurrence, providing valuable information for operand selection. To assess the performance of STAGE, we conduct evaluations using seven representative datasets and three different PLMs. Our results demonstrate the effectiveness of STAGE across diverse data domains, varying data sizes, and different PLMs. Also, STAGE demonstrates superior performance when compared to previous methods that use simple modification operations or large language models like GPT3.

pdf abs
Stance Reasoner: Zero-Shot Stance Detection on Social Media with Explicit Reasoning
Maksym Taranukhin | Vered Shwartz | Evangelos Milios

Social media platforms are rich sources of opinionated content. Stance detection allows the automatic extraction of users’ opinions on various topics from such content. We focus on zero-shot stance detection, where the model’s success relies on (a) having knowledge about the target topic; and (b) learning general reasoning strategies that can be employed for new topics. We present Stance Reasoner, an approach to zero-shot stance detection on social media that leverages explicit reasoning over background knowledge to guide the model’s inference about the document’s stance on a target. Specifically, our method uses a pre-trained language model as a source of world knowledge, with the chain-of-thought in-context learning approach to generate intermediate reasoning steps. Stance Reasoner outperforms the current state-of-the-art models on 3 Twitter datasets, including fully supervised models. It can better generalize across targets, while at the same time providing explicit and interpretable explanations for its predictions.

pdf abs
STEntConv: Predicting Disagreement between Reddit Users with Stance Detection and a Signed Graph Convolutional Network
Isabelle Lorge | Li Zhang | Xiaowen Dong | Janet Pierrehumbert

The rise of social media platforms has led to an increase in polarised online discussions, especially on political and socio-cultural topics such as elections and climate change. We propose a simple and entirely novel unsupervised method to better predict whether the authors of two posts agree or disagree, leveraging user stances about named entities obtained from their posts. We present STEntConv, a model which builds a graph of users and named entities weighted by stance and trains a Signed Graph Convolutional Network (SGCN) to detect disagreement between comment and reply posts. We run experiments and ablation studies and show that including this information improves disagreement detection performance on a dataset of Reddit posts for a range of controversial subreddit topics, without the need for platform-specific features or user history

pdf abs
Step-by-Step: Controlling Arbitrary Style in Text with Large Language Models
Pusheng Liu | Lianwei Wu | Linyong Wang | Sensen Guo | Yang Liu

Recently, the autoregressive framework based on large language models (LLMs) has achieved excellent performance in controlling the generated text to adhere to the required style. These methods guide LLMs through prompt learning to generate target text in an autoregressive manner. However, this manner possesses lower controllability and suffers from the challenge of accumulating errors, where early prediction inaccuracies might influence subsequent word generation. Furthermore, existing prompt-based methods overlook specific region editing, resulting in a deficiency of localized control over input text. To overcome these challenges, we propose a novel three-stage prompt-based approach for specific region editing. To alleviate the issue of accumulating errors, we transform the text style transfer task into a text infilling task, guiding the LLMs to modify only a small portion of text within the editing region to achieve style transfer, thus reducing the number of autoregressive iterations. To achieve an effective specific editing region, we adopt both prompt-based and word frequency-based strategies for region selection, subsequently employing a discriminator to validate the efficacy of the selected region. Experiments conducted on several publicly competitive datasets for text style transfer task confirm that our proposed approach achieves state-of-the-art performance. Keywords: text style transfer, natural language generation, large language models

pdf abs
Step Feasibility-Aware and Error-Correctable Entailment Tree Generation
Junyue Song | Xin Wu | Yi Cai

An entailment tree is a structured reasoning path that clearly demonstrates the process of deriving hypotheses through multiple steps of inference from known premises. It enhances the interpretability of QA systems. Existing methods for generating entailment trees typically employ iterative frameworks to ensure reasoning faithfulness. However, they often suffer from the issue of false feasible steps, where selected steps appear feasible but actually lead to incorrect intermediate conclusions. Moreover, the existing iterative frameworks do not consider error-prone search branches, resulting in error propagation. In this work, we propose SPEH: an iterative entailment tree generation framework with Step feasibility Perception and state Error Handling mechanisms. Step Feasibility Perception enables the model to learn how to choose steps that are not false feasible. State Error Handling includes error detection and backtracking, allowing the model to correct errors when entering incorrect search branches. Experimental results demonstrate the effectiveness of our approach in improving the generation of entailment trees.

pdf abs
Still All Greeklish to Me: Greeklish to Greek Transliteration
Anastasios Toumazatos | John Pavlopoulos | Ion Androutsopoulos | Stavros Vassos

Modern Greek is normally written in the Greek alphabet. In informal online messages, however, Greek is often written using characters available on Latin-character keyboards, a form known as Greeklish. Originally used to bypass the lack of support for the Greek alphabet in older computers, Greeklish is now also used to avoid switching languages on multilingual keyboards, hide spelling mistakes, or as a form of slang. There is no consensus mapping, hence the same Greek word can be written in numerous different ways in Greeklish. Even native Greek speakers may struggle to understand (or be annoyed by) Greeklish, which requires paying careful attention to context to decipher. Greeklish may also be a problem for NLP models trained on Greek datasets written in the Greek alphabet. Experimenting with a range of statistical and deep learning models on both artificial and real-life Greeklish data, we find that: (i) prompting large language models (e.g., GPT-4) performs impressively well with few- or even zero-shot training, outperforming several fine-tuned encoder-decoder models; however (ii) a twenty years old statistical Greeklish transliteration model is still very competitive; and (iii) the problem is still far from having been solved; (iv) nevertheless, downstream Greek NLP systems that need to cope with Greeklish, such as moderation classifiers, can benefit significantly even with the current non-perfect transliteration systems. We make all our code, models, and data available and suggest future improvements, based on an analysis of our experimental results.

pdf abs
Stories and Personal Experiences in the COVID-19 Discourse
Neele Falk | Gabriella Lapesa

Storytelling, i.e., the use of of anecdotes and personal experiences, plays a crucial role in everyday argumentation. This is particularly true for the highly controversial debates that spark in times of crisis - where the focus of the discussion is on heterogeneous aspects of everyday life. For individuals, stories can have a strong persuasive power; for a larger collective, stories can help decision-makers to develop strategies for addressing the challenges people are facing, especially in times of crisis. In this paper, we analyse the use of storytelling in the COVID-19 discourse. We carry out our analysis on three publicly available Reddit datasets, for a total of 367K comments. We automatically annotate the Reddit datasets by detecting spans containing storytelling and classifying them into: a) personal vs. general – is the story experienced by the speaker? b) argumentative function (Does the story clarify a problem, potentially consisting in harm to a specific group? Does it exemplify a solution to a problem, or does it establish the credibility of the speaker?), and c) topic. We then carry out an analysis which establishes the relevance of storytelling in the COVID discourse and further uncovers interactions between topics and types of stories associated to them.

pdf abs
Strengthening the WiC: New Polysemy Dataset in Hindi and Lack of Cross Lingual Transfer
Haim Dubossarsky | Farheen Dairkee

This study addresses the critical issue of Natural Language Processing in low-resource languages such as Hindi, which, despite having substantial number of speakers, is limited in linguistic resources. The paper focuses on Word Sense Disambiguation, a fundamental NLP task that deals with polysemous words. It introduces a novel Hindi WSD dataset in the modern WiC format, enabling the training and testing of contextualized models. The primary contributions of this work lie in testing the efficacy of multilingual models to transfer across languages and hence to handle polysemy in low-resource languages, and in providing insights into the minimum training data required for a viable solution. Experiments compare different contextualized models on the WiC task via transfer learning from English to Hindi. Models purely transferred from English yield poor 55% accuracy, while fine-tuning on Hindi dramatically improves performance to 90% accuracy. This demonstrates the need for language-specific tuning and resources like the introduced Hindi WiC dataset to drive advances in Hindi NLP. The findings offer valuable insights into addressing the NLP needs of widely spoken yet low-resourced languages, shedding light on the problem of transfer learning in these contexts.

pdf abs
StructAM: Enhancing Address Matching through Semantic Understanding of Structure-aware Information
Zhaoqi Zhang | Pasquale Balsebre | Siqiang Luo | Zhen Hai | Jiangping Huang

The task of address matching involves linking unstructured addresses to standard ones in a database. The challenges presented by this task are manifold: misspellings, incomplete information, and variations in address content are some examples. While there have been previous studies on entity matching in natural language processing, for the address matching solution, existing approaches still rely on string-based similarity matching or manually-designed rules. In this paper, we propose StructAM, a novel method based on pre-trained language models (LMs) and graph neural networks to extract the textual and structured information of the addresses. The proposed method leverages the knowledge acquired by large language models during the pre-training phase, and refines it during the fine-tuning process on the address domain, to obtain address-specific semantic features. Meanwhile, it also applies an attribute attention mechanism based on Graph Sampling and Aggregation (GraphSAGE) module to capture internal hierarchy information of the address text. To further enhance the accuracy of our algorithm in dirty settings, we incorporate spatial coordinates and contextual information from the surrounding area as auxiliary guidance. We conduct extensive experiments on real-world datasets from four different countries and the results show that StructAM outperforms state-of-the-art baseline approaches for address matching.

Over the past few years, we have witnessed remarkable advancements in Code Pre-trained Models (CodePTMs). These models achieved excellent representation capabilities by designing structure-based pre-training tasks for code. However, how to enhance the absorption of structural knowledge when fine-tuning CodePTMs still remains a significant challenge. To fill this gap, in this paper, we present SAT, a novel structure-enhanced and plug-and-play fine-tuning method for CodePTMs. We first propose a structure loss to quantify the difference between the information learned by CodePTMs and the knowledge extracted from code structure. Specifically, we use the attention scores from Transformer layer as the learned information, and the shortest path length between leaves in abstract syntax trees as the structural knowledge. Subsequently, multi-task learning is introduced to improve the performance of fine-tuning. Experiments conducted on four pre-trained models and two generation tasks demonstrate the effectiveness of our proposed method as a plug-and-play solution. Furthermore, we observed that SAT can benefit CodePTMs more with limited training data.

pdf abs
Structure-aware Generation Model for Cross-Domain Aspect-based Sentiment Classification
Shichen Li | Zhongqing Wang | Yanzhi Xu | Guodong Zhou

Employing pre-trained generation models for cross-domain aspect-based sentiment classification has recently led to large improvements. However, they ignore the importance of syntactic structures, which have shown appealing effectiveness in classification based models. Different from previous studies, efficiently encoding the syntactic structure in generation model is challenging because such models are pretrained on natural language, and modeling structured data may lead to catastrophic forgetting of distributional knowledge. In this study, we propose a novel structure-aware generation model to tackle this challenge. In particular, a prompt-driven strategy is designed to bridge the gap between different domains, by capturing implicit syntactic information from the input and output sides. Furthermore, the syntactic structure is explicitly encoded into the structure-aware generation model, which can effectively learn domain-irrelevant features based on syntactic pivot features. Empirical results demonstrate the effectiveness of the proposed structure-aware generation model over several strong baselines. The results also indicate the proposed model is capable of leveraging the input syntactic structure into the generation model.

Unsupervised text style transfer aims to modify the style of a sentence while preserving its content without parallel corpora. Existing approaches attempt to separate content from style, but some words contain both content and style information. It makes them difficult to disentangle, where unsatisfactory disentanglement results in the loss of the content information or the target style. To address this issue, researchers adopted a “cycle reconstruction” mechanism to maintain content information, but it is still hard to achieve satisfactory content preservation due to incomplete disentanglement. In this paper, we propose a new disentanglement-based method, StyleFlow, which effectively avoids the loss of contents through a better cycle reconstruction via a reversible encoder. The reversible encoder is a normalizing flow that can not only produce output given input but also infer the exact input given the output reversely. We design a stack of attention-aware coupling layers, where each layer is reversible and adopts the attention mechanism to improve the content-style disentanglement. Moreover, we propose a data augmentation method based on normalizing flow to enhance the training data. Our experiments on sentiment transfer and formality transfer tasks show that StyleFlow outperforms strong baselines on both content preservation and style transfer.

Large Language Models (LLMs) have demonstrated impressive performances across various NLP tasks with just a few prompts via in-context learning. Previous studies have emphasized the pivotal role of well-chosen examples in in-context learning, as opposed to randomly selected instances that exhibits unstable results.A successful example selection scheme depends on multiple factors, while in the context of LLMs-based machine translation, the common selection algorithms only consider the single factor, i.e., the similarity between the example source sentence and the input sentence.In this paper, we introduce a novel approach to use multiple translational factors for in-context example selection by using monotone submodular function maximization.The factors include surface/semantic similarity between examples and inputs on both source and target sides, as well as the diversity within examples.Importantly, our framework mathematically guarantees the coordination between these factors, which are different and challenging to reconcile.Additionally, our research uncovers a previously unexamined dimension: unlike other NLP tasks, the translation part of an example is also crucial, a facet disregarded in prior studies.Experiments conducted on BLOOMZ-7.1B and LLAMA2-13B, demonstrate that our approach significantly outperforms random selection and robust single-factor baselines across various machine translation tasks.

Deep neural networks (DNNs) are notoriously vulnerable to adversarial attacks that place carefully crafted perturbations on normal examples to fool DNNs. To better understand such attacks, a characterization of the features carried by adversarial examples is needed. In this paper, we tackle this challenge by inspecting the subspaces of sample features through spectral analysis. We first empirically show that the features of either clean signals or adversarial perturbations are redundant and span in low-dimensional linear subspaces respectively with minimal overlap, and the classical low-dimensional subspace projection can suppress perturbation features out of the subspace of clean signals. This makes it possible for DNNs to learn a subspace where only features of clean signals exist while those of perturbations are discarded, which can facilitate the distinction of adversarial examples. To prevent the residual perturbations that is inevitable in subspace learning, we propose an independence criterion to disentangle clean signals from perturbations. Experimental results show that the proposed strategy enables the model to inherently suppress adversaries, which not only boosts model robustness but also motivates new directions of effective adversarial defense.

pdf abs
Sub-Table Rescorer for Table Question Answering
Atsushi Kojima

We propose a sub-table rescorer (STR) to improve the performance of an inner table retriever (ITR)-based inference for the table question answering. Tabular language model (TLM) truncates the sequence of a long table due to their input token limits. It leads to accuracy degradation. To solve this problem, ITR extracts sub-table candidates, which correspond to a part of an entire greater original table on the basis of relevance scores to the question for each of the columns and rows. Then, the topN longest sub-tables are selected. Our proposed STR estimates the relevance score between a question and each sub-table. In this work, we explored two different methods to integrate STR to the ITR-based inference. In the first method, STR rescores sub-table candidates, and the topN sub-tables are chosen. Then, TLM outputs the most confident answer. In the second method, the score calculated by STR is interpolated with the score calculated by TLM. Then, the most confident answer is chosen. In the experiment, we evaluate the performance on the WikiTableQuestions dataset. By applying STR to the ITR-based inference, we observed 4.4% and 6.3% relative reductions in error rate in the rescoring- and score-fusion-based methods, respectively.

This paper introduces the upgrade of a training corpus for linguistic annotation of modern standard Slovene. The enhancement spans both the size of the corpus and the depth of annotation layers. The revised SUK 1.0 corpus, building on its predecessor ssj500k 2.3, has doubled in size, containing over a million tokens. This expansion integrates three preexisting open-access datasets, all of which have undergone automatic tagging and meticulous manual review across multiple annotation layers, each represented in varying proportions. These layers span tokenization, segmentation, lemmatization, MULTEXT-East morphology, Universal Dependencies, JOS-SYN syntax, semantic role labeling, named entity recognition, and the newly incorporated coreferences. The paper illustrates the annotation processes for each layer while also presenting the results of the new CLASSLA-Stanza annotation tool, trained on the SUK corpus data. As one of the fundamental language resources of modern Slovene, the SUK corpus calls for constant development, as outlined in the concluding section.

pdf abs
SuperST: Superficial Self-Training for Few-Shot Text Classification
Ju-Hyoung Lee | Joonghyuk Hahn | Hyeon-Tae Seo | Jiho Park | Yo-Sub Han

In few-shot text classification, self-training is a popular tool in semi-supervised learning (SSL). It relies on pseudo-labels to expand data, which has demonstrated success. However, these pseudo-labels contain potential noise and provoke a risk of underfitting the decision boundary. While the pseudo-labeled data can indeed be noisy, fully acquiring this flawed data can result in the accumulation of further noise and eventually impacting the model performance. Consequently, self-training presents a challenge: mitigating the accumulation of noise in the pseudo-labels. Confronting this challenge, we introduce superficial learning, inspired by pedagogy’s focus on essential knowledge. Superficial learning in pedagogy is a learning scheme that only learns the material ‘at some extent’, not fully understanding the material. This approach is usually avoided in education but counter-intuitively in our context, we employ superficial learning to acquire only the necessary context from noisy data, effectively avoiding the noise. This concept serves as the foundation for SuperST, our self-training framework. SuperST applies superficial learning to the noisy data and fine-tuning to the less noisy data, creating an efficient learning cycle that prevents overfitting to the noise and spans the decision boundary effectively. Notably, SuperST improves the classifier accuracy for few-shot text classification by 18.5% at most and 8% in average, compared with the state-of-the-art SSL baselines. We substantiate our claim through empirical experiments and decision boundary analysis.

pdf abs
SwissSLi: The Multi-parallel Sign Language Corpus for Switzerland
Zifan Jiang | Anne Göhring | Amit Moryossef | Rico Sennrich | Sarah Ebling

In this work, we introduce SwissSLi, the first sign language corpus that contains parallel data of all three Swiss sign languages, namely Swiss German Sign Language (DSGS), French Sign Language of Switzerland (LSF-CH), and Italian Sign Language of Switzerland (LIS-CH). The data underlying this corpus originates from television programs in three spoken languages: German, French, and Italian. The programs have for the most part been translated into sign language by deaf translators, resulting in a unique, up to six-way multi-parallel dataset between spoken and sign languages. We describe and release the sign language videos and spoken language subtitles as well as the overall statistics and some derivatives of the raw material. These derived components include cropped videos, pose estimation, phrase/sign-segmented videos, and sentence-segmented subtitles, all of which facilitate downstream tasks such as sign language transcription (glossing) and machine translation. The corpus is publicly available on the SWISSUbase data platform for research purposes only under a CC BY-NC-SA 4.0 license.

Joint entity-relation extraction remains a challenging task in information retrieval, given the intrinsic difficulty in modelling the interdependence between named entity recognition (NER) and relation extraction (RE) sub-tasks. Most existing joint extraction models encode entity and relation features in a sequential or parallel manner, allowing for limited one-way interaction. However, it is not yet clear how to capture the interdependence between these two sub-tasks in a synergistic and mutually reinforcing fashion. With this in mind, we propose a novel approach for joint entity-relation extraction, named Synergetic Interaction Network (SINET) which utilizes a cross-task attention mechanism to effectively leverage contextual associations between NER and RE. Specifically, we construct two sets of distinct token representations for NER and RE sub-tasks respectively. Then, both sets of unique representation interact with one another via a cross-task attention mechanism, which exploits associated contextual information produced by concerted efforts of both NER and RE. Experiments on three benchmark datasets demonstrate that the proposed model achieves significantly better performance in joint entity-relation extraction. Moreover, extended analysis validates that the proposed mechanism can indeed leverage the semantic information produced by NER and RE sub-tasks to boost one another in a complementary way. The source code is available to the public online.

Although there have been some works using prompt learning for the Aspect-based Sentiment Analysis(ABSA) tasks, their methods of prompt-tuning are simple and crude. Compared with vanilla fine-tuning methods, prompt learning intuitively bridges the objective form gap between pre-training and fine-tuning. Concretely, simply constructing prompt related to aspect words fails to fully exploit the potential of Pre-trained Language Models, and conducting more robust and professional prompt engineering for downstream tasks is a challenging problem that needs to be solved urgently. Therefore, in this paper, we propose a novel Syntax-aware Enhanced Prompt method (SynPrompt), which sufficiently mines the key syntactic information related to aspect words from the syntactic dependency tree. Additionally, to effectively harness the domain-specific knowledge embedded within PLMs for the ABSA tasks, we construct two adaptive prompt frameworks to enhance the perception ability of the above method. After conducting extensive experiments on three benchmark datasets, we have found that our method consistently achieves favorable results. These findings not only demonstrate the effectiveness and rationality of our proposed methods but also provide a powerful alternative to traditional prompt-tuning.

pdf abs
Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation
Kartik Kartik | Sanjana Soni | Anoop Kunchukuttan | Tanmoy Chakraborty | Md. Shad Akhtar

The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka code-mixed language) in a single utterance. This has resulted a formidable challenge for the computational models due to the scarcity of annotated data and presence of noise. A potential solution to mitigate the data scarcity problem in low-resource setup is to leverage existing data in resource-rich language through translation. In this paper, we tackle the problem of code-mixed (Hinglish and Bengalish) to English machine translation. First, we synthetically develop HINMIX, a parallel corpus of Hinglish to English, with ~4.2M sentence pairs. Subsequently, we propose RCMT, a robust perturbation based joint-training model that learns to handle noise in the real-world code-mixed text by parameter sharing across clean and noisy words. Further, we show the adaptability of RCMT in a zero-shot setup for Bengalish to English translation. Our evaluation and comprehensive analyses qualitatively and quantitatively demonstrate the superiority of RCMT over state-of-the-art code-mixed and robust translation methods.

pdf abs
SynTOD: Augmented Response Synthesis for Robust End-to-End Task-Oriented Dialogue System
Nguyen Quang Chieu | Quang-Minh Tran | Khac-Hoai Nam Bui

Task-oriented dialogue (TOD) systems are introduced to solve specific tasks, which focus on training multiple tasks such as language understanding, tracking states, and generating appropriate responses to help users achieve their specific goals. Currently, one of the remaining challenges in this emergent research field is the capability to produce more robust architectures fine-tuned for end-to-end TOD systems. In this study, we consider this issue by exploiting the ability of pre-trained models to provide synthesis responses, which are then used as the input for the fine-tuned process. The main idea is to overcome the gap between the training process and inference process during fine-tuning end-to-end TOD systems. The experiment on Multiwoz datasets shows the effectiveness of our model compared with strong baselines in this research field. The source code is available for further exploitation.

Code search with natural language helps us reuse existing code snippets. Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the quadratic complexity of multi-head self-attention, there is a limit on the input token length. For efficient training on standard GPUs like V100, existing pretrained code models, including GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code that is greater than 256 tokens. To tackle the long code problem, we propose a new baseline SEA (Split, Encode and Aggregate), which splits long code into code blocks, encodes these blocks into embeddings, and aggregates them to obtain a comprehensive long code representation. With SEA, we could directly use Transformer-based pretraining models to model long code without changing their internal structure and re-pretraining. We also compare SEA with sparse Trasnformer methods. With GraphCodeBERT as the encoder, SEA achieves an overall mean reciprocal ranking score of 0.785, which is 10.1% higher than GraphCodeBERT on the CodeSearchNet benchmark, justifying SEA as a strong baseline for long code search.

Event relation extraction (ERE) is a critical and fundamental challenge for natural language processing. Existing work mainly focuses on directly modeling the entire document, which cannot effectively handle long-range dependencies and information redundancy. To address these issues, we propose a cluster-aware compression method for improving event relation extraction (TacoERE), which explores a compression-then-extraction paradigm. Specifically, we first introduce document clustering for modeling event dependencies. It splits the document into intra- and inter-clusters, where intra-clusters aim to enhance the relations within the same cluster, while inter-clusters attempt to model the related events at arbitrary distances. Secondly, we utilize cluster summarization to simplify and highlight important text content of clusters for mitigating information redundancy and event distance. We have conducted extensive experiments on both pre-trained language models, such as RoBERTa, and large language models, such as ChatGPT and GPT-4, on three ERE datasets, i.e., MAVEN-ERE, EventStoryLine and HiEve. Experimental results demonstrate that TacoERE is an effective method for ERE.

pdf abs
TACO – Twitter Arguments from COnversations
Marc Feger | Stefan Dietze

Twitter has emerged as a global hub for engaging in online conversations and as a research corpus for various disciplines that have recognized the significance of its user-generated content. Argument mining is an important analytical task for processing and understanding online discourse. Specifically, it aims to identify the structural elements of arguments, denoted as information and inference. These elements, however, are not static and may require context within the conversation they are in, yet there is a lack of data and annotation frameworks addressing this dynamic aspect on Twitter. We contribute TACO, the first dataset of Twitter Arguments utilizing 1,814 tweets covering 200 entire COnversations spanning six heterogeneous topics annotated with an agreement of 0.718 Krippendorff’s α among six experts. Second, we provide our annotation framework, incorporating definitions from the Cambridge Dictionary, to define and identify argument components on Twitter. Our transformer-based classifier achieves an 85.06% macro F1 baseline score in detecting arguments. Moreover, our data reveals that Twitter users tend to engage in discussions involving informed inferences and information. TACO serves multiple purposes, such as training tweet classifiers to manage tweets based on inference and information elements, while also providing valuable insights into the conversational reply patterns of tweets.

pdf abs
TAeKD: Teacher Assistant Enhanced Knowledge Distillation for Closed-Source Multilingual Neural Machine Translation
Bo Lv | Xin Liu | Kaiwen Wei | Ping Luo | Yue Yu

Knowledge Distillation (KD) serves as an efficient method for transferring language knowledge from open-source large language models (LLMs) to more computationally efficient models. However, challenges arise when attempting to apply vanilla KD methods to transfer knowledge from closed-source Multilingual Neural Machine Translation (MNMT) models based on LLMs. In this scenario, the soft labels and training data are not accessible, making it difficult to achieve effective knowledge transfer. To address this issue, this paper proposes a Teacher Assistant enhanced Knowledge Distillation (TAeKD) method to augment the knowledge transfer capacity from closed-source MNMT models. Specifically, TAeKD designs a fusion model that integrates translation outputs from multiple closed-source models to generate soft labels and training samples. Furthermore, a quality assessment learning mechanism is introduced to enhance the generalization of the fusion model and elevate the quality of the fusion data used to train the student model. To facilitate research on knowledge transfer from MNMT models, we also introduce FuseData, a benchmark consisting of a blend of translations from multiple closed-source systems. The experimental results show that TAeKD outperforms the previous state-of-the-art KD methods on both WMT22 and FLORES-101 test sets.

pdf abs
TaiChi: Improving the Robustness of NLP Models by Seeking Common Ground While Reserving Differences
Huimin Chen | Chengyu Wang | Yanhao Wang | Cen Chen | Yinggui Wang

Recent studies have shown that Pre-trained Language Models (PLMs) are vulnerable to adversarial examples, crafted by introducing human-imperceptible perturbations to clean examples to deceive the models. This vulnerability stems from the divergence in the data distributions of clean and adversarial examples. Therefore, addressing this issue involves teaching the model to diminish the differences between the two types of samples and to focus more on their similarities. To this end, we propose a novel approach named TaiChi that employs a Siamese network architecture. Specifically, it consists of two sub-networks sharing the same structure but trained on clean and adversarial samples, respectively, and uses a contrastive learning strategy to encourage the generation of similar language representations for both kinds of samples. Furthermore, it utilizes the Kullback-Leibler (KL) divergence loss to enhance the consistency in the predictive behavior of the two sub-networks. Extensive experiments across three widely used datasets demonstrate that TaiChi achieves superior trade-offs between robustness to adversarial attacks at token and character levels and accuracy on clean examples compared to previous defense methods. Our code and data are publicly available at https://github.com/sai4july/TaiChi.

pdf abs
Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction
Ziyang Xu | Keqin Peng | Liang Ding | Dacheng Tao | Xiliang Lu

Recent research shows that pre-trained language models (PLMs) suffer from “prompt bias” in factual knowledge extraction, i.e., prompts tend to introduce biases toward specific labels. Prompt bias presents a significant challenge in assessing the factual knowledge within PLMs. Therefore, this paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias. We show that: 1) all prompts in the experiments exhibit non-negligible bias, with gradient-based prompts like AutoPrompt and OptiPrompt displaying significantly higher levels of bias; 2) prompt bias can amplify benchmark accuracy unreasonably by overfitting the test datasets, especially on imbalanced datasets like LAMA. Based on these findings, we propose a representation-based approach to mitigate the prompt bias during inference time. Specifically, we first estimate the biased representation using prompt-only querying, and then remove it from the model’s internal representations to generate the debiased representations, which are used to produce the final debiased outputs. Experiments across various prompts, PLMs, and benchmarks show that our approach can not only correct the overfitted performance caused by prompt bias, but also significantly improve the prompt retrieval capability (up to 10% absolute performance gain). These results indicate that our approach effectively alleviates prompt bias in knowledge evaluation, thereby enhancing the reliability of benchmark assessments. Hopefully, our plug-and-play approach can be a golden standard to strengthen PLMs toward reliable knowledge bases. Code and data are released in https://github.com/FelliYang/PromptBias.

Researchers have attempted to mitigate lexical bias in toxic language detection (TLD). However, existing methods fail to disentangle the “useful” and “misleading” impact of lexical bias on model decisions. Therefore, they do not effectively exploit the positive effects of the bias and lead to a degradation in the detection performance of the debiased model. In this paper, we propose a Counterfactual Causal Debiasing Framework (CCDF) to mitigate lexical bias in TLD. It preserves the “useful impact” of lexical bias and eliminates the “misleading impact”. Specifically, we first represent the total effect of the original sentence and biased tokens on decisions from a causal view. We then conduct counterfactual inference to exclude the direct causal effect of lexical bias from the total effect. Empirical evaluations demonstrate that the debiased TLD model incorporating CCDF achieves state-of-the-art performance in both accuracy and fairness compared to competitive baselines applied on several vanilla models. The generalization capability of our model outperforms current debiased models for out-of-distribution data.

pdf abs
TAPASGO: Transfer Learning towards a German-Language Tabular Question Answering Model
Dominik Andreas Kowieski | Michael Hellwig | Thomas Feilhauer

Processing tabular data holds significant importance across various domains and applications. This study investigates the performance and limitations of fine-tuned models for tabular data analysis, specifically focusing on using fine-tuning mechanics on an English model towards a potential German model. The validation of the effectiveness of the transfer learning approach compares the performance of the fine-tuned German model and of the original English model on test data from the German training set. A potential shortcut that translates the German test data into English serves for comparison. Results reveal that the fine-tuned model outperforms the original model significantly, demonstrating the effectiveness of transfer learning even for a limited amount of training data. One also observes that the English model can effectively process translated German tabular data, albeit with a slight accuracy drop compared to fine-tuning. The model evaluation extends to real-world data extracted from the sustainability reports of a financial institution. The fine-tuned model proves superior in extracting knowledge from these training-unrelated tables, indicating its potential applicability in practical scenarios. This paper also releases the first manually annotated dataset for German Table Question Answering and the related annotation tool.

pdf abs
Target-Adaptive Consistency Enhanced Prompt-Tuning for Multi-Domain Stance Detection
Shaokang Wang | Li Pan

Stance detection is a fundamental task in Natural Language Processing (NLP). It is challenging due to diverse expressions and topics related to the targets from multiple domains. Recently, prompt-tuning has been introduced to convert the original task into a cloze-style prediction task, achieving impressive results. Many prompt-tuning-based methods focus on one or two classic scenarios with concrete external knowledge enhancement. However, when facing intricate information in multi-domain stance detection, these methods cannot be adaptive to multi-domain semantics. In this paper, we propose a novel target-adaptive consistency enhanced prompt-tuning method (TCP) for stance detection with multiple domains. TCP incorporates target knowledge and prior knowledge to construct target-adaptive verbalizers for diverse domains and employs pilot experiments distillation to enhance the consistency between verbalizers and model training. Specifically, to capture the knowledge from multiple domains, TCP uses a target-adaptive candidate mining strategy to obtain the domain-related candidates. Then, TCP refines them with prior attributes to ensure prediction consistency. The Pre-trained Language Models (PLMs) in prompt-tuning are with large-scale parameters, while only changing the verbalizer without corresponding tuning has a limited impact on the training process. Target-aware pilot experiments are conducted to enhance the consistency between the verbalizer and training by distilling the target-adaptive knowledge into prompt-tuning. Extensive experiments and ablation studies demonstrate that TCP outperforms the state-of-the-art methods on nine stance detection datasets from multiple domains.

pdf abs
Targeted Syntactic Evaluation on the Chomsky Hierarchy
Taiga Someya | Ryo Yoshida | Yohei Oseki

In this paper, we propose a novel evaluation paradigm for Targeted Syntactic Evaluations, where we assess how well language models can recognize linguistic phenomena situated at different levels of the Chomsky hierarchy. Specifically, we create formal languages that abstract four syntactic phenomena in natural languages, each identified at a different level of the Chomsky hierarchy, and use these to evaluate the capabilities of language models: (1) (Adj)ˆn NP type, (2) NPˆn VPˆn type, (3) Nested Dependency type, and (4) Cross Serial Dependency type. We first train three different language models (LSTM, Transformer LM, and Stack-RNN) on language modeling tasks and then evaluate them using pairs of a positive and a negative sentence by investigating whether they can assign a higher probability to the positive sentence than the negative one. Our result demonstrated that all language models have the ability to capture the structural patterns of the (Adj)ˆn NP type formal language. However, LSTM and Transformer LM failed to capture NPˆn VPˆn type language and no architectures can recognize nested dependency and Cross Serial dependency correctly. Neural language models, especially Transformer LMs, have exhibited high performance across a multitude of downstream tasks, leading to the perception that they possess an understanding of natural languages. However, our findings suggest that these models may not necessarily comprehend the syntactic structures that underlie natural language phenomena such as dependency. Rather, it appears that they may extend grammatical rules equivalent to regular grammars to approximate the rules governing dependencies.

In recent years, there has been a significant increase in interest in developing Spoken Language Understanding (SLU) systems. SLU involves extracting a list of semantic information from the speech signal. A major issue for SLU systems is the lack of sufficient amount of bi-modal (audio and textual semantic annotation) training data. Existing SLU resources are mainly available in high-resource languages such as English, Mandarin and French. However, one of the current challenges concerning low-resourced languages is data collection and annotation. In this work, we present a new freely available corpus, named TARIC-SLU, composed of railway transport conversations in Tunisian dialect that is continuously annotated in dialogue acts and slots. We describe the semantic model of the dataset, the data and experiments conducted to build ASR-based and SLU-based baseline models. To facilitate its use, a complete recipe, including data preparation, training and evaluation scripts, has been built and will be integrated to SpeechBrain, a popular open-source conversational AI toolkit based on PyTorch.

As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically. Different from the image captioning task, visual storytelling requires not only modeling the relationships between objects in the image but also mining the connections between adjacent images. Recent approaches primarily utilize either end-to-end frameworks or multi-stage frameworks to generate relevant stories, but they usually overlook latent topic information. In this paper, in order to generate a more coherent and relevant story, we propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST). In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives. Then we apply two topic-consistent reinforcement learning rewards to identify the discrepancy between the generated story and the human-labeled story so as to refine the whole generation process. Extensive experimental results on the VIST dataset and human evaluation demonstrate that our proposed model outperforms most of the competitive models across multiple evaluation metrics.

pdf abs
Task-agnostic Distillation of Encoder-Decoder Language Models
Chen Zhang | Yang Yang | Qiuchi Li | Jingang Wang | Dawei Song

Finetuning pretrained language models (LMs) have enabled appealing performance on a diverse array of tasks. The intriguing task-agnostic property has driven a shifted focus from task-specific to task-agnostic distillation of LMs. While task-agnostic, compute-efficient, performance-preserved LMs can be yielded by task-agnostic distillation, previous studies mainly sit in distillation of either encoder-only LMs (e.g., BERT) or decoder-only ones (e.g., GPT) yet largely neglect that distillation of encoder-decoder LMs (e.g., T5) can posit very distinguished behaviors. Frustratingly, we discover that existing task-agnostic distillation methods can fail to handle the distillation of encoder-decoder LMs. To the demand, we explore a few paths and uncover a path named as MiniEnD that successfully tackles the distillation of encoder-decoder LMs in a task-agnostic fashion. We examine MiniEnD on language understanding and abstractive summarization. The results showcase that MiniEnD is generally effective and is competitive compared to other alternatives. We further scale MiniEnD up to distillation of 3B encoder-decoder language models with interpolated distillation. The results imply the opportunities and challenges in distilling large language models (e.g., LLaMA).

pdf abs
Task-Oriented Paraphrase Analytics
Marcel Gohsen | Matthias Hagen | Martin Potthast | Benno Stein

Since paraphrasing is an ill-defined task, the term “paraphrasing” covers text transformation tasks with different characteristics. Consequently, existing paraphrasing studies have applied quite different (explicit and implicit) criteria as to when a pair of texts is to be considered a paraphrase, all of which amount to postulating a certain level of semantic or lexical similarity. In this paper, we conduct a literature review and propose a taxonomy to organize the 25 identified paraphrasing (sub-)tasks. Using classifiers trained to identify the tasks that a given paraphrasing instance fits, we find that the distributions of task-specific instances in the known paraphrase corpora vary substantially. This means that the use of these corpora, without the respective paraphrase conditions being clearly defined (which is the normal case), must lead to incomparable and misleading results.

pdf abs
tasksource: A Large Collection of NLP tasks with a Structured Dataset Preprocessing Framework
Damien Sileo

The HuggingFace Datasets Hub hosts thousands of datasets, offering exciting opportunities for language model training and evaluation. However, datasets for a specific task type often have different structures, making harmonization challenging which prevents the interchangeable use of comparable datasets. As a result, multi-task training or evaluation necessitates manual work to fit data into task templates. Several initiatives independently tackle this issue by releasing harmonized datasets or providing harmonization codes to preprocess datasets into a consistent format. We identify patterns in such preprocessings, such as column renaming, or more complex patterns. We then propose an annotation framework that enables concise, readable, and reusable preprocessing annotations. tasksource annotates more than 600 task preprocessings and provides a backend to automate dataset alignment. We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable parameter count according to an external evaluation.

Large Language Models (LLMs) have achieved impressive results in Machine Translation by simply following instructions, even without training on parallel data. However, LLMs still face challenges on low-resource languages due to the lack of pre-training data. In real-world situations, humans can become proficient in their native languages through abundant and meaningful social interactions and can also learn foreign languages effectively using well-organized textbooks. Drawing inspiration from human learning patterns, we introduce the Translate After LEarNing Textbook (TALENT) approach, which aims to enhance LLMs’ ability to translate low-resource languages by learning from a textbook. TALENT follows a step-by-step process: (1) Creating a Textbook for low-resource languages. (2) Guiding LLMs to absorb the Textbook’s content for Syntax Patterns. (3) Enhancing translation by utilizing the Textbook and Syntax Patterns. We thoroughly assess TALENT’s performance using 112 low-resource languages from FLORES-200 with two LLMs: ChatGPT and BLOOMZ. Evaluation across three different metrics reveals that TALENT consistently enhances translation performance by 14.8% compared to zero-shot baselines. Further analysis demonstrates that TALENT not only improves LLMs’ comprehension of low-resource languages but also equips them with the knowledge needed to generate accurate and fluent sentences in these languages.

pdf abs
TECA: A Two-stage Approach with Controllable Attention Soft Prompt for Few-shot Nested Named Entity Recognition
Yuanyuan Xu | Linhai Zhang | Deyu Zhou

Few-shot nested named entity recognition (NER), identifying named entities that are nested with a small number of labeled data, has attracted much attention. Recently, a span-based method based on three stages ( focusing, bridging and prompting) has been proposed for few-shot nested NER. However, such a span-based approach for few-shot nested NER suffers from two challenges: 1) error propagation because of its 3-stage pipeline-based framework; 2) ignoring the relationship between inner and outer entities, which is crucial for few-shot nested NER. Therefore, in this work, we propose a two-stage approach with a controllable attention soft prompt for few-shot nested named entity recognition (TECA). It consists of two components: span part identification and entity mention recognition. The span part identification provides possible entity mentions without an extra filtering module. The entity mention recognition pays fine-grained attention to the inner and outer entities and the corresponding adjacent context through the controllable attention soft prompt to classify the candidate entity mentions. Experimental results show that the TECA approach achieves state-of-the-art performance consistently on the four benchmark datasets (ACE2004, ACE2005, GENIA, and KBP2017) and outperforms several competing baseline models on F1-score by 5.62% on ACE04, 5.11% on ACE05, 3.41% on KBP2017 and 0.7% on GENIA on the 10-shot setting.

pdf abs
TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu
Gopichand Kanumolu | Lokesh Madasu | Nirmal Surange | Manish Shrivastava

News headline generation is a crucial task in increasing productivity for both the readers and producers of news. This task can easily be aided by automated News headline-generation models. However, the presence of irrelevant headlines in scraped news articles results in sub-optimal performance of generation models. We propose that relevance-based headline classification can greatly aid the task of generating relevant headlines. Relevance-based headline classification involves categorizing news headlines based on their relevance to the corresponding news articles. While this task is well-established in English, it remains under-explored in low-resource languages like Telugu due to a lack of annotated data. To address this gap, we present TeClass, the first-ever human-annotated Telugu news headline classification dataset, containing 78,534 annotations across 26,178 article-headline pairs. We experiment with various baseline models and provide a comprehensive analysis of their results. We further demonstrate the impact of this work by fine-tuning various headline generation models using TeClass dataset. The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores. To encourage future research, the annotated dataset as well as the annotation guidelines will be made publicly available.

pdf abs
TED-EL: A Corpus for Speech Entity Linking
Silin Li | Ruoyu Song | Tianwei Lan | Zeming Liu | Yuhang Guo

Speech entity linking amis to recognize mentions from speech and link them to entities in knowledge bases. Previous work on entity linking mainly focuses on visual context and text context. In contrast, speech entity linking focuses on audio context. In this paper, we first propose the speech entity linking task. To facilitate the study of this task, we propose the first speech entity linking dataset, TED-EL. Our corpus is a high-quality, human-annotated, audio, text, and mention-entity pair parallel dataset derived from Technology, Entertainment, Design (TED) talks and includes a wide range of entity types (24 types). Based on TED-EL, we designed two types of models: ranking-based and generative speech entity linking models. We conducted experiments on the TED-EL dataset for both types of models. The results show that the ranking-based models outperform the generative models, achieving an F1 score of 60.68%.

pdf abs
Tell Me Again! a Large-Scale Dataset of Multiple Summaries for the Same Story
Hans Ole Hatzel | Chris Biemann

A wide body of research is concerned with the semantics of narratives, both in terms of understanding narratives and generating fictional narratives and stories. We provide a dataset of summaries to be used as a proxy for entire stories or for the analysis of the summaries themselves. Our dataset consists of a total of 96,831 individual summaries across 29,505 stories. We intend for the dataset to be used for training and evaluation of embedding representations for stories, specifically the stories’ narratives. The summary data is harvested from five different language versions of Wikipedia. Our dataset comes with rich metadata, which we extract from Wikidata, enabling a wide range of applications that operate on story summaries in conjunction with metadata. To set baseline results, we run retrieval experiments on the dataset, exploring the capability of similarity models in retrieving summaries of the same story. For this retrieval, a crucial element is to not place too much emphasis on the named entities, as this can enable retrieval of other summaries for the same work without taking the narrative into account.

Reasoning over the Temporal Knowledge Graph (TKG) that predicts facts in the future has received much attention. Most previous works attempt to model temporal dynamics with knowledge graphs and graph convolution networks. However, these methods lack the consideration of high-order interactions between objects in TKG, which is an important factor to predict future facts. To address this problem, we introduce dynamic hypergraph embedding for temporal knowledge graph reasoning. Specifically, we obtain high-order interactions by constructing hypergraphs based on temporal knowledge graphs at different timestamps. Besides, we integrate the differences caused by time into the hypergraph representation in order to fit TKG. Then, we adapt dynamic meta-embedding for temporal hypergraph representation that allows our model to choose the appropriate high-order interactions for downstream reasoning. Experimental results on public TKG datasets show that our method outperforms the baselines. Furthermore, the analysis part demonstrates that the proposed method brings good interpretation for the predicted results.

pdf abs
Term-Driven Forward-Looking Claim Synthesis in Earnings Calls
Chung-Chi Chen | Hiroya Takamura

Argument synthesis aims to generate rational claims, representing a fundamental objective in this field. While existing models excel in summarizing arguments and engaging in debates, we observe a critical gap in their ability to generate accurate arguments that incorporate forward-looking perspectives. In light of this observation, this paper introduces a novel task called “forward-looking claim planning.” We delve into this task by exploring the efficacy of well-performing classification and generation models. Furthermore, we propose several customized preprocessing methods that yield substantial performance improvements. Through comprehensive discussion and analysis, we also outline a future research agenda for the forward-looking claim planning task.

pdf abs
text2story: A Python Toolkit to Extract and Visualize Story Components of Narrative Text
Evelin Amorim | Ricardo Campos | Alipio Jorge | Pedro Mota | Rúben Almeida

Story components, namely, events, time, participants, and their relations are present in narrative texts from different domains such as journalism, medicine, finance, and law. The automatic extraction of narrative elements encompasses several NLP tasks such as Named Entity Recognition, Semantic Role Labeling, Event Extraction, Coreference resolution, and Temporal Inference. The text2story python, an easy-to-use modular library, supports the narrative extraction and visualization pipeline. The package contains an array of narrative extraction tools that can be used separately or in sequence. With this toolkit, end users can process free text in English or Portuguese and obtain formal representations, like standard annotation files or a formal logical representation. The toolkit also enables narrative visualization as Message Sequence Charts (MSC), Knowledge Graphs, and Bubble Diagrams, making it useful to visualize and transform human-annotated narratives. The package combines the use of off-the-shelf and custom tools and is easily patched (replacing existing components) and extended (e.g. with new visualizations). It includes an experimental module for narrative element effectiveness assessment and being is therefore also a valuable asset for researchers developing solutions for narrative extraction. To evaluate the baseline components, we present some results of the main annotators embedded in our packages for datasets in English and Portuguese. We also compare the results with the extraction of narrative elements by GPT-3, a robust LLM model.

Narratives have been the subject of extensive research across various scientific fields such as linguistics and computer science. However, the scarcity of freely available datasets, essential for studying this genre, remains a significant obstacle. Furthermore, datasets annotated with narratives components and their morphosyntactic and semantic information are even scarcer. To address this gap, we developed the Text2Story Lusa datasets, which consist of a collection of news articles in European Portuguese. The first datasets consists of 357 news articles and the second dataset comprises a subset of 117 manually densely annotated articles, totaling over 50 thousand individual annotations. By focusing on texts with substantial narrative elements, we aim to provide a valuable resource for studying narrative structures in European Portuguese news articles. On the one hand, the first dataset provides researchers with data to study narratives from various perspectives. On the other hand, the annotated dataset facilitates research in information extraction and related tasks, particularly in the context of narrative extraction pipelines. Both datasets are made available adhering to FAIR principles, thereby enhancing their utility within the research community.

pdf abs
Text360Nav: 360-Degree Image Captioning Dataset for Urban Pedestrians Navigation
Chieko Nishimura | Shuhei Kurita | Yohei Seki

Text feedback from urban scenes is a crucial tool for pedestrians to understand surroundings, obstacles, and safe pathways. However, existing image captioning datasets often concentrate on the overall image description and lack detailed scene descriptions, overlooking features for pedestrians walking on urban streets. We developed a new dataset to assist pedestrians in urban scenes using 360-degree camera images. Through our dataset of Text360Nav, we aim to provide textual feedback from machinery visual perception such as 360-degree cameras to visually impaired individuals and distracted pedestrians navigating urban streets, including those engrossed in their smartphones while walking. In experiments, we combined our dataset with multimodal generative models and observed that models trained with our dataset can generate textual descriptions focusing on street objects and obstacles that are meaningful in urban scenes in both quantitative and qualitative analyses, thus supporting the effectiveness of our dataset for urban pedestrian navigation.

pdf abs
Text Filtering Classifiers for Medium-Resource Languages
Jón Daðason | Hrafn Loftsson

Web-crawled corpora are essential resources for linguistic and NLP research, offering far more data than is available from curated corpora. However, they often contain a great deal of low-quality texts which can complicate research and degrade the quality of pre-trained language models. Therefore, they are typically filtered, e.g. by applying rules or classifiers. In this paper, we compare the effectiveness of various text filtering classifiers and measure their impact on language model performance for three medium-resource languages. We present TQ-IS, an Icelandic text quality dataset consisting of 2,000 web-crawled documents, in which spans of low-quality text have been manually identified and labeled. We then evaluate a perplexity-based classifier, a supervised classifier trained on TQ-IS, and a self-supervised classifier trained to discern between documents from curated and web-crawled corpora on Icelandic, Estonian and Basque. We find that these classifiers obtain F1 scores of 94.48%, 99.01% and 93.40%, respectively, when evaluated on the TQ-IS dataset. Furthermore, our results show that while adding filtered web-crawled text to a pre-training corpus can improve downstream performance for pre-trained language models, any improvement is likely to remain modest unless the web-crawled corpus is significantly larger in size.

pdf abs
Text Style Transfer Evaluation Using Large Language Models
Phil Sidney Ostheimer | Mayank Kumar Nagda | Marius Kloft | Sophie Fellenz

Evaluating Text Style Transfer (TST) is a complex task due to its multi-faceted nature. The quality of the generated text is measured based on challenging factors, such as style transfer accuracy, content preservation, and overall fluency. While human evaluation is considered to be the gold standard in TST assessment, it is costly and often hard to reproduce. Therefore, automated metrics are prevalent in these domains. Nonetheless, it is uncertain whether and to what extent these automated metrics correlate with human evaluations. Recent strides in Large Language Models (LLMs) have showcased their capacity to match and even exceed average human performance across diverse, unseen tasks. This suggests that LLMs could be a viable alternative to human evaluation and other automated metrics in TST evaluation. We compare the results of different LLMs in TST evaluation using multiple input prompts. Our findings highlight a strong correlation between (even zero-shot) prompting and human evaluation, showing that LLMs often outperform traditional automated metrics. Furthermore, we introduce the concept of prompt ensembling, demonstrating its ability to enhance the robustness of TST evaluation. This research contributes to the ongoing efforts for more robust and diverse evaluation methods by standardizing and validating TST evaluation with LLMs.

pdf abs
Text-to-Multimodal Retrieval with Bimodal Input Fusion in Shared Cross-Modal Transformer
Pranav Arora | Selen Pehlivan | Jorma Laaksonen

The rapid proliferation of multimedia content has necessitated the development of effective multimodal video retrieval systems. Multimodal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and visual. This work aims to improve multimodal retrieval by guiding the creation of a shared embedding space with task-specific contrastive loss functions. An important aspect of our work is to propose a model that learns retrieval cues for the textual query from multiple modalities both separately and jointly within a hierarchical architecture that can be flexibly extended and fine-tuned for any number of modalities. To this end, the loss functions and the architectural design of the model are developed with a strong focus on increasing the mutual information between the textual and cross-modal representations. The proposed approach is quantitatively evaluated on the MSR-VTT and YouCook2 text-to-video retrieval benchmark datasets. The results showcase that the approach not only holds its own against state-of-the-art methods, but also outperforms them in a number of scenarios, with a notable relative improvements from baseline in R@1, R@5 and R@10 metrics.

pdf abs
Textual Coverage of Eventive Entries in Lexical Semantic Resources
Eva Fučíková | Cristina Fernández Alcaina | Jan Hajič | Zdeňka Urešová

This short paper focuses on the coverage of eventive entries (verbs, predicates, etc.) of some well-known lexical semantic resources when applied to random running texts taken from the internet. While coverage gaps are often reported for manually created lexicons (which is the case of most semantically-oriented lexical ones), it was our aim to quantify these gaps, cross-lingually, on a new purely textual resource set produced by the HPLT Project from crawled internet data. Several English, German, Spanish and Czech lexical semantic resources (which, for the most part, focus on verbs and predicates) have been selected for this experiment. We also describe the challenges related to the fact that these resources are (to a varying extent) semantically oriented, meaning that the texts have to be preprocessed to obtain lemmas (base forms) and some types of MWEs before the coverage can be reasonably evaluated, and thus the results are necessarily only approximate. The coverage of these resources, with some exclusions as described in the paper, range from 41.00% to 97.33%, confirming the need to expand at least some - even well-known - resources to cover the prevailing source of today’s textual resources with regard to lexical units describing events or states (or possibly other eventive mentions).

pdf abs
The Challenges of Creating a Parallel Multilingual Hate Speech Corpus: An Exploration
Katerina Korre | Arianna Muti | Alberto Barrón-Cedeño

Hate speech is infamously one of the most demanding topics in Natural Language Processing, as its multifacetedness is accompanied by a handful of challenges, such as multilinguality and cross-linguality. Hate speech has a subjective aspect that intensifies when referring to different cultures and different languages. In this respect, we design a pipeline that will help us explore the possibility of the creation of a parallel multilingual hate speech dataset, using machine translation. In this paper, we evaluate how/whether this is feasible by assessing the quality of the translations, calculating the toxicity levels of original and target texts, and calculating correlations between the newly obtained scores. Finally, we perform a qualitative analysis to gain further semantic and grammatical insights. With this pipeline we aim at exploring ways of filtering hate speech texts in order to parallelize sentences in multiple languages, examining the challenges of the task.

pdf abs
The Contextual Variability of English Nouns: The Impact of Categorical Specificity beyond Conceptual Concreteness
Giulia Rambelli | Marianna Bolognesi

Research on conceptual abstraction has investigated the differences in contextual distributions, or “contextual variability,” of abstract and concrete concept words (e.g., *love* vs. *cat*). Empirical studies on this topic show that abstract words tend to occur in diverse linguistic contexts, while concrete words are typically constrained within more homogeneous contexts. Nonetheless, these investigations have somewhat overlooked a factor that influences both abstract and concrete concepts: *Categorial Specificity*, which denotes the inclusiveness of a category (e.g., *ragdoll* vs. *mammal*). We argue that more specific words are tied to narrower domains, independently or whether they are concrete or abstract, thus resulting in a diminished degree of contextual variability when compared to generic terms. In this study, we used distributional models to investigate the interplay between contextual variability, concreteness, specificity, and their interaction. Analyzing 676 English nouns, we found that contextual variability is explained by both concreteness and specificity: more specific words have closer contexts, while generic words, whether abstract or concrete, exhibit less related contexts.

We introduce a new corpus, named AIKIA, for Offensive Language Detection (OLD) in Modern Greek (EL). EL is a less-resourced language regarding OLD. AIKIA offers free access to annotated data leveraged from EL Twitter and fiction texts using the lexicon of offensive terms, ERIS, that originates from HurtLex. AIKIA has been annotated for offensive values with the Best Worst Scaling (BWS) method, which is designed to avoid problems of categorical and scalar annotation methods. BWS assigns continuous offensive scores in the form of floating point numbers instead of binary arithmetical or categorical values. AIKIA’s performance in OLD was tested by fine-tuning a variety of pre-trained language models in a binary classification task. Experimentation with a number of thresholds showed that the best mapping of the continuous values to binary labels should occur at the range [0.5-0.6] of BWS values and that the pre-trained models on EL data achieved the highest Macro-F1 scores. Greek-Media-BERT outperformed all models with a threshold of 0.6 by obtaining a Macro-F1 score of 0.92

pdf abs
The Distracted Ear: How Listeners Shape Conversational Dynamics
Auriane Boudin | Stéphane Rauzy | Roxane Bertrand | Magalie Ochs | Philippe Blache

In the realm of human communication, feedback plays a pivotal role in shaping the dynamics of conversations. This study delves into the multifaceted relationship between listener feedback, narration quality and distraction effects. We present an analysis conducted on the SMYLE corpus, specifically enriched for this study, where 30 dyads of participants engaged in 1) face-to-face storytelling (8.2 hours) followed by 2) a free conversation (7.8 hours). The storytelling task unfolds in two conditions, where a storyteller engages with either a “normal” or a “distracted” listener. Examining the feedback impact on storytellers, we discover a positive correlation between the frequency of specific feedback and the narration quality in normal conditions, providing an encouraging conclusion regarding the enhancement of interaction through specific feedback in distraction-free settings. In contrast, in distracted settings, a negative correlation emerges, suggesting that increased specific feedback may disrupt narration quality, underscoring the complexity of feedback dynamics in human communication. The contribution of this paper is twofold: first presenting a new and highly enriched resource for the analysis of discourse phenomena in controlled and normal conditions; second providing new results on feedback production, its form and its consequence on the discourse quality (with direct applications in human-machine interaction).

pdf abs
The Effects of Pretraining in Video-Guided Machine Translation
Ammon Shurtz | Lawry Sorenson | Stephen D. Richardson

We propose an approach that improves the performance of VMT (Video-guided Machine Translation) models, which integrate text and video modalities. We experiment with the MAD (Movie Audio Descriptions) dataset, a new dataset which contains transcribed audio descriptions of movies. We find that the MAD dataset is more lexically rich than the VATEX dataset (the current VMT baseline), and we experiment with MAD pretraining to improve performance on the VATEX dataset. We experiment with two different video encoder architectures: a Conformer (Convolution-augmented Transformer) and a Transformer. Additionally, we conduct experiments by masking the source sentences to assess the degree to which the performance of both architectures improves due to pretraining on additional video data. Finally, we conduct an analysis of the transfer learning potential of a video dataset and compare it to pretraining on a text-only dataset. Our findings demonstrate that pretraining with a lexically rich dataset leads to significant improvements in model performance when models use both text and video modalities.

pdf abs
The ELCo Dataset: Bridging Emoji and Lexical Composition
Zi Yun Yang | Ziqing Zhang | Yisong Miao

Can emojis be composed to convey intricate meanings like English phrases? As a pioneering study, we present the Emoji-Lexical Composition (ELCo) dataset, a new resource that offers parallel annotations of emoji sequences corresponding to English phrases. Our dataset contains 1,655 instances, spanning 209 diverse concepts from tangible ones like “right man” (✔️👨) to abstract ones such as “full attention” (🧐✍️, illustrating a metaphoric composition of a focusing face and writing hand). ELCo enables the analysis of the patterns shared between emoji and lexical composition. Through a corpus study, we discovered that simple strategies like direct representation and reduplication are sufficient for conveying certain concepts, but a richer, metaphorical strategy is essential for expressing more abstract ideas. We further introduce an evaluative task, Emoji-based Textual Entailment (EmoTE), to assess the proficiency of NLP models in comprehending emoji compositions. Our findings reveals the challenge of understanding emoji composition in a zero-shot setting for current models, including ChatGPT. Our analysis indicates that the intricacy of metaphorical compositions contributes to this challenge. Encouragingly, models show marked improvement when fine-tuned on the ELCo dataset, with larger models excelling in deciphering nuanced metaphorical compositions.

pdf abs
The Emergence of Semantic Units in Massively Multilingual Models
Andrea Gregor de Varda | Marco Marelli

Massively multilingual models can process text in several languages relying on a shared set of parameters; however, little is known about the encoding of multilingual information in single network units. In this work, we study how two semantic variables, namely valence and arousal, are processed in the latent dimensions of mBERT and XLM-R across 13 languages. We report a significant cross-lingual overlap in the individual neurons processing affective information, which is more pronounced when considering XLM-R vis-à-vis mBERT. Furthermore, we uncover a positive relationship between cross-lingual alignment and performance, where the languages that rely more heavily on a shared cross-lingual neural substrate achieve higher performance scores in semantic encoding.

Creating language technology based on language data has become very popular with the recent advances of large language models and neural network technologies. This makes language resources very valuable, and especially in case of indigenous languages, the scarce resources are even more precious. Given the good results of simply fetching everything you can from the internet and feeding it to neural networks in English, there has been more work on doing the same for all languages. However, indigenous language resources as they are on the web are not comparable in that they would encode the most recent normativised language in all its aspects. This problematic is further due to not understanding the texts input to models or output by models by the people who work on them. Corpora also have intelligent property rights and copyrights that are not respected. Furthermore, the web is filled with the result of language model -generated texts. In this article we describe an ethical and sustainable way to work with indigenous languages.

The Igbo language is facing a risk of becoming endangered, as indicated by a 2025 UNESCO study. This highlights the need to develop language technologies for Igbo to foster communication, learning and preservation. To create robust, impactful, and widely adopted language technologies for Igbo, it is essential to incorporate the multi-dialectal nature of the language. The primary obstacle in achieving dialectal-aware language technologies is the lack of comprehensive dialectal datasets. In response, we present the IgboAPI dataset, a multi-dialectal Igbo-English dictionary dataset, developed with the aim of enhancing the representation of Igbo dialects. Furthermore, we illustrate the practicality of the IgboAPI dataset through two distinct studies: one focusing on Igbo semantic lexicon and the other on machine translation. In the semantic lexicon project, we successfully establish an initial Igbo semantic lexicon for the Igbo semantic tagger, while in the machine translation study, we demonstrate that by finetuning existing machine translation systems using the IgboAPI dataset, we significantly improve their ability to handle dialectal variations in sentences.

pdf abs
The Impact of Stance Object Type on the Quality of Stance Detection
Maxwell A. Weinzierl | Sanda M. Harabagiu

Stance as an expression of an author’s standpoint and as a means of communication has long been studied by computational linguists. Automatically identifying the stance of a subject toward an object is an active area of research in natural language processing. Significant work has employed topics and claims as the object of stance, with frames of communication becoming more recently considered as alternative objects of stance. However, little attention has been paid to finding what are the benefits and what are the drawbacks when inferring the stance of a text towards different possible stance objects. In this paper we seek to answer this question by analyzing the implied knowledge and the judgments required when deciding the stance of a text towards each stance object type. Our analysis informed experiments with models capable of inferring the stance of a text towards any of the stance object types considered, namely topics, claims, and frames of communication. Experiments clearly indicate that it is best to infer the stance of a text towards a frame of communication, rather than a claim or a topic. It is also better to infer the stance of a text towards a claim rather than a topic. Therefore we advocate that rather than continuing efforts to annotate the stance of texts towards topics, it is better to use those efforts to produce annotations towards frames of communication. These efforts will allow us to better capture the stance towards claims and topics as well.

pdf abs
The Influence of Automatic Speech Recognition on Linguistic Features and Automatic Alzheimer’s Disease Detection from Spontaneous Speech
Jonathan Heitz | Gerold Schneider | Nicolas Langer

Alzheimer’s disease (AD) represents a major problem for society and a heavy burden for those affected. The study of changes in speech offers a potential means for large-scale AD screening that is non-invasive and inexpensive. Automatic Speech Recognition (ASR) is necessary for a fully automated system. We compare different ASR systems in terms of Word Error Rate (WER) using a publicly available benchmark dataset of speech recordings of AD patients and controls. Furthermore, this study is the first to quantify how popular linguistic features change when replacing manual transcriptions with ASR output. This contributes to the understanding of linguistic features in the context of AD detection. Moreover, we investigate how ASR affects AD classification performance by implementing two popular approaches: A fine-tuned BERT model, and Random Forest on popular linguistic features. Our results show best classification performance when using manual transcripts, but the degradation when using ASR is not dramatic. Performance stays strong, achieving an AUROC of 0.87. Our BERT-based approach is affected more strongly by ASR transcription errors than the simpler and more explainable approach based on linguistic features.

pdf abs
The Key Points: Using Feature Importance to Identify Shortcomings in Sign Language Recognition Models
Ruth M. Holmes | Ellen Rushe | Anthony Ventresque

Pose estimation keypoints are widely used in sign language recognition (SLR) as a means of generalising to unseen signers. Despite the advantages of keypoints, SLR models struggle to achieve high recognition accuracy for many signed languages due to the large degree of variability between occurrences of the same signs, the lack of large datasets and the imbalanced nature of the data therein. In this paper we seek to provide a deeper analysis into the ways that these keypoints are used by models in order to determine which are most informative to SLR, identify potentially redundant ones and investigate whether keypoints that are central to differentiating signs in practice are being effectively used as expected by models.

pdf abs
The Low Saxon LSDC Dataset at Universal Dependencies
Janine Siewert | Jack Rueter

We present an extension of the Low Saxon Universal Dependencies dataset and discuss a few annotation-related challenges. Low Saxon is a West-Germanic low-resource language that lacks a common standard and therefore poses challenges for NLP. The 1,000 sentences in our dataset cover the last 200 years and 8 of the 9 major dialects. They are presented both in original and in normalised spelling and two lemmata are provided: A Modern Low Saxon lemma and a Middle Low Saxon lemma. Several annotation-related issues result from dialectal variation in morphological categories, and we explain differences in the pronoun, gender, case, and mood system. Furthermore, we take up three syntactic constructions that do not occur in Standard Dutch or Standard German: the possessive dative, pro-drop in pronominal adverbs, and complementiser doubling in subordinate interrogative clauses. These constructions are also rare in the other Germanic UD datasets and have not always been annotated consistently.

pdf abs
The Onomastic Repertoire of the Roman d’Alexandre (ORNARE). Designing an Integrated Digital Onomastic Tool for Medieval French Romance
Marta Milazzo | Giorgio Maria Di Nunzio

The paper reports on the first results of the design and implementation of a new digital tool for romance philology: the digital Onomastic Repertoire for the medieval French romance (12th-15th centuries). This tool, projected with a modular and integrable architecture, was implemented from a selection of romances, the corpus of the Medieval French Roman d’Alexandre. After introducing the peculiarities of the onomastic system in the Middle Ages (and, more generally, the peculiarities of medieval literary texts), the paper describes 1) the methodological challenges faced in the preparatory work, illustrates and comments on the first results achieved and 2) the design and implementation of the first integrated system for the interactive creation of the Onomastic Repertoire of the romaN d’AlexandRE (ORNARE), and 3) the current research output in terms of both a digital edition and the digital onomastic index of the corpus.

pdf abs
The Open-World Lottery Ticket Hypothesis for OOD Intent Classification
Yunhua Zhou | Pengyu Wang | Peiju Liu | Yuxin Wang | Xipeng Qiu

Most existing methods of Out-of-Domain (OOD) intent classification rely on extensive auxiliary OOD corpora or specific training paradigms. However, they are underdeveloped in the underlying principle that the models should have differentiated confidence in In- and Out-of-domain intent. In this work, we shed light on the fundamental cause of model overconfidence on OOD and demonstrate that calibrated subnetworks can be uncovered by pruning the overparameterized model. Calibrated confidence provided by the subnetwork can better distinguish In- and Out-of-domain, which can be a benefit for almost all post hoc methods. In addition to bringing fundamental insights, we also extend the Lottery Ticket Hypothesis to open-world scenarios. We conduct extensive experiments on four real-world datasets to demonstrate our approach can establish consistent improvements compared with a suite of competitive baselines.

pdf abs
Theoretical and Empirical Advantages of Dense-Vector to One-Hot Encoding of Intent Classes in Open-World Scenarios
Paulo Cavalin | Claudio Santos Pinhanez

This work explores the intrinsic limitations of the popular one-hot encoding method in classification of intents when detection of out-of-scope (OOS) inputs is required. Although recent work has shown that there can be significant improvements in OOS detection when the intent classes are represented as dense-vectors based on domain-specific knowledge, we argue in this paper that such gains are more likely due to advantages of the much richer topologies that can be created with dense vectors compared to the equidistant class representation assumed by one-hot encodings. We start by demonstrating how dense-vector encodings are able to create OOS spaces with much richer topologies. Then, we show empirically, using four standard intent classification datasets, that knowledge-free, randomly generated dense-vector encodings of intent classes can yield over 20% gains over one-hot encodings, producing better systems for open-world classification tasks, mostly from improvements in OOS detection.

Parallel corpora are still scarce for most of the world’s language pairs. The situation is by no means different for regional languages of France. In addition, adequate web interfaces facilitate and encourage the use of parallel corpora by target users, such as language learners and teachers, as well as linguists. In this paper, we describe ParCoLab, a parallel corpus and a web platform for querying the corpus. From its onset, ParCoLab has been geared towards lower-resource languages, with an initial corpus in Serbian, along with French and English (later Spanish). We focus here on the extension of ParCoLab with a parallel corpus for four regional languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais. In particular, we detail criteria for choosing texts and issues related to their collection. The new parallel corpus contains more than 20k tokens per regional language.

pdf abs
The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings
Michal Mochtak | Peter Rupnik | Nikola Ljubešić

The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment, which are used in a series of experiments focused on training a robust sentiment identifier for parliamentary proceedings. The paper additionally introduces the first domain-specific multilingual transformer language model for political science applications, which was additionally pre-trained on 1.72 billion words from parliamentary proceedings of 27 European parliaments. We present experiments demonstrating how the additional pre-training on parliamentary data can significantly improve the model downstream performance, in our case, sentiment identification in parliamentary proceedings. We further show that our multilingual model performs very well on languages not seen during fine-tuning, and that additional fine-tuning data from other languages significantly improves the target parliament’s results. The paper makes an important contribution to multiple disciplines inside the social sciences, and bridges them with computer science and computational linguistics. Lastly, the resulting fine-tuned language model sets up a more robust approach to sentiment analysis of political texts across languages, which allows scholars to study political sentiment from a comparative perspective using standardized tools and techniques.

pdf abs
There’s Something New about the Italian Parliament: The IPSA Corpus
Valentino Frasnelli | Alessio Palmero Aprosio

Parliamentary debates constitute a substantial and somewhat underutilized reservoir of publicly available written content. Despite their potential, the Italian parliamentary documents remain largely unexplored and most importantly inaccessible in their original paper-based form. In this paper we attempt to transform these valuable historical documents into IPSA, a digitally readable structured corpus containing speeches, reports of the Standing Committees, and law proposals spanning 175 years of Italian history, from the issuing of the Statuto Albertino in 1848, up to the present day. At first, the PDF documents, available on the official websites of Senato della Repubblica and Camera dei Deputati, the two chambers that form the Italian Parliament, are digitized using Optical Character Recognition (OCR) techniques. Then, the speeches are tagged with the corresponding speakers. The final dataset is released both in textual and structured format.

pdf abs
The RIP Corpus of Collaborative Hypothesis-Making
Ella Schad | Jacky Visser | Chris Reed

The dearth of literature combining hypothesis-making and collaborative problem solving presents a problem in the investigation into how hypotheses are generated in group environments. A new dataset, the Resolving Investigative hyPotheses (RIP) corpus, is introduced to address this issue. The corpus uses the fictionalised environment of a murder investigation game. An artificial environment restricts the number of possible hypotheses compared to real-world situations, allowing a deeper dive into the data. In three groups of three, participants collaborated to solve the mystery: two groups came to the wrong conclusion in different ways, and one succeeded in solving the game. RIP is a 49k-word dialogical corpus, consisting of three sub-corpora, annotated for argumentation and discourse structure on the basis of Inference Anchoring Theory. The corpus shows the emergent roles individuals took on and the strategies the groups employed, showing what can be gained through a deeper exploration of this domain. The corpus bridges the gap between these two areas – hypothesis generation and collaborative problem solving – by using an environment rich with potential for hypothesising within a highly collaborative space.

pdf abs
The Role of Creaky Voice in Turn Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS
Harm Lameris | Eva Szekely | Joakim Gustafson

Recent advancements in spontaneous text-to-speech (TTS) have enabled the realistic synthesis of creaky voice, a voice quality known for its diverse pragmatic and paralinguistic functions. In this study, we used synthesized creaky voice in perceptual tests, to explore how listeners without formal training perceive two distinct types of creaky voice. We annotated a spontaneous speech corpus using creaky voice detection tools and modified a neural TTS engine with a creaky phonation embedding to control the presence of creaky phonation in the synthesized speech. We performed an objective analysis using a creak detection tool which revealed significant differences in creaky phonation levels between the two creaky voice types and modal voice. Two subjective listening experiments were performed to investigate the effect of creaky voice on perceived certainty, valence, sarcasm, and turn finality. Participants rated non-positional creak as less certain, less positive, and more indicative of turn finality, while positional creak was rated significantly more turn final compared to modal phonation.

pdf abs
The Role of Syntactic Span Preferences in Post-Hoc Explanation Disagreement
Jonathan Kamp | Lisa Beinborn | Antske Fokkens

Post-hoc explanation methods are an important tool for increasing model transparency for users. Unfortunately, the currently used methods for attributing token importance often yield diverging patterns. In this work, we study potential sources of disagreement across methods from a linguistic perspective. We find that different methods systematically select different classes of words and that methods that agree most with other methods and with humans display similar linguistic preferences. Token-level differences between methods are smoothed out if we compare them on the syntactic span level. We also find higher agreement across methods by estimating the most important spans dynamically instead of relying on a fixed subset of size k. We systematically investigate the interaction between k and spans and propose an improved configuration for selecting important tokens.

pdf abs
The SAMER Arabic Text Simplification Corpus
Bashar Alhafni | Reem Hazim | Juan David Pineros Liberato | Muhamed Al Khalil | Nizar Habash

We present the SAMER Corpus, the first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. Our corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels most of which were published between 1865 and 1955. Our corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels. We describe the corpus selection process, and outline the guidelines we followed to create the annotations and ensure their quality. Our corpus is publicly available to support and encourage research on Arabic text simplification, Arabic automatic readability assessment, and the development of Arabic pedagogical language technologies.

pdf abs
The Slovak Autistic and Non-Autistic Child Speech Corpus:Task-Oriented Child-Adult Interactions
Joanna Kruyt | Róbert Sabo | Katarína Polónyiová | Daniela Ostatníková | Štefan Beňuš

This paper presents the Slovak Autistic and Non-Autistic Child Speech Corpus, which consists of audio-recordings and transcripts of collaborative, task-oriented conversations between children (with or without autism spectrum disorder, ASD) and a non-autistic adult experimenter. The task used to elicit this corpus was the Maps task. This corpus was primarily recorded to investigate lexical alignment, but can also be used to study other conversation coordination strategies and behaviours. Scores on various standardised psychometric tests, such as those measuring IQ, executive functioning, and theory of mind, are included for each participant. In total, the corpus contains over 15 hours of speech. This relatively large database contains a non-Germanic language and can be shared with any qualified researcher, making it a valuable resource for replication of existing findings regarding communication and ASD as well as future research into communication between individuals with and without ASD.

The Swedish parliamentary records are an important source material for social science and humanities researchers. We introduce a new research corpus, the Swedish Parliament Corpus, which is larger and more developed than previously available research corpora for the Swedish parliament. The corpus contains annotated and structured parliamentary records over more than 150 years, through the bicameral parliament (1867–1970) and the unicameral parliament (1971–). In addition to the records, which contain all speeches in the parliament, we also provide a database of all members of parliament over the same period. Along with the corpus, we describe procedures to ensure data quality. The corpus facilitates detailed analysis of parliamentary speeches in several research fields.

pdf abs
The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English
Tom S Juzek

We present a preview of the Syntactic Acceptability Dataset, a resource being designed for both syntax and computational linguistics research. In its current form, the dataset comprises 1,000 English sequences from the syntactic discourse: Half from textbooks and half from the journal Linguistic Inquiry, the latter to ensure a representation of the contemporary discourse. Each entry is labeled with its grammatical status (“well-formedness” according to syntactic formalisms) extracted from the literature, as well as its acceptability status (“intuitive goodness” as determined by native speakers) obtained through crowdsourcing, with highest experimental standards. Even in its preliminary form, this dataset stands as the largest of its kind that is publicly accessible. We also offer preliminary analyses addressing three debates in linguistics and computational linguistics: We observe that grammaticality and acceptability judgments converge in about 83% of the cases and that “in-betweenness” occurs frequently. This corroborates existing research. We also find that while machine learning models struggle with predicting grammaticality, they perform considerably better in predicting acceptability. This is a novel finding. Future work will focus on expanding the dataset.

While human values play a crucial role in making arguments persuasive, we currently lack the necessary extensive datasets to develop methods for analyzing the values underlying these arguments on a large scale. To address this gap, we present the Touché23-ValueEval dataset, an expansion of the Webis-ArgValues-22 dataset. We collected and annotated an additional 4780 new arguments, doubling the dataset’s size to 9324 arguments. These arguments were sourced from six diverse sources, covering religious texts, community discussions, free-text arguments, newspaper editorials, and political debates. Each argument is annotated by three crowdworkers for 54 human values, following the methodology established in the original dataset. The Touché23-ValueEval dataset was utilized in the SemEval 2023 Task 4. ValueEval: Identification of Human Values behind Arguments, where an ensemble of transformer models demonstrated state-of-the-art performance. Furthermore, our experiments show that a fine-tuned large language model, Llama-2-7B, achieves comparable results.

pdf abs
TIGER: A Unified Generative Model Framework for Multimodal Dialogue Response Generation
Fanheng Kong | Peidong Wang | Shi Feng | Daling Wang | Yifei Zhang

Responding with multimodal content has been recognized as one of the essential functionalities of intelligent conversational agents. However, existing research on multimodal dialogues primarily focuses on two topics: (1) textual response generation that ground the conversation on a given image; and (2) visual response selection based on the dialogue context. In light of the aforementioned gap, we propose mulTImodal GEnerator for dialogue Response (TIGER), a unified generative model framework for multimodal dialogue response generation. Through extensive experiments, TIGER has demonstrated new state-of-the-art results, providing users with an enhanced conversational experience. A multimodal dialogue system based on TIGER is available at https://github.com/friedrichor/TIGER. A video demonstrating the system is available at https://www.youtube.com/watch?v=Kd0CMwDs8Rk.

pdf abs
TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya
Hailay Kidu Teklehaymanot | Dren Fazlija | Niloy Ganguly | Gourab Kumar Patro | Wolfgang Nejdl

The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources. This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert-annotated dataset containing 2,685 question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the-art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pre-trained models. The notable disparities between human performance and the best model performance underscore the potential for fu- ture enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC. Keywords: Tigrinya QA dataset, Low resource QA dataset, domain specific QA

pdf abs
Time-aware COMET: A Commonsense Knowledge Model with Temporal Knowledge
Eiki Murata | Daisuke Kawahara

To better handle commonsense knowledge, which is difficult to acquire in ordinary training of language models, commonsense knowledge graphs and commonsense knowledge models have been constructed. The former manually and symbolically represents commonsense, and the latter stores these graphs’ knowledge in the models’ parameters. However, the existing commonsense knowledge models that deal with events do not consider granularity or time axes. In this paper, we propose a time-aware commonsense knowledge model, TaCOMET. The construction of TaCOMET consists of two steps. First, we create TimeATOMIC using ChatGPT, which is a commonsense knowledge graph with time. Second, TaCOMET is built by continually finetuning an existing commonsense knowledge model on TimeATOMIC. TimeATOMIC and continual finetuning let the model make more time-aware generations with rich commonsense than the existing commonsense models. We also verify the applicability of TaCOMET on a robotic decision-making task. TaCOMET outperformed the existing commonsense knowledge model when proper times are input. Our dataset and models will be made publicly available.

pdf abs
Title-based Extractive Summarization via MRC Framework
Hongjin Kim | Jai-Eun Kim | Harksoo Kim

Existing studies on extractive summarization have primarily focused on scoring and selecting summary sentences independently. However, these models are limited to sentence-level extraction and tend to select highly generalized sentences while overlooking the overall content of a document. To effectively consider the semantics of a document, in this study, we introduce a novel machine reading comprehension (MRC) framework for extractive summarization (MRCSum) by setting a query as the title. Our framework enables MRCSum to consider the semantic coherence and relevance of summary sentences in relation to the overall content. In particular, when a title is not available, we generate a title-like query, which is expected to achieve the same effect as a title. Our title-like query consists of the topic and keywords to serve as information on the main topic or theme of the document. We conduct experiments in both Korean and English languages, evaluating the performance of MRCSum on datasets comprising both long and short summaries. Our results demonstrate the effectiveness of MRCSum in extractive summarization, showcasing its ability to generate concise and informative summaries with or without explicit titles. Furthermore, our MRCSum outperforms existing models by capturing the essence of the document content and producing more coherent summaries.

End-to-end multimodal aspect-based sentiment analysis (MABSA) combines multimodal aspect terms extraction (MATE) with multimodal aspect sentiment classification (MASC), aiming to simultaneously extract aspect words and classify the sentiment polarity of each aspect. However, existing MABSA methods have overlooked two issues: (i) They only focus on fusing image regional information and textual words for two subtasks of MABSA. Whereas, MATE subtask relies more on global image information to assist in obtaining the quantity and attributes of aspects. Ignoring the integration with global information may affect the performance of MABSA methods. (ii) They fail to take advantage of target information. Nevertheless, the fine-grained details of targets are important for classifying sentiments of aspects. To solve these problems, we propose a Target-oriented Multi-grained Fusion Network(TMFN). It fuses text information with global coarse-grained image information for MATE subtask and with fine-grained image information for MASC subtask. In addition, a target-oriented feature alignment (TOFA) module is designed to enhance target-related information in image features with target details. In such a way, image features will contain more target emotional-related information which is beneficial to sentiment classification. Extensive experiments show that our method outperforms state-of-the-art methods on two benchmark datasets.

pdf abs
To Drop or Not to Drop? Predicting Argument Ellipsis Judgments: A Case Study in Japanese
Yukiko Ishizuki | Tatsuki Kuribayashi | Yuichiroh Matsubayashi | Ryohei Sasano | Kentaro Inui

Speakers sometimes omit certain arguments of a predicate in a sentence; such omission is especially frequent in pro-drop languages. This study addresses a question about ellipsis—what can explain the native speakers’ ellipsis decisions?—motivated by the interest in human discourse processing and writing assistance for this choice. To this end, we first collect large-scale human annotations of whether and why a particular argument should be omitted across over 2,000 data points in the balanced corpus of Japanese, a prototypical pro-drop language. The data indicate that native speakers overall share common criteria for such judgments and further clarify their quantitative characteristics, e.g., the distribution of related linguistic factors in the balanced corpus. Furthermore, the performance of the language model–based argument ellipsis judgment model is examined, and the gap between the systems’ prediction and human judgments in specific linguistic aspects is revealed. We hope our fundamental resource encourages further studies on natural human ellipsis judgment.

pdf abs
To Err Is Human, How about Medical Large Language Models? Comparing Pre-trained Language Models for Medical Assessment Errors and Reliability
Wen-wai Yim | Yujuan Fu | Asma Ben Abacha | Meliha Yetisgen

Unpredictability, especially unpredictability with unknown error characteristics, is a highly undesirable trait, particularly in medical patient care applications. Although large pre-trained language models (LLM) have been applied to a variety of unseen tasks with highly competitive and successful results, their sensitivity to language inputs and resulting performance variability is not well-studied. In this work, we test state-of-the-art pre-trained language models from a variety of families to characterize their error generation and reliability in medical assessment ability. Particularly, we experiment with general medical assessment multiple choice tests, as well as their open-ended and true-false alternatives. We also profile model consistency, error agreements with each other and to humans; and finally, quantify their ability to recover and explain errors. The findings in this work can be used to give further information about medical models so that modelers can make better-informed decisions rather than relying on standalone performance metrics alone.

pdf abs
Token-length Bias in Minimal-pair Paradigm Datasets
Naoya Ueda | Masato Mita | Teruaki Oka | Mamoru Komachi

Minimal-pair paradigm datasets have been used as benchmarks to evaluate the linguistic knowledge of models and provide an unsupervised method of acceptability judgment. The model performances are evaluated based on the percentage of minimal pairs in the MPP dataset where the model assigns a higher sentence log-likelihood to an acceptable sentence than to an unacceptable sentence. Each minimal pair in MPP datasets is controlled to align the number of words per sentence because the sentence length affects the sentence log-likelihood. However, aligning the number of words may be insufficient because recent language models tokenize sentences with subwords. Tokenization may cause a token length difference in minimal pairs, introducing token-length bias that skews the evaluation results. This study demonstrates that MPP datasets suffer from token-length bias and fail to evaluate the linguistic knowledge of a language model correctly. The results proved that sentences with a shorter token length would likely be assigned a higher log-likelihood regardless of their acceptability, which becomes problematic when comparing models with different tokenizers. To address this issue, we propose a debiased minimal pair generation method, allowing MPP datasets to measure language ability correctly and provide comparable results for all models.

pdf abs
To Learn or Not to Learn: Replaced Token Detection for Learning the Meaning of Negation
Gunjan Bhattarai | Katrin Erk

State-of-the-art language models perform well on a variety of language tasks, but they continue to struggle with understanding negation cues in tasks like natural language inference (NLI). Inspired by Hossain et al. (2020), who show under-representation of negation in language model pretraining datasets, we experiment with additional pretraining with negation data for which we introduce two new datasets. We also introduce a new learning strategy for negation building on ELECTRA’s (Clark et al., 2020) replaced token detection objective. We find that continuing to pretrain ELECTRA-Small’s discriminator leads to substantial gains on a variant of RTE (Recognizing Textual Entailment) with additional negation. On SNLI (Stanford NLI) (Bowman et al., 2015), there are no gains due to the extreme under-representation of negation in the data. Finally, on MNLI (Multi-NLI) (Williams et al., 2018), we find that performance on negation cues is primarily stymied by neutral-labeled examples.

In recent years, the fine-tuned generative models have been proven more powerful than the previous tagging-based or span-based models on named entity recognition (NER) task. It has also been found that the information related to entities, such as entity types, can prompt a model to achieve NER better. However, it is not easy to determine the entity types indeed existing in the given sentence in advance, and inputting too many potential entity types would distract the model inevitably. To exploit entity types’ merit on promoting NER task, in this paper we propose a novel NER framework, namely ToNER based on a generative model. In ToNER, a type matching model is proposed at first to identify the entity types most likely to appear in the sentence. Then, we append a multiple binary classification task to fine-tune the generative model’s encoder, so as to generate the refined representation of the input sentence. Moreover, we add an auxiliary task for the model to discover the entity types which further fine-tunes the model to output more accurate results. Our extensive experiments on some NER benchmarks verify the effectiveness of our proposed strategies in ToNER that are oriented towards entity types’ exploitation.

Tool learning aims to extend the capabilities of large language models (LLMs) with external tools. A major challenge in tool learning is how to support a large number of tools, including unseen tools. To address this challenge, previous studies have proposed retrieving suitable tools for the LLM based on the user query. However, previously proposed methods do not consider the differences between seen and unseen tools, nor do they take the hierarchy of the tool library into account, which may lead to suboptimal performance for tool retrieval. Therefore, to address the aforementioned issues, we propose ToolRerank, an adaptive and hierarchy-aware reranking method for tool retrieval to further refine the retrieval results. Specifically, our proposed ToolRerank includes Adaptive Truncation, which truncates the retrieval results related to seen and unseen tools at different positions, and Hierarchy-Aware Reranking, which makes retrieval results more concentrated for single-tool queries and more diverse for multi-tool queries. Experimental results show that ToolRerank can improve the quality of the retrieval results, leading to better execution results generated by the LLM.

pdf abs
Topic Classification and Headline Generation for Maltese Using a Public News Corpus
Amit Kumar Chaudhary | Kurt Micallef | Claudia Borg

The development of NLP tools for low-resource languages is impeded by the lack of data. While recent unsupervised pre-training approaches ease this requirement, the need for labelled data is crucial to progress the development of such tools. Moreover, publicly available datasets for such languages typically cover low-level syntactic tasks. In this work, we introduce new semantic datasets for Maltese generated automatically using associated metadata from a corpus in the news domain. The datasets are a news tag multi-label classification and a news abstractive summarisation task by generating its title. We also present an evaluation using publicly available models as baselines. Our results show that current models are lacking the semantic knowledge required to solve such tasks, shedding light on the need to use better modelling approaches for Maltese.

pdf abs
Topic-Controllable Summarization: Topic-Aware Evaluation and Transformer Methods
Tatiana Passali | Grigorios Tsoumakas

Topic-controllable summarization is an emerging research area with a wide range of potential applications. However, existing approaches suffer from significant limitations. For example, the majority of existing methods built upon recurrent architectures, which can significantly limit their performance compared to more recent Transformer-based architectures, while they also require modifications to the model’s architecture for controlling the topic. At the same time, there is currently no established evaluation metric designed specifically for topic-controllable summarization. This work proposes a new topic-oriented evaluation measure to automatically evaluate the generated summaries based on the topic affinity between the generated summary and the desired topic. The reliability of the proposed measure is demonstrated through appropriately designed human evaluation. In addition, we adapt topic embeddings to work with powerful Transformer architectures and propose a novel and efficient approach for guiding the summary generation through control tokens. Experimental results reveal that control tokens can achieve better performance compared to more complicated embedding-based approaches while also being significantly faster.

pdf abs
Topic Detection and Tracking with Time-Aware Document Embeddings
Hang Jiang | Doug Beeferman | Weiquan Mao | Deb Roy

The time at which a message is communicated is a vital piece of metadata in many real-world natural language processing tasks such as Topic Detection and Tracking (TDT). TDT systems aim to cluster a corpus of news articles by event, and in that context, stories that describe the same event are likely to have been written at around the same time. Prior work on time modeling for TDT takes this into account, but does not well capture how time interacts with the semantic nature of the event. For example, stories about a tropical storm are likely to be written within a short time interval, while stories about a movie release may appear over weeks or months. In our work, we design a neural method that fuses temporal and textual information into a single representation of news documents for event detection. We fine-tune these time-aware document embeddings with a triplet loss architecture, integrate the model into downstream TDT systems, and evaluate the systems on two benchmark TDT data sets in English. In the retrospective setting, we apply clustering algorithms to the time-aware embeddings and show substantial improvements over baselines on the News2013 data set. In the online streaming setting, we add our document encoder to an existing state-of-the-art TDT pipeline and demonstrate that it can benefit the overall performance. We conduct ablation studies on the time representation and fusion algorithm strategies, showing that our proposed model outperforms alternative strategies. Finally, we probe the model to examine how it handles recurring events more effectively than previous TDT systems.

pdf abs
TopicDiff: A Topic-enriched Diffusion Approach for Multimodal Conversational Emotion Detection
Jiamin Luo | Jingjing Wang | Guodong Zhou

Multimodal Conversational Emotion (MCE) detection, generally spanning across the acoustic, vision and language modalities, has attracted increasing interest in the multimedia community. Previous studies predominantly focus on learning contextual information in conversations with only a few considering the topic information in single language modality, while always neglecting the acoustic and vision topic information. On this basis, we propose a model-agnostic Topic-enriched Diffusion (TopicDiff) approach for capturing multimodal topic information in MCE tasks. Particularly, we integrate the diffusion model into neural topic model to alleviate the diversity deficiency problem of neural topic model in capturing topic information. Detailed evaluations demonstrate the significant improvements of TopicDiff over the state-of-the-art MCE baselines, justifying the importance of multimodal topic information to MCE and the effectiveness of TopicDiff in capturing such information. Furthermore, we observe an interesting finding that the topic information in acoustic and vision is more discriminative and robust compared to the language.

pdf abs
Topics as Entity Clusters: Entity-based Topics from Large Language Models and Graph Neural Networks
Manuel V. Loureiro | Steven Derby | Tri Kurniawan Wijaya

Topic models aim to reveal latent structures within a corpus of text, typically through the use of term-frequency statistics over bag-of-words representations from documents. In recent years, conceptual entities — interpretable, language-independent features linked to external knowledge resources — have been used in place of word-level tokens, as words typically require extensive language processing with a minimal assurance of interpretability. However, current literature is limited when it comes to exploring purely entity-driven neural topic modeling. For instance, despite the advantages of using entities for eliciting thematic structure, it is unclear whether current techniques are compatible with these sparsely organised, information-dense conceptual units. In this work, we explore entity-based neural topic modeling and propose a novel topic clustering approach using bimodal vector representations of entities. Concretely, we extract these latent representations from large language models and graph neural networks trained on a knowledge base of symbolic relations, in order to derive the most salient aspects of these conceptual units. Analysis of coherency metrics confirms that our approach is better suited to working with entities in comparison to state-of-the-art models, particularly when using graph-based embeddings trained on a knowledge base.

pdf abs
To Share or Not to Share: What Risks Would Laypeople Accept to Give Sensitive Data to Differentially-Private NLP Systems?
Christopher Weiss | Frauke Kreuter | Ivan Habernal

Although the NLP community has adopted central differential privacy as a go-to framework for privacy-preserving model training or data sharing, the choice and interpretation of the key parameter, privacy budget 𝜀 that governs the strength of privacy protection, remains largely arbitrary. We argue that determining the 𝜀 value should not be solely in the hands of researchers or system developers, but must also take into account the actual people who share their potentially sensitive data. In other words: Would you share your instant messages for 𝜀 of 10? We address this research gap by designing, implementing, and conducting a behavioral experiment (311 lay participants) to study the behavior of people in uncertain decision-making situations with respect to privacy-threatening situations. Framing the risk perception in terms of two realistic NLP scenarios and using a vignette behavioral study help us determine what 𝜀 thresholds would lead lay people to be willing to share sensitive textual data – to our knowledge, the first study of its kind.

pdf abs
Towards a Corpus of Spoken Maltese: Korpus tal-Malti Mitkellem, KMM
Alexandra (Sandra) Vella | Sarah Agius | Aiden Williams | Claudia Borg

This paper presents the rationale for a “dedicated” corpus of spoken Maltese, Korpus tal-Malti Mitkellem, KMM, ‘Corpus of Spoken Maltese’, based on the concept of a gold-standard Core collection. The Core collection is designed to cater to as wide a variety of user needs as possible whilst respecting basic principles governing corpus design, such as representativeness and balance, and delivering high quality in terms of both audio quality and annotations. An overview is provided of the composition of the current Core corpus of around 20 hours of data and of the human annotation effort involved. We also carry out a small qualitative analysis of the output of a Maltese ASR system and compare it to the human annotators’ output. Initial results are promising, showing that the ASR is robust enough to generate first-pass texts for annotators to work on, thus reducing the human effort, and consequently, the cost involved.

pdf abs
Towards a Danish Semantic Reasoning Benchmark - Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models
Bolette Pedersen | Nathalie Sørensen | Sussi Olsen | Sanni Nimb | Simon Gray

We present the first version of a semantic reasoning benchmark for Danish compiled semi-automatically from a number of human-curated lexical-semantic resources, which function as our gold standard. Taken together, the datasets constitute a benchmark for assessing selected language understanding capacities of large language models (LLMs) for Danish. This first version comprises 25 datasets across 6 different tasks and include 3,800 test instances. Although still somewhat limited in size, we go beyond comparative evaluation datasets for Danish by including both negative and contrastive examples as well as low-frequent vocabulary; aspects which tend to challenge current LLMs when based substantially on language transfer. The datasets focus on features such as semantic inference and entailment, similarity, relatedness, and ability to disambiguate words in context. We use ChatGPT to assess to which degree our datasets challenge the ceiling performance of state-of-the-art LLMs, average performance being relatively high with an average accuracy of 0.6 on ChatGPT 3.5 turbo and 0.8 on ChatGPT 4.0.

pdf abs
Towards a Framework for Evaluating Explanations in Automated Fact Verification
Neema Kotonya | Francesca Toni

As deep neural models in NLP become more complex, and as a consequence opaque, the necessity to interpret them becomes greater. A burgeoning interest has emerged in rationalizing explanations to provide short and coherent justifications for predictions. In this position paper, we advocate for a formal framework for key concepts and properties about rationalizing explanations to support their evaluation systematically. We also outline one such formal framework, tailored to rationalizing explanations of increasingly complex structures, from free-form explanations to deductive explanations, to argumentative explanations (with the richest structure). Focusing on the automated fact verification task, we provide illustrations of the use and usefulness of our formalization for evaluating explanations, tailored to their varying structures.

pdf abs
Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data
Shinka Mori | Oana Ignat | Andrew Lee | Rada Mihalcea

Synthetic data generation has the potential to impact applications and domains with scarce data. However, before such data is used for sensitive tasks such as mental health, we need an understanding of how different demographics are represented in it. In our paper, we analyze the potential of producing synthetic data using GPT-3 by exploring the various stressors it attributes to different race and gender combinations, to provide insight for future researchers looking into using LLMs for data generation. Using GPT-3, we develop HeadRoom, a synthetic dataset of 3,120 posts about depression-triggering stressors, by controlling for race, gender, and time frame (before and after COVID-19). Using this dataset, we conduct semantic and lexical analyses to (1) identify the predominant stressors for each demographic group; and (2) compare our synthetic data to a human-generated dataset. We present the procedures to generate queries to develop depression data using GPT-3, and conduct analyzes to uncover the types of stressors it assigns to demographic groups, which could be used to test the limitations of LLMs for synthetic data generation for depression data. Our findings show that synthetic data mimics some of the human-generated data distribution for the predominant depression stressors across diverse demographics.

pdf abs
Towards an Ideal Tool for Learner Error Annotation
Špela Arhar Holdt | Tomaž Erjavec | Iztok Kosem | Elena Volodina

Annotation and analysis of corrections in learner corpora have always presented technical challenges, mainly on account of the fact that until now there has not been any standard tool available, and that original and corrected versions of texts have been mostly stored together rather than treated as individual texts. In this paper, we present CJVT Svala 1.0, the Slovene version of the SVALA tool, which was originally used for the annotation of Swedish learner language. The localisation into Slovene resulted in the development of several new features in SVALA such as the support for multiple annotation systems, localisation into other languages, and the support for more complex annotation systems. Adopting the parallel aligned approach to text visualisation and annotation, as well as storing the data, combined with the tool supporting this, i.e. SVALA, are proposed as new standards in Learner Corpus Research.

pdf abs
Towards Answering Health-related Questions from Medical Videos: Datasets and Approaches
Deepak Gupta | Kush Attal | Dina Demner-Fushman

The increase in the availability of online videos has transformed the way we access information and knowledge. A growing number of individuals now prefer instructional videos as they offer a series of step-by-step procedures to accomplish particular tasks. Instructional videos from the medical domain may provide the best possible visual answers to first aid, medical emergency, and medical education questions. This paper focuses on answering health-related questions asked by health consumers by providing visual answers from medical videos. The scarcity of large-scale datasets in the medical domain is a key challenge that hinders the development of applications that can help the public with their health-related questions. To address this issue, we first proposed a pipelined approach to create two large-scale datasets: HealthVidQA-CRF and HealthVidQA-Prompt. Leveraging the datasets, we developed monomodal and multimodal approaches that can effectively provide visual answers from medical videos to natural language questions. We conducted a comprehensive analysis of the results and outlined the findings, focusing on the impact of the created datasets on model training and the significance of visual features in enhancing the performance of the monomodal and multi-modal approaches for medical visual answer localization task.

pdf abs
Towards a Unified Taxonomy of Deep Syntactic Relations
Kira Droganova | Daniel Zeman

This paper analyzes multiple deep-syntactic frameworks with the goal of creating a proposal for a set of universal semantic role labels. The proposal examines various theoretic linguistic perspectives and focuses on Meaning-Text Theory and Functional Generative Description frameworks and PropBank. The research is based on the data from four Indo-European and one Uralic language – Spanish and Catalan (Taulé et al., 2011), Czech (Hajič et al., 2017), English (Hajič et al., 2012), and Finnish (Haverinen et al., 2015). Updated datasets with the new universal semantic role labels are now publicly available as a result of our work. Nevertheless, our proposal is oriented towards Universal Dependencies (UD) (de Marneffe et al., 2021) and our ultimate goal is to apply a subset of the universal labels to the full UD data.

pdf abs
Towards Autonomous Tool Utilization in Language Models: A Unified, Efficient and Scalable Framework
Zhi Li | Yicheng Li | Hequan Ye | Yin Zhang

In recent research, significant advancements have been achieved in tool learning for large language models. Looking towards future advanced studies, the issue of fully autonomous tool utilization is particularly intriguing: given only a query, language models can autonomously decide whether to employ a tool, which specific tool to select, and how to utilize these tools, all without needing any tool-specific prompts within the context. To achieve this, we introduce a unified, efficient, and scalable framework for fine-tuning language models. Based on the degree of tool dependency, we initially categorize queries into three distinct types. By transforming the entire process into a sequential decision-making problem through conditional probability decomposition, our approach unifies the three types and autoregressively generates decision processes. Concurrently, we’ve introduced an “instruct, execute, and reformat” strategy specifically designed for efficient data annotation. Through end-to-end training on the annotated dataset comprising 26 diverse APIs, the model demonstrates a level of self-awareness, automatically seeking tool assistance when necessary. It significantly surpasses original instruction-tuned open-source language models and GPT-3.5/4 on multiple evaluation metrics. To address real-world scalability needs, we’ve enhanced our framework with a dynamic rehearsal strategy for continual learning, proven to require minimal new annotations to exhibit remarkable performance.

pdf abs
Towards a Zero-Data, Controllable, Adaptive Dialog System
Dirk Väth | Lindsey Vanderlyn | Ngoc Thang Vu

Conversational Tree Search (Väth et al., 2023) is a recent approach to controllable dialog systems, where domain experts shape the behavior of a Reinforcement Learning agent through a dialog tree. The agent learns to efficiently navigate this tree, while adapting to information needs, e.g., domain familiarity, of different users. However, the need for additional training data hinders deployment in new domains. To address this, we explore approaches to generate this data directly from dialog trees. We improve the original approach, and show that agents trained on synthetic data can achieve comparable dialog success to models trained on human data, both when using a commercial Large Language Model for generation, or when using a smaller open-source model, running on a single GPU. We further demonstrate the scalability of our approach by collecting and testing on two new datasets: ONBOARD, a new domain helping foreign residents moving to a new city, and the medical domain DIAGNOSE, a subset of Wikipedia articles related to scalp and head symptoms. Finally, we perform human testing, where no statistically significant differences were found in either objective or subjective measures between models trained on human and generated data.

Readability is a crucial characteristic of texts, greatly influencing comprehension and reading efficacy. Unfortunately, limited research is available for less-resourced languages, especially for young populations where its impact is even higher. This paper introduces a new readability tool for children’s literature in the Romanian language, explicitly targeting primary school students aged 7-11. The tool consists of a digital repository of school reading texts (self-compiled corpus) and a text analysis interface that generates automatic readability reports for uploaded short texts. The methodology involves extracting, testing, and calibrating a readability formula for Romanian using the children’s literature corpus. Related work on readability and readability tools is discussed, followed by a description of the children’s literature corpus and the platform functionalities. The first steps are presented towards validating the readability formula for children’s literature in Romanian using the ReaderBench framework, while calibration variables relevant to the Romanian language and children’s literature are examined. Currently, no existing platform integrates a research-based readability formula for the Romanian language, making this tool unique. Overall, this research contributes to applied corpus linguistics and Digital Humanities studies and offers a valuable resource for educators, parents, and children in accessing age-appropriate and readable texts.

Contemporary NLP has rapidly progressed from feature-based classification to fine-tuning and prompt-based techniques leveraging large language models. Many of these techniques remain understudied in the context of real-world, clinically enriched spontaneous dialogue. We fill this gap by systematically testing the efficacy and overall performance of a wide variety of NLP techniques ranging from feature-based to in-context learning on transcribed speech collected from patients with bipolar disorder, schizophrenia, and healthy controls taking a focused, clinically-validated language test. We observe impressive utility of a range of feature-based and language modeling techniques, finding that these approaches may provide a plethora of information capable of upholding clinical truths about these subjects. Building upon this, we establish pathways for future research directions in automated detection and understanding of psychiatric conditions.

pdf abs
Towards Cost-effective Multi-style Conversations: A Pilot Study in Task-oriented Dialogue Generation
Tiziano Labruna | Bernardo Magnini

Conversations exhibit significant variation when different styles are employed by participants, often leading to subpar performance when a dialogue model is exclusively trained on single-style datasets. We present a cost-effective methodology for generating multi-style conversations, which can be used in the development of conversational agents. This methodology only assumes the availability of a conversational domain, such as a knowledge base, and leverages the generative capabilities of large language models. In a pilot study focused on the generation aspect of task-oriented dialogues, we extended the well-known MultiWOZ dataset to encompass multi-style variations. Our findings highlight two key experimental outcomes: (i) these novel resources pose challenges for current single-style models, and (ii) multi-style resources enhance the dialogue model’s resilience to stylistic variations.

pdf abs
Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification
Artem Abzaliev | Humberto Perez-Espinosa | Rada Mihalcea

Similar to humans, animals make extensive use of verbal and non-verbal forms of communication, including a large range of audio signals. In this paper, we address dog vocalizations and explore the use of self-supervised speech representation models pre-trained on human speech to address dog bark classification tasks that find parallels in human-centered tasks in speech recognition. We specifically address four tasks: dog recognition, breed identification, gender classification, and context grounding. We show that using speech embedding representations significantly improves over simpler classification baselines. Further, we also find that models pre-trained on large human speech acoustics can provide additional performance boosts on several tasks.

pdf abs
Towards Equitable Natural Language Understanding Systems for Dialectal Cohorts: Debiasing Training Data
Khadige Abboud | Gokmen Oz

Despite being widely spoken, dialectal variants of languages are frequently considered low in resources due to lack of writing standards and orthographic inconsistencies. As a result, training natural language understanding (NLU) systems relies primarily on standard language resources leading to biased and inequitable NLU technology that underserves dialectal speakers. In this paper, we propose to address this problem through a framework composed of a dialect identification model that is used to obtain targeted training data augmentation for under-represented dialects, in an effort to debias NLU model for dialectal cohorts in NLU systems. We conduct experiments on two dialect rich non-English languages: Arabic and German, using large-scale commercial NLU datasets as well as open-source datasets. Results show that such framework can provide insights on dialect disparity in real-world NLU systems and targeted data argumentation can help narrow the model’s performance gap between standard language speakers and dialect speakers.

pdf abs
Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset
Santosh T.y.s.s. | Nina Baumgartner | Matthias Stürmer | Matthias Grabmair | Joel Niklaus

The assessment of explainability in Legal Judgement Prediction (LJP) systems is of paramount importance in building trustworthy and transparent systems, particularly considering the reliance of these systems on factors that may lack legal relevance or involve sensitive attributes. This study delves into the realm of explainability and fairness in LJP models, utilizing Swiss Judgement Prediction (SJP), the only available multilingual LJP dataset. We curate a comprehensive collection of rationales that ‘support’ and ‘oppose’ judgement from legal experts for 108 cases in German, French, and Italian. By employing an occlusion-based explainability approach, we evaluate the explainability performance of state-of-the-art monolingual and multilingual BERT-based LJP models, as well as models developed with techniques such as data augmentation and cross-lingual transfer, which demonstrated prediction performance improvement. Notably, our findings reveal that improved prediction performance does not necessarily correspond to enhanced explainability performance, underscoring the significance of evaluating models from an explainability perspective. Additionally, we introduce a novel evaluation framework, Lower Court Insertion (LCI), which allows us to quantify the influence of lower court information on model predictions, exposing current models’ biases.

pdf abs
Towards Few-shot Entity Recognition in Document Images: A Graph Neural Network Approach Robust to Image Manipulation
Prashant Krishnan | Zilong Wang | Yangkun Wang | Jingbo Shang

Recent advances of incorporating layout information, typically bounding box coordinates, into pre-trained language models have achieved significant performance in entity recognition from document images. Using coordinates can easily model the position of each token, but they are sensitive to manipulations in document images (e.g., shifting, rotation or scaling) which are common in real scenarios. Such limitation becomes even worse when the training data is limited in few-shot settings. In this paper, we propose a novel framework, LAGER, which leverages the topological adjacency relationship among the tokens through learning their relative layout information with graph neural networks. Specifically, we consider the tokens in the documents as nodes and formulate the edges based on the topological heuristics. Such adjacency graphs are invariant to affine transformations, making it robust to the common image manipulations. We incorporate these graphs into the pre-trained language model by adding graph neural network layers on top of the language model embeddings. Extensive experiments on two benchmark datasets show that LAGER significantly outperforms strong baselines under different few-shot settings and also demonstrate better robustness to manipulations.

Large language models (LLMs) have achieved significant performance in various natural language reasoning tasks. However, they still struggle with performing first-order logic reasoning over formal logical theories expressed in natural language. This is because the previous LLMs-based reasoning systems have the theoretical incompleteness issue. As a result, it can only address a limited set of simple reasoning problems, which significantly decreases their generalization ability. To address this issue, we propose a novel framework, named Generalizable and Faithful Reasoner (GFaiR), which introduces the paradigm of resolution refutation. Resolution refutation has the capability to solve all first-order logic reasoning problems by extending reasoning rules and employing the principle of proof by contradiction, so our system’s completeness can be improved by introducing resolution refutation. Experimental results demonstrate that our system outperforms previous works by achieving state-of-the-art performances in complex scenarios while maintaining performances in simple scenarios. Besides, we observe that GFaiR is faithful to its reasoning process.

In textual question answering (TQA) systems, complex questions often require retrieving multiple textual fact chains with multiple reasoning steps. While existing benchmarks are limited to single-chain or single-hop retrieval scenarios. In this paper, we propose to conduct Graph-Hop —— a novel multi-chains and multi-hops retrieval and reasoning paradigm in complex question answering. We construct a new benchmark called ReasonGraphQA, which provides explicit and fine-grained evidence graphs for complex question to support comprehensive and detailed reasoning. In order to further study how graph-based evidential reasoning can be performed, we explore what form of Graph-Hop works best for generating textual evidence explanations in knowledge reasoning and question answering. We have thoroughly evaluated existing evidence retrieval and reasoning models on the ReasonGraphQA. Experiments highlight Graph-Hop is a promising direction for answering complex questions, but it still has certain limitations. We have further studied mitigation strategies to meet these challenges and discuss future directions.

Math Word Problem (MWP) is a crucial NLP task aimed at providing solutions for given mathematical descriptions. A notable sub-category of MWP is the Linear Programming Word Problem (LPWP), which holds significant relevance in real-world decision-making and operations research. While the recent rise of generative large language models (LLMs) has brought more advanced solutions to LPWPs, existing evaluation methodologies for this task still diverge from human judgment and face challenges in recognizing mathematically equivalent answers. In this paper, we introduce a novel evaluation metric rooted in graph edit distance, featuring benefits such as permutation invariance and more accurate program equivalence identification. Human evaluations empirically validate the superior efficacy of our proposed metric when particularly assessing LLM-based solutions for LPWP.

pdf abs
Towards Human-Like Machine Comprehension: Few-Shot Relational Learning in Visually-Rich Documents
Hao Wang | Tang Li | Chenhui Chu | Rui Wang | Pinpin Zhu

Key-value relations are prevalent in Visually-Rich Documents (VRDs), often depicted in distinct spatial regions accompanied by specific color and font styles. These non-textual cues serve as important indicators that greatly enhance human comprehension and acquisition of such relation triplets. However, current document AI approaches often fail to consider this valuable prior information related to visual and spatial features, resulting in suboptimal performance, particularly when dealing with limited examples. To address this limitation, our research focuses on few-shot relational learning, specifically targeting the extraction of key-value relation triplets in VRDs. Given the absence of a suitable dataset for this task, we introduce two new few-shot benchmarks built upon existing supervised benchmark datasets. Furthermore, we propose a variational approach that incorporates relational 2D-spatial priors and prototypical rectification techniques. This approach aims to generate relation representations that are more aware of the spatial context and unseen relation in a manner similar to human perception. Experimental results demonstrate the effectiveness of our proposed method by showcasing its ability to outperform existing methods. This study also opens up new possibilities for practical applications.

Large Language Models (LLMs) hold considerable promise for artificial general intelligence, given their intrinsic abilities to accomplish a wide range of open-domain tasks either independently or in tandem with specialized expert models. However, despite these capabilities, the performance of LLMs has yet to be comprehensively evaluated in realistic scenarios. To this end, in this work, we introduce a novel task, the Realistic Chinese Spell Checking (RCSC), to evaluate the effectiveness of existing methods comprehensively. In contrast to existing works that solely address Chinese character misspellings or pinyin conversions, our task aims to convert the realistic Chinese text into the corresponding correct text. The realistic Chinese text may potentially contain both Chinese misspellings and pinyin conversions. We first present the Realistic Chinese Spell Checking Benchmark (RCSCB), which consists of two subsets and contains a total of 581,657 samples. Then, we benchmark the performance of various baselines and find that all the existing methods, including instruction-based LLMs, achieve unsatisfactory results on RCSCB. To further improve the performance on RCSCB, we propose Pinyin-Enhanced Spell Checker (PESC), which is specifically designed to address pinyin-related misspellings. Experimental results demonstrate that PESC can achieve state-of-the-art performance on RCSCB. Despite the progress made, the current state-of-the-art performance is still far from satisfactory. We expect further progress on this crucial and challenging task.

Multi-modal sarcasm detection aims to identify whether a given sample with multi-modal information (i.e., text and image) is sarcastic, which has received increasing attention due to the rapid growth of multi-modal posts on modern social media. However, mainstream models process the input of each modality in a holistic manner, resulting in redundant and unrefined information. Moreover, the representations of different modalities are entangled in one common latent space to perform complex cross-modal interactions, neglecting the heterogeneity and distribution gap of different modalities. To address these issues, we propose a novel framework DMMD (short for Disentangled Multi-grained Multi-modal Distilling) for multi-modal sarcasm detection, which conducts multi-grained knowledge distilling (i.e., intra-subspace and inter-subspace) based on the disentangled multi-modal representations. Concretely, the representations of each modality are disentangled explicitly into modality-agnostic/specific subspaces. Then we transfer cross-modal knowledge by conducting intra-subspace knowledge distilling in a self-adaptive pattern. We also apply mutual learning to regularize the underlying inter-subspace consistency. Extensive experiments on a commonly used benchmark demonstrate the efficacy of our DMMD over cutting-edge methods. More encouragingly, visualization results indicate the multi-modal representations display meaningful distributional patterns, and we hope it will be helpful for the community of multi-modal knowledge transfer.

pdf abs
Towards Realistic Few-Shot Relation Extraction: A New Meta Dataset and Evaluation
Fahmida Alam | Md Asiful Islam | Robert Vacareanu | Mihai Surdeanu

We introduce a meta dataset for few-shot relation extraction, which includes two datasets derived from existing supervised relation extraction datasets – NYT29 (Takanobu et al., 2019; Nayak and Ng, 2020) and WIKI- DATA (Sorokin and Gurevych, 2017) – as well as a few-shot form of the TACRED dataset (Sabo et al., 2021). Importantly, all these few-shot datasets were generated under realistic assumptions such as: the test relations are different from any relations a model might have seen before, limited training data, and a preponderance of candidate relation mentions that do not correspond to any of the relations of interest. Using this large resource, we conduct a comprehensive evaluation of six recent few-shot relation extraction methods, and observe that no method comes out as a clear winner. Further, the overall performance on this task is low, indicating substantial need for future research. We release all versions of the data, i.e., both supervised and few-shot, for future research.

Evidence-aware fake news detection aims to determine the veracity of a given news (i.e., claim) with external evidences. We find that existing methods lack sufficient semantic perception and are easily blinded by textual expressions. For example, they still make the same prediction after we flip the semantics of a claim, which makes them vulnerable to malicious attacks. In this paper, we propose a model-agnostic training framework to improve the semantic perception of evidence-aware fake news detection. Specifically, we first introduce two kinds of data augmentation to complement the original training set with synthetic data. The semantic-flipped augmentation synthesizes claims with similar textual expressions but opposite semantics, while the semantic-invariant augmentation synthesizes claims with the same semantics but different writing styles. Moreover, we design a novel module to learn better claim representation which is more sensitive to the semantics, and further incorporate it into a multi-objective optimization paradigm. In the experiments, we also extend the original test set of benchmark datasets with the synthetic data to better evaluate the model perception of semantics. Experimental results demonstrate that our approach significantly outperforms the state-of-the-art methods on the extended test set, while achieving competitive performance on the original one. Our source code are released at https://github.com/Xyang1998/RobustFND.

pdf abs
Towards Robust In-Context Learning for Machine Translation with Large Language Models
Shaolin Zhu | Menglong Cui | Deyi Xiong

Using large language models (LLMs) for machine translation via in-context learning (ICL) has become an interesting research direction of machine translation (MT) in recent years. Its main idea is to retrieve a few translation pairs as demonstrations from an additional datastore (parallel corpus) to guide translation without updating the LLMs. However, the underlying noise of retrieved demonstrations usually dramatically deteriorate the performance of LLMs. In this paper, we propose a robust method to enable LLMs to achieve robust translation with ICL. The method incorporates a multi-view approach, considering both sentence- and word-level information, to select demonstrations that effectively avoid noise. At the sentence level, a margin-based score is designed to avoid semantic noise. At the word level, word embeddings are utilized to evaluate the related tokens and change the weight of words in demonstrations. By considering both sentence- and word-level similarity, the proposed method provides fine-grained demonstrations that effectively prompt the translation of LLMs. Experimental results demonstrate the effectiveness of our method, particularly in domain adaptation.

This paper addresses the task of temporal activity localization (TAL). Although recent works have made significant progress in TAL research, almost all of them implicitly assume that the dense frame-level correspondences in each video-query pair are correctly annotated. However, in reality, such an assumption is extremely expensive and even impossible to satisfy due to subjective labeling. To alleviate this issue, in this paper, we explore a new TAL setting termed Noisy Temporal activity localization (NTAL), where a TAL model should be robust to the mixed training data with noisy moment boundaries. Inspired by the memorization effect of neural networks, we propose a novel method called Co-Teaching Regularizer (CTR) for NTAL. Specifically, we first learn a Gaussian Mixture Model to divide the mixed training data into preliminary clean and noisy subsets. Subsequently, we refine the labels of the two subsets by an adaptive prediction function so that their true positive and false positive samples could be identified. To avoid single model being prone to its mistakes learned by the mixed data, we adopt a co-teaching paradigm, which utilizes two models sharing the same framework to teach each other for robust learning. A curriculum strategy is further introduced to gradually learn the moment confidence from easy to hard. Experiments on three datasets demonstrate that our CTR is significantly more robust to the noisy training data compared to the existing methods.

pdf abs
Towards Semantic Tagging for Irish
Tim Czerniak | Elaine Uí Dhonnchadha

Well annotated corpora have been shown to have great value, both in linguistic and non-linguistic research, and in supporting machine-learning and many other non-research activities including language teaching. For minority languages, annotated corpora can help in understanding language usage norms among native and non-native speakers, providing valuable information both for lexicography and for teaching, and helping to combat the decline of speaker numbers. At the same time, minority languages suffer from having fewer available language resources than majority languages, and far less-developed annotation tooling. To date there is very little work in semantic annotation for Irish. In this paper we report on progress to date in the building of a standard tool-set for semantic annotation of Irish, including a novel method for evaluation of semantic annotation. A small corpus of Irish language data has been manually annotated with semantic tags, and manually checked. A semantic type tagging framework has then been developed using existing technologies, and using a semantic lexicon that has been built from a variety of sources. Semantic disambiguation methods have been added with a view to increasing accuracy. That framework has then been tested using the manually tagged corpus, resulting in over 90% lexical coverage and almost 80% tag accuracy. Development is ongoing as part of a larger corpus development project, and plans include expansion of the manually tagged corpus, expansion of the lexicon, and exploration of further disambiguation methods. As the first semantic tagger for Irish, to our knowledge, it is hoped that this research will form a sound basis for semantic annotation of Irish corpora in to the future.

pdf abs
Towards Standardized Annotation and Parsing for Korean FrameNet
Yige Chen | Jae Ihn | KyungTae Lim | Jungyeul Park

Previous research on Korean FrameNet has produced several datasets that serve as resources for FrameNet parsing in Korean. However, these datasets suffer from the problem that annotations are assigned on the word level, which is not optimally designed based on the agglutinative feature of Korean. To address this issue, we introduce a morphologically enhanced annotation strategy for Korean FrameNet datasets and parsing by leveraging the CoNLL-U format. We present the results of the FrameNet parsers trained on the Korean FrameNet data in the original format and our proposed format, respectively, and further elaborate on the linguistic rationales of our proposed scheme. We suggest the morpheme-based scheme to be the standard of Korean FrameNet data annotation.

pdf abs
Towards the WhAP Corpus: A Resource for the Study of Italian on WhatsApp
Ilaria Fiorentini | Marco Forlano | Nicholas Nese

Over the past two decades, the rise of new technologies and social networks has significantly shaped written language, imbuing it with characteristics akin to the spoken language. This study reports on the ongoing initiative to build the WhAP corpus, a resource featuring WhatsApp conversations in Italian, encompassing both written and spoken messages and totaling at present more than 400.000 tokens, 89 conversations, and 194 participants from diverse age groups and geographical regions of Italy. More specifically, this paper focuses on the practical steps involved in the construction of the resource. Once publicly accessible, the WhAP Corpus will enable in-depth linguistic research on the language used on WhatsApp, which shows unique features such as the blending of written and spoken elements.

pdf abs
Towards Understanding the Relationship between In-context Learning and Compositional Generalization
Sungjun Han | Sebastian Padó

According to the principle of compositional generalization, the meaning of a complex expression can be understood as a function of the meaning of its parts and of how they are combined. This principle is crucial for human language processing and also, arguably, for NLP models in the face of out-of-distribution data. However, many neural network models, including Transformers, have been shown to struggle with compositional generalization. In this paper, we hypothesize that forcing models to in-context learn can provide an inductive bias to promote compositional generalization. To test this hypothesis, we train a causal Transformer in a setting that renders ‘ordinary’ learning very difficult: we present it with different orderings of the training instance and shuffle instance labels. This corresponds to training the model on all possible few-shot learning problems attainable from the dataset. The model can solve the task, however, by utilizing earlier examples to generalize to later ones – i.e., in-context learning. In evaluations on the datasets, SCAN, COGS, and GeoQuery, models trained in this manner indeed show improved compositional generalization. This indicates the usefulness of in-context learning problems as an inductive bias for generalization.

pdf abs
Towards Universal Dependencies for Ancash Quechua
Johanna Cordova

This paper presents a brief description of some morphosyntactic features of Ancash Quechua, the majority variety of the Central Quechua language family (QI), for the purpose of building a corpus annotated according to the Universal Dependencies (UD) schema. The creation of such a corpus has two objectives: for Quechua linguistics, it opens up the possibility of more systematic linguistic studies and comparisons with other languages. It also enables the development of a syntactic parser, which would be the first NLP tool for a Quechua language of this family. For the UD project, adding Quechua, an agglutinative language with a rich morphology, makes it possible to point out some possible shortcomings of the universal annotation schema, and to fuel the discussion to adapt this schema to the specific features of the languages with a similar typology. The first step towards this work was first to gather and digitise the available linguistic resources, thus creating the first bilingual and sentence-aligned digital corpus in Ancash Quechua and Spanish. After identifying some linguistic features not fully described in the UD schema, we proposed annotation solutions, and built an initial corpus of around twenty sentences, which we are making freely available.

In this paper, we introduce an innovative pre-training framework TP-Link, which aims to improve context-dependent Text-to-SQL Parsing by leveraging Linking information. This enhancement is achieved through better representation of both natural language utterances and the database schema, ultimately facilitating more effective text-to-SQL conversations. We present two novel pre-training objectives: (i) utterance linking prediction (ULP) task that models intricate syntactic relationships among natural language utterances in context-dependent text-to-SQL scenarios, and (ii) schema linking prediction (SLP) task that focuses on capturing fine-grained schema linking relationships between the utterances and the database schema. Extensive experiments demonstrate that our proposed TP-Link achieves state-of-the-art performance on two leading downstream benchmarks (i.e., SParC and CoSQL).

pdf abs
Training BERT Models to Carry over a Coding System Developed on One Corpus to Another
Dalma Galambos | Pal Zsamboki

This paper describes how we train BERT models to carry over a coding system developed on the paragraphs of a Hungarian literary journal to another. The aim of the coding system is to track trends in the perception of literary translation around the political transformation in 1989 in Hungary. To evaluate not only task performance but also the consistence of the annotation, moreover, to get better predictions from an ensemble, we use 10-fold crossvalidation. Extensive hyperparameter tuning is used to obtain the best possible results and fair comparisons. To handle label imbalance, we use loss functions and metrics robust to it. Evaluation of the effect of domain shift is carried out by sampling a test set from the target domain. We establish the sample size by estimating the bootstrapped confidence interval via simulations. This way, we show that our models can carry over one annotation system to the target domain. Comparisons are drawn to provide insights such as learning multilabel correlations and confidence penalty improve resistance to domain shift, and domain adaptation on OCR-ed text on another domain improves performance almost to the same extent as that on the corpus under study. See our code at https://codeberg.org/zsamboki/bert-annotator-ensemble

pdf abs
TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills
Qiushi Sun | Nuo Chen | Jianing Wang | Ming Gao | Xiang Li

Code pre-trained models (CodePTMs) have recently demonstrated a solid capacity to process various code intelligence tasks, e.g., code clone detection, code translation, and code summarization. The current mainstream method that deploys these models to downstream tasks is to fine-tune them on individual tasks, which is generally costly and needs sufficient data for large models. To tackle the issue, in this paper, we present TransCoder, a unified Transferable fine-tuning strategy for Code representation learning. Inspired by human inherent skills of knowledge generalization, TransCoder drives the model to learn better code-related knowledge like human programmers. Specifically, we employ a tunable prefix encoder to first capture cross-task and cross-language transferable knowledge, subsequently applying the acquired knowledge for optimized downstream adaptation. Besides, our approach confers benefits for tasks with minor training sample sizes and languages with smaller corpora, underscoring versatility and efficacy. Extensive experiments conducted on representative datasets clearly demonstrate that our method can lead to superior performance on various code-related tasks and encourage mutual reinforcement, especially in low-resource scenarios. Our codes are available at https://github.com/QiushiSun/TransCoder.

pdf abs
TransERR: Translation-based Knowledge Graph Embedding via Efficient Relation Rotation
Jiang Li | Xiangdong Su | Fujun Zhang | Guanglai Gao

This paper presents a translation-based knowledge geraph embedding method via efficient relation rotation (TransERR), a straightforward yet effective alternative to traditional translation-based knowledge graph embedding models. Different from the previous translation-based models, TransERR encodes knowledge graphs in the hypercomplex-valued space, thus enabling it to possess a higher degree of translation freedom in mining latent information between the head and tail entities. To further minimize the translation distance, TransERR adaptively rotates the head entity and the tail entity with their corresponding unit quaternions, which are learnable in model training. We also provide mathematical proofs to demonstrate the ability of TransERR in modeling various relation patterns, including symmetry, antisymmetry, inversion, composition, and subrelation patterns. The experiments on 10 benchmark datasets validate the effectiveness and the generalization of TransERR. The results also indicate that TransERR can better encode large-scale datasets with fewer parameters than the previous translation-based models. Our code and datasets are available at https://github.com/dellixx/TransERR.

pdf abs
Transfer Fine-tuning for Quality Estimation of Text Simplification
Yuki Hironaka | Tomoyuki Kajiwara | Takashi Ninomiya

To efficiently train quality estimation of text simplification on a small-scale labeled corpus, we train sentence difficulty estimation prior to fine-tuning the pre-trained language models. Our proposed method improves the quality estimation of text simplification in the framework of transfer fine-tuning, in which pre-trained language models can improve the performance of the target task by additional training on the relevant task prior to fine-tuning. Since the labeled corpus for quality estimation of text simplification is small (600 sentence pairs), an efficient training method is desired. Therefore, we propose a training method for pseudo quality estimation that does not require labels for quality estimation. As a relevant task for quality estimation of text simplification, we train the estimation of sentence difficulty. This is a binary classification task that identifies which sentence is simpler using an existing parallel corpus for text simplification. Experimental results on quality estimation of English text simplification showed that not only the quality estimation performance on simplicity that was trained, but also the quality estimation performance on fluency and meaning preservation could be improved in some cases.

pdf abs
Transferring BERT Capabilities from High-Resource to Low-Resource Languages Using Vocabulary Matching
Piotr Rybak

Pre-trained language models have revolutionized the natural language understanding landscape, most notably BERT (Bidirectional Encoder Representations from Transformers). However, a significant challenge remains for low-resource languages, where limited data hinders the effective training of such models. This work presents a novel approach to bridge this gap by transferring BERT capabilities from high-resource to low-resource languages using vocabulary matching. We conduct experiments on the Silesian and Kashubian languages and demonstrate the effectiveness of our approach to improve the performance of BERT models even when the target language has minimal training data. Our results highlight the potential of the proposed technique to effectively train BERT models for low-resource languages, thus democratizing access to advanced language understanding models.

pdf abs
Transformer-based Joint Modelling for Automatic Essay Scoring and Off-Topic Detection
Sourya Dipta Das | Yash A. Vadi | Kuldeep Yadav

Automated Essay Scoring (AES) systems are widely popular in the market as they constitute a cost-effective and time-effective option for grading systems. Nevertheless, many studies have demonstrated that the AES system fails to assign lower grades to irrelevant responses. Thus, detecting the off-topic response in automated essay scoring is crucial in practical tasks where candidates write unrelated text responses to the given task in the question. In this paper, we are proposing an unsupervised technique that jointly scores essays and detects off-topic essays. The proposed Automated Open Essay Scoring (AOES) model uses a novel topic regularization module (TRM), which can be attached on top of a transformer model, and is trained using a proposed hybrid loss function. After training, the AOES model is further used to calculate the Mahalanobis distance score for off-topic essay detection. Our proposed method outperforms the baseline we created and earlier conventional methods on two essay-scoring datasets in off-topic detection as well as on-topic scoring. Experimental evaluation results on different adversarial strategies also show how the suggested method is robust for detecting possible human-level perturbations.

pdf abs
Transformer-based Swedish Semantic Role Labeling through Transfer Learning
Dana Dannélls | Richard Johansson | Lucy Yang Buhr

Semantic Role Labeling (SRL) is a task in natural language understanding where the goal is to extract semantic roles for a given sentence. English SRL has achieved state-of-the-art performance using Transformer techniques and supervised learning. However, this technique is not a viable choice for smaller languages like Swedish due to the limited amount of training data. In this paper, we present the first effort in building a Transformer-based SRL system for Swedish by exploring multilingual and cross-lingual transfer learning methods and leveraging the Swedish FrameNet resource. We demonstrate that multilingual transfer learning outperforms two different cross-lingual transfer models. We also found some differences between frames in FrameNet that can either hinder or enhance the model’s performance. The resulting end-to-end model is freely available and will be made accessible through Språkbanken Text’s research infrastructure.

pdf abs
Transformers for Bridging Persian Dialects: Transliteration Model for Tajiki and Iranian Scripts
MohammadAli SadraeiJavaheri | Ehsaneddin Asgari | Hamid Reza Rabiee

In this study, we address the linguistic challenges posed by Tajiki Persian, a distinct variant of the Persian language that utilizes the Cyrillic script due to historical “Russification”. This distinguishes it from other Persian dialects that adopt the Arabic script. Despite its profound linguistic and cultural significance, Tajiki Persian remains a low-resource language with scant digitized datasets for computational applications. To address this deficiency, we created a parallel corpus using Shahnameh, a seminal Persian epic poem. Employing optical character recognition, we extracted Tajiki Persian verses from primary sources and applied a heuristic method to align them with their Iranian Persian counterparts. We then trained and assessed transliteration models using two prominent sequence-to-sequence architectures: GRU with attention and transformer. Our results underscore the enhanced performance of our models, particularly in contrast to pre-trained large multilingual models like GPT-3.5, emphasizing the value of dedicated datasets in advancing computational approaches for underrepresented languages. With the publication of this work, we are disseminating, for the first time, a vast collection of Persian poetry spanning 1000 years, transcribed in Tajiki scripts for the benefit of the Tajiki-speaking communities. The dataset, along with the model’s code and checkpoints, is accessible at https://github.com/language-ml/Tajiki-Shahname, marking a significant contribution to computational linguistic resources for Tajiki Persian.

Training large language models (LLMs) with open-domain instruction data has yielded remarkable success in aligning to end tasks and human preferences. Extensive research has highlighted the importance of the quality and diversity of instruction data. However, the impact of data complexity, as a crucial metric, remains relatively unexplored from three aspects: (1)where the sustainability of performance improvements with increasing complexity is uncertain; (2)whether the improvement brought by complexity merely comes from introducing more training tokens; and (3)where the potential benefits of incorporating instructions from easy to difficult are not yet fully understood. In this paper, we propose Tree-Instruct to systematically enhance the instruction complexity in a controllable manner. By adding a specified number of nodes to instructions’ semantic trees, this approach not only yields new instruction data from the modified tree but also allows us to control the difficulty level of modified instructions. Our preliminary experiments reveal the following insights: (1)Increasing complexity consistently leads to sustained performance improvements of LLMs. (2)Under the same token budget, a few complex instructions outperform diverse yet simple instructions. (3)Curriculum instruction tuning might not yield the anticipated results; focusing on increasing complexity appears to be the key.

KEPLMs are pre-trained models that utilize external knowledge to enhance language understanding. Previous language models facilitated knowledge acquisition by incorporating knowledge-related pre-training tasks learned from relation triples in knowledge graphs. However, these models do not prioritize learning embeddings for entity-related tokens. Updating all parameters in KEPLM is computationally demanding. This paper introduces TRELM, a Robust and Efficient Pre-training framework for Knowledge-Enhanced Language Models. We observe that text corpora contain entities that follow a long-tail distribution, where some are suboptimally optimized and hinder the pre-training process. To tackle this, we employ a robust approach to inject knowledge triples and employ a knowledge-augmented memory bank to capture valuable information. Moreover, updating a small subset of neurons in the feed-forward networks (FFNs) that store factual knowledge is both sufficient and efficient. Specifically, we utilize dynamic knowledge routing to identify knowledge paths in FFNs and selectively update parameters during pre-training. Experimental results show that TRELM achieves at least a 50% reduction in pre-training time and outperforms other KEPLMs in knowledge probing tasks and multiple knowledge-aware language understanding tasks.

pdf abs
Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks
Abhinav Sukumar Rao | Atharva Roshan Naik | Sachin Vashistha | Somak Aditya | Monojit Choudhury

Recent explorations with commercial Large Language Models (LLMs) have shown that non-expert users can jailbreak LLMs by simply manipulating their prompts; resulting in degenerate output behavior, privacy and security breaches, offensive outputs, and violations of content regulator policies. Limited studies have been conducted to formalize and analyze these attacks and their mitigations. We bridge this gap by proposing a formalism and a taxonomy of known (and possible) jailbreaks. We survey existing jailbreak methods and their effectiveness on open-source and commercial LLMs (such as GPT-based models, OPT, BLOOM, and FLAN-T5-XXL). We further discuss the challenges of jailbreak detection in terms of their effectiveness against known attacks. For further analysis, we release a dataset of model outputs across 3700 jailbreak prompts over 4 tasks.

pdf abs
Triple-R: Automatic Reasoning for Fact Verification Using Language Models
Mohammadamin Kanaani | Sajjad Dadkhah | Ali A. Ghorbani

The rise of online social media platforms has made them a popular source of news. However, they are also prone to misinformation and fake news. To combat this, fact-checking is essential to verify the accuracy of claims made on these platforms. However, the existing methods in this field often lack the use of external sources and human-understandable explanations for system decisions. In this paper, we introduce a framework called Triple-R (Retriever, Ranker, Reasoner) that addresses these challenges. The framework uses the Web as an external knowledge source to retrieve relevant evidence for claims and includes a method to generate reasons based on the retrieved evidence for datasets lacking explanations. We then use this modified dataset to fine-tune a causal language model that generates natural language explanations and labels for pairs of retrieved evidence and claims. Our approach aims to improve the transparency and interpretability of fact-checking systems by providing understandable explanations for decision-making processes. We evaluated our method on a popular dataset and demonstrated its performance through an ablation study. The modified dataset is available on the Canadian Institute for Cybersecurity datasets webpage at https://www.unb.ca/cic/datasets/index.html.

pdf abs
Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation
Francois Meyer | Jan Buys

Most data-to-text datasets are for English, so the difficulties of modelling data-to-text for low-resource languages are largely unexplored. In this paper we tackle data-to-text for isiXhosa, which is low-resource and agglutinative. We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG, which presents a new linguistic context that shifts modelling demands to subword-driven techniques. We also develop an evaluation framework for T2X that measures how accurately generated text describes the data. This enables future users of T2X to go beyond surface-level metrics in evaluation. On the modelling side we explore two classes of methods - dedicated data-to-text models trained from scratch and pretrained language models (PLMs). We propose a new dedicated architecture aimed at agglutinative data-to-text, the Subword Segmental Pointer Generator (SSPG). It jointly learns to segment words and copy entities, and outperforms existing dedicated models for 2 agglutinative languages (isiXhosa and Finnish). We investigate pretrained solutions for T2X, which reveals that standard PLMs come up short. Fine-tuning machine translation models emerges as the best method overall. These findings underscore the distinct challenge presented by T2X: neither well-established data-to-text architectures nor customary pretrained methodologies prove optimal. We conclude with a qualitative analysis of generation errors and an ablation study.

pdf abs
Trustworthiness and Self-awareness in Large Language Models: An Exploration through the Think-Solve-Verify Framework
Zhendong Liu | Changhong Xia | Wei He | Chongjun Wang

As Large Language Models (LLMs) become increasingly influential in reasoning tasks, ensuring their trustworthiness and introspective self-awareness is critical. This research introduces the Think-Solve-Verify (TSV) framework, an innovative strategy tailored to explore LLMs’ trustworthiness, introspective self-awareness, and collaborative reasoning. This method accentuates a model’s capability to construct introspective reasoning processes from answers and ensure their trustworthiness. The reasoning with TSV consistently performs at or near the top across the majority of datasets with a single interaction with LLM. Moreover, we refine the voting process of self-consistency within the Chain-of-Thought (CoT) approach, leading to notable accuracy enhancements. In our evaluations, this approach improved performance from 67.3% to 72.8% on the AQuA dataset. Furthermore, we delve into the model’s ability to explain the given answers, highlighting the significance of discerning genuine comprehension from mere guesswork.

Retrieval-augmented language models (RALMs) have demonstrated significant potential in refining and expanding their internal memory by retrieving evidence from external sources. However, RALMs will inevitably encounter knowledge conflicts when integrating their internal memory with external sources. Knowledge conflicts can ensnare RALMs in a tug-of-war between knowledge, limiting their practical applicability. In this paper, we focus on exploring and resolving knowledge conflicts in RALMs. First, we present an evaluation framework for assessing knowledge conflicts across various dimensions. Then, we investigate the behavior and preference of RALMs from the following two perspectives: (1) Conflicts between internal memory and external sources: We find that stronger RALMs emerge with the Dunning-Kruger effect, persistently favoring their faulty internal memory even when correct evidence is provided. Besides, RALMs exhibit an availability bias towards common knowledge; (2) Conflicts between truthful, irrelevant and misleading evidence: We reveal that RALMs follow the principle of majority rule, leaning towards placing trust in evidence that appears more frequently. Moreover, we find that RALMs exhibit confirmation bias, and are more willing to choose evidence that is consistent with their internal memory. To solve the challenge of knowledge conflicts, we propose a method called Conflict-Disentangle Contrastive Decoding (CD2) to better calibrate the model’s confidence. Experimental results demonstrate that our CD2 can effectively resolve knowledge conflicts in RALMs.

pdf abs
TunArTTS: Tunisian Arabic Text-To-Speech Corpus
Imen Laouirine | Rami Kammoun | Fethi Bougares

Being labeled as a low-resource language, the Tunisian dialect has no existing prior TTS research. In this paper, we present a speech corpus for Tunisian Arabic Text-to-Speech (TunArTTS) to initiate the development of end-to-end TTS systems for the Tunisian dialect. Our Speech corpus is extracted from an online English and Tunisian Arabic dictionary. We were able to extract a mono-speaker speech corpus of +3 hours of a male speaker sampled at 44100 kHz. The corpus is processed and manually diacritized. Furthermore, we develop various TTS systems based on two approaches: training from scratch and transfer learning. Both Tacotron2 and FastSpeech2 were used and evaluated using subjective and objective metrics. The experimental results show that our best results are obtained with the transfer learning from a pre-trained model on the English LJSpeech dataset. This model obtained a mean opinion score (MOS) of 3.88. TunArTTS will be publicly available for research purposes along with the baseline TTS system demo. Keywords: Tunisian Dialect, Text-To-Speech, Low-resource, Transfer Learning, TunArTTS

pdf abs
TweetTER: A Benchmark for Target Entity Retrieval on Twitter without Knowledge Bases
Kiamehr Rezaee | Jose Camacho-Collados | Mohammad Taher Pilehvar

Entity linking is a well-established task in NLP consisting of associating entity mentions with entries in a knowledge base. Current models have demonstrated competitive performance in standard text settings. However, when it comes to noisy domains such as social media, certain challenges still persist. Typically, to evaluate entity linking on existing benchmarks, a comprehensive knowledge base is necessary and models are expected to possess an understanding of all the entities contained within the knowledge base. However, in practical scenarios where the objective is to retrieve sentences specifically related to a particular entity, strict adherence to a complete understanding of all entities in the knowledge base may not be necessary. To address this gap, we introduce TweetTER (Tweet Target Entity Retrieval), a novel benchmark that aims to bridge the challenges in entity linking. The distinguishing feature of this benchmark is its approach of re-framing entity linking as a binary entity retrieval task. This enables the evaluation of language models’ performance without relying on a conventional knowledge base, providing a more practical and versatile evaluation framework for assessing the effectiveness of language models in entity retrieval tasks.

pdf abs
Two Counterexamples to Tokenization and the Noiseless Channel
Marco Cognetta | Vilém Zouhar | Sangwhan Moon | Naoaki Okazaki

In Tokenization and the Noiseless Channel (Zouhar et al., 2023), Rényi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest Rényi efficiency of the unigram distribution should be chosen. The Rényi efficiency is thus treated as a predictor of downstream performance (e.g., predicting BLEU for a machine translation task), without the expensive step of training multiple models with different tokenizers. Although useful, the predictive power of this metric is not perfect, and the authors note there are additional qualities of a good tokenization scheme that Rényi efficiency alone cannot capture. We describe two variants of BPE tokenization which can arbitrarily increase Rényi efficiency while decreasing the downstream model performance. These counterexamples expose cases where Rényi efficiency fails as an intrinsic tokenization metric and thus give insight for building more accurate predictors.

pdf abs
Typos Correction Training against Misspellings from Text-to-Text Transformers
Guicai Xie | Ke Zhang | Lei Duan | Wei Zhang | Zeqian Huang

Dense retrieval (DR) has become a mainstream approach to information seeking, where a system is required to return relevant information to a user query. In real-life applications, typoed queries resulting from the users’ mistyping words or phonetic typing errors exist widely in search behaviors. Current dense retrievers experience a significant drop in retrieval effectiveness when they encounter typoed queries. Therefore, the search system requires the extra introduction of spell-checkers to deal with typos and then applies the DR model to perform robust matching. Herein, we argue that directly conducting the typos correction training would be beneficial to make an end-to-end retriever against misspellings. To this end, we propose a novel approach that can facilitate the incorporation of the spelling correction objective into the DR model using the encoder-decoder architecture. During typos correction training, we also develop a prompt-based augmentation technique to enhance the DR space alignment of the typoed query and its original query. Extensive experiments demonstrate that the effectiveness of our proposed end-to-end retriever significantly outperforms existing typos-aware training approaches and sophisticated training advanced retrievers. Our code is available at https://github.com/striver314/ToCoTR.

The Universal Dependencies (UD) project has created an invaluable collection of treebanks with contributions in over 140 languages. However, the UD annotations do not tell the full story. Grammatical constructions that convey meaning through a particular combination of several morphosyntactic elements—for example, interrogative sentences with special markers and/or word orders—are not labeled holistically. We argue for (i) augmenting UD annotations with a ‘UCxn’ annotation layer for such meaning-bearing grammatical constructions, and (ii) approaching this in a typologically informed way so that morphosyntactic strategies can be compared across languages. As a case study, we consider five construction families in ten languages, identifying instances of each construction in UD treebanks through the use of morphosyntactic patterns. In addition to findings regarding these particular constructions, our study yields important insights on methodology for describing and identifying constructions in language-general and language-particular ways, and lays the foundation for future constructional enrichment of UD treebanks.

pdf abs
UDMorph: Morphosyntactically Tagged UD Corpora
Maarten Janssen

UDMorph provides an infrastructure parallel to that provided by UD for annotated corpus data that follow the UD guidelines, but do not provide dependency relations: a place where new annotated data-sets can be deposited, and existing data-sets can be found and downloaded. It also provides a corpus creation environment to easily create annotated data for additional languages. And it provides a REST and GUI interface to a growing collection taggers with a CoNLL-U output, currently for around 150 different languages.

pdf abs
UkraiNER: A New Corpus and Annotation Scheme towards Comprehensive Entity Recognition
Lauriane Aufrant | Lucie Chasseur

Named entity recognition as it is traditionally envisioned excludes in practice a significant part of the entities of potential interest for real-word applications: nested, discontinuous, non-named entities. Despite various attempts to broaden their coverage, subsequent annotation schemes have achieved little adoption in the literature and the most restrictive variant of NER remains the default. This is partly due to the complexity of those annotations and their format. In this paper, we introduce a new annotation scheme that offers higher comprehensiveness while preserving simplicity, together with an annotation tool to implement that scheme. We also release the corpus UkraiNER, comprised of 10,000 French sentences in the geopolitical news domain and manually annotated with comprehensive entity recognition. Our baseline experiments on UkraiNER provide a first point of comparison to facilitate future research (82 F1 for comprehensive entity recognition, 87 F1 when focusing on traditional nested NER), as well as various insights on the composition and challenges that this corpus presents for state-of-the-art named entity recognition models.

pdf abs
UMTIT: Unifying Recognition, Translation, and Generation for Multimodal Text Image Translation
Liqiang Niu | Fandong Meng | Jie Zhou

Prior research in Image Machine Translation (IMT) has focused on either translating the source image solely into the target language text or exclusively into the target image. As a result, the former approach lacked the capacity to generate target images, while the latter was insufficient in producing target text. In this paper, we present a Unified Multimodal Text Image Translation (UMTIT) model that not only translates text images into the target language but also generates consistent target images. The UMTIT model consists of two image-text modality conversion steps: the first step converts images to text to recognize the source text and generate translations, while the second step transforms text to images to create target images based on the translations. Due to the limited availability of public datasets, we have constructed two multimodal image translation datasets. Experimental results show that our UMTIT model is versatile enough to handle tasks across multiple modalities and outperforms previous methods. Notably, UMTIT surpasses the state-of-the-art TrOCR in text recognition tasks, achieving a lower Character Error Rate (CER); it also outperforms cascading methods in text translation tasks, obtaining a higher BLEU score; and, most importantly, UMTIT can generate high-quality target text images.

pdf abs
Uncertainty-Aware Cross-Modal Alignment for Hate Speech Detection
Chuanpeng Yang | Fuqing Zhu | Yaxin Liu | Jizhong Han | Songlin Hu

Hate speech detection has become an urgent task with the emergence of huge multimodal harmful content (, memes) on social media platforms. Previous studies mainly focus on complex feature extraction and fusion to learn discriminative information from memes. However, these methods ignore two key points: 1) the misalignment of image and text in memes caused by the modality gap, and 2) the uncertainty between modalities caused by the contribution degree of each modality to hate sentiment. To this end, this paper proposes an uncertainty-aware cross-modal alignment (UCA) framework for modeling the misalignment and uncertainty in multimodal hate speech detection. Specifically, we first utilize the cross-modal feature encoder to capture image and text feature representations in memes. Then, a cross-modal alignment module is applied to reduce semantic gaps between modalities by aligning the feature representations. Next, a cross-modal fusion module is designed to learn semantic interactions between modalities to capture cross-modal correlations, providing complementary features for memes. Finally, a cross-modal uncertainty learning module is proposed, which evaluates the divergence between unimodal feature distributions to to balance unimodal and cross-modal fusion features. Extensive experiments on five publicly available datasets show that the proposed UCA produces a competitive performance compared with the existing multimodal hate speech detection methods.

pdf abs
Uncovering Agendas: A Novel French & English Dataset for Agenda Detection on Social Media
Gregorios Katsios | Ning Sa | Ankita Bhaumik | Tomek Strzalkowski

The behavior and decision making of groups or communities can be dramatically influenced by individuals pushing particular agendas, e.g., to promote or disparage a person or an activity, to call for action, etc.. In the examination of online influence campaigns, particularly those related to important political and social events, scholars often concentrate on identifying the sources responsible for setting and controlling the agenda (e.g., public media). In this article we present a methodology for detecting specific instances of agenda control through social media where annotated data is limited or non-existent. By using a modest corpus of Twitter messages centered on the 2022 French Presidential Elections, we carry out a comprehensive evaluation of various approaches and techniques that can be applied to this problem. Our findings demonstrate that by treating the task as a textual entailment problem, it is possible to overcome the requirement for a large annotated training dataset.

pdf abs
Uncovering the Potential of ChatGPT for Discourse Analysis in Dialogue: An Empirical Study
Yaxin Fan | Feng Jiang | Peifeng Li | Haizhou Li

Large language models, like ChatGPT, have shown remarkable capability in many downstream tasks, yet their ability to understand discourse structures of dialogues remains less explored, where it requires higher level capabilities of understanding and reasoning. In this paper, we aim to systematically inspect ChatGPT’s performance in two discourse analysis tasks: topic segmentation and discourse parsing, focusing on its deep semantic understanding of linear and hierarchical discourse structures underlying dialogue. To instruct ChatGPT to complete these tasks, we initially craft a prompt template consisting of the task description, output format, and structured input. Then, we conduct experiments on four popular topic segmentation datasets and two discourse parsing datasets. The experimental results showcase that ChatGPT demonstrates proficiency in identifying topic structures in general-domain conversations yet struggles considerably in specific-domain conversations. We also found that ChatGPT hardly understands rhetorical structures that are more complex than topic structures. Our deeper investigation indicates that ChatGPT can give more reasonable topic structures than human annotations but only linearly parses the hierarchical rhetorical structures. In addition, we delve into the impact of in-context learning (e.g., chain-of-thought) on ChatGPT and conduct the ablation study on various prompt components, which can provide a research foundation for future work. The code is available at https://github.com/yxfanSuda/GPTforDDA.

pdf abs
Understanding How Positional Encodings Work in Transformer Model
Taro Miyazaki | Hideya Mino | Hiroyuki Kaneko

A transformer model is used in general tasks such as pre-trained language models and specific tasks including machine translation. Such a model mainly relies on positional encodings (PEs) to handle the sequential order of input vectors. There are variations of PEs, such as absolute and relative, and several studies have reported on the superiority of relative PEs. In this paper, we focus on analyzing in which part of a transformer model PEs work and the different characteristics between absolute and relative PEs through a series of experiments. Experimental results indicate that PEs work in both self- and cross-attention blocks in a transformer model, and PEs should be added only to the query and key of an attention mechanism, not to the value. We also found that applying two PEs in combination, a relative PE in the self-attention block and an absolute PE in the cross-attention block, can improve translation quality.

Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units. A prominent feature of these languages is these complex grapheme units that comprise consonants/consonant conjuncts, vowel diacritics, and consonant diacritics, which, together make a unique Language. Unicode-based writing schemes of these languages often disregard this feature of these languages and encode words as linear sequences of Unicode characters using an intricate scheme of connector characters and font interpreters. Due to this way of using a few dozen Unicode glyphs to write thousands of different unique glyphs (complex graphemes), there are serious ambiguities that lead to malformed words. In this paper, we are proposing two libraries: i) a normalizer for normalizing inconsistencies caused by a Unicode-based encoding scheme for Indic languages and ii) a grapheme parser for Abugida text. It deconstructs words into visually distinct orthographic syllables or complex graphemes and their constituents. Our proposed normalizer is a more efficient and effective tool than the previously used IndicNLP normalizer. Moreover, our parser and normalizer are also suitable tools for general Abugida text processing as they performed well in our robust word-based and NLP experiments. We report the pipeline for the scripts of 7 languages in this work and develop the framework for the integration of more scripts.

In video-text retrieval, most existing methods adopt the dual-encoder architecture for fast retrieval, which employs two individual encoders to extract global latent representations for videos and texts. However, they face challenges in capturing fine-grained semantic concepts. In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and combines the strengths of latent and lexicon representations for video-text retrieval. Specifically, we map videos and texts into a pre-defined lexicon space, where each dimension corresponds to a semantic concept. A two-stage semantics grounding approach is proposed to activate semantically relevant dimensions and suppress irrelevant dimensions. The learned lexicon representations can thus reflect fine-grained semantics of videos and texts. Furthermore, to leverage the complementarity between latent and lexicon representations, we propose a unified learning scheme to facilitate mutual learning via structure sharing and self-distillation. Experimental results show our UNIFY framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.

Recent researches have shown that multi-task instruction tuning after pre-training greatly improves the model’s robustness and transfer ability, which is crucial for building a high-quality dialog system. However, most previous works on multi-task instruction tuning rely heavily on human-defined input format or prompt, which is not optimal in quality and quantity.In this work, we propose to use Task-aware Automatic Prompt generation (TAP) to automatically generate high-quality prompts. Using the high-quality prompts generated, we scale the corpus of the pre-trained conversation model to 122 datasets from 15 dialog-related tasks, resulting in Universal Pre-trained Conversation Model (UniPCM), a powerful foundation model for various conversational tasks and different dialog systems. Extensive experiments have shown that UniPCM is robust to input prompts and capable of various dialog-related tasks. Moreover, UniPCM has strong transfer ability and excels at low resource scenarios, achieving SOTA results on 9 different datasets ranging from task-oriented dialog to open-domain conversation. Furthermore, we are amazed to find that TAP can generate prompts on par with those collected with crowdsourcing.

Cross-lingual representation learning transfers knowledge from resource-rich data to resource-scarce ones to improve the semantic understanding abilities of different languages. However, previous works rely on shallow unsupervised data generated by token surface matching, regardless of the global context-aware semantics of the surrounding text tokens. In this paper, we propose an Unsupervised Pseudo Semantic Data Augmentation (UniPSDA) mechanism for cross-lingual natural language understanding to enrich the training data without human interventions. Specifically, to retrieve the tokens with similar meanings for the semantic data augmentation across different languages, we propose a sequential clustering process in 3 stages: within a single language, across multiple languages of a language family, and across languages from multiple language families. Meanwhile, considering the multi-lingual knowledge infusion with context-aware semantics while alleviating computation burden, we directly replace the key constituents of the sentences with the above-learned multi-lingual family knowledge, viewed as pseudo-semantic. The infusion process is further optimized via three de-biasing techniques without introducing any neural parameters. Extensive experiments demonstrate that our model consistently improves the performance on general zero-shot cross-lingual natural language understanding tasks, including sequence classification, information extraction, and question answering.

Conversational retrieval refers to an information retrieval system that operates in an iterative and interactive manner, requiring the retrieval of various external resources, such as persona, knowledge, and even response, to effectively engage with the user and successfully complete the dialogue. However, most previous work trained independent retrievers for each specific resource, resulting in sub-optimal performance and low efficiency. Thus, we propose a multi-task framework function as a universal retriever for three dominant retrieval tasks during the conversation: persona selection, knowledge selection, and response selection. To this end, we design a dual-encoder architecture consisting of a context-adaptive dialogue encoder and a candidate encoder, aiming to attention to the relevant context from the long dialogue and retrieve suitable candidates by simply a dot product. Furthermore, we introduce two loss constraints to capture the subtle relationship between dialogue context and different candidates by regarding historically selected candidates as hard negatives. Extensive experiments and analysis establish state-of-the-art retrieval quality both within and outside its training domain, revealing the promising potential and generalization capability of our model to serve as a universal retriever for different candidate selection tasks simultaneously.

The aim of the Universal Anaphora initiative is to push forward the state of the art in anaphora and anaphora resolution by expanding the aspects of anaphoric interpretation which are or can be reliably annotated in anaphoric corpora, producing unified standards to annotate and encode these annotations, delivering datasets encoded according to these standards, and developing methods for evaluating models that carry out this type of interpretation. Although several papers on aspects of the initiative have appeared, no overall description of the initiative’s goals, proposals and achievements has been published yet except as an online draft. This paper aims to fill this gap, as well as to discuss its progress so far.

pdf abs
Universal Dependencies: Extensions for Modern and Historical German
Stefanie Dipper | Cora Haiber | Anna Maria Schröter | Alexandra Wiemann | Maike Brinkschulte

In this paper we present extensions of the UD scheme for modern and historical German. The extensions relate in part to fundamental differences such as those between different kinds of arguments and modifiers. We illustrate the extensions with examples from the MHG data and discuss a number of MHG-specific constructions. At the current time, we have annotated a corpus of Middle High German with almost 29K tokens using this scheme, which to our knowledge is the first UD treebank for Middle High German. Inter-annotator agreement is very high: the annotators achieve a score of α = 0.85. A statistical analysis of the annotations shows some interesting differences in the distribution of labels between modern and historical German.

pdf abs
Universal Dependencies for Learner Russian
Alla Rozovskaya

We introduce a pilot annotation of Russian learner data with syntactic dependency relations. The annotation is performed on a subset of sentences from RULEC-GEC and RU-Lang8, two error-corrected Russian learner datasets. We provide manually labeled Universal Dependency (UD) trees for 500 sentence pairs, annotating both the original (source) and the corrected (target) version of each sentence. Further, we outline guidelines for annotating learner Russian data containing non-standard erroneous text and analyze the effect that the individual errors have on the resulting dependency trees. This study should contribute to a wide range of computational and theoretical research directions in second language learning and grammatical error correction.

pdf abs
Unleashing the Power of Imbalanced Modality Information for Multi-modal Knowledge Graph Completion
Yichi Zhang | Zhuo Chen | Lei Liang | Huajun Chen | Wen Zhang

Multi-modal knowledge graph completion (MMKGC) aims to predict the missing triples in the multi-modal knowledge graphs by incorporating structural, visual, and textual information of entities into the discriminant models. The information from different modalities will work together to measure the triple plausibility. Existing MMKGC methods overlook the imbalance problem of modality information among entities, resulting in inadequate modal fusion and inefficient utilization of the raw modality information. To address the mentioned problems, we propose Adaptive Multi-modal Fusion and Modality Adversarial Training (AdaMF-MAT) to unleash the power of imbalanced modality information for MMKGC. AdaMF-MAT achieves multi-modal fusion with adaptive modality weights and further generates adversarial samples by modality-adversarial training to enhance the imbalanced modality information. Our approach is a co-design of the MMKGC model and training strategy which can outperform 19 recent MMKGC methods and achieve new state-of-the-art results on three public MMKGC benchmarks. Our code and data have been released at https://github.com/zjukg/AdaMF-MAT.

The in-context learning (ICL) for relational triple extraction (RTE) has achieved promising performance, but still encounters two key challenges: (1) how to design effective prompts and (2) how to select proper demonstrations. Existing methods, however, fail to address these challenges appropriately. On the one hand, they usually recast RTE task to text-to-text prompting formats, which is unnatural and results in a mismatch between the output format at the pre-training time and the inference time for large language models (LLMs). On the other hand, they only utilize surface natural language features and lack consideration of triple semantics in sample selection. These issues are blocking improved performance in ICL for RTE, thus we aim to tackle prompt designing and sample selection challenges simultaneously. To this end, we devise a tabular prompting for RTE (TableIE) which frames RTE task into a table generation task to incorporate explicit structured information into ICL, facilitating conversion of outputs to RTE structures. Then we propose instructive in-context learning (I²CL) which only selects and annotates a few samples considering internal triple semantics in massive unlabeled samples. Specifically, we first adopt off-the-shelf LLMs to perform schema-agnostic pre-extraction of triples in unlabeled samples using TableIE. Then we propose a novel triple-level similarity metric considering triple semantics between these samples and train a sample retrieval model based on calculated similarities in pre-extracted unlabeled data. We also devise three different sample annotation strategies for various scenarios. Finally, the annotated samples are considered as few-shot demonstrations in ICL for RTE. Experimental results on two RTE benchmarks show that I²CL with TableIE achieves state-of-the-art performance compared to other methods under various few-shot RTE settings.

pdf abs
Unmasking Biases: Exploring Gender Bias in English-Catalan Machine Translation through Tokenization Analysis and Novel Dataset
Audrey Mash | Carlos Escolano | Aleix Sant | Maite Melero | Francesca de Luca Fornaciari

This paper presents a comprehensive evaluation of gender bias in English-Catalan machine translation, encompassing the creation of a novel language resource and an analysis of translation quality across four different tokenization models. The study introduces a new dataset derived from the MuST-SHE corpus, focusing on gender-neutral terms that necessitate gendered translations in Catalan. The results reveal noteworthy gender bias across all translation models, with a consistent preference for masculine forms. Notably, the study finds that when context is available, BPE and Sentencepiece Unigram tokenization methods outperform others, achieving higher accuracy in gender translation. However, when no context is provided, Morfessor outputs more feminine forms than other tokenization methods, albeit still a small percentage. The study also reflects that stereotypes present in the data are amplified in the translation output. Ultimately, this work serves as a valuable resource for addressing and mitigating gender bias in machine translation, emphasizing the need for improved awareness and sensitivity to gender issues in natural language processing applications.

pdf abs
Unpacking Bias: An Empirical Study of Bias Measurement Metrics, Mitigation Algorithms, and Their Interactions
Felipe Bravo-Marquez | Maria Jose Zambrano

Word embeddings (WE) have been shown to capture biases from the text they are trained on, which has led to the development of several bias measurement metrics and bias mitigation algorithms (i.e., methods that transform the embedding space to reduce bias). This study identifies three confounding factors that hinder the comparison of bias mitigation algorithms with bias measurement metrics: (1) reliance on different word sets when applying bias mitigation algorithms, (2) leakage between training words employed by mitigation methods and evaluation words used by metrics, and (3) inconsistencies in normalization transformations between mitigation algorithms. We propose a very simple comparison methodology that carefully controls for word sets and vector normalization to address these factors. We conduct a component isolation experiment to assess how each component of our methodology impacts bias measurement. After comparing the bias mitigation algorithms using our comparison methodology, we observe increased consistency between different debiasing algorithms when evaluated using our approach.

pdf abs
Unraveling Spontaneous Speech Dimensions for Cross-Corpus ASR System Evaluation for French
Solene Virginie Evain | Solange Rossato | François Portet

Many papers on speech processing use the term ‘spontaneous speech’ as a catch-all term for situations like speaking with a friend, being interviewed on radio/TV or giving a lecture. However, Automatic Speech Recognition (ASR) systems performance seems to exhibit variation on this type of speech: the more spontaneous the speech, the higher the WER (Word Error Rate). Our study focuses on better understanding the elements influencing the levels of spontaneity in order to evaluate the relation between categories of spontaneity and ASR systems performance and improve the recognition on those categories. We first analyzed the literature, listed and unraveled those elements, and finally identified four axes: the situation of communication, the level of intimacy between speakers, the channel and the type of communication. Then, we trained ASR systems and measured the impact of instances of face-to-face interaction labeled with the previous dimensions (different levels of spontaneity) on WER. We made two axes vary and found that both dimensions have an impact on the WER. The situation of communication seems to have the biggest impact on spontaneity: ASR systems give better results for situations like an interview than for friends having a conversation at home.

In public procurement, establishing reference prices is essential to guide competitors in setting product prices. Group-purchased products, which are not standardized by default, are necessary to estimate reference prices. Text clustering techniques can be used to group similar items based on their descriptions, enabling the definition of reference prices for specific products or services. However, selecting an appropriate representation for text is challenging. This paper introduces a framework for text cleaning, extraction, and representation. We test eight distinct sentence representations tailored for public procurement item descriptions. Among these representations, we propose an approach that captures the most important components of item descriptions. Through extensive evaluation of a dataset comprising over 2 million items, our findings show that using sophisticated supervised methods to derive vectors for unsupervised tasks offers little advantages over leveraging unsupervised methods. Our results also highlight that domain-specific contextual knowledge is crucial for representation improvement.

Providing knowledge documents for large language models (LLMs) has emerged as a promising solution to update the static knowledge inherent in their parameters. However, knowledge in the document may conflict with the memory of LLMs due to outdated or incorrect knowledge in the LLMs’ parameters. This leads to the necessity of examining the capability of LLMs to assimilate supplemental external knowledge that conflicts with their memory. While previous studies have explained to what extent LLMs extract conflicting knowledge from the provided text, they neglect the necessity to <b>reason</b> with conflicting knowledge. Furthermore, there lack a detailed analysis on strategies to enable LLMs to resolve conflicting knowledge via prompting, decoding strategy, and supervised fine-tuning. To address these limitations, we construct a new dataset, dubbed KNOT, for knowledge conflict resolution examination in the form of question answering. KNOT facilitates in-depth analysis by dividing reasoning with conflicting knowledge into three levels: (1) Direct Extraction, which directly extracts conflicting knowledge to answer questions. (2) Explicit Reasoning, which reasons with conflicting knowledge when the reasoning path is explicitly provided in the question. (3) Implicit Reasoning, where reasoning with conflicting knowledge requires LLMs to infer the reasoning path independently to answer questions. We also conduct extensive experiments on KNOT to establish empirical guidelines for LLMs to utilize conflicting knowledge in complex circumstances. Dataset and associated codes can be accessed at our <a href=https://github.com/THU-KEG/KNOT>GitHub repository</a> .

Deep learning has introduced significant improvements in many software analysis tasks. Although the Large Language Models (LLMs) based neural code models demonstrate commendable performance when trained and tested within the intra-project independent and identically distributed (IID) setting, they often struggle to generalize effectively to real-world inter-project out-of-distribution (OOD) data. In this work, we show that this phenomenon is caused by the heavy reliance on project-specific shortcuts for prediction instead of ground-truth evidence. We propose a Cond-Idf measurement to interpret this behavior, which quantifies the relatedness of a token with a label and its project-specificness. The strong correlation between model behavior and the proposed measurement indicates that without proper regularization, models tend to leverage spurious statistical cues for prediction. Equipped with these observations, we propose a novel bias mitigation mechanism that regularizes the model’s learning behavior by leveraging latent logic relations among samples. Experimental results on two representative program analysis tasks indicate that our mitigation framework can improve both inter-project OOD generalization and adversarial robustness, while not sacrificing accuracy on intra-project IID data.

pdf abs
Unveiling Strengths and Weaknesses of NLP Systems Based on a Rich Evaluation Corpus: The Case of NER in French
Alice Millour | Yoann Dupont | Karen Fort | Liam Duignan

Named Entity Recognition (NER) is an applicative task for which annotation schemes vary. To compare the performance of systems which tagsets differ in precision and coverage, it is necessary to assess (i) the comparability of their annotation schemes and (ii) the individual adequacy of the latter to a common annotation scheme. What is more, and given the lack of robustness of some tools towards textual variation, we cannot expect an evaluation led on an homogeneous corpus with low-coverage to provide a reliable prediction of the actual tools performance. To tackle both these limitations in evaluation, we provide a gold corpus for French covering 6 textual genres and annotated with a rich tagset that enables comparison with multiple annotation schemes. We use the flexibility of this gold corpus to provide both: (i) an individual evaluation of four heterogeneous NER systems on their target tagsets, (ii) a comparison of their performance on a common scheme. This rich evaluation framework enables a fair comparison of NER systems across textual genres and annotation schemes.

pdf abs
Unveiling Vulnerability of Self-Attention
Khai Jiet Liong | Hongqiu Wu | Hai Zhao

Pre-trained language models (PLMs) are shown to be vulnerable to minor word changes, which poses a significant threat to real-world systems. While previous studies directly focus on manipulating word inputs, they are limited by their means of generating adversarial samples, lacking generalization to versatile real-world attacks. This paper studies the basic structure of transformer-based PLMs, the self-attention (SA) mechanism. (1) We propose a powerful perturbation technique named ‘HackAttend,’ which perturbs the attention scores within the SA matrices via meticulously crafted attention masks. We show that state-of-the-art PLMs fall into heavy vulnerability, with minor attention perturbations (1%) resulting in a very high attack success rate (98%). Our paper extends the conventional text attack of word perturbations to more general structural perturbations. (2) We introduce ‘S-Attend,’ a novel smoothing technique that effectively makes SA robust via structural perturbations. We empirically demonstrate that this simple yet effective technique achieves robust performance on par with adversarial training when facing various text attackers.

pdf abs
UQA: Corpus for Urdu Question Answering
Samee Arif | Sualeha Farid | Awais Athar | Agha Ali Raza

This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA

pdf abs
UrduMASD: A Multimodal Abstractive Summarization Dataset for Urdu
Ali Faheem | Faizad Ullah | Muhammad Sohaib Ayub | Asim Karim

In this era of multimedia dominance, the surge of multimodal content on social media has transformed our methods of communication and information exchange. With the widespread use of multimedia content, the ability to effectively summarize this multimodal content is crucial for enhancing consumption, searchability, and retrieval. The scarcity of such training datasets has been a barrier to research in this area, especially for low-resource languages like Urdu. To address this gap, this paper introduces “UrduMASD”, a video-based Urdu multimodal abstractive text summarization dataset. The dataset contains 15,374 collections of videos, audio, titles, transcripts, and corresponding text summaries. To ensure the quality of the dataset, intrinsic evaluation metrics such as Abstractivity, Compression, Redundancy, and Semantic coherence have been employed. It was observed that our dataset surpasses existing datasets on numerous key quality metrics. Additionally, we present baseline results achieved using both text-based and state-of-the-art multimodal summarization models. On adding visual information, an improvement of 2.6% was observed in the ROUGE scores, highlighting the efficacy of utilizing multimodal inputs for summarization. To the best of our knowledge, this is the first dataset in Urdu that provides video-based multimodal data for abstractive text summarization, making it a valuable resource for advancing research in this field.

pdf abs
User Guide for KOTE: Korean Online That-gul Emotions Dataset
Duyoung Jeon | Junho Lee | Cheongtag Kim

Despite the lack of comprehensive exploration of emotional connotations, sentiment analysis, which categorizes data as positive or negative, has been widely employed to identify emotional aspects in texts. Recently, corpora labeled with more than just valence or polarity have been built to surpass this limitation. However, most Korean emotion corpora are limited by their small size and narrow range of emotions covered. In this paper, we introduce the KOTE dataset. The KOTE dataset comprises 50,000 Korean online comments, totaling 250,000 cases, each manually labeled for 43 emotions and NO EMOTION through crowdsourcing. The taxonomy for the 43 emotions was systematically derived through cluster analysis of Korean emotion concepts within the word embedding space. After detailing the development of KOTE, we further discuss the results of fine-tuning, as well as analysis for social discrimination within the corpus.

pdf abs
Using Bibliodata LODification to Create Metadata-Enriched Literary Corpora in Line with FAIR Principles
Agnieszka Karlinska | Cezary Rosiński | Marek Kubis | Patryk Hubar | Jan Wieczorek

This paper discusses the design principles and procedures for creating a balanced corpus for research in computational literary studies, building on the experience of computational linguistics but adapting it to the specificities of the digital humanities. It showcases the development of the Metadata-enriched Polish Novel Corpus from the 19th and 20th centuries (19/20MetaPNC), consisting of 1,000 novels from 1854–1939, as an illustrative case and proposes a comprehensive workflow for the creation and reuse of literary corpora. What sets 19/20MetaPNC apart is its approach to balance, which considers the spatial dimension, the inclusion of non-canonical texts previously overlooked by other corpora, and the use of a complex, multi-stage metadata enrichment and verification process. Emphasis is placed on research-oriented metadata design, efficient data collection and data sharing according to the FAIR principles as well as 5- and 7-star data standards to increase the visibility and reusability of the corpus. A knowledge graph-based solution for the creation of exchangeable and machine-readable metadata describing corpora has been developed. For this purpose, metadata from bibliographic catalogs and other sources were transformed into Linked Data following the bibliodata LODification approach.

Nowadays, the spread of misinformation is a prominent problem in society. Our research focuses on aiding the automatic identification of misinformation by analyzing the persuasive strategies employed in textual documents. We introduce a novel annotation scheme encompassing common persuasive writing tactics to achieve our objective. Additionally, we provide a dataset on health misinformation, thoroughly annotated by experts utilizing our proposed scheme. Our contribution includes proposing a new task of annotating pieces of text with their persuasive writing strategy types. We evaluate fine-tuning and prompt-engineering techniques with pre-trained language models of the BERT family and the generative large language models of the GPT family using persuasive strategies as an additional source of information. We evaluate the effects of employing persuasive strategies as intermediate labels in the context of misinformation detection. Our results show that those strategies enhance accuracy and improve the explainability of misinformation detection models. The persuasive strategies can serve as valuable insights and explanations, enabling other models or even humans to make more informed decisions regarding the trustworthiness of the information.

pdf abs
Using Pre-Trained Language Models in an End-to-End Pipeline for Antithesis Detection
Ramona Kühn | Khouloud Saadi | Jelena Mitrović | Michael Granitzer

Rhetorical figures play an important role in influencing readers and listeners. Some of these word constructs that deviate from the usual language structure are known to be persuasive – antithesis is one of them. This figure combines parallel phrases with opposite ideas or words to highlight a contradiction. By identifying this figure, persuasive actors can be better identified. For this task, we create an annotated German dataset for antithesis detection. The dataset consists of posts from a Telegram channel criticizing the COVID-19 politics in Germany. Furthermore, we propose a three-block pipeline approach to detect the figure antithesis using large language models. Our pipeline splits the text into phrases, identifies phrases with a syntactically parallel structure, and detects if these parallel phrase pairs present opposing ideas by fine-tuning the German ELECTRA model, a state-of-the-art deep learning model for the German language. Furthermore, we compare the results with multilingual BERT and German BERT. Our novel approach outperforms the state-of-the-art methods (F1-score of 50.43 %) for antithesis detection by achieving an F1-score of 65.11 %.

pdf abs
Using Speech Technology to Test Theories of Phonetic and Phonological Typology
Anisia Popescu | Lori Lamel | Ioana Vasilescu

The present paper uses speech technology derived tools and methodologies to test theories about phonetic typology. We specifically look at how the two-way laryngeal contrast (voiced /b, d, g, v, z/ vs. voiceless /p, t, k, f, s/ obstruents) is implemented in European Portuguese, a language that has been suggested to exhibit a different voicing system than its sister Romance languages, more similar to the one found for Germanic languages. A large European Portuguese corpus was force aligned using (1) different combinations of parallel Portuguese (original), Italian (Romance language) and German (Germanic language) acoustic phone models and letting an ASR system choose the best fitting one, and (2) pronunciation variants (/b, d, g, v, z/ produced as either [b, d, g, v, z] or [p, t, k, f, s]) for obstruent consonants. Results support previous accounts in the literature that European Portuguese is diverging from the traditional voicing system known for Romance language, towards a hybrid system where stops and fricatives are specified for different voicing features.

pdf abs
Utilizing Local Hierarchy with Adversarial Training for Hierarchical Text Classification
Zihan Wang | Peiyi Wang | Houfeng Wang

Hierarchical text classification (HTC) is a challenging subtask of multi-label classification due to its complex taxonomic structure. Nearly all recent HTC works focus on how the labels are structured but ignore the sub-structure of ground-truth labels according to each input text which contains fruitful label co-occurrence information. In this work, we introduce this local hierarchy with an adversarial framework. We propose a HiAdv framework that can fit in nearly all HTC models and optimize them with the local hierarchy as auxiliary information. We test on two typical HTC models and find that HiAdv is effective in all scenarios and is adept at dealing with complex taxonomic hierarchies. Further experiments demonstrate that the promotion of our framework indeed comes from the local hierarchy and the local hierarchy is beneficial for rare classes which have insufficient training data.

This paper focuses on improving the performance of machine translation for manga (Japanese-style comics). In manga machine translation, text consists of a sequence of speech bubbles and each speech bubble is translated individually. However, each speech bubble itself does not contain sufficient information for translation. Therefore, previous work has proposed methods to use contextual information, such as the previous speech bubble, speech bubbles within the same scene, and corresponding scene images. In this research, we propose two new approaches to capture broader contextual information. Our first approach involves scene-based translation that considers the previous scene. The second approach considers broader context information, including details about the work, author, and manga genre. Through our experiments, we confirm that each of our methods improves translation quality, with the combination of both methods achieving the highest quality. Additionally, detailed analysis reveals the effect of zero-anaphora resolution in translation, such as supplying missing subjects not mentioned within a scene, highlighting the usefulness of longer contextual information in manga machine translation.

pdf abs
UzbekVerbDetection: Rule-based Detection of Verbs in Uzbek Texts
Maksud Sharipov | Elmurod Kuriyozov | Ollabergan Yuldashev | Ogabek Sobirov

Verb detection is a fundamental task in natural language processing that involves identifying the action or state expressed by a verb in a sentence. However, in Uzbek language, verb detection is challenging due to the complexity of its morphology and the agglutinative nature of the language. In this paper, we propose a rule-based approach for verb detection in Uzbek texts based on affixes/suffixes. Our method is based on a set of rules that capture the morphological patterns of verb forms in Uzbek language. We evaluate the proposed approach on a dataset of Uzbek texts and report an F1-score of 0.97, which outperforms existing methods for verb detection in Uzbek language. Our results suggest that rule-based approaches can be effective for verb detection in Uzbek texts and have potential applications in various natural language processing tasks.

pdf abs
Validating and Exploring Large Geographic Corpora
Jonathan Dunn

This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English: (i) the agreement of independent language identification systems, (ii) hash-based deduplication, and (iii) location-specific outlier detection. The impact of each of these steps is then evaluated at the language level and the country level by using corpus similarity measures to compare each resulting corpus with baseline data sets. The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations. The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations. This result shows how standard corpus creation techniques can accidentally exclude under-represented populations.

pdf abs
Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs
David R. Mortensen | Valentina Izrailevitch | Yunze Xiao | Hinrich Schütze | Leonie Weissweiler

Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to which language models capture this type of generalization. This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing lexical-syntactic flexibility—the degree to which models can generalize over words in a construction with a non-prototypical part of speech. This task is situated within a natural language inference paradigm. We test the abilities of five language models—two proprietary models (GPT-3.5 and GPT-4), three open source model (Mistral 7B, Falcon 40B, and Llama 2 70B). We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it and that the 7-billion parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4.

pdf abs
VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain
Khai Le-Duc

Due to privacy restrictions, there’s a shortage of publicly available speech recognition datasets in the medical domain. In this work, we present VietMed - a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country. Moreover, we release the first public large-scale pre-trained models for Vietnamese ASR, w2v2-Viet and XLSR-53-Viet, along with the first public large-scale fine-tuned models for medical ASR. Even without any medical data in unsupervised pre-training, our best pre-trained model XLSR-53-Viet generalizes very well to the medical domain by outperforming state-of-the-art XLSR-53, from 51.8% to 29.6% WER on test set (a relative reduction of more than 40%). All code, data and models are made publicly available here.

pdf abs
VI-OOD: A Unified Framework of Representation Learning for Textual Out-of-distribution Detection
Li-Ming Zhan | Bo Liu | Xiao-Ming Wu

Out-of-distribution (OOD) detection plays a crucial role in ensuring the safety and reliability of deep neural networks in various applications. While there has been a growing focus on OOD detection in visual data, the field of textual OOD detection has received less attention. Only a few attempts have been made to directly apply general OOD detection methods to natural language processing (NLP) tasks, without adequately considering the characteristics of textual data. In this paper, we delve into textual OOD detection with Transformers. We first identify a key problem prevalent in existing OOD detection methods: the biased representation learned through the maximization of the conditional likelihood p(y|x) can potentially result in subpar performance. We then propose a novel variational inference framework for OOD detection (VI-OOD), which maximizes the likelihood of the joint distribution p(x, y) instead of p(y|x). VI-OOD is tailored for textual OOD detection by efficiently exploiting the representations of pre-trained Transformers. Through comprehensive experiments on various text classification tasks, VI-OOD demonstrates its effectiveness and wide applicability. Our code has been released at https://github.com/liam0949/LLM-OOD.

pdf abs
Visual-Linguistic Dependency Encoding for Image-Text Retrieval
Wenxin Guo | Lei Zhang | Kun Zhang | Yi Liu | Zhendong Mao

Image-text retrieval is a fundamental task to bridge the semantic gap between natural language and vision. Recent works primarily focus on aligning textual meanings with visual appearance. However, they often overlook the semantic discrepancy caused by syntactic structure in natural language expressions and relationships among visual entities. This oversight would lead to sub-optimal alignment and degraded retrieval performance, since the underlying semantic dependencies and object interactions remain inadequately encoded in both textual and visual embeddings. In this paper, we propose a novel Visual-Linguistic Dependency Encoding (VL-DE) framework, which explicitly models the dependency information among textual words and interaction patterns between image regions, improving the discriminative power of cross-modal representations for more accurate image-text retrieval. Specifically, VL-DE enhances textual representations by considering syntactic relationships and dependency types, and visual representations by attending to its spatially neighboring regions. Cross-attention mechanism is then introduced to aggregate aligned region-word pairs into image-text similarities. Analysis on Winoground, a dataset specially designed to measure vision-linguistic compositional structure reasoning, shows that VL-DE outperforms existing methods, demonstrating its effectiveness at this task. Comprehensive experiments on two benchmarks, Flickr30K and MS-COCO, further validates the competitiveness of our approach.

pdf abs
Visual-Textual Entailment with Quantities Using Model Checking and Knowledge Injection
Nobuyuki Iokawa | Hitomi Yanaka

In recent years, there has been great interest in multimodal inference. We concentrate on visual-textual entailment (VTE), a critical task in multimodal inference. VTE is the task of determining entailment relations between an image and a sentence. Several deep learning-based approaches have been proposed for VTE, but current approaches struggle with accurately handling quantities. On the other hand, one promising approach, one based on logical inference that can successfully deal with large quantities, has also been proposed. However, that approach uses automated theorem provers, increasing the computational cost for problems involving many entities. In addition, that approach cannot deal well with lexical differences between the semantic representations of images and sentences. In this paper, we present a logic-based VTE system that overcomes these drawbacks, using model checking for inference to increase efficiency and knowledge injection to perform more robust inference. We create a VTE dataset containing quantities and negation to assess how well VTE systems understand such phenomena. Using this dataset, we demonstrate that our system solves VTE tasks with quantities and negation more robustly than previous approaches.

pdf abs
Vygotsky Distance: Measure for Benchmark Task Similarity
Maxim K. Surkov | Ivan P. Yamshchikov

Evaluation plays a significant role in modern natural language processing. Most modern NLP benchmarks consist of arbitrary sets of tasks that neither guarantee any generalization potential for the model once applied outside the test set nor try to minimize the resource consumption needed for model evaluation. This paper presents a theoretical instrument and a practical algorithm to calculate similarity between benchmark tasks, we call this similarity measure “Vygotsky distance”. The core idea of this similarity measure is that it is based on relative performance of the “students” on a given task, rather that on the properties of the task itself. If two tasks are close to each other in terms of Vygotsky distance the models tend to have similar relative performance on them. Thus knowing Vygotsky distance between tasks one can significantly reduce the number of evaluation tasks while maintaining a high validation quality. Experiments on various benchmarks, including GLUE, SuperGLUE, CLUE, and RussianSuperGLUE, demonstrate that a vast majority of NLP benchmarks could be at least 40% smaller in terms of the tasks included. Most importantly, Vygotsky distance could also be used for the validation of new tasks thus increasing the generalization potential of the future NLP models.

pdf abs
WaCadie: Towards an Acadian French Corpus
Jeremy Robichaud | Paul Cook

Corpora are important assets within the natural language processing (NLP) and linguistics communities, as they allow the training of models and corpus-based studies of languages. However, corpora do not exist for many languages and language varieties, such as Acadian French. In this paper, we first show that off-the-shelf NLP systems perform more poorly on Acadian French than on standard French. An Acadian French corpus could, therefore, potentially be used to improve NLP models for this dialect. Then, leveraging web-as-corpus methodologies, specifically BootCaT, domain crawling, and social media scraping, we create three corpora of Acadian French. To evaluate these corpora, drawing on the linguistic literature on Acadian French, we propose 22 statistical corpus-based measures of the extent to which a corpus is Acadian French. We use these measures to compare these newly built corpora to known Acadian French text and find that all three corpora include some traces of Acadian French.

pdf abs
Well Begun Is Half Done: An Implicitly Augmented Generative Framework with Distribution Modification for Hierarchical Text Classification
Huawen Feng | Jingsong Yan | Junlong Liu | Junhao Zheng | Qianli Ma

Hierarchical Text Classification (HTC) is a challenging task which aims to extract the labels in a tree structure corresponding to a given text. Discriminative methods usually incorporate the hierarchical structure information into the encoding process, while generative methods decode the features according to it. However, the data distribution varies widely among different categories of samples, but current methods ignore the data imbalance, making the predictions biased and susceptible to error propagation. In this paper, we propose an **IM**plicitly **A**ugmented **G**enerativ **E** framework with distribution modification for hierarchical text classification (**IMAGE**). Specifically, we translate the distributions of original samples along various directions through implicit augmentation to get more diverse data. Furthermore, given the scarcity of the samples of tail classes, we adjust their distributions by transferring knowledge from other classes in label space. In this way, the generative framework learns a better beginning of the feature sequence without a prediction bias and avoids being misled by its wrong predictions for head classes. Experimental results show that **IMAGE** obtains competitive results compared with state-of-the-art methods and prove its superiority on unbalanced data.

pdf abs
What Are the Implications of Your Question? Non-Information Seeking Question-Type Identification in CNN Transcripts
Yao Sun | Anastasiia Tatlubaeva | Zhihan Li | Chester Palen-Michel

Non-information seeking questions (NISQ) capture the subtle dynamics of human discourse. In this work, we utilize a dataset of over 1,500 information-seeking question(ISQ) and NISQ to evaluate human and machine performance on classifying fine-grained NISQ types. We introduce the first publicly available corpus focused on annotating both ISQs and NISQs as an initial benchmark. Additionally, we establish competitive baselines by assessing diverse systems, including Generative Pre-Trained Transformer Language models, on a new question classification task. Our results demonstrate the inherent complexity of making nuanced NISQ distinctions. The dataset is publicly available at https://github.com/YaoSun0422/NISQ_dataset.git

pdf abs
What Can Diachronic Contexts and Topics Tell Us about the Present-Day Compositionality of English Noun Compounds?
Samin Mahdizadeh Sani | Malak Rassem | Chris Jenkins | Filip Miletić | Sabine Schulte im Walde

Predicting the compositionality of noun compounds such as climate change and tennis elbow is a vital component in natural language understanding. While most previous computational methods that automatically determine the semantic relatedness between compounds and their constituents have applied a synchronic perspective, the current study investigates what diachronic changes in contexts and semantic topics of compounds and constituents reveal about the compounds’ present-day degrees of compositionality. We define a binary classification task that utilizes two diachronic vector spaces based on contextual co-occurrences and semantic topics, and demonstrate that diachronic changes in cosine similarities – measured over context or topic distributions – uncover patterns that distinguish between compounds with low and high present-day compositionality. Despite fewer dimensions in the topic models, the topic space performs on par with the co-occurrence space and captures rather similar information. Temporal similarities between compounds and modifiers as well as between compounds and their prepositional paraphrases predict the compounds’ present-day compositionality with accuracy >0.7.

This paper investigates what insights about linguistic features and what knowledge about the structure of natural language can be obtained from the encodings in transformer language models. In particular, we explore how BERT encodes the government relation between constituents in a sentence. We use several probing classifiers, and data from two morphologically rich languages. Our experiments show that information about government is encoded across all transformer layers, but predominantly in the early layers of the model. We find that, for both languages, a small number of attention heads encode enough information about the government relations to enable us to train a classifier capable of discovering new, previously unknown types of government, never seen in the training data. Currently, data is lacking for the research community working on grammatical constructions, and government in particular. We release the Government Bank—a dataset defining the government relations for thousands of lemmas in the languages in our experiments.

Large Language Models (LLMs) are now being considered as judges of high efficiency to evaluate the quality of answers generated by candidate models. However, their judgments may be influenced by complex scenarios and inherent biases, raising concerns about their reliability. This study aims to bridge this gap by introducing four unexplored factors and examining the performance of LLMs as judges, namely answer quantity, inducing statements, judging strategy, and judging style. Additionally, we introduce a new dimension of question difficulty to provide a more comprehensive understanding of LLMs’ judgments across varying question intricacies. We employ ChatGPT, GPT-4, Gemini, and Claude-2 as judges and conduct experiments on Vicuna Benchmark and MT-bench. Our study reveals that LLMs’ judging abilities are susceptible to the influence of these four factors, and analyzing from the newly proposed dimension of question difficulty is highly necessary. We also provide valuable insights into optimizing LLMs’ performance as judges, enhancing their reliability and adaptability across diverse evaluation scenarios.

pdf abs
What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?
Richard Johansson

We investigate the behavior of methods using linear projections to remove information about a concept from a language representation, and we consider the question of what happens to a dataset transformed by such a method. A theoretical analysis and experiments on real-world and synthetic data show that these methods inject strong statistical dependencies into the transformed datasets. After applying such a method, the representation space is highly structured: in the transformed space, an instance tends to be located near instances of the opposite label. As a consequence, the original labeling can in some cases be reconstructed by applying an anti-clustering method.

pdf abs
What Has LeBenchmark Learnt about French Syntax?
Zdravko Dugonjić | Adrien Pupier | Benjamin Lecouteux | Maximin Coavoux

The paper reports on a series of experiments aiming at probing LeBenchmark, a pretrained acoustic model trained on 7k hours of spoken French, for syntactic information. Pretrained acoustic models are increasingly used for downstream speech tasks such as automatic speech recognition, speech translation, spoken language understanding or speech parsing. They are trained on very low level information (the raw speech signal), and do not have explicit lexical knowledge. Despite that, they obtained reasonable results on tasks that requires higher level linguistic knowledge. As a result, an emerging question is whether these models encode syntactic information. We probe each representation layer of LeBenchmark for syntax, using the Orféo treebank, and observe that it has learnt some syntactic information. Our results show that syntactic information is more easily extractable from the middle layers of the network, after which a very sharp decrease is observed.

pdf abs
What Is Needed for Intra-document Disambiguation of Math Identifiers?
Takuto Asakura | Yusuke Miyao

In automated scientific document analysis, accurately interpreting math formulae is imperative alongside comprehending natural language. Ambiguity in math identifiers within a single document poses significant challenges to understanding math formulae. While disambiguating math identifiers across documents has seen some progress, resolving ambiguity within a document remains inadequately researched due to complexity and insufficient datasets. The level of difficulty and information required to accomplish this task was uncertain. This study aims to determine which information is necessary for the intra-document disambiguation of math identifiers. Our findings indicate that the position data and local formula structure surrounding the identifiers, including modifiers, are particularly critical. For our study, we expanded a dataset for formula grounding and doubled its size to include annotations for 27,655 math identifier occurrences. We have created a multi-layer perceptron model that performs similarly to humans, with an 85% accuracy and a kappa value of 0.73, outperforming rule-based baselines. We trained and evaluated the model with papers in natural language processing (NLP). Our findings were also confirmed valid in fields other than NLP by applying the trained models to papers from various fields. These results will aid in improving mathematical language processing, such as mathematical information retrieval.

pdf abs
When Argumentation Meets Cohesion: Enhancing Automatic Feedback in Student Writing
Yuning Ding | Omid Kashefi | Swapna Somasundaran | Andrea Horbach

In this paper, we investigate the role of arguments in the automatic scoring of cohesion in argumentative essays. The feature analysis reveals that in argumentative essays, the lexical cohesion between claims is more important to the overall cohesion, while the evidence is expected to be diverse and divergent. Our results show that combining features related to argument segments and cohesion features improves the performance of the automatic cohesion scoring model trained on a transformer. The cohesion score is also learned more accurately in a multi-task learning process by adding the automatic segmentation of argumentative elements as an auxiliary task. Our findings contribute to both the understanding of cohesion in argumentative writing and the development of automatic feedback.

pdf abs
When Cohesion Lies in the Embedding Space: Embedding-Based Reference-Free Metrics for Topic Segmentation
Iacopo Ghinassi | Lin Wang | Chris Newell | Matthew Purver

In this paper we propose a new framework and new methods for the reference-free evaluation of topic segmentation systems directly in the embedding space. Specifically, we define a common framework for reference-free, embedding-based topic segmentation metrics, and show how this applies to an existing metric. We then define new metrics, based on a previously defined cohesion score, Average Relative Proximity. Using this approach, we show that Large Language Models (LLMs) yield features that, if used correctly, can strongly correlate with traditional topic segmentation metrics based on costly and rare human annotations, while outperforming existing reference-free metrics borrowed from clustering evaluation in most domains. We then show that smaller language models specifically fine-tuned for different sentence-level tasks can outperform LLMs several orders of magnitude larger. Via a thorough comparison of our metric’s performance across different datasets, we see that conversational data present the biggest challenge in this framework. Finally, we analyse the behaviour of our metrics in specific error cases, such as those of under-generation and moving of ground truth topic boundaries, and show that our metrics behave more consistently than other reference-free methods.

pdf abs
When Do “More Contexts” Help with Sarcasm Recognition?
Ojas Nimase | Sanghyun Hong

Sarcasm recognition is challenging because it needs an understanding of the true intention, which is opposite to or different from the literal meaning of the words. Prior work has addressed this challenge by developing a series of methods that provide richer contexts, e.g., sentiment or cultural nuances, to models. While shown to be effective individually, no study has systematically evaluated their collective effectiveness. As a result, it remains unclear to what extent additional contexts can improve sarcasm recognition. In this work, we explore the improvements that existing methods bring by incorporating more contexts into a model. To this end, we develop a framework where we can integrate multiple contextual cues and test different approaches. In evaluation with four approaches on three sarcasm recognition benchmarks, we achieve existing state-of-the-art performances and also demonstrate the benefits of sequentially adding more contexts. We also identify inherent drawbacks of using more contexts, highlighting that in the pursuit of even better results, the model may need to adopt societal biases.

pdf abs
When Your Cousin Has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages
Niyati Bafna | Cristina España-Bonet | Josef van Genabith | Benoît Sagot | Rachel Bawden

Most existing approaches for unsupervised bilingual lexicon induction (BLI) depend on good quality static or contextual embeddings requiring large monolingual corpora for both languages. However, unsupervised BLI is most likely to be useful for low-resource languages (LRLs), where large datasets are not available. Often we are interested in building bilingual resources for LRLs against related high-resource languages (HRLs), resulting in severely imbalanced data settings for BLI. We first show that state-of-the-art BLI methods in the literature exhibit near-zero performance for severely data-imbalanced language pairs, indicating that these settings require more robust techniques. We then present a new method for unsupervised BLI between a related LRL and HRL that only requires inference on a masked language model of the HRL, and demonstrate its effectiveness on truly low-resource languages Bhojpuri and Magahi (with <5M monolingual tokens each), against Hindi. We further present experiments on (mid-resource) Marathi and Nepali to compare approach performances by resource range, and release our resulting lexicons for five low-resource Indic languages: Bhojpuri, Magahi, Awadhi, Braj, and Maithili, against Hindi.

pdf abs
Which Sense Dominates Multisensory Semantic Understanding? A Brain Decoding Study
Dandan Huang | Lu Cao | Zhenting Li | Yue Zhang

Decoding semantic meanings from brain activity has attracted increasing attention. Neurolinguists have found that semantic perception is open to multisensory stimulation, as word meanings can be delivered by both auditory and visual inputs. Prior work which decodes semantic meanings from neuroimaging data largely exploits brain activation patterns triggered by stimulation in cross-modality (i.e. text-audio pairs, text-picture pairs). Their goal is to develop a more sophisticated computational model to probing what information from the act of language understanding is represented in human brain. While how the brain receiving such information influences decoding performance is underestimated. This study dissociates multisensory integration of word understanding into written text, spoken text and image perception respectively, exploring the decoding efficiency and reliability of unisensory information in the brain representation. The findings suggest that, in terms of unisensory, decoding is most successful when semantics is represented in pictures, but the effect disappears in the case of congeneric words which share a related meaning. These results reveal the modality dependence and multisensory enhancement in the brain decoding methodology.

pdf abs
Who Did You Blame When Your Project Failed? Designing a Corpus for Presupposition Generation in Cross-Examination Dialogues
Maria Francis | Julius Steuer | Dietrich Klakow | Volha Petukhova

This paper introduces the corpus for the novel task of presupposition generation - a natural language generation problem where a model produces a list of presuppositions carried by the given input sentence, in the context of the presented research - given the cross-examination question. Two datasets, PECaN (Presupposition, Entailment, Contradiction and Neutral) and PGen (Presuppostion Generation), are designed to fine-tune existing BERT (CITATION) and T5 (CITATION) models for classification and generation tasks. Various corpora construction methods are proposed ranging from manual annotations, prompting the GPT 3.0 model, to augmenting data from the existing corpora. The fine-tuned models achieved high accuracy on the novel Presupposition as Natural Language Inference (PNLI) task which extends the traditional Natural Language Inference (NLI) incorporating instances of presupposition into classification. T5 outperforms BERT by broad margin achieving an overall accuracy of 84.35% compared to 71.85% of BERT, and specifically when classifying presuppositions (93% vs 73% respectively). Regarding presupposition generation, we observed that despite the limited amount of data used for fine-tuning, the model displays an emerging proficiency in generation presuppositions reaching ROUGE scores of 43.47, adhering to systematic patterns that mirror valid strategies for presupposition generation, although failed to generate the complete lists.

pdf abs
Who Is Bragging More Online? A Large Scale Analysis of Bragging in Social Media
Mali Jin | Daniel Preotiuc-Pietro | A. Seza Doğruöz | Nikolaos Aletras

Bragging is the act of uttering statements that are likely to be positively viewed by others and it is extensively employed in human communication with the aim to build a positive self-image of oneself. Social media is a natural platform for users to employ bragging in order to gain admiration, respect, attention and followers from their audiences. Yet, little is known about the scale of bragging online and its characteristics. This paper employs computational sociolinguistics methods to conduct the first large scale study of bragging behavior on Twitter (U.S.) by focusing on its overall prevalence, temporal dynamics and impact of demographic factors. Our study shows that the prevalence of bragging decreases over time within the same population of users. In addition, younger, more educated and popular users in the U.S. are more likely to brag. Finally, we conduct an extensive linguistics analysis to unveil specific bragging themes associated with different user traits.

pdf abs
Who Said What: Formalization and Benchmarks for the Task of Quote Attribution
Wenjie Zhong | Jason Naradowsky | Hiroya Takamura | Ichiro Kobayashi | Yusuke Miyao

The task of quote attribution seeks to pair textual utterances with the name of their speakers. Despite continuing research efforts on the task, models are rarely evaluated systematically against previous models in comparable settings on the same datasets. This has resulted in a poor understanding of the relative strengths and weaknesses of various approaches. In this work we formalize the task of quote attribution, and in doing so, establish a basis of comparison across existing models. We present an exhaustive benchmark of known models, including natural extensions to larger LLM base models, on all available datasets in both English and Chinese. Our benchmarking results reveal that the CEQA model attains state-of-the-art performance among all supervised methods, and ChatGPT, operating in a four-shot setting, demonstrates performance on par with or surpassing that of supervised methods on some datasets. Detailed error analysis identify several key factors contributing to prediction errors.

pdf abs
Why Voice Biomarkers of Psychiatric Disorders Are Not Used in Clinical Practice? Deconstructing the Myth of the Need for Objective Diagnosis
Vincent P. Martin | Jean-Luc Rouas

Given the high prevalence of mental disorders and the significant diagnostic delays and difficulties in patient follow-up, voice biomarkers hold the promise of improving access to care and therapeutic follow-up for people with psychiatric disorders. Yet, despite many years of successful research in the field, none of these voice biomarkers are implemented in clinical practice. Beyond the reductive explanation of the lack of explainability of the involved machine learning systems, we look for arguments in the epistemology and sociology of psychiatry. We show that the estimation of diagnoses, the major task in the literature, is of little interest to both clinicians and patients. After tackling the common misbeliefs about diagnosis in psychiatry in a didactic way, we propose a paradigm shift towards the estimation of clinical symptoms and signs, which not only address the limitations raised against diagnosis estimation but also enable the formulation of new machine learning tasks. We hope that this paradigm shift will empower the use of vocal biomarkers in clinical practice. It is however conditional on a change in database labeling practices, but also on a profound change in the speech processing community’s practices towards psychiatry.

pdf abs
WikiFactDiff: A Large, Realistic, and Temporally Adaptable Dataset for Atomic Factual Knowledge Update in Causal Language Models
Hichem Ammar Khodja | Frederic Bechet | Quentin Brabant | Alexis Nasr | Gwénolé Lecorvé

The factuality of large language model (LLMs) tends to decay over time since events posterior to their training are “unknown” to them. One way to keep models up-to-date could be factual update: the task of inserting, replacing, or removing certain simple (atomic) facts within the model. To study this task, we present WikiFactDiff, a dataset that describes the evolution of factual knowledge between two dates as a collection of simple facts divided into three categories: new, obsolete, and static. We describe several update scenarios arising from various combinations of these three types of basic update. The facts are represented by subject-relation-object triples; indeed, WikiFactDiff was constructed by comparing the state of the Wikidata knowledge base at 4 January 2021 and 27 February 2023. Those fact are accompanied by verbalization templates and cloze tests that enable running update algorithms and their evaluation metrics. Contrary to other datasets, such as zsRE and CounterFact, WikiFactDiff constitutes a realistic update setting that involves various update scenarios, including replacements, archival, and new entity insertions. We also present an evaluation of existing update algorithms on WikiFactDiff.

The task of Split and Rephrase, which splits a complex sentence into multiple simple sentences with the same meaning, improves readability and enhances the performance of downstream tasks in natural language processing (NLP). However, while Split and Rephrase can be improved using a text-to-text generation approach that applies encoder-decoder models fine-tuned with a large-scale dataset, it still suffers from hallucinations and under-splitting. To address these issues, this paper presents a simple and strong data refinement approach. Here, we create WikiSplit++ by removing instances in WikiSplit where complex sentences do not entail at least one of the simpler sentences and reversing the order of reference simple sentences. Experimental results show that training with WikiSplit++ leads to better performance than training with WikiSplit, even with fewer training instances. In particular, our approach yields significant gains in the number of splits and the entailment ratio, a proxy for measuring hallucinations.

We present a comprehensive computational study of the under-investigated phenomenon of personal name compounds (PNCs) in German such as Willkommens-Merkel (‘Welcome-Merkel’). Prevalent in news, social media, and political discourse, PNCs are hypothesized to exhibit an evaluative function that is reflected in a more positive or negative perception as compared to the respective personal full name (such as Angela Merkel). We model 321 PNCs and their corresponding full names at discourse level, and show that PNCs bear an evaluative nature that can be captured through a variety of computational methods. Specifically, we assess through valence information whether a PNC is more positively or negatively evaluative than the person’s name, by applying and comparing two approaches using (i) valence norms and (ii) pre-trained language models (PLMs). We further enrich our data with personal, domain-specific, and extra-linguistic information and perform a range of regression analyses revealing that factors including compound and modifier valence, domain, and political party membership influence how a PNC is evaluated.

pdf abs
WkNER: Enhancing Named Entity Recognition with Word Segmentation Constraints and kNN Retrieval
Yanchun Li | Senlin Deng | Dongsu Shen | Shujuan Tian | Saiqin Long

Fine-tuning Pre-trained Language Models (PLMs) is a popular Natural Language Processing (NLP) paradigm for addressing Named Entity Recognition (NER) tasks. However, neural network models often demonstrate poor generalization capabilities due to significant disparities between the knowledge learned by PLMs and the distribution of the target dataset, as well as data scarcity issues. In addition, token omission in predictions due to insufficient learning remains a challenge in NER. In this paper, we propose a kNN retrieval enhancement algorithm (WkNER) that incorporates word segmentation information to enhance the model’s generalization ability and alleviate the problem of missing entity tokens in prediction. The introduction of word segmentation information is used to preliminarily determine the boundaries of entities and alleviate the common prediction errors of missing tokens within entities made by the fine-tuned model. Secondly, we find that non-entities in the retrieval table contain a large amount of redundant information, and explore the effects of introducing non-entity information of different scales on the model. Experimental results show that our proposed method significantly improves the performance of baseline models, and achieves better or compared recognition accuracy than previous state-of-the-art models in multiple public Chinese and English datasets. Especially in low-resource scenarios, our method achieves higher accuracy on 20% of the dataset than the original method on the full dataset.

pdf abs
Word-Aware Modality Stimulation for Multimodal Fusion
Shuhei Tateishi | Yasuhito Osugi | Makoto Nakatsuji

Multimodal learning is generally expected to make more accurate predictions than text-only analysis. Here, although various methods for fusing multimodal inputs have been proposed for sentiment analysis tasks, we found that they may be inhibiting their fusion methods, which are based on attention-based language models, from learning non-verbal modalities, because non-verbal ones are isolated from the linguistic semantics and contexts and do not include them, meaning that they are unsuitable for applying attention to text modalities during the fusion phase. To address this issue, we propose Word-aware Modality Stimulation Fusion (WA-MSF) for facilitating integration of non-verbal modalities with the text modality. The Modality Stimulation Unit layer (MSU-layer) is the core concept of WA-MSF; it integrates language contexts and semantics into non-verbal modalities, thereby instilling linguistic essence into these modalities. Moreover, WA-MSF uses aMLP in the fusion phase in order to utilize spatial and temporal representations of non-verbal modalities more effectively than transformer fusion. In our experiments, WA-MSF set a new state-of-the-art level of performance on sentiment prediction tasks.

pdf abs
Word-level Commonsense Knowledge Selection for Event Detection
Shuai Yang | Yu Hong | Shiming He | Qingting Xu | Jianmin Yao

Event Detection (ED) is a task of automatically extracting multi-class trigger words. The understanding of word sense is crucial for ED. In this paper, we utilize context-specific commonsense knowledge to strengthen word sense modeling. Specifically, we leverage a Context-specific Knowledge Selector (CKS) to select the exact commonsense knowledge of words from a large knowledge base, i.e., ConceptNet. Context-specific selection is made in terms of the relevance of knowledge to the living contexts. On this basis, we incorporate the commonsense knowledge into the word-level representations before decoding. ChatGPT is an ideal generative CKS when the prompts are deliberately designed, though it is cost-prohibitive. To avoid the heavy reliance on ChatGPT, we train an offline CKS using the predictions of ChatGPT over a small number of examples (about 9% of all). We experiment on the benchmark ACE-2005 dataset. The test results show that our approach yields substantial improvements compared to the BERT baseline, achieving the F1-score of about 78.3%. All models, source codes and data will be made publicly available.

pdf abs
WordNet under Scrutiny: Dictionary Examples in the Era of Large Language Models
Fatemah Yousef Almeman | Steven Schockaert | Luis Espinosa Anke

Dictionary definitions play a prominent role in a wide range of NLP tasks, for instance by providing additional context about the meaning of rare and emerging terms. Many dictionaries also provide examples to illustrate the prototypical usage of words, which brings further opportunities for training or enriching NLP models. The intrinsic qualities of dictionaries, and related lexical resources such as glossaries and encyclopedias, are however still not well-understood. While there has been significant work on developing best practices, such guidance has been aimed at traditional usages of dictionaries (e.g. supporting language learners), and it is currently unclear how different quality aspects affect the NLP systems that rely on them. To address this issue, we compare WordNet, the most commonly used lexical resource in NLP, with a variety of dictionaries, as well as with examples that were generated by ChatGPT. Our analysis involves human judgments as well as automatic metrics. We furthermore study the quality of word embeddings derived from dictionary examples, as a proxy for downstream performance. We find that WordNet’s examples lead to lower-quality embeddings than those from the Oxford dictionary. Surprisingly, however, the ChatGPT generated examples were found to be most effective overall.

The awareness of multi-cultural human values is critical to the ability of language models (LMs) to generate safe and personalized responses. However, this awareness of LMs has been insufficiently studied, since the computer science community lacks access to the large-scale real-world data about multi-cultural values. In this paper, we present WorldValuesBench, a globally diverse, large-scale benchmark dataset for the multi-cultural value prediction task, which requires a model to generate a rating response to a value question based on demographic contexts. Our dataset is derived from an influential social science project, World Values Survey (WVS), that has collected answers to hundreds of value questions (e.g., social, economic, ethical) from 94,728 participants worldwide. We have constructed more than 20 million examples of the type "(demographic attributes, value question) → answer” from the WVS responses. We perform a case study using our dataset and show that the task is challenging for strong open and closed-source models. On merely 11.1%, 25.0%, 72.2%, and 75.0% of the questions, Alpaca-7B, Vicuna-7B-v1.5, Mixtral-8x7B-Instruct-v0.1, and GPT-3.5 Turbo can respectively achieve <0.2 Wasserstein 1-distance from the human normalized answer distributions. WorldValuesBench opens up new research avenues in studying limitations and opportunities in multi-cultural value awareness of LMs.

pdf abs
Would You Like to Make a Donation? A Dialogue System to Persuade You to Donate
Yuhan Song | Houfeng Wang

Persuasive dialogue is a type of dialogue commonly used in human daily life in scenarios such as promotion and sales. Its purpose is to influence the decision, attitude or behavior of another person through the dialogue process. Persuasive automated dialogue systems can be applied in a variety of fields such as charity, business, education, and healthcare. Regardless of their amazing abilities, Large Language Models (LLMs) such as ChatGPT still have limitations in persuasion. There is few research dedicated to persuasive dialogue in the current research of automated dialogue systems. In this paper, we introduce a persuasive automated dialogue system. In the system, a context-aware persuasion strategy selection module makes dialogue system flexibly use different persuasion strategies to persuade users; Then a natural language generation module is used to output a response. We also propose a persuasiveness prediction model to automatically evaluate the persuasiveness of generated text. Experimental results show that our dialogue system can achieve better performance on several automated evaluation metrics than baseline models.

pdf abs
WW-CSL: A New Dataset for Word-Based Wearable Chinese Sign Language Detection
Fan Xu | Kai Liu | Yifeng Yang | Keyu Yan

Sign language is an effective non-verbal communication mode for the hearing-impaired people. Since the video-based sign language detection models have high requirements for enough lighting and clear background, current wearing glove-based sign language models are robust for poor light and occlusion situations. In this paper, we annotate a new dataset of Word-based Wearable Chinese Sign Languag (WW-CSL) gestures. Specifically, we propose a three-form (e.g., sequential sensor data, gesture video, and gesture text) scheme to represent dynamic CSL gestures. Guided by the scheme, a total of 3,000 samples were collected, corresponding to 100 word-based CSL gestures. Furthermore, we present a transformer-based baseline model to fuse 2 inertial measurement unites (IMUs) and 10 flex sensors for the wearable CSL detection. In order to integrate the advantage of video-based and wearable glove-based CSL gestures, we also propose a transformer-based Multi-Modal CSL Detection (MM-CSLD) framework which adeptly integrates the local sequential sensor data derived from wearable-based CSL gestures with the global, fine-grained skeleton representations captured from video-based CSL gestures simultaneously.

pdf abs
XAI-Attack: Utilizing Explainable AI to Find Incorrectly Learned Patterns for Black-Box Adversarial Example Creation
Markus Bayer | Markus Neiczer | Maximilian Samsinger | Björn Buchhold | Christian Reuter

Adversarial examples, capable of misleading machine learning models into making erroneous predictions, pose significant risks in safety-critical domains such as crisis informatics, medicine, and autonomous driving. To counter this, we introduce a novel textual adversarial example method that identifies falsely learned word indicators by leveraging explainable AI methods as importance functions on incorrectly predicted instances, thus revealing and understanding the weaknesses of a model. To evaluate the effectiveness of our approach, we conduct a human and a transfer evaluation and propose a novel adversarial training evaluation setting for better robustness assessment. While outperforming current adversarial example and training methods, the results also show our method’s potential in facilitating the development of more resilient transformer models by detecting and rectifying biases and patterns in training data, showing baseline improvements of up to 23 percentage points in accuracy on adversarial tasks. The code of our approach is freely available for further exploration and use.

pdf abs
XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates
Haopeng Zhang | Hayate Iso | Sairam Gurajada | Nikita Bhutani

Text editing is a crucial task of modifying text to better align with user intents. However, existing text editing benchmark datasets contain only coarse-grained instructions and lack explainability, thus resulting in outputs that deviate from the intended changes outlined in the gold reference. To comprehensively investigate the text editing capabilities of large language models (LLMs), this paper introduces XATU, the first benchmark specifically designed for fine-grained instruction-based explainable text editing. XATU considers finer-grained text editing tasks of varying difficulty (simplification, grammar check, fact-check, etc.), incorporating lexical, syntactic, semantic, and knowledge-intensive edit aspects. To enhance interpretability, we combine LLM-based annotation and human annotation, resulting in a benchmark that includes fine-grained instructions and gold-standard edit explanations. By evaluating existing LLMs against our benchmark, we demonstrate the effectiveness of instruction tuning and the impact of underlying architecture across various editing tasks. Furthermore, extensive experimentation reveals the significant role of explanations in fine-tuning language models for text editing tasks. The benchmark will be open-sourced to support reproduction and facilitate future research at https://github.com/megagonlabs/xatu.

pdf abs
XVD: Cross-Vocabulary Differentiable Training for Generative Adversarial Attacks
Tom Roth | Inigo Jauregi Unanue | Alsharif Abuadbba | Massimo Piccardi

An adversarial attack to a text classifier consists of an input that induces the classifier into an incorrect class prediction, while retaining all the linguistic properties of correctly-classified examples. A popular class of adversarial attacks exploits the gradients of the victim classifier to train a dedicated generative model to produce effective adversarial examples. However, this training signal alone is not sufficient to ensure other desirable properties of the adversarial attacks, such as similarity to non-adversarial examples, linguistic fluency, grammaticality, and so forth. For this reason, in this paper we propose a novel training objective which leverages a set of pretrained language models to promote such properties in the adversarial generation. A core component of our approach is a set of vocabulary-mapping matrices which allow cascading the generative model to any victim or component model of choice, while retaining differentiability end-to-end. The proposed approach has been tested in an ample set of experiments covering six text classification datasets, two victim models, and four baselines. The results show that it has been able to produce effective adversarial attacks, outperforming the compared generative approaches in a majority of cases and proving highly competitive against established token-replacement approaches.

Warning: This paper contains explicit statements of offensive stereotypes which may be upsetting The study of bias, fairness and social impact in Natural Language Processing (NLP) lacks resources in languages other than English. Our objective is to support the evaluation of bias in language models in a multilingual setting. We use stereotypes across nine types of biases to build a corpus containing contrasting sentence pairs, one sentence that presents a stereotype concerning an underadvantaged group and another minimally changed sentence, concerning a matching advantaged group. We build on the French CrowS-Pairs corpus and guidelines to provide translations of the existing material into seven additional languages. In total, we produce 11,139 new sentence pairs that cover stereotypes dealing with nine types of biases in seven cultural contexts. We use the final resource for the evaluation of relevant monolingual and multilingual masked language models. We find that language models in all languages favor sentences that express stereotypes in most bias categories. The process of creating a resource that covers a wide range of language types and cultural settings highlights the difficulty of bias evaluation, in particular comparability across languages and contexts.

pdf abs
ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus
Injy Hamed | Fadhl Eryani | David Palfreyman | Nizar Habash

We present ZAEBUC-Spoken, a multilingual multidialectal Arabic-English speech corpus. The corpus comprises twelve hours of Zoom meetings involving multiple speakers role-playing a work situation where Students brainstorm ideas for a certain topic and then discuss it with an Interlocutor. The meetings cover different topics and are divided into phases with different language setups. The corpus presents a challenging set for automatic speech recognition (ASR), including two languages (Arabic and English) with Arabic spoken in multiple variants (Modern Standard Arabic, Gulf Arabic, and Egyptian Arabic) and English used with various accents. Adding to the complexity of the corpus, there is also code-switching between these languages and dialects. As part of our work, we take inspiration from established sets of transcription guidelines to present a set of guidelines handling issues of conversational speech, code-switching and orthography of both languages. We further enrich the corpus with two layers of annotations; (1) dialectness level annotation for the portion of the corpus where mixing occurs between different variants of Arabic, and (2) automatic morphological annotations, including tokenization, lemmatization, and part-of-speech tagging.

pdf abs
ZeLa: Advancing Zero-Shot Multilingual Semantic Parsing with Large Language Models and Chain-of-Thought Strategies
Truong Dinh Do | Phuong Minh Nguyen | Minh Nguyen

In recent years, there have been significant advancements in semantic parsing tasks, thanks to the introduction of pre-trained language models. However, a substantial gap persists between English and other languages due to the scarcity of annotated data. One promising strategy to bridge this gap involves augmenting multilingual datasets using labeled English data and subsequently leveraging this augmented dataset for training semantic parsers (known as zero-shot multilingual semantic parsing). In our study, we propose a novel framework to effectively perform zero-shot multilingual semantic parsing under the support of large language models (LLMs). Given data annotated pairs (sentence, semantic representation) in English, our proposed framework automatically augments data in other languages via multilingual chain-of-thought (CoT) prompting techniques that progressively construct the semantic form in these languages. By breaking down the entire semantic representation into sub-semantic fragments, our CoT prompting technique simplifies the intricate semantic structure at each step, thereby facilitating the LLMs in generating accurate outputs more efficiently. Notably, this entire augmentation process is achieved without the need for any demonstration samples in the target languages (zero-shot learning). In our experiments, we demonstrate the effectiveness of our method by evaluating it on two well-known multilingual semantic parsing datasets: MTOP and MASSIVE.

pdf abs
ZenPropaganda: A Comprehensive Study on Identifying Propaganda Techniques in Russian Coronavirus-Related Media
Anton Chernyavskiy | Svetlana Shomova | Irina Dushakova | Ilya Kiriya | Dmitry Ilvovsky

The topic of automatic detection of manipulation and propaganda in the media is not a novel issue; however, it remains an urgent concern that necessitates continuous research focus. The topic is studied within the framework of various papers, competitions and shared tasks, which provide different techniques definitions and include the analysis of text data, images, as well as multi-lingual sources. In this study, we propose a novel multi-level classification scheme for identifying propaganda techniques. We introduce a new Russian dataset ZenPropaganda consisting of coronavirus-related texts collected from Vkontakte and Yandex.Zen platforms, which have been expertly annotated with fine-grained labeling of manipulative spans. We further conduct a comprehensive analysis by comparing our dataset with existing related ones and evaluate the performance of state-of-the-art approaches that have been proposed for them. Furthermore, we provide a detailed discussion of our findings, which can serve as a valuable resource for future research in this field.

The rapid expansion of the digital world has propelled sentiment analysis into a critical tool across diverse sectors such as marketing, politics, customer service, and healthcare. While there have been significant advancements in sentiment analysis for widely spoken languages, low-resource languages, such as Bangla, remain largely under-researched due to resource constraints. Furthermore, the recent unprecedented performance of Large Language Models (LLMs) in various applications highlights the need to evaluate them in the context of low-resource languages. In this study, we present a sizeable manually annotated dataset encompassing 33,606 Bangla news tweets and Facebook comments. We also investigate zero- and few-shot in-context learning with several language models, including Flan-T5, GPT-4, and Bloomz, offering a comparative analysis against fine-tuned models. Our findings suggest that monolingual transformer-based models consistently outperform other models, even in zero and few-shot scenarios. To foster continued exploration, we intend to make this dataset and our research tools publicly available to the broader research community.

pdf abs
Zero-shot Cross-lingual Automated Essay Scoring
Junyi He | Xia Li

Due to the difficulty of creating high-quality labelled training data for different languages, the low-resource problem is crucial yet challenging for automated essay scoring (AES). However, little attention has been paid to addressing this challenge. In this paper, we propose a novel zero-shot cross-lingual scoring method from the perspectives of pretrained multilingual representation and writing quality alignment to score essays in unseen languages. Specifically, we adopt multilingual pretrained language models as the encoder backbone to deeply and comprehensively represent multilingual essays. Motivated by the fact that the scoring knowledge for evaluating writing quality is comparable across different languages, we introduce an innovative strategy for aligning essays in a language-independent manner. The proposed strategy aims to capture shared knowledge from diverse languages, thereby enhancing the representation of essays written in unseen languages with respect to their quality. We include essay datasets in six languages (Czech, German, English, Spanish, Italian and Portuguese) to establish extensive experiments, and the results demonstrate that our method achieves state-of-the-art cross-lingual scoring performance.

Event Causality Identification (ECI) refers to the detection of causal relations between events in texts. However, most existing studies focus on sentence-level ECI with high-resource languages, leaving more challenging document-level ECI (DECI) with low-resource languages under-explored. In this paper, we propose a Heterogeneous Graph Interaction Model with Multi-granularity Contrastive Transfer Learning (GIMC) for zero-shot cross-lingual document-level ECI. Specifically, we introduce a heterogeneous graph interaction network to model the long-distance dependencies between events that are scattered over a document. Then, to improve cross-lingual transferability of causal knowledge learned from the source language, we propose a multi-granularity contrastive transfer learning module to align the causal representations across languages. Extensive experiments show our framework outperforms the previous state-of-the-art model by 9.4% and 8.2% of average F1 score on monolingual and multilingual scenarios respectively. Notably, in the multilingual scenario, our zero-shot framework even exceeds GPT-3.5 with few-shot learning by 24.3% in overall performance.

Zero-shot event detection is a challenging task. Recent research work proposed to use a pre-trained textual entailment (TE) model on this task. However, those methods treated the TE model as a frozen annotator. We treat the TE model as an annotator that can be enhanced. We propose to use TE models to annotate large-scale unlabeled text and use annotated data to finetune the TE model, yielding an improved TE model. Finally, the improved TE model is used for inference on the test set. To improve the efficiency, we propose to use keywords to filter out sentences with a low probability of expressing event(s). To improve the coverage of keywords, we expand limited number of seed keywords using WordNet, so that we can use the TE model to annotate unlabeled text efficiently. The experimental results show that our method can outperform other baselines by 15% on the ACE05 dataset.

pdf abs
Zero-shot Learning for Multilingual Discourse Relation Classification
Eleni Metheniti | Philippe Muller | Chloé Braud | Margarita Hernández Casas

Classifying discourse relations is known as a hard task, relying on complex indices. On the other hand, discourse-annotated data is scarce, especially for languages other than English: many corpora, of limited size, exist for several languages but the domain is split between different theoretical frameworks that have a huge impact on the nature of the textual spans to be linked, and the label set used. Moreover, each annotation project implements modifications compared to the theoretical background and other projects. These discrepancies hinder the development of systems taking advantage of all the available data to tackle data sparsity and work on transfer between languages is very limited, almost nonexistent between frameworks, while it could improve our understanding of some theoretical aspects and enhance many applications. In this paper, we propose the first experiments on zero-shot learning for discourse relation classification and investigate several paths in the way source data can be combined, either based on languages, frameworks, or similarity measures. We demonstrate how difficult transfer is for the task at hand, and that the most impactful factor is label set divergence, where the notion of underlying framework possibly conceals crucial disagreements.

Zero-shot Spoken Language Understanding (SLU) aims to enable task-oriented dialogue systems to understand user needs without training data. Challenging but worthwhile, zero-shot SLU reduces the time and effort that data labeling takes. Recent advancements in large language models (LLMs), such as GPT3.5 and ChatGPT, have shown promising results in zero-shot settings, which motivates us to explore prompt-based methods. In this study, we investigate whether strong SLU models can be constructed by directly prompting LLMs. Specifically, we propose a simple yet effective two-stage framework dubbed GPT-SLU, which transforms the SLU task into a question-answering problem. Powered by multi-stage mutual guided prompts, GPT-SLU can leverage the correlations between two subtasks in SLU to achieve better predictions, which is greatly explored in the traditional fine-tuning paradigm. Experimental results on three SLU benchmark datasets demonstrate the significant potential of LLMs for zero-shot SLU. Comprehensive analyses validate the effectiveness of our proposed framework and also indicate that there is still room for further improvement of LLMs in SLU scenarios.

pdf (full)
bib (full) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries

pdf bib
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries
Roman Klinger | Naozaki Okazaki | Nicoletta Calzolari | Min-Yen Kan

Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on four key areas: MLLM architecture design, instructional learning, multimodal reasoning, and the efficiency of MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research.

pdf bib abs
Geo-Cultural Representation and Inclusion in Language Technologies
Sunipa Dev | Rida Qadri

Training and evaluation of language models are increasingly relying on semi-structured data that is annotated by humans, along with techniques such as RLHF growing in usage across the board. As a result, both the data and the human perspectives involved in this process play a key role in what is taken as ground truth by our models. As annotation tasks are becoming increasingly more subjective and culturally complex, it is unclear how much of their socio-cultural identity annotators use to respond to tasks. We also currently do not have ways to integrate rich and diverse community perspectives into our language technologies. Accounting for such cross-cultural differences in interacting with technology is an increasingly crucial step for evaluating AI harms holistically. Without this, the state of the art of the AI models being deployed is at risk of causing unprecedented biases at a global scale. In this tutorial, we will take an interactive approach by utilizing some different types of annotation tasks to investigate together how our different socio-cultural perspectives and lived experiences influence what we consider as appropriate representations of global concepts.

This tutorial reviews the design of common meaning representations, SoTA models for predicting meaning representations, and the applications of meaning representations in a wide range of downstream NLP tasks and real-world applications. Reporting by a diverse team of NLP researchers from academia and industry with extensive experience in designing, building and using meaning representations, our tutorial has three components: (1) an introduction to common meaning representations, including basic concepts and design challenges; (2) a review of SoTA methods on building models for meaning representations; and (3) an overview of applications of meaning representations in downstream NLP tasks and real-world applications. We propose a cutting-edge, full-day tutorial for all stakeholders in the AI community, including NLP researchers, domain-specific practitioners, and students

pdf abs
Navigating the Modern Evaluation Landscape: Considerations in Benchmarks and Frameworks for Large Language Models (LLMs)
Leshem Choshen | Ariel Gera | Yotam Perlitz | Michal Shmueli-Scheuer | Gabriel Stanovsky

General-Purpose Language Models have changed the world of Natural Language Processing, if not the world itself. The evaluation of such versatile models, while supposedly similar to evaluation of generation models before them, in fact presents a host of new evaluation challenges and opportunities. In this Tutorial, we will start from the building blocks of evaluation. The tutorial welcomes people from diverse backgrounds and assumes little familiarity with metrics, datasets, prompts and benchmarks. It will lay the foundations and explain the basics and their importance, while touching on the major points and breakthroughs of the recent era of evaluation. It will also compare traditional evaluation methods – which are still widely used – to newly developed methods. We will contrast new to old approaches, from evaluating on many-task benchmarks rather than on dedicated datasets to efficiency constraints, and from testing stability and prompts on in-context learning to using the models themselves as evaluation metrics. Finally, the tutorial will cover practical issues, ranging from reviewing widely-used benchmarks and prompt banks to efficient evaluation.

pdf abs
Mining, Assessing, and Improving Arguments in NLP and the Social Sciences
Gabriella Lapesa | Eva Maria Vecchi | Serena Villata | Henning Wachsmuth

Computational argumentation is an interdisciplinary research field, connecting Natural Language Processing (NLP) to other disciplines such as the social sciences. The focus of recent research has concentrated on argument quality assessment: what makes an argument good or bad? We present a tutorial which is an updated edition of the EACL 2023 tutorial presented by the same authors. As in the previous version, the tutorial will have a strong interdisciplinary and interactive nature, and will be structured along three main coordinates: (1) the notions of argument quality (AQ) across disciplines (how do we recognize good and bad arguments?), with a particular focus on the interface between Argument Mining (AM) and Deliberation Theory; (2) the modeling of subjectivity (who argues to whom; what are their beliefs?); and (3) the generation of improved arguments (what makes an argument better?). The tutorial will also touch upon a series of topics that are particularly relevant for the LREC-COLING audience (the issue of resource quality for the assessment of AQ; the interdisciplinary application of AM and AQ in a text-as-data approach to Political Science), in line with the developments in NLP (LLMs for AQ assessment), and relevant for the societal applications of AQ assessment (bias and debiasing). We will involve the participants in two annotation studies on the assessment and the improvement of quality.

pdf abs
Knowledge Editing for Large Language Models
Ningyu Zhang | Yunzhi Yao | Shumin Deng

Even with their impressive abilities, Large Language Models (LLMs) such as ChatGPT are not immune to issues of factual or logically consistent. Concretely, the key concern is how to seamlessly update those LLMs to correct mistakes without resorting to an exhaustive retraining or continuous training procedure, both of which can demand significant computational resources and time. Thus, the capability to edit LLMs offers an efficient solution to alter a model’s behavior, notably within a distinct area of interest, without negatively impacting its performance on other tasks. Through this tutorial, we strive to acquaint interested NLP researchers with recent and emerging techniques for editing LLMs. Specifically, we aim to present a systematic and current overview of cutting-edge methods, supplemented with practical tools, and unveil new research opportunities for our audiences. All the valuable resources can be accessed at https://github.com/zjunlp/KnowledgeEditingPapers.

pdf abs
The DBpedia Databus Tutorial: Increase the Visibility and Usability of Your Data
Milan Dojchinovski

This tutorial introduces DBpedia Databus (https://databus.dbpedia.org), a FAIR data publishing platform, to address challenges faced by data producers and consumers. It covers data organization, publishing, and consumption on the DBpedia Databus, with an exclusive focus on Linguistic Knowledge Graphs. The tutorial offers practical insights for knowledge graph stakeholders, aiding data integration and accessibility in the Linked Open Data community. Designed for a diverse audience, it fosters hands-on learning to familiarize participants with the DBpedia Databus technology.

pdf abs
NLP for Chemistry – Introduction and Recent Advances
Camilo Thorne | Saber Akhondi

In this half-day tutorial we will be giving an introductory overview to a number of recent applications of natural language processing to a relatively underrepresented application domain: chemistry. Specifically, we will see how neural language models (transformers) can be applied (oftentimes with near-human performance) to chemical text mining, reaction extraction, or more importantly computational chemistry (forward and backward synthesis of chemical compounds). At the same time, a number of gold standards for experimentation have been made available to the research –academic and otherwise– community. Theoretical results will be, whenever possible, supported by system demonstrations in the form of Jupyter notebooks. This tutorial targets an audience interested in bioinformatics and biomedical applications, but pre-supposes no advanced knowledge of either.

pdf abs
Formal Semantic Controls over Language Models
Danilo Silva de Carvalho | Yingji Zhang | André Freitas

Text embeddings provide a concise representation of the semantics of sentences and larger spans of text, rather than individual words, capturing a wide range of linguistic features. They have found increasing application to a variety of NLP tasks, including machine translation and natural language inference. While most recent breakthroughs in task performance are being achieved by large scale distributional models, there is a growing disconnection between their knowledge representation and traditional semantics, which hinders efforts to capture such knowledge in human interpretable form or explain model inference behaviour. In this tutorial, we examine from basics to the cutting edge research on the analysis and control of text representations, aiming to shorten the gap between deep latent semantics and formal symbolics. This includes the considerations on knowledge formalisation, the linguistic information that can be extracted and measured from distributional models, and intervention techniques that enable explainable reasoning and controllable text generation, covering methods from pooling to LLM-based.

pdf abs
Towards a Human-Computer Collaborative Scientific Paper Lifecycle: A Pilot Study and Hands-On Tutorial
Qingyun Wang | Carl Edwards | Heng Ji | Tom Hope

Due to the rapid growth of publications varying in quality, there exists a pressing need to help scientists digest and evaluate relevant papers, thereby facilitating scientific discovery. This creates a number of urgent questions; however, computer-human collaboration in the scientific paper lifecycle is still in the exploratory stage and lacks a unified framework for analyzing the relevant tasks. Additionally, with the recent significant success of large language models (LLMs), they have increasingly played an important role in academic writing. In this cutting-edge tutorial, we aim to provide an all-encompassing overview of the paper lifecycle, detailing how machines can augment every stage of the research process for the scientist, including scientific literature understanding, experiment development, manuscript draft writing, and finally draft evaluation. This tutorial is devised for researchers interested in this rapidly-developing field of NLP-augmented paper writing. The tutorial will also feature a session of hands-on exercises during which participants can guide machines in generating ideas and automatically composing key paper elements. Furthermore, we will address current challenges, explore future directions, and discuss potential ethical issues. A toolkit designed for human-computer collaboration throughout the paper lifecycle will also be made publically available.

pdf abs
Tutorial Proposal: Hallucination in Large Language Models
Vipula Rawte | Aman Chadha | Amit Sheth | Amitava Das

In the fast-paced domain of Large Language Models (LLMs), the issue of hallucination is a prominent challenge. Despite continuous endeavors to address this concern, it remains a highly active area of research within the LLM landscape. Grasping the intricacies of this problem can be daunting, especially for those new to the field. This tutorial aims to bridge this knowledge gap by introducing the emerging realm of hallucination in LLMs. It will comprehensively explore the key aspects of hallucination, including benchmarking, detection, and mitigation techniques. Furthermore, we will delve into the specific constraints and shortcomings of current approaches, providing valuable insights to guide future research efforts for participants.

In the landscape of natural language processing (NLP), addressing the challenges of bias and hallucination is paramount to ensuring the ethical and unbiased development of Large Language Models (LLMs). This tutorial delves into the intricate dimensions of LLMs, shedding light on the critical importance of understanding and mitigating the profound impacts of bias and hallucination. Divided into two parts, the first part delves deep into the complexity of bias propagation in LLM development, where we dissect its origins and far-reaching impacts. We then present innovative methodologies for mitigating diverse forms of bias, including dynamic word embeddings and robust benchmarking strategies. The second part of the tutorial discusses hallucination - a prevalent issue in generative AI systems such as LLMs. Through advanced data-driven techniques, we decode its intricate effects and complexities, followed factually-driven mitigation strategies. Furthermore, we shed light on the pivotal role of human cognitive behavior in the context of hallucination, drawing insights from cognitive data, including human eye-tracking data. Ultimately, this cutting-edge tutorial serves as a guiding light, equipping participants with indispensable tools and insights to navigate the ethical complexities of LLMs, thus paving the way for the development of unbiased and ethically robust NLP systems.

pdf abs
Knowledge-enhanced Response Generation in Dialogue Systems: Current Advancements and Emerging Horizons
Priyanshu Priya | Deeksha Varshney | Mauajama Firdaus | Asif Ekbal

This tutorial provides an in-depth exploration of Knowledge-enhanced Dialogue Systems (KEDS), diving into their foundational aspects, methodologies, advantages, and practical applications. Topics include the distinction between internal and external knowledge integration, diverse methodologies employed in grounding dialogues, and innovative approaches to leveraging knowledge graphs for enhanced conversation quality. Furthermore, the tutorial touches upon the rise of biomedical text mining, the advent of domain-specific language models, and the challenges and strategies specific to medical dialogue generation. The primary objective is to give attendees a comprehensive understanding of KEDS. By delineating the nuances of these systems, the tutorial aims to elucidate their significance, highlight advancements made using deep learning, and pinpoint the current challenges. Special emphasis is placed on showcasing how KEDS can be fine-tuned for domain-specific requirements, with a spotlight on the healthcare sector. The tutorial is crafted for both beginners and intermediate researchers in the dialogue systems domain, with a focus on those keen on advancing research in KEDS. It will also be valuable for practitioners in sectors like healthcare, seeking to integrate advanced dialogue systems.

pdf (full)
bib (full) Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024

pdf bib
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024
Pierre Zweigenbaum | Reinhard Rapp | Serge Sharoff

pdf bib
On a Novel Application of Wasserstein-Procrustes for Unsupervised Cross-Lingual Alignment of Embeddings
Guillem Ramírez | Rumen Dangovski | Preslav Nakov | Marin Soljacic

pdf
Invited Talk: The Way Towards Massively Multilingual Language Models
François Yvon

pdf
Exploring the Potential of Large Language Models in Adaptive Machine Translation for Generic Text and Subtitles
Abdelhadi Soudi | Mohamed Hannani | Kristof Van Laerhoven | Eleftherios Avramidis

pdf
INCLURE: a Dataset and Toolkit for Inclusive French Translation
Paul Lerner | Cyril Grouin

pdf
Creating Clustered Comparable Corpora from Wikipedia with Different Fuzziness Levels and Language Representativity
Anna Laskina | Eric Gaussier | Gaelle Calvary

pdf
EuReCo: Not Building and Yet Using Federated Comparable Corpora for Cross-Linguistic Research
Marc Kupietz | Piotr Banski | Nils Diewald | Beata Trawinski | Andreas Witt

pdf
Building Annotated Parallel Corpora Using the ATIS Dataset: Two UD-style treebanks in English and Turkish
Neslihan Cesur | Aslı Kuzgun | Mehmet Kose | Olcay Taner Yıldız

pdf
Bootstrapping the Annotation of UD Learner Treebanks
Arianna Masciolini

pdf
SweDiagnostics: A Diagnostics Natural Language Inference Dataset for Swedish
Felix Morger

pdf
Multiple Discourse Relations in English TED Talks and Their Translation into Lithuanian, Portuguese and Turkish
Deniz Zeyrek | Giedrė Valūnaitė Oleškevičienė | Amalia Mendes

pdf
mini-CIEP+ : A Shareable Parallel Corpus of Prose
Annemarie Verkerk | Luigi Talamo

pdf (full)
bib (full) Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024

pdf bib
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024
Kyle Gorman | Emily Prud'hommeaux | Brian Roark | Richard Sproat

pdf bib abs
ParsText: A Digraphic Corpus for Tajik-Farsi Transliteration
Rayyan Merchant | Kevin Tang

Despite speaking dialects of the same language, Persian speakers from Tajikistan cannot read Persian texts from Iran and Afghanistan. This is due to the fact that Tajik Persian is written in the Tajik-Cyrillic script, while Iranian and Afghan Persian are written in the Perso-Arabic script. As the formal registers of these dialects all maintain high levels of mutual intelligibility with each other, machine transliteration has been proposed as a more practical and appropriate solution than machine translation. Unfortunately, Persian texts written in both scripts are much more common in print in Tajikistan than online. This paper introduces a novel corpus meant to remedy that gap: ParsText. ParsText contains 2,813 Persian sentences written in both Tajik-Cyrillic and Perso-Arabic manually collected from blog pages and news articles online. This paper presents the need for such a corpus, previous and related work, data collection and alignment procedures, corpus statistics, and discusses directions for future work.

pdf bib abs
A Joint Approach for Automatic Analysis of Reading and Writing Errors
Wieke Harmsen | Catia Cucchiarini | Roeland van Hout | Helmer Strik

Analyzing the errors that children make on their ways to becoming fluent readers and writers can provide invaluable scientific insights into the processes that underlie literacy acquisition. To this end, we present in this paper an extension of an earlier developed spelling error detection and classification algorithm for Dutch, so that reading errors can also be automatically detected from their phonetic transcription. The strength of this algorithm lies in its ability to detect errors at Phoneme-Corresponding Unit (PCU) level, where a PCU is a sequence of letters corresponding to one phoneme. We validated this algorithm and found good agreement between manual and automatic reading error classifications. We also used the algorithm to analyze written words by second graders and phonetic transcriptions of read words by first graders. With respect to the writing data, we found that the PCUs ‘ei’, ‘eu’, ‘g’, ‘ij’ and ‘ch’ were most frequently written incorrectly, for the reading data, these were the PCUs ‘v’, ‘ui’, ‘ng’, ‘a’ and ‘g’. This study presents a first attempt at developing a joint method for detecting reading and writing errors. In future research this algorithm can be used to analyze corpora containing reading and writing data from the same children.

pdf abs
Tool for Constructing a Large-Scale Corpus of Code Comments and Other Source Code Annotations
Luna Peck | Susan Brown

The sublanguage of source code annotations—explanatory natural language writing that accompanies programming source code—is little-studied in linguistics. To facilitate research into this domain, we have developed a program prototype that can extract code comments and changelogs (i.e. commit messages) from public, open-source code repositories, with automatic tokenization and part-of-speech tagging on the extracted text. The program can also automatically detect and discard “commented-out” source code in data from Python repositories, to prevent it from polluting the corpus, demonstrating that such sanitization is likely feasible for other programming languages as well. With the current tool, we have produced a 6-million word corpus of English-language comments extracted from three different programming languages: Python, C, and C++.

pdf abs
Tokenization via Language Modeling: the Role of Preceding Text
Rastislav Hronsky | Emmanuel Keuleers

While language models benefit immensely from their capacity to model large context (i.e., sequence of preceding tokens), the role of context is unclear in text tokenization, which is, in many cases, language model-driven to begin with. In this paper, we attempt to explore the role in three different writing systems and using three different text tokenization strategies (word-based, Morfessor, and BPE). In the first experiment, we examined how the size of context used for predicting the next token affects the ranking of the segmentation strategies i.t.o. language model surprisal. This effect was very writing system specific: minimal in case of English, and rank-reversing due to increased context size and token granularity in case of Turkish and Chinese. In the second experiment, we examined how context alters segmentation hypotheses when using language models to identify word boundaries. In this case, the effect was subtle: using context-aware, rather than context-free segment scores improved boundary recognition accuracy by up to 0.5%, once baseline effects were exploited.

pdf abs
Abbreviation Across the World’s Languages and Scripts
Kyle Gorman | Brian Roark

Detailed taxonomies for non-standard words, including abbreviations, have been developed for speech and language processing, though mostly with reference to English. In this paper, we examine abbreviation formation strategies in a diverse sample of more than 50 languages, dialects and scripts. The resulting taxonomy—and data about which strategies are attested in which languages—provides key information needed to create multilingual systems for abbreviation expansion, an essential component for speech processing and text understanding

pdf abs
Now You See Me, Now You Don’t: ‘Poverty of the Stimulus’ Problems and Arbitrary Correspondences in End-to-End Speech Models
Daan van Esch

End-to-end models for speech recognition and speech synthesis have many benefits, but we argue they also face a unique set of challenges not encountered in conventional multi-stage hybrid systems, which relied on the explicit injection of linguistic knowledge through resources such as phonemic dictionaries and verbalization grammars. These challenges include handling words with unusual grapheme-to-phoneme correspondences, converting between written forms like ‘12’ and spoken forms such as ‘twelve’, and contextual disambiguation of homophones or homographs. We describe the mitigation strategies that have been used for these problems in end-to-end systems, either implicitly or explicitly, and call out that the most commonly used mitigation techniques are likely incompatible with newly emerging approaches that use minimal amounts of supervised audio training data. We review best-of-both-world approaches that allow the use of end-to-end models combined with traditional linguistic resources, which we show are increasingly straightforward to create at scale, and close with an optimistic outlook for bringing speech technologies to many more languages by combining these strands of research.

pdf abs
Towards Fast Cognate Alignment on Imbalanced Data
Logan Born | M. Willis Monroe | Kathryn Kelley | Anoop Sarkar

Cognate alignment models purport to enable decipherment, but their speed and need for clean data can make them unsuitable for realistic decipherment problems. We seek to draw attention to these shortcomings in the hopes that future work may avoid them, and we outline two techniques which begin to overcome the described problems.

pdf abs
Simplified Chinese Character Distance Based on Ideographic Description Sequences
Yixia Wang | Emmanuel Keuleers

Character encoding systems have long overlooked the internal structure of characters. Ideographic Description Sequences, which explicitly represent spatial relations between character components, are a potential solution to this problem. In this paper, we illustrate the utility of Ideographic Description Sequences in computing edit distance and finding orthographic neighbors for Simplified Chinese characters. In addition, we explore the possibility of using Ideographic Description Sequences to encode spatial relations between components in other scripts.

pdf (full)
bib (full) Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024

pdf bib
Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024
Dina Demner-Fushman | Sophia Ananiadou | Paul Thompson | Brian Ondov

This paper presents a study on Swiss-French sign language production in the medical domain. In emergency care settings, a lack of clear communication can interfere with accurate delivery of health related services. For patients communicating with sign language, equal access to healthcare remains an issue. While previous work has explored producing sign language gloss from a source text, we propose to extend this approach to produce a multichannel sign language output given a written French input. Furthermore, we extend our approach with a multi-task framework allowing us to include the Unified Medical Language System (UMLS) in our model. Results show that the introduction of UMLS in the training data improves model accuracy by 13.64 points.

Sentiment analysis is an important tool for aggregating patient voices, in order to provide targeted improvements in healthcare services. A prerequisite for this is the availability of in-domain data annotated for sentiment. This article documents an effort to add sentiment annotations to free-text comments in patient surveys collected by the Norwegian Institute of Public Health (NIPH). However, annotation can be a time-consuming and resource-intensive process, particularly when it requires domain expertise. We therefore also evaluate a possible alternative to human annotation, using large language models (LLMs) as annotators. We perform an extensive evaluation of the approach for two openly available pretrained LLMs for Norwegian, experimenting with different configurations of prompts and in-context learning, comparing their performance to human annotators. We find that even for zero-shot runs, models perform well above the baseline for binary sentiment, but still cannot compete with human annotators on the full dataset.

pdf abs
Simulating Diverse Patient Populations Using Patient Vignettes and Large Language Models
Daniel Reichenpfader | Kerstin Denecke

Ensuring equitable access to digital therapeutics (DTx) is essential to avoid healthcare inequalities in an era of increasing digitization. This requires DTx to be tested with users from diverse populations, which is often not realistic due to time and resource constraints. In this paper, we propose the use of large language models (LLMs) to simulate diverse patients. Specifically, we manually create a patient vignette that characterizes a specific population group. Variations of this vignette are used for role-prompting a commercial LLM, GPT-4, instructing the LLM to take on the role described in the patient vignette and act accordingly. We investigate if the LLM stays in its given role. To do this, we simulate a medical anamnesis interview with the role-prompted LLM and analyze its responses for compliance, coherence, correctness, containment, and clarification. Our results show that GPT-4 generates compliant, coherent and clinically valid responses, including information that is not explicitly stated in the provided patient vignette.

In this article, we aim to measure the patients’ progress in recognizing and naming emotions by capturing a variety of phenomena that express emotion in discourse. To do so, we introduce an emotion annotation scheme adapted for Acquired Brain Injury (ABI) patients’ narratives. We draw on recent research outcomes in line with linguistic and psychological theories of emotion in the development of French resources for Natural Language Processing (NLP). From this perspective and following Battistelli et al. (2022) guidelines, our protocol considers several means of expressing emotions, including prototypical expressions as well as implicit means. Its originality lies on the methodology adopted for its creation, as we combined, adapted, and tested several previous annotation schemes to create a tool tailored to our spoken clinical French corpus and its unique characteristics and challenges.

In recent years, it has become common for patients to get full access to their Electronic Health Records (EHRs), thanks to the advancements in the EHRs systems of many healthcare providers. While this access empowers patients and doctors with comprehensive and real-time health information, it also introduces new challenges, in particular due to the unstructured nature of much of the information within EHRs. To address this, we propose a pipeline to structure clinical notes, providing them with a clear and concise overview of their health data and its longitudinal evolution, also allowing clinicians to focus more on patient care during consultations. In this paper, we present preliminary results on extracting structured information from anamneses of patients diagnosed with ST-Elevation Myocardial Infarction from an Italian hospital. Our pipeline exploits text classification models to extract relevant clinical variables, comparing rule-based, recurrent neural network and BERT-based models. While various approaches utilized ontologies or knowledge graphs for Italian data, our work represents the first attempt to develop this type of pipeline. The results for the extraction of most variables are satisfactory (f1-score > 0.80), with the exception of the most rare values of certain variables, for which we propose future research directions to investigate.

In this paper, we describe results of a study on evaluation of intralingual machine translation. The study focuses on machine translations of medical texts into Plain German. The automatically simplified texts were compared with manually simplified texts (i.e., simplified by human experts) as well as with the underlying, unsimplified source texts. We analyse the quality of the translations based on different criteria, such as correctness, readability, and syntactic complexity. The study revealed that the machine translations were easier to read than the source texts, but contained a higher number of complex syntactic relations than the human translations. Furthermore, we identified various types of mistakes. These included not only grammatical mistakes but also content-related mistakes that resulted, for example, from mistranslations of grammatical structures, ambiguous words or numbers, omissions of relevant prefixes or negation, and incorrect explanations of technical terms.

pdf abs
Large Language Models as Drug Information Providers for Patients
Luca Giordano | Maria Pia di Buono

Recently, a significant interest has arisen about the application of Large Language Models (LLMs) in medical settings to enhance various aspects of healthcare. Particularly, the application of such models to improve knowledge access for both clinicians and patients seems very promising but still far from perfect. In this paper, we present a preliminary evaluation of LLMs as drug information providers to support patients in drug administration. We focus on posology, namely dosage quantity and prescription, contraindications and adverse drug reactions and run an experiment on the Italian language to assess both the trustworthiness of the outputs and their readability. The results show that different types of errors affect the LLM answers. In some cases, the model does not recognize the drug name, due to the presence of synonymous words, or it provides untrustworthy information, caused by intrinsic hallucinations. Overall, the complexity of the language is lower and this could contribute to make medical information more accessible to lay people.

pdf abs
Towards Generation of Personalised Health Intervention Messages
Clara Wan Ching Ho | Volha Petukhova

Self-care is essential in managing chronic diseases when patients could not always be monitored by medical staff. It therefore fills in the gap to provide patients with advice in improving their conditions in day-to-day practices. However, effectiveness of self-interventions in encouraging healthy behaviour is limited, as they are often delivered in the same manner for patients regardless of their demographics, personality and individual preferences. In this paper, we propose strategies to generate personalized health intervention messages departing from assumptions made by theories of social cognition and learning, planned behaviour and information processing. The main task is then defined personalised argument generation task. Specifically, an existing well-performing Natural Language Generation (NLG) pipeline model is extended to modulate linguistic features by ranking texts generated based on individuals’ predicted preferences for persuasive messages. Results show that the model is capable of generating diverse intervention messages while preserving the original intended meaning. The modulated interventions were approved by human evaluators as being more understandable and maintaining the same level of convincingness as human-written texts. However, the generated personalised interventions did not show significant improvements in the power to change health-related attitudes and/or behaviour compared to their non-personalised counterparts. This is attributed to the fact that human data collected for the model’s training was rather limited in size and variation.

pdf abs
Analysing Emotions in Cancer Narratives: A Corpus-Driven Approach
Daisy Monika Lal | Paul Rayson | Sheila A. Payne | Yufeng Liu

Cancer not only affects a patient’s physical health, but it can also elicit a wide spectrum of intense emotions in patients, friends, and family members. People with cancer and their carers (family member, partner, or friend) are increasingly turning to the web for information and support. Despite the expansion of sentiment analysis in the context of social media and healthcare, there is relatively less research on patient narratives, which are longer, more complex texts, and difficult to assess. In this exploratory work, we examine how patients and carers express their feelings about various aspects of cancer (treatments and stages). The objective of this paper is to illustrate with examples the nature of language in the clinical domain, as well as the complexities of language when performing automatic sentiment and emotion analysis. We perform a linguistic analysis of a corpus of cancer narratives collected from Reddit. We examine the performance of five state-of-the-art models (T5, DistilBERT, Roberta, RobertaGo, and NRCLex) to see how well they match with human comparisons separated by linguistic and medical background. The corpus yielded several surprising results that could be useful to sentiment analysis NLP experts. The linguistic issues encountered were classified into four categories: statements expressing a variety of emotions, ambiguous or conflicting statements with contradictory emotions, statements requiring additional context, and statements in which sentiment and emotions can be inferred but are not explicitly mentioned.

pdf abs
Study of Medical Text Reading and Comprehension through Eye-Tracking Fixations
Oksana Ivchenko | Natalia Grabar

Reading plays a crucial role in cognitive processes, acting as the primary way in which people access and assimilate information. However, the ability to effectively comprehend and understand text is significantly influenced by various factors related to people and text types. We propose to study the reading easiness and comprehension of texts through the eye-tracking technology, which tracks gaze and records eye movement during reading. We concentrate on the study of eye-tracking measures related to fixations (average duration of fixations and number of fixations). The experiments are performed on several types of texts (clinical cases, encyclopedia articles related to the medical area, general-language texts, and simplified clinical cases). Eye-tracking measures are analysed quantitatively and qualitatively to draw the reading patterns and analyse how the reading differs across the text types.

We propose a dialogue system that enables heart failure patients to inquire about salt content in foods and help them monitor and reduce salt intake. Addressing the lack of specific datasets for food-based salt content inquiries, we develop a template-based conversational dataset. The dataset is structured to ask clarification questions to identify food items and their salt content. Our findings indicate that while fine-tuning transformer-based models on the dataset yields limited performance, the integration of Neuro-Symbolic Rules significantly enhances the system’s performance. Our experiments show that by integrating neuro-symbolic rules, our system achieves an improvement in joint goal accuracy of over 20% across different data sizes compared to naively fine-tuning transformer-based models.

pdf abs
On Simplification of Discharge Summaries in Serbian: Facing the Challenges
Anđelka Zečević | Milica Ćulafić | Stefan Stojković

The simplified information page (SIP) is a simplified discharge summary created to mitigate health risks caused by low medical comprehension. One of the most critical aspects of medical comprehension concerns interpreting medication instructions such as proper dosing, frequency, and duration. In our work, we examine the capacities of mainstream Large Language Models (LLMs) such as ChatGPT and Gemini to generate SIP-like medication-oriented pages based on the provided discharge summaries. We are sharing the initial qualitative assessments of our study based on a small collection of discharge summaries in Serbian, pointing to noticed inaccuracies, unfaithful content, and language quality. Hopefully, these findings might be helpful in addressing the multilingual perspective of patient-oriented language.

Metaphors shape the way we think by enabling the expression of one concept in terms of another one. For instance, cancer can be understood as a place from which one can go in and out, as a journey that one can traverse, or as a battle. Giving patients awareness of the way they refer to cancer and different narratives in which they can reframe it has been proven to be a key aspect when experiencing the disease. In this work, we propose a preliminary identification and representation of Spanish cancer metaphors using MIP (Metaphor Identification Procedure) and MetaNet. The created resource is the first openly available dataset for medical metaphors in Spanish. Thus, in the future, we expect to use it as the gold standard in automatic metaphor processing tasks, which will also serve to further populate the resource and understand how cancer is experienced and narrated.

pdf abs
Generating Synthetic Documents with Clinical Keywords: A Privacy-Sensitive Methodology
Simon Meoni | Éric De la Clergerie | Théo Ryffel

Electronic Health Records store valuable patient-staff interaction data. These notes, often unstructured to save healthcare personnel time, can be challenging to analyze manually. Proprietary online Large Language Models have demonstrated impressive results in analyzing EHR notes. However, Clinical NLP faces unique challenges due to the sensitive and specialized nature of the data. Sending patient information via external APIs poses privacy risks, and hospitals require customized NLP systems to align with their unique practices. To address these challenges, developing customized LLMs using specific training datasets is crucial. To address this, we propose generating synthetic training data using keywords extracted without confidential information. Furthermore, we introduce a reward mechanism that iteratively refines the quality of synthetic documents. This involves scoring synthetic candidates against real clinical reports using a semantic textual similarity score and performing an aligment step to align the model with its best-scored utterances.

Creating a certified conversational agent poses several issues. The need to manage fine-grained information delivery and the necessity to provide reliable medical information requires a notable effort, especially in dataset preparation. In this paper, we investigate the challenges of building a certified medical chatbot in Italian that provides information about pregnancy and early childhood. We show some negative initial results regarding the possibility of creating a certified conversational agent within the RASA framework starting from unstructured data. Finally, we propose a modular RAG model to implement a Large Language Model in a certified context, overcoming data limitations and enabling data collection on actual conversations.

pdf abs
Towards Using Automatically Enhanced Knowledge Graphs to Aid Temporal Relation Extraction
Timotej Knez | Slavko Žitnik

Temporal relation extraction in medical document analysis is crucial for understanding patient histories and treatment outcomes. This paper introduces a novel approach leveraging a bimodal model integrating textual content and a knowledge graph, to enhance temporal relation extraction. The paper presents ongoing research in constructing an optimal knowledge graph by augmenting PrimeKG with dynamically expanded information using a language model-generated knowledge graph, and further personalize the information with patient-specific graphs tailored for relation prediction. The pipeline for constructing this enriched knowledge graph is detailed, aiming to improve the capabilities of temporal relation extraction models. The preliminary results show that adding a simple knowledge graph to the temporal relation extraction model can significantly increase the performance, achieving new state-of-the-art results. While the research in using enhanced knowledge graphs is still ongoing, this paper lays the groundwork for leveraging common knowledge to advance temporal relation extraction in medical contexts. This approach holds promise for enhancing the understanding of patient histories and treatment outcomes, potentially leading to improved healthcare decision-making and patient care.

Hospital discharge letters are a fundamental component of patient management, as they provide the crucial information needed for patient post-hospital care. However their creation is very demanding and resource intensive, as it requires consultation of several reports documenting the patient’s journey throughout their hospital stay. Given the increasing pressures on doctor’s time, tools that can draft a reasonable discharge summary, to be then reviewed and finalized by the experts, would be welcome. In this paper we present a comparative study exploring the possibility of automatic generation of discharge summaries within the context of an hospital in an Italian-speaking region and discuss quantitative and qualitative results. Despite some shortcomings, the obtained results show that a generic generative system such as ChatGPT is capable of producing discharge summaries which are relatively close to the human generated ones, even in Italian.

pdf abs
Evaluating LLMs for Temporal Entity Extraction from Pediatric Clinical Text in Rare Diseases Context
Judith Jeyafreeda Andrew | Marc Vincent | Anita Burgun | Nicolas Garcelon

The aim of this work is to extract Temporal Entities from patients’ EHR from pediatric hospital specialising in Rare Diseases, thus allowing to create a patient timeline relative to diagnosis . We aim to perform an evaluation of NLP tools and Large Language Models (LLM) to test their application in the field of clinical study where data is limited and sensitive. We present a short annotation guideline for temporal entity identification. We then use the tool EDS-NLP, the Language Model CamemBERT-with-Dates and the LLM Vicuna to extract temporal entities. We perform experiments using three different prompting techniques on the LLM Vicuna to evaluate the model thoroughly. We use a small dataset of 50 EHR describing the evolution of rare diseases in patients to perform our experiments. We show that among the different methods to prompt a LLM, using a decomposed structure of prompting method on the LLM vicuna produces the best results for temporal entity recognition. The LLM learns from examples in the prompt and decomposing one prompt to several prompts allows the model to avoid confusions between the different entity types. Identifying the temporal entities in EHRs helps to build the timeline of a patient and to learn the evolution of a diseases. This is specifically important in the case of rare diseases due to the availability of limited examples. In this paper, we show that this can be made possible with the use of Language Models and LLM in a secure environment, thus preserving the privacy of the patient

pdf abs
Generating Distributable Surrogate Corpus for Medical Multi-label Classification
Seiji Shimizu | Shuntaro Yada | Shoko Wakamiya | Eiji Aramaki

In medical and social media domains, annotated corpora are often hard to distribute due to copyrights and privacy issues. To overcome this situation, we propose a new method to generate a surrogate corpus for a downstream task by using a text generation model. We chose a medical multi-label classification task, MedWeb, in which patient-generated short messages express multiple symptoms. We first fine-tuned text generation models with different prompting designs on the original corpus to obtain synthetic versions of that corpus. To assess the viability of the generated corpora for the downstream task, we compared the performance of multi-label classification models trained either on the original or the surrogate corpora. The results and the error analysis showed the difficulty of generating surrogate corpus in multi-label settings, suggesting text generation under complex conditions is not trivial. On the other hand, our experiment demonstrates that the generated corpus with a sentinel-based prompting is comparatively viable in a single-label (multiclass) classification setting.

pdf abs
CliniRes: Publicly Available Mapping of Clinical Lexical Resources
Elena Zotova | Montse Cuadros | German Rigau

This paper presents a human-readable resource for mapping identifiers from various clinical knowledge bases. This resource is a version of UMLS Metathesaurus enriched with WordNet 3.0 and 3.1 synsets, Wikidata items with their clinical identifiers, SNOMED CT to ICD-10 mapping and Spanish ICD-10 codes description. The main goal of the presented resource is to provide semantic interoperability across the clinical concepts from various knowledge bases and facilitate its integration into mapping tools. As a side effect, the mapping enriches already annotated medical corpora for entity recognition or entity linking tasks with new labels. We experiment with entity linking task, using a corpus annotated both manually and with the mapping method and demonstrate that a semi-automatic way of annotation may be used to create new labels. The resource is available in English and Spanish, although all languages of UMLS may be extracted. The new lexical resource is publicly available.

This article presents MedDialog-FR, a large publicly available corpus of French medical conversations for the medical domain. Motivated by the lack of French dialogue corpora for data-driven dialogue systems and the paucity of available information related to women’s intimate health, we introduce an annotated corpus of question-and-answer dialogues between a real patient and a real doctor concerning women’s intimate health. The corpus is composed of about 20,000 dialogues automatically translated from the English version of MedDialog-EN. The corpus test set is composed of 1,400 dialogues that have been manually post-edited and annotated with 22 categories from the UMLS ontology. We also fine-tuned state-of-the-art reference models to automatically perform multi-label classification and response generation to give an initial performance benchmark and highlight the difficulty of the tasks.

Mental health peer support forums have become widely used in recent years. The emerging mental health crisis and the COVID-19 pandemic have meant that finding a place online for support and advice when dealing with mental health issues is more critical than ever. The need to examine, understand and find ways to improve the support provided by mental health forums is vital in the current climate. As part of this, we present our initial explorations in using modern transformer models to detect four key concepts (connectedness, lived experience, empathy and gratitude), which we believe are essential to understanding how people use mental health forums and will serve as a basis for testing more expansive realise theories about mental health forums in the future. As part of this work, we also replicate previously published results on empathy utilising an existing annotated dataset and test the other concepts on our manually annotated mental health forum posts dataset. These results serve as a basis for future research examining peer support forums.

pdf abs
Revisiting the MIMIC-IV Benchmark: Experiments Using Language Models for Electronic Health Records
Jesus Lovon-Melgarejo | Thouria Ben-Haddi | Jules Di Scala | Jose G. Moreno | Lynda Tamine

The lack of standardized evaluation benchmarks in the medical domain for text inputs can be a barrier to widely adopting and leveraging the potential of natural language models for health-related downstream tasks. This paper revisited an openly available MIMIC-IV benchmark for electronic health records (EHRs) to address this issue. First, we integrate the MIMIC-IV data within the Hugging Face datasets library to allow an easy share and use of this collection. Second, we investigate the application of templates to convert EHR tabular data to text. Experiments using fine-tuned and zero-shot LLMs on the mortality of patients task show that fine-tuned text-based models are competitive against robust tabular classifiers. In contrast, zero-shot LLMs struggle to leverage EHR representations. This study underlines the potential of text-based approaches in the medical field and highlights areas for further improvement.

pdf abs
Unraveling Clinical Insights: A Lightweight and Interpretable Approach for Multimodal and Multilingual Knowledge Integration
Kanimozhi Uma | Marie-Francine Moens

In recent years, the analysis of clinical texts has evolved significantly, driven by the emergence of language models like BERT such as PubMedBERT, and ClinicalBERT, which have been tailored for the (bio)medical domain that rely on extensive archives of medical documents. While they boast high accuracy, their lack of interpretability and language transfer limitations restrict their clinical utility. To address this, we propose a new, lightweight graph-based embedding method designed specifically for radiology reports. This approach considers the report’s structure and content, connecting medical terms through the multilingual SNOMED Clinical Terms knowledge base. The resulting graph embedding reveals intricate relationships among clinical terms, enhancing both clinician comprehension and clinical accuracy without the need for large pre-training datasets. Demonstrating the versatility of our method, we apply this embedding to two tasks: disease and image classification in X-ray reports. In disease classification, our model competes effectively with BERT-based approaches, yet it is significantly smaller and requires less training data. Additionally, in image classification, we illustrate the efficacy of the graph embedding by leveraging cross-modal knowledge transfer, highlighting its applicability across diverse languages.

In this research, we propose a framework to generate human-like question-answer pairs with long or factoid answers automatically and, based on them, automatically evaluate the quality of Retrieval-Augmented Generation (RAG). Our framework can also create datasets that assess hallucination levels of Large Language Models (LLMs) by simulating unanswerable questions. We then apply the framework to create a dataset of question-answer (QA) pairs based on more than 1,000 leaflets about the medical and administrative procedures of a hospital. The dataset was evaluated by hospital specialists, who confirmed that more than 50% of the QA pairs are applicable. Finally, we show that our framework can be used to evaluate LLM performance by using Llama-2-13B fine-tuned in Dutch (Vanroy, 2023) with the generated dataset, and show the method’s use in testing models with regard to answering unanswerable and factoid questions appears promising.

Many people in the US use more than one language at home, yet English remains the dominant (L1) language in US society, which can complicate medical encounters. In this study we ask in what ways effective communication can be ensured in health care settings when speakers differ in language proficiency. One strategy people use is second language (L2) speech accommodation, which is characterized by slowed speech, less complex words, and clearer enunciation. We employ a mixed-reality platform called MURSION to document how a group of Physician Assistant students use speech accommodation during a healthcare encounter. MURSION is a computer-based virtual environment where participants interact with an Avatar controlled by a human interactor in a standardized environment. We record 5-minute interactions between the student and a high or low English proficiency Avatar. Our analyses evaluate lexical choices in L1-L2 interactions with SCOPE (South Carolina Psycholinguistic Metabase) and acoustic properties with PRAAT. Results show that clinical students use slower speech and high frequency words when speaking to a low proficiency virtual patient, indicating a sensitivity for the communicative needs of L2 English users. Speech accommodation results will contribute to communication training modules for clinicians to interact efficiently with linguistically diverse populations.

pdf abs
Enhancing Consumer Health Question Reformulation: Chain-of-Thought Prompting Integrating Focus, Type, and User Knowledge Level
Jooyeon Lee | Luan Huy Pham | Özlem Uzuner

In this paper, we explore consumer health question (CHQ) reformulation, focusing on enhancing the quality of reformation of questions without considering interest shifts. Our study introduces the use of the NIH GARD website as a gold standard dataset for this specific task, emphasizing its relevance and applicability. Additionally, we developed other datasets consisting of related questions scraped from Google, Bing, and Yahoo. We augmented, evaluated and analyzed the various datasets, demonstrating that the reformulation task closely resembles the question entailment generation task. Our approach, which integrates the Focus and Type of consumer inquiries, represents a significant advancement in the field of question reformulation. We provide a comprehensive analysis of different methodologies, offering insights into the development of more effective and user-centric AI systems for consumer health support.

pdf abs
Exploring the Challenges of Behaviour Change Language Classification: A Study on Semi-Supervised Learning and the Impact of Pseudo-Labelled Data
Selina Meyer | Marcos Fernandez-Pichel | David Elsweiler | David E. Losada

Automatic classification of behaviour change language can enhance conversational agents’ capabilities to adjust their behaviour based on users’ current situations and to encourage individuals to make positive changes. However, the lack of annotated language data of change-seekers hampers the performance of existing classifiers. In this study, we investigate the use of semi-supervised learning (SSL) to classify highly imbalanced texts around behaviour change. We assess the impact of including pseudo-labelled data from various sources and examine the balance between the amount of added pseudo-labelled data and the strictness of the inclusion criteria. Our findings indicate that while adding pseudo-labelled samples to the training data has limited classification impact, it does not significantly reduce performance regardless of the source of these new samples. This reinforces previous findings on the feasibility of applying classifiers trained on behaviour change language to diverse contexts.

pdf abs
Development of a Benchmark Corpus for Medical Device Adverse Event Detection
Susmitha Wunnava | David A. Harris | Florence T. Bourgeois | Timothy A. Miller

The U.S. Food and Drug Administration (FDA) collects real-world adverse events, including device-associated deaths, injuries, and malfunctions, through passive reporting to the agency’s Manufacturer and User Facility Device Experience (MAUDE) database. However, this system’s full potential remains untapped given the extensive use of unstructured text in medical device adverse event reports and lack of FDA resources and expertise to properly analyze all available data. In this work, we focus on addressing this limitation through the development of an annotated benchmark corpus to support the design and development of state-of-the-art NLP approaches towards automatic extraction of device-related adverse event information from FDA Medical Device Adverse Event Reports. We develop a dataset of labeled medical device reports from a diverse set of high-risk device types, that can be used for supervised machine learning. We develop annotation guidelines and manually annotate for nine entity types. The resulting dataset contains 935 annotated adverse event reports, containing 12252 annotated spans across the nine entity types. The dataset developed in this work will be made publicly available upon publication.

pdf abs
Using BART to Automatically Generate Discharge Summaries from Swedish Clinical Text
Nils Berg | Hercules Dalianis

Documentation is a regular part of contemporary healthcare practices and one such documentation task is the creation of a discharge summary, which summarizes a care episode. However, to manually write discharge summaries is a time-consuming task, and research has shown that discharge summaries are often lacking quality in various respects. To alleviate this problem, text summarization methods could be applied on text from electronic health records, such as patient notes, to automatically create a discharge summary. Previous research has been conducted on this topic on text in various languages and with various methods, but no such research has been conducted on Swedish text. In this paper, four datasets extracted from a Swedish clinical corpora were used to fine-tune four BART language models to perform the task of summarizing Swedish patient notes into a discharge summary. Out of these models, the best performing model was manually evaluated by a senior, now retired, nurse and clinical coder. The evaluation results show that the best performing model produces discharge summaries of overall low quality. This is possibly due to issues in the data extracted from the Health Bank research infrastructure, which warrants further work on this topic.

pdf abs
Biomedical Entity Linking for Dutch: Fine-tuning a Self-alignment BERT Model on an Automatically Generated Wikipedia Corpus
Fons Hartendorp | Tom Seinen | Erik van Mulligen | Suzan Verberne

Biomedical entity linking, a main component in automatic information extraction from health-related texts, plays a pivotal role in connecting textual entities (such as diseases, drugs and body parts mentioned by patients) to their corresponding concepts in a structured biomedical knowledge base. The task remains challenging despite recent developments in natural language processing. This report presents the first evaluated biomedical entity linking model for the Dutch language. We use MedRoBERTa.nl as basemodel and perform second-phase pretraining through self-alignment on a Dutch biomedical ontology extracted from the UMLS and Dutch SNOMED. We derive a corpus from Wikipedia of ontology-linked Dutch biomedical entities in context and fine-tune our model on this dataset. We evaluate our model on the Dutch portion of the Mantra GSC-corpus and achieve 54.7% classification accuracy and 69.8% 1-distance accuracy. We then perform a case study on a collection of unlabeled, patient-support forum data and show that our model is hampered by the limited quality of the preceding entity recognition step. Manual evaluation of small sample indicates that of the correctly extracted entities, around 65% is linked to the correct concept in the ontology. Our results indicate that biomedical entity linking in a language other than English remains challenging, but our Dutch model can be used to for high-level analysis of patient-generated text.

pdf abs
Unveiling Voices: Identification of Concerns in a Social Media Breast Cancer Cohort via Natural Language Processing
Swati Rajwal | Avinash Kumar Pandey | Zhishuo Han | Abeed Sarker

We leveraged a dataset of ∼1.5 million Twitter (now X) posts to develop a framework for analyzing breast cancer (BC) patients’ concerns and possible reasons for treatment discontinuation. Our primary objectives were threefold: (1) to curate and collect data from a BC cohort; (2) to identify topics related to uncertainty/concerns in BC-related posts; and (3) to conduct a sentiment intensity analysis of posts to identify and analyze negatively polarized posts. RoBERTa outperformed other models with a micro-averaged F1 score of 0.894 and a macro-averaged F1 score of 0.853 for (1). For (2), we used GPT-4 and BERTopic, and qualitatively analyzed posts under relevant topics. For (3), sentiment intensity analysis of posts followed by qualitative analyses shed light on potential reasons behind treatment discontinuation. Our work demonstrates the utility of social media mining to discover BC patient concerns. Information derived from the cohort data may help design strategies in the future for increasing treatment compliance.

pdf abs
Intent Detection and Entity Extraction from Biomedical Literature
Ankan Mullick | Mukur Gupta | Pawan Goyal

Biomedical queries have become increasingly prevalent in web searches, reflecting the growing interest in accessing biomedical literature. Despite recent research on large-language models (LLMs) motivated by endeavors to attain generalized intelligence, their efficacy in replacing task and domain-specific natural language understanding approaches remains questionable. In this paper, we address this question by conducting a comprehensive empirical evaluation of intent detection and named entity recognition (NER) tasks from biomedical text. We show that Supervised Fine Tuned approaches are still relevant and more effective than general-purpose LLMs. Biomedical transformer models such as PubMedBERT can surpass ChatGPT on NER task with only 5 supervised examples.

pdf (full)
bib (full) Proceedings of the Workshop on Cognitive Aspects of the Lexicon @ LREC-COLING 2024

pdf bib
Proceedings of the Workshop on Cognitive Aspects of the Lexicon @ LREC-COLING 2024
Michael Zock | Emmanuele Chersoni | Yu-Yin Hsu | Simon de Deyne

pdf bib abs
CLAVELL - Cognitive Linguistic Annotation and Visualization Environment for Language Learning
Werner Winiwarter

In this paper we introduce a novel sentence annotation based on radical construction grammar and Uniform Meaning Representation, which covers all levels of linguistic analysis, from interlinear morphemic glossing to PropBank rolesets, WordNet synsets, and Wikipedia page titles as concept identifiers. We visually enhance our annotation by using images to represent concepts, emojis for thematic roles, and color-coding for constructions. The meaning representation is embedded into the syntactic parse by aligning all concepts with the surface tokens in the sentence. The main motivation for developing this type of representation was its use in second language acquisition as part of a Web-based language learning environment. In entertaining and engaging annotation tasks language students assemble the representation step-by-step following a bottom-up strategy. Based on language exposure while performing these exercises, we populate personal idiolectal constructicons representing the students’ current status of second language comprehension. As first use case, we have implemented a solution for Japanese due to its soaring popularity in our language education program and the particular challenges involved with trying to master this language.

pdf bib abs
Individual Text Corpora Predict Openness, Interests, Knowledge and Level of Education
Markus J. Hofmann | Markus T. Jansen | Christoph Wigbels | Benny Briesemeister | Arthur M. Jacobs

Here we examine whether the personality dimension of openness to experience can be predicted from the individual google search history. By web scraping, individual text corpora (ICs) were generated from 214 participants with a mean number of 5 million word tokens. We trained word2vec models and used the similarities of each IC to label words, which were derived from a lexical approach of personality. These IC-label-word similarities were utilized as predictive features in neural models. For training and validation, we relied on 179 participants and held out a test sample of 35 participants. A grid search with varying number of predictive features, hidden units and boost factor was performed. As model selection criterion, we used R2 in the validation samples penalized by the absolute R2 difference between training and validation. The selected neural model explained 35% of the openness variance in the test sample, while an ensemble model with the same architecture often provided slightly more stable predictions for intellectual interests, knowledge in humanities and level of education. Finally, a learning curve analysis suggested that around 500 training participants are required for generalizable predictions. We discuss ICs as a complement or replacement of survey-based psychodiagnostics.

pdf abs
An Empirical Study on Vague Deictic Temporal Adverbials
Svenja Kenneweg | Brendan Balcerak Jackson | Joerg Deigmoeller | Julian Eggert | Philipp Cimiano

Temporal adverbial phrases such as recently and some time ago have a special function in communication and temporal cognition. These adverbials are deictic, in that their meaning is tied to their time of utterance; and they are vague, in that the time periods to which they apply are under-specified in comparison to expressions such as yesterday, which precisely indicates the day before the day of utterance. Despite their vagueness, conversational participants have a mental image of when events described using these adverbials take place. We present a study that aims to quantify this mental model in terms of fuzzy or graded membership. To achieve this, we investigated the four English temporal adverbials recently, just, some time ago and long time ago as applied to types of events with different durations and frequencies, by conducting surveys to measure how speakers judge the different adverbials to apply in different time ranges. Our results suggest that it is possible to represent the meanings of deictic vague temporal adverbials geometrically in terms of graded membership within a temporal conceptual space.

pdf abs
Symbolic Learning of Rules for Semantic Relation Types Identification in French Genitive Postnominal Prepositional Phrases
Hani Guenoune | Mathieu Lafourcade

We are interested in the semantic relations conveyed by polylexical entities in the postnominal prepositional noun phrases form “A de B” (A of B). After identifying a relevant set of semantic relations types, we proceed, using generative AI, to build a collection of phrases, for each semantic relation type identified. We propose an algorithm for creating rules that allow the selection of the relation between A and B in noun phrases of each type. These rules correspond to selecting from a knowledge base the appropriate neighborhood of a given term. For the phrase “désert d’Algérie” carrying the location relation, the term “désert” is identified as a geographical location, and “Algérie” as a country. These constraints are used to automatically learn a set of rules for selecting the location relation for this type of example. Rules are not exclusive as there may be instances that fall under multiple relations. In the phrase “portrait de sa mère - the portrait of his/her mother”, all of depiction, possession, and producer types are a possible match.

pdf abs
How Human-Like Are Word Associations in Generative Models? An Experiment in Slovene
Špela Vintar | Mojca Brglez | Aleš Žagar

Large language models (LLMs) show extraordinary performance in a broad range of cognitive tasks, yet their capability to reproduce human semantic similarity judgements remains disputed. We report an experiment in which we fine-tune two LLMs for Slovene, a monolingual SloT5 and a multilingual mT5, as well as an mT5 for English, to generate word associations. The models are fine-tuned on human word association norms created within the Small World of Words project, which recently started to collect data for Slovene. Since our aim was to explore differences between human and model-generated outputs, the model parameters were minimally adjusted to fit the association task. We perform automatic evaluation using a set of methods to measure the overlap and ranking, and in addition a subset of human and model-generated responses were manually classified into four categories (meaning-, positionand form-based, and erratic). Results show that human-machine overlap is very small, but that the models produce a similar distribution of association categories as humans.

pdf abs
Idiom Complexity in Apple-Pie Order: The Disentanglement of Decomposability and Transparency
Irene Pagliai

Both decomposability and transparency investigate the interplay between literality and figurativity in idioms. For this reason, they have often been merged. This study argues that idiom decomposability and transparency are related but conceptually different constructs, thus advocating for their distinction. Leveraging a normed lexicon of Italian and English idioms, the respective effects of decomposability and transparency on idiom meaning recognition are explored via statistical modeling. Results show the two variables contribute differently to idiom meaning recognition in the two languages, while the absence of collinearity underscores their distinct contributions. Based on this empirical evidence, the study finally proposes FrameNet and MetaNet as computational tools for modeling idiom decomposability and transparency. This study thus not only substantiates the separation of idiom decomposability and transparency, but also sets a foundation for future interdisciplinary research to bridge the gap in idiom research between empirical psycholinguistics, cognitive linguistics and computational applications.

pdf abs
What GPT-4 Knows about Aspectual Coercion: Focused on “Begin the Book”
Seohyun Im | Chungmin Lee

This paper explores whether Pre-trained Large Language Models (PLLMs) like GPT-4 can grasp profound linguistic insights into language phenomena such as Aspectual Coercion through interaction with Microsoft’s Copilot, which integrates GPT-4. Firstly, we examined Copilot’s understanding of the co-occurrence constraints of the aspectual verb “begin” and the complex-type noun “book” using the classic illustration of Aspectual Coercion, “begin the book.” Secondly, we verified Copilot’s awareness of both the default interpretation of “begin the book” with no specific context and the contextually preferred interpretation. Ultimately, Copilot provided appropriate responses regarding potential interpretations of “begin the book” based on its distributional properties and context-dependent preferred interpretations. However, it did not furnish sophisticated explanations concerning these interpretations from a linguistic theoretical perspective. On the other hand, by offering diverse interpretations grounded in distributional properties, language models like GPT-4 demonstrated their potential contribution to the refinement of linguistic theories. Furthermore, we suggested the feasibility of employing Language Models to construct language resources associated with language phenomena including Aspectual Coercion.

pdf abs
Can GPT-4 Recover Latent Semantic Relational Information from Word Associations? A Detailed Analysis of Agreement with Human-annotated Semantic Ontologies.
Simon De Deyne | Chunhua Liu | Lea Frermann

Word associations, i.e., spontaneous responses to a cue word, provide not only a window into the human mental lexicon but have also been shown to be a repository of common-sense knowledge and can underpin efforts in lexicography and the construction of dictionaries. Especially the latter tasks require knowledge about the relations underlying the associations (e.g., Taxonomic vs. Situational); however, to date, there is neither an established ontology of relations nor an effective labelling paradigm. Here, we test GPT-4’s ability to infer semantic relations for human-produced word associations. We use four human-labelled data sets of word associations and semantic features, with differing relation inventories and various levels of annotator agreement. We directly prompt GPT-4 with detailed relation definitions without further fine-tuning or training. Our results show that while GPT-4 provided a good account of higher-level classifications (e.g. Taxonomic vs Situational), prompting instructions alone cannot obtain similar performance for detailed classifications (e.g. superordinate, subordinate or coordinate relations) despite high agreement among human annotators. This suggests that latent relations can at least be partially recovered from word associations and highlights ways in which LLMs could be improved and human annotation protocols could adapted to reduce coding ambiguity.

pdf abs
What’s in a Name? Electrophysiological Differences in Processing Proper Nouns in Mandarin Chinese
Bernard A. J. Jap | Yu-Yin Hsu | Lavinia Salicchi | Yu Xi Li

The current study examines how proper names and common nouns in Chinese are cognitively processed during sentence comprehension. EEG data was recorded when participants were presented with neutral contexts followed by either a proper name or a common noun. Proper names in Chinese often consist of characters that can function independently as words or be combined with other characters to form words, potentially benefiting from the semantic features carried by each character. Using cluster-based permutation tests, we found a larger N400 for common nouns when compared to proper names. Our results suggest that the semantics of characters do play a role in facilitating the processing of proper names. This is consistent with previous behavioral findings on noun processing in Chinese, indicating that common nouns require more cognitive resources to process than proper names. Moreover, our results suggest that proper names are processed differently between alphabetic languages and Chinese language.

pdf abs
Cross-Linguistic Processing of Non-Compositional Expressions in Slavic Languages
Iuliia Zaitova | Irina Stenger | Muhammad Umer Butt | Tania Avgustinova

This study focuses on evaluating and predicting the intelligibility of non-compositional expressions within the context of five closely related Slavic languages: Belarusian, Bulgarian, Czech, Polish, and Ukrainian, as perceived by native speakers of Russian. Our investigation employs a web-based experiment where native Russian respondents take part in free-response and multiple-choice translation tasks. Based on the previous studies in mutual intelligibility and non-compositionality, we propose two predictive factors for reading comprehension of unknown but closely related languages: 1) linguistic distances, which include orthographic and phonological distances; 2) surprisal scores obtained from monolingual Language Models (LMs). Our primary objective is to explore the relationship of these two factors with the intelligibility scores and response times of our web-based experiment. Our findings reveal that, while intelligibility scores from the experimental tasks exhibit a stronger correlation with phonological distances, LM surprisal scores appear to be better predictors of the time participants invest in completing the translation tasks.

pdf abs
Using Language Models to Unravel Semantic Development in Children’s Use of Perception Verbs
Bram van Dijk | Max J. van Duijn | Li Kloostra | Marco Spruit | Barend Beekhuizen

In this short paper we employ a Language Model (LM) to gain insight into how complex semantics of a Perception Verb (PV) emerge in children. Using a Dutch LM as representation of mature language use, we find that for all ages 1) the LM accurately predicts PV use in children’s freely-told narratives; 2) children’s PV use is close to mature use; 3) complex PV meanings with attentional and cognitive aspects can be found. Our approach illustrates how LMs can be meaningfully employed in studying language development, hence takes a constructive position in the debate on the relevance of LMs in this context.

pdf abs
Representing Abstract Concepts with Images: An Investigation with Large Language Models
Ludovica Cerini | Alessandro Bondielli | Alessandro Lenci

Multimodal metaphorical interpretation of abstract concepts has always been a debated problem in many research fields, including cognitive linguistics and NLP. With the dramatic improvements of Large Language Models (LLMs) and the increasing attention toward multimodal Vision-Language Models (VLMs), there has been pronounced attention on the conceptualization of abstracts. Nevertheless, a systematic scientific investigation is still lacking. This work introduces a framework designed to shed light on the indirect grounding mechanisms that anchor the meaning of abstract concepts to concrete situations (e.g. ability - a person skating), following the idea that abstracts acquire meaning from embodied and situated simulation. We assessed human and LLMs performances by a situation generation task. Moreover, we assess the figurative richness of images depicting concrete scenarios, via a text-to-image retrieval task performed on LAION-400M.

pdf abs
Big-Five Backstage: A Dramatic Dataset for Characters Personality Traits & Gender Analysis
Marina Tiuleneva | Vadim A. Porvatov | Carlo Strapparava

This paper introduces a novel textual dataset comprising fictional characters’ lines with annotations based on their gender and Big-Five personality traits. Using psycholinguistic findings, we compared texts attributed to fictional characters and real people with respect to their genders and personality traits. Our results indicate that imagined personae mirror most of the language categories observed in real people while demonstrating them in a more expressive manner.

pdf abs
Interaction of Semantics and Morphology in Russian Word Vectors
Yulia Zinova | Ruben van de Vijver | Anastasia Yablokova

In this paper we explore how morphological information can be extracted from fastText embeddings for Russian nouns. We investigate the negative effects of syncretism and propose ways of modifying the vectors that can help to find better representations for morphological functions and thus for out of vocabulary words. In particular, we look at the effect of analysing shift vectors instead of original vectors, discuss various possibilities of finding base forms to create shift vectors, and show that using only the high frequency data is beneficial when looking for structure with respect to the morphosyntactic functions in the embeddings.

pdf abs
Listen, Repeat, Decide: Investigating Pronunciation Variation in Spoken Word Recognition among Russian Speakers
Vladislav Ivanovich Zubov | Elena Riekhakaynen

Variability is one of the important features of natural speech and a challenge for spoken word recognition models and automatic speech recognition systems. We conducted two preliminary experiments aimed at finding out whether native Russian speakers regard differently certain types of pronunciation variation when the variants are equally possible according to orthoepic norms. In the first experiment, the participants had to repeat the words with three different types of pronunciation variability. In the second experiment, we focused on the assessment of words with variable and only one standard stress. Our results support the hypothesis that listeners pay the most attention to words with variable stress, less to the variability of soft and hard consonants, and even less to the presence / absence of /j/. Assessing the correct pronunciation of words with variable stress takes significantly more time than assessing words which have only one correct pronunciation variant. These preliminary results show that pronunciation variants can provide new evidence on how a listener access the mental lexicon during natural speech processing and chooses among the variants stored in it.

pdf abs
The Mental Lexicon of Communicative Fragments and Contours: The Remix N-gram Method
Emese K. Molnár | Andrea Dömötör

The classical mental lexicon models represented the lexicon as a list of words. Usage-based models describe the mental lexicon more dynamically, but they do not capture the real-time operation of speech production. In the linguistic model of Boris Gasparov, the notions of communicative fragment and contour can provide a comprehensive description of the diversity of linguistic experience. Fragments and contours form larger linguistic structures than words and they are recognized as a whole unit by speakers through their communicative profile. Fragments are prefabricated units that can be added to or merged with each other during speech production. The contours serve as templates for the utterances by combining specific and abstract linguistic elements. Based on this theoretical framework, our tool applies remix n-grams (combination of word forms, lemmas and POS-tags) to identify similar linguistic structures in different texts that form the basic units of the mental lexicon.

pdf abs
Three Studies on Predicting Word Concreteness with Embedding Vectors
Michael Flor

Human-assigned concreteness ratings for words are commonly used in psycholinguistic and computational linguistic studies. Previous research has shown that such ratings can be modeled and extrapolated by using dense word-embedding representations. However, due to rater disagreement, considerable amounts of human ratings in published datasets are not reliable. We investigate how such unreliable data influences modeling of concreteness with word embeddings. Study 1 compares fourteen embedding models over three datasets of concreteness ratings, showing that most models achieve high correlations with human ratings, and exhibit low error rates on predictions. Study 2 investigates how exclusion of the less reliable ratings influences the modeling results. It indicates that improved results can be achieved when data is cleaned. Study 3 adds additional conditions over those of study 2 and indicates that the improved results hold only for the cleaned data, and that in the general case removing the less reliable data points is not useful.

pdf abs
Combining Neo-Structuralist and Cognitive Approaches to Semantics to Build Wordnets for Ancient Languages: Challenges and Perspectives
Erica Biagetti | Martina Giuliani | Silvia Zampetta | Silvia Luraghi | Chiara Zanchi

This paper addresses challenges encountered in constructing lexical databases, specifically WordNets, for three ancient Indo-European languages: Ancient Greek, Latin, and Sanskrit. The difficulties partly arise from adapting concepts and methodologies designed for modern languages to the construction of lexical resources for ancient ones. A further significant challenge arises from the goal of creating WordNets that not only adhere to a neo-structuralist relational view of meaning but also integrate Cognitive Semantics concepts, aiming for a more realistic representation of meaning. This integration is crucial for facilitating studies in diachronic semantics and lexicology, and representing meaning in such a nuanced manner becomes paramount when constructing language resources for theoretical research, rather than for applied tasks, as is the case with lexical resources for ancient languages. The paper delves into these challenges through a case study focused on the TEMPERATURE conceptual domain in the three languages. It outlines difficulties in distinguishing prototypical and non-prototypical senses, literal and non-literal ones, and, within non-literal meanings, between metaphorical and metonymic ones. Solutions adopted to address these challenges are presented, highlighting the necessity of achieving maximum granularity in meaning representation while maintaining a sustainable workflow for annotators.

pdf abs
SensoryT5: Infusing Sensorimotor Norms into T5 for Enhanced Fine-grained Emotion Classification
Yuhan Xia | Qingqing Zhao | Yunfei Long | Ge Xu | Jia Wang

In traditional research approaches, sensory perception and emotion classification have traditionally been considered separate domains. Yet, the significant influence of sensory experiences on emotional responses is undeniable. The natural language processing (NLP) community has often missed the opportunity to merge sensory knowledge with emotion classification. To address this gap, we propose SensoryT5, a neurocognitive approach that integrates sensory information into the T5 (Text-to-Text Transfer Transformer) model, designed specifically for fine-grained emotion classification. This methodology incorporates sensory cues into the T5’s attention mechanism, enabling a harmonious balance between contextual understanding and sensory awareness. The resulting model amplifies the richness of emotional representations. In rigorous tests across various detailed emotion classification datasets, SensoryT5 showcases improved performance, surpassing both the foundational T5 model and current state-of-the-art works. Notably, SensoryT5’s success signifies a pivotal change in the NLP domain, highlighting the potential influence of neurocognitive data in refining machine learning models’ emotional sensitivity.

pdf (full)
bib (full) Proceedings of the First Workshop on Language-driven Deliberation Technology (DELITE) @ LREC-COLING 2024

Measuring the quality of contributions in political online discussions is crucial in deliberation research and computer science. Research has identified various indicators to assess online discussion quality, and with deep learning advancements, automating these measures has become feasible. While some studies focus on analyzing specific quality indicators, a comprehensive quality score incorporating various deliberative aspects is often preferred. In this work, we introduce AQuA, an additive score that calculates a unified deliberative quality score from multiple indices for each discussion post. Unlike other singular scores, AQuA preserves information on the deliberative aspects present in comments, enhancing model transparency. We develop adapter models for 20 deliberative indices, and calculate correlation coefficients between experts’ annotations and the perceived deliberativeness by non-experts to weigh the individual indices into a single deliberative score. We demonstrate that the AQuA score can be computed easily from pre-trained adapters and aligns well with annotations on other datasets that have not be seen during training. The analysis of experts’ vs. non-experts’ annotations confirms theoretical findings in the social science literature.

pdf bib abs
A Unified LLM-KG Framework to Assist Fact-Checking in Public Deliberation
Nikolaos Giarelis | Charalampos Mastrokostas | Nikos Karacapilidis

Fact-checking plays a crucial role in public deliberation by promoting transparency, accuracy, credibility, and accountability. Aiming to augment the efficiency and adoption of current public deliberation platforms, which mostly rely on the abilities of participants to meaningfully process and interpret the associated content, this paper explores the combination of deep learning and symbolic reasoning. Specifically, it proposes a framework that unifies the capabilities of Large Language Models (LLMs) and Knowledge Graphs (KGs), and reports on an experimental evaluation. This evaluation is conducted through a questionnaire asking users to assess a baseline LLM against the proposed framework, using a series of fact-checking metrics, namely readability, coverage, non-redundancy, and quality. The experimentation results are promising and confirm the potential of combining the capabilities of these two technologies in the context of public deliberation and digital democracy.

pdf abs
Can Text Simplification Help to Increase the Acceptance of E-participation?
Regina Stodden | Phillip Nguyen

This study investigated the effect of text simplification (with and without artificial intelligence support) and the role of participants (author or reader) on the acceptance of e-participation processes. Therefore, a near-realistic experimental study with 276 participants was conducted simulating a participatory budgeting process. The results of our study show, on the one hand, that text simplification and the role of participants has no direct influence on the intention to use e-participation. Although a higher level of participation cannot be achieved by text simplification, our results also show that no negative consequences for usage intention can be expected from text simplification. On the other hand, the results show that people with reading and writing difficulties prefer text simplification for proposals in e-participation.

pdf abs
Pitfalls of Conversational LLMs on News Debiasing
Ipek Baris Schlicht | Defne Altiok | Maryanne Taouk | Lucie Flek

This paper addresses debiasing in news editing and evaluates the effectiveness of conversational Large Language Models in this task. We designed an evaluation checklist tailored to news editors’ perspectives, obtained generated texts from three popular conversational models using a subset of a publicly available dataset in media bias, and evaluated the texts according to the designed checklist. Furthermore, we examined the models as evaluator for checking the quality of debiased model outputs. Our findings indicate that none of the LLMs are perfect in debiasing. Notably, some models, including ChatGPT, introduced unnecessary changes that may impact the author’s style and create misinformation. Lastly, we show that the models do not perform as proficiently as domain experts in evaluating the quality of debiased outputs.

pdf abs
Integrating conflict prevention tools into deliberative democracy online platforms
Sara Greco | Chiara Jermini

This paper presents a set of preliminary guidelines for conflict prevention developed within the EU-funded research project ORBIS (“Augmenting participation, co-creation, trust and transparency in Deliberative Democracy at all scales”), whose goal is developing online platforms that enable citizens to enhance their participation in democratic processes, through open discussions around important political topics. Based on previous research on communication and argumentation in conflict resolution discourse and on the empirical analysis of discussions around deliberative democracy topics, this paper highlights recurrent interpersonal communication problems that might occur in group discussions around complex topics and that, if not handled well, can lead to conflicts; and introduces a first proposal for solutions to help, both through technology and with the assistance of human moderations, participants in such discussions to avoid the development and the escalation of conflicts.

pdf abs
A Hybrid Human-AI Approach for Argument Map Creation From Transcripts
Lucas Anastasiou | Anna De Liddo

In order to overcome challenges of traditional deliberation approaches that often silo information exchange between synchronous and asynchronous modes therefore hindering effective deliberation, we present a hybrid framework combining Large Language Models (LLMs) and human-in-the-loop curation to generate argument maps from deliberation transcripts. This approach aims to enhance the efficiency and quality of the generated argument maps, promote transparency, and connect the asynchronous and synchronous deliberation modes. Finally, we outline a realistic deliberation scenario where this process can be successfully integrated.

pdf abs
Leveraging High-Precision Corpus Queries for Text Classification via Large Language Models
Nathan Dykes | Stephanie Evert | Philipp Heinrich | Merlin Humml | Lutz Schröder

We use query results from manually designed corpus queries for fine-tuning an LLM to identify argumentative fragments as a text mining task. The resulting model outperforms both an LLM fine-tuned on a relatively large manually annotated gold standard of tweets as well as a rule-based approach. This proof-of-concept study demonstrates the usefulness of corpus queries to generate training data for complex text categorisation tasks, especially if the targeted category has low prevalence (so that a manually annotated gold standard contains only a small number of positive examples).

pdf (full)
bib (full) Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024

pdf bib
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024
Giorgio Maria Di Nunzio | Federica Vezzani | Liana Ermakova | Hosein Azarbonyad | Jaap Kamps

pdf bib abs
Reproduction of German Text Simplification Systems
Regina Stodden

The paper investigates the reproducibility of various approaches to automatically simplify German texts and identifies key challenges in the process. We reproduce eight sentence simplification systems including rules-based models, fine-tuned models, and prompting of autoregressive models. We highlight three main issues of reproducibility: the impossibility of reproduction due to missing details, code, or restricted access to data/models; variations in reproduction, hindering meaningful comparisons; and discrepancies in evaluation scores between reported and reproduced models. To enhance reproducibility and facilitate model comparison, we recommend the publication of model-related details, including checkpoints, code, and training methodologies. Our study also emphasizes the importance of releasing system generations, when possible, for thorough analysis and better understanding of original works. In our effort to compare reproduced models, we also create a German sentence simplification benchmark of the eight models across six test sets. Overall, the study underscores the significance of transparency, documentation, and diverse training data for advancing reproducibility and meaningful model comparison in automated German text simplification.

pdf bib abs
Complexity-Aware Scientific Literature Search: Searching for Relevant and Accessible Scientific Text
Liana Ermakova | Jaap Kamps

Abstract: We conduct a series of experiments on ranking scientific abstracts in response to popular science queries issued by non-expert users. We show that standard IR ranking models optimized on topical relevance are indeed ignoring the individual user’s context and background knowledge. We also demonstrate the viability of complexity-aware retrieval models that retrieve more accessible relevant documents or ensure these are ranked prior to more advanced documents on the topic. More generally, our results help remove some of the barriers to consulting scientific literature by non-experts and hold the potential to promote science literacy in the general public. Lay Summary: In a world of misinformation and disinformation, access to objective evidence-based scientific information is crucial. The general public ignores scientific information due to its perceived complexity, resorting to shallow information on the web or in social media. We analyze the complexity of scientific texts retrieved for a lay person’s topic, and find a great variation in text complexity. A proof of concept complexity-aware search engine is able to retrieve both relevant and accessible scientific information for a layperson’s information need.

pdf abs
Beyond Sentence-level Text Simplification: Reproducibility Study of Context-Aware Document Simplification
Jan Bakker | Jaap Kamps

Previous research on automatic text simplification has focused on almost exclusively on sentence-level inputs. However, the simplification of full documents cannot be tackled by naively simplifying each sentence in isolation, as this approach fails to preserve the discourse structure of the document. Recent Context-Aware Document Simplification approaches explore various models whose input goes beyond the sentence-level. These model achieve state-of-the-art performance on the Newsela-auto dataset, which requires a difficult to obtain license to use. We replicate these experiments on an open-source dataset, namely Wiki-auto, and share all training details to make future reproductions easy. Our results validate the claim that models guided by a document-level plan outperform their standard counterparts. However, they do not support the claim that simplification models perform better when they have access to a local document context. We also find that planning models do not generalize well to out-of-domain settings. Lay Summary: We have access to unprecedented amounts of information, yet the most authoritative sources may exceed a user’s language proficiency level. Text simplification technology can change the writing style while preserving the main content. Recent paragraph-level and document-level text simplification approaches outcompete traditional sentence-level approaches, and increase the understandability of complex texts.

pdf abs
Towards Automatic Finnish Text Simplification
Anna Dmitrieva | Jörg Tiedemann

Automatic text simplification (ATS/TS) models typically require substantial parallel training data. This paper describes our work on expanding the Finnish-Easy Finnish parallel corpus and making baseline simplification models. We discuss different approaches to document and sentence alignment. After finding the optimal alignment methodologies, we increase the amount of document-aligned data 6.5 times and add a sentence-aligned version of the dataset consisting of more than twelve thousand sentence pairs. Using sentence-aligned data, we fine-tune two models for text simplification. The first is mBART, a sequence-to-sequence translation architecture proven to show good results for monolingual translation tasks. The second is the Finnish GPT model, for which we utilize instruction fine-tuning. This work is the first attempt to create simplification models for Finnish using monolingual parallel data in this language. The data has been deposited in the Finnish Language Bank (Kielipankki) and is available for non-commercial use, and the models will be made accessible through either Kielipankki or public repositories such as Huggingface or GitHub.

pdf abs
Multilingual Resources for Lexical Complexity Prediction: A Review
Matthew Shardlow | Kai North | Marcos Zampieri

Lexical complexity prediction is the NLP task aimed at using machine learning to predict the difficulty of a target word in context for a given user or user group. Multiple datasets exist for lexical complexity prediction, many of which have been published recently in diverse languages. In this survey, we discuss nine recent datasets (2018-2024) all of which provide lexical complexity prediction annotations. Particularly, we identified eight languages (French, Spanish, Chinese, German, Russian, Japanese, Turkish and Portuguese) with at least one lexical complexity dataset. We do not consider the English datasets, which have already received significant treatment elsewhere in the literature. To survey these datasets, we use the recommendations of the Complex 2.0 Framework (Shardlow et al., 2022), identifying how the datasets differ along the following dimensions: annotation scale, context, multiple token instances, multiple token annotations, diverse annotators. We conclude with future research challenges arising from our survey of existing lexical complexity prediction datasets.

pdf abs
Plain Language Summarization of Clinical Trials
Polydoros Giannouris | Theodoros Myridis | Tatiana Passali | Grigorios Tsoumakas

Plain language summarization, or lay summarization, is an emerging natural language processing task, aiming to make scientific articles accessible to an audience of non-scientific backgrounds. The healthcare domain can greatly benefit from applications of automatic plain language summarization, as results that concern a large portion of the population are reported in large documents with complex terminology. However, existing corpora for this task are limited in scope, usually regarding conference or journal article abstracts. In this paper, we introduce the task of automated generation of plain language summaries for clinical trials, and construct CARES (Clinical Abstractive Result Extraction and Simplification), the first corresponding dataset. CARES consists of publicly available, human-written summaries of clinical trials conducted by Pfizer. Source text is identified from documents released throughout the life-cycle of the trial, and steps are taken to remove noise and select the appropriate sections. Experiments show that state-of-the-art models achieve satisfactory results in most evaluation metrics

pdf abs
Enhancing Lexical Complexity Prediction through Few-shot Learning with Gpt-3
Jenny Alexandra Ortiz-Zambrano | César Humberto Espín-Riofrío | Arturo Montejo-Ráez

This paper describes an experiment to evaluate the ability of the GPT-3 language model to classify terms regarding their lexical complexity. This was achieved through the creation and evaluation of different versions of the model: text-Davinci-002 y text-Davinci-003 and prompts for few-shot learning to determine the complexity of the words. The results obtained on the CompLex dataset achieve a minimum average error of 0.0856. Although this is not better than the state of the art (which is 0.0609), it is a performing and promising approach to lexical complexity prediction without the need for model fine-tuning.

pdf abs
An Approach towards Unsupervised Text Simplification on Paragraph-Level for German Texts
Leon Fruth | Robin Jegan | Andreas Henrich

Text simplification as a research field has received attention in recent years for English and other languages, however, German text simplification techniques are lacking thus far. We present an unsupervised simplification approach for German texts using reinforcement learning (self-critical sequence training). Our main contributions are the adaption of an existing method for English, the selection and creation of German corpora for this task and the customization of rewards for particular aspects of the German language. In our paper, we describe our system and an evaluation, including still present issues and problems due to the complexity of the German language, as well as directions for future research.

pdf abs
Simplification Strategies in French Spontaneous Speech
Lucía Ormaechea | Nikos Tsourakis | Didier Schwab | Pierrette Bouillon | Benjamin Lecouteux

Automatic Text Simplification (ATS) aims at rewriting texts into simpler variants while preserving their original meaning, so they can be more easily understood by different audiences. While ATS has been widely used for written texts, its application to spoken language remains unexplored, even if it is not exempt from difficulty. This study aims to characterize the edit operations performed in order to simplify French transcripts for non-native speakers. To do so, we relied on a data sample randomly extracted from the Orféo-CEFC French spontaneous speech dataset. In the absence of guidelines to direct this process, we adopted an intuitive simplification approach, so as to investigate the crafted simplifications based on expert linguists’ criteria, and to compare them with those produced by a generative AI (namely, ChatGPT). The results, analyzed quantitatively and qualitatively, reveal that the most common edits are deletions, and affect oral production aspects, like restarts or hesitations. Consequently, candidate simplifications are typically register-standardized sentences that solely include the propositional content of the input. The study also examines the alignment between human- and machine-based simplifications, revealing a moderate level of agreement, and highlighting the subjective nature of the task. The findings contribute to understanding the intricacies of simplifying spontaneous spoken language. In addition, the provision of a small-scale parallel dataset derived from such expert simplifications, Propicto-Orféo-Simple, can facilitate the evaluation of speech simplification solutions.

pdf abs
DARES: Dataset for Arabic Readability Estimation of School Materials
Mo El-Haj | Sultan Almujaiwel | Damith Premasiri | Tharindu Ranasinghe | Ruslan Mitkov

This research introduces DARES, a dataset for assessing the readability of Arabic text in Saudi school materials. DARES compromise of 13335 instances from textbooks used in 2021 and contains two subtasks; (a) Coarse-grained readability assessment where the text is classified into different educational levels such as primary and secondary. (b) Fine-grained readability assessment where the text is classified into individual grades.. We fine-tuned five transformer models that support Arabic and found that CAMeLBERTmix performed the best in all input settings. Evaluation results showed high performance for the coarse-grained readability assessment task, achieving a weighted F1 score of 0.91 and a macro F1 score of 0.79. The fine-grained task achieved a weighted F1 score of 0.68 and a macro F1 score of 0.55. These findings demonstrate the potential of our approach for advancing Arabic text readability assessment in education, with implications for future innovations in the field.

Reading movements and times are a precious cue to follow reader’s strategy, and to track the underlying effort in text processing. To date, many approaches are being devised to simplify texts to overcome difficulties stemming from sentences obscure, ambiguous or deserving clarification. In the legal domain, ensuring the clarity of norms and regulations is of the utmost importance, as the full understanding of such documents lies at the foundation of core social obligations and rights. This task requires determining which utterances and text excerpts are difficult for which (sort of) reader. This investigation is the aim of the present work. We propose a preliminary study based on eye-tracking data of 61 readers, with focus on individuating different reader profiles, and on predicting reading times of our readers.

pdf abs
The Simplification of the Language of Public Administration: The Case of Ombudsman Institutions
Gabriel Gonzalez-Delgado | Borja Navarro-Colorado

Language produced by Public Administrations has crucial implications in citizens’ lives. However, its syntactic complexity and the use of legal jargon, among other factors, make it difficult to be understood for laypeople and certain target audiences. The NLP task of Automatic Text Simplification (ATS) can help to the necessary simplification of this technical language. For that purpose, specialized parallel datasets of complex-simple pairs need to be developed for the training of these ATS systems. In this position paper, an on-going project is presented, whose main objectives are (a) to extensively analyze the syntactical, lexical, and discursive features of the language of English-speaking ombudsmen, as samples of public administrative language, with special attention to those characteristics that pose a threat to comprehension, and (b) to develop the OmbudsCorpus, a parallel corpus of complex-simple supra-sentential fragments from ombudsmen’s case reports that have been manually simplified by professionals and annotated with standardized simplification operations. This research endeavor aims to provide a deeper understanding of the simplification process and to enhance the training of ATS systems specialized in administrative texts.

pdf abs
Term Variation in Institutional Languages: Degrees of Specialization in Municipal Waste Management Terminology
Nicola Cirillo | Daniela Vellutino

Institutional Italian is a variety of Italian used in the official communications of institutions, especially in public administrations. Besides legal and administrative languages, it comprises the language used in websites, social media and advertising material produced by public administrations. To understand the lexical profile of institutional languages completely, standard measures of lexical complexity, like the type-token ratio and the percentage of basic vocabulary, should be complemented with the examination of the terminological variation. This study compares the terminology of three types of institutional texts: administrative acts, technical-operational texts, and informative texts. In particular, we collected 86 terms with various degrees of specialization and analysed their distribution within the subcorpora of ItaIst-DdAC_GRU, a corpus composed of institutional texts drafted by Italian municipalities about municipal waste management. Results suggest that administrative acts employ high-specialization terms compliant with the law, often in the form of acronyms. Conversely, informative texts contain more low-specialization terms, privileging single-word terms to remain self-contained. Finally, the terminology of technical-operational texts is characterised by standardized and formulaic phrases.

pdf abs
LARGEMED: A Resource for Identifying and Generating Paraphrases for French Medical Terms
Ioana Buhnila | Amalia Todirascu

This article presents a method extending an existing French corpus of paraphrases of medical terms ANONYMOUS with new data from Web archives created during the Covid-19 pandemic. Our method semi-automatically detects new terms and paraphrase markers introducing paraphrases from these Web archives, followed by a manual annotation step to identify paraphrases and their lexical and semantic properties. The extended large corpus LARGEMED could be used for automatic medical text simplification for patients and their families. To automatise data collection, we propose two experiments. The first experiment uses the new LARGEMED dataset to train a binary classifier aiming to detect new sentences containing possible paraphrases. The second experiment aims to use correct paraphrases to train a model for paraphrase generation, by adapting T5 Language Model to the paraphrase generation task using an adversarial algorithm.

pdf abs
Clearer Governmental Communication: Text Simplification with ChatGPT Evaluated by Quantitative and Qualitative Research
Nadine Beks van Raaij | Daan Kolkman | Ksenia Podoynitsyna

This research investigates the application of ChatGPT for the simplification of Dutch government letters, aiming to enhance their comprehensibility without compromising legal accuracy. We use a three-stage mixed method evaluation procedure to compare the performance of a naive approach, RoBERTA, and ChatGPT. We select the six most complicated letters from a corpus of 200 letters and use the three approaches to simplify them. First, we compare their scores on four evaluation metrics (ROUGE, BLEU, BLEURT, and LiNT), then we assess the simplifications with a legal and linguistic expert. Finally we investigate the performance of ChatGPT in a randomized controlled trial with 72 participants. Our findings reveal that ChatGPT significantly improves the readability of government letters, demonstrating over a 20% increase in comprehensibility scores and a 19% increase in correct question answering among participants. We also demonstrate the importance of a robust evaluation procedure.

pdf abs
Legal Science and Compute Science: A Preliminary Discussions on How to Represent the “Penumbra” Cone with AI
Angela Condello | Giorgio Maria Di Nunzio

Legal science encounters significant challenges with the widespread integration of AI software across various legal operations. The distinction between signs, senses, and references from a linguistic point of view, as drawn by Gottlob Frege, underscores the complexity of legal language, especially in multilingual contexts like the European Union. In this paper, we describe the problems of legal terminology, examining the “penumbra” problem through Herbert Hart’s legal theory of meaning. We also analyze the feasibility of training automatic systems to handle conflicts between different interpretations of legal norms, particularly in multilingual legal systems. By examining the transformative impact of Artificial Intelligence on traditional legal practices, this research contributes to the theoretical discussion about the exploration of innovative methodologies for simplifying complex terminologies without compromising meaning.

pdf abs
Simpler Becomes Harder: Do LLMs Exhibit a Coherent Behavior on Simplified Corpora?
Miriam Anschütz | Edoardo Mosca | Georg Groh

Text simplification seeks to improve readability while retaining the original content and meaning. Our study investigates whether pre-trained classifiers also maintain such coherence by comparing their predictions on both original and simplified inputs. We conduct experiments using 11 pre-trained models, including BERT and OpenAI’s GPT 3.5, across six datasets spanning three languages. Additionally, we conduct a detailed analysis of the correlation between prediction change rates and simplification types/strengths. Our findings reveal alarming inconsistencies across all languages and models. If not promptly addressed, simplified inputs can be easily exploited to craft zero-iteration model-agnostic adversarial attacks with success rates of up to 50%.

pdf abs
Pre-Gamus: Reducing Complexity of Scientific Literature as a Support against Misinformation
Nico Colic | Jin-Dong Kim | Fabio Rinaldi

Scientific literature encodes a wealth of knowledge relevant to various users. However, the complexity of scientific jargon makes it inaccessible to all but domain specialists. It would be helpful for different types of people to be able to get at least a gist of a paper. Biomedical practitioners often find it difficult to keep up with the information load; but even lay people would benefit from scientific information, for example to dispel medical misconceptions. Besides, in many countries, familiarity with English is limited, let alone scientific English, even among professionals. All this points to the need for simplified access to the scientific literature. We thus present an application aimed at solving this problem, which is capable of summarising scientific text in a way that is tailored to specific types of users, and in their native language. For this objective, we used an LLM that our system queries using user-selected parameters. We conducted an informal evaluation of this prototype using a questionnaire in 3 different languages.

pdf (full)
bib (full) Proceedings of the Workshop on Deep Learning and Linked Data (DLnLD) @ LREC-COLING 2024

pdf bib
Proceedings of the Workshop on Deep Learning and Linked Data (DLnLD) @ LREC-COLING 2024
Gilles Sérasset | Hugo Gonçalo Oliveira | Giedre Valunaite Oleskeviciene

pdf bib abs
Investigating the Impact of Different Graph Representations for Relation Extraction with Graph Neural Networks
Moritz Blum | Gennaro Nolano | Basil Ell | Philipp Cimiano

Graph Neural Networks(GNNs) have been applied successfully to various NLP tasks, particularly Relation Extraction(RE). Even though most of these approaches rely on the syntactic dependency tree of a sentence to derive a graph representation, the impact of this choice compared to other possible graph representations has not been evaluated. We examine the effect of representing text though a graph of different graph representations for GNNs that are applied to RE, considering, e.g., a fully connected graph of tokens, of semantic role structures, and combinations thereof. We further examine the impact of background knowledge injection from Knowledge Graphs(KGs) into the graph representation to achieve enhanced graph representations. Our results show that combining multiple graph representations can improve the model’s predictions. Moreover, the integration of background knowledge positively impacts scores, as enhancing the text graphs with Wikidata features or WordNet features can lead to an improvement of close to 0.1 points in F1.

pdf bib abs
TaxoCritic: Exploring Credit Assignment in Taxonomy Induction with Multi-Critic Reinforcement Learning
Injy Sarhan | Bendegúz Toth | Pablo Mosteiro | Shihan Wang

Taxonomies can serve as a vital foundation for several downstream tasks such as information retrieval and question answering, yet manual construction limits coverage and full potential. Automatic taxonomy induction, particularly using deep Reinforcement Learning (RL), is underexplored in Natural Language Processing (NLP). To address this gap, we present TaxoCritic, a novel approach that leverages deep multi-critic RL agents for taxonomy induction while incorporating credit assignment mechanisms. Our system uniquely assesses different sub-actions within the induction process, providing a granular analysis that aids in the precise attribution of credit and blame. We evaluate the effectiveness of multi-critic algorithms in experiments regarding both accuracy and robustness performance in edge identification. By providing a detailed comparison with state-of-the-art models and highlighting the strengths and limitations of our method, we aim to contribute to the ongoing

pdf abs
Combining Deep Learning Models and Lexical Linked Data: Some Insights from the Development of a Multilingual News Named Entity Recognition and Linking Dataset
Emmanuel Cartier | Emile Peetermans

This paper presents the methodology and outcomes of a Named Entity Recognition and Linking multilingual news benchmark that leverages both Deep learning approaches by using a fine-tuned transformer model to detect mentions of persons, locations and organisations in text, and Linguistic Linked Open Data, through the use of Wikidata to disambiguate mentions and link them to ontology entries. It shows all the advantages of combining both approaches, not only for building the benchmark but also for fine-tuning detection models. We also insist on several perspectives of research to improve the accuracy of a combining system and go further on leveraging the complementary approaches.

pdf abs
Deductive Verification of LLM Generated SPARQL Queries
Alexandre Rademaker | Guilherme Lima | Sandro Rama Fiorini | Viviane Torres da Silva

Considering the increasing applications of Large Language Models (LLMs) to many natural language tasks, this paper presents preliminary findings on developing a verification component for detecting hallucinations of an LLM that produces SPARQL queries from natural language questions. We suggest a logic-based deductive verification of the generated SPARQL query by checking if the original NL question’s deep semantic representation entails the SPARQL’s semantic representation.

pdf abs
How to Turn Card Catalogs into LLM Fodder
Mary Ann Tan | Shufan Jiang | Harald Sack

Bibliographical metadata collections describing pre-modern objects suffer from incompleteness and inaccuracies. This hampers the identification of literary works. In addition, titles often contain voluminous descriptive texts that do not adhere to contemporary title conventions. This paper explores several NLP approaches where greater textual length in titles is leveraged to enhance descriptive information.

pdf abs
Evaluating Large Language Models for Linguistic Linked Data Generation
Maria Pia di Buono | Blerina Spahiu | Verginica Barbu Mititelu

Large language models (LLMs) have revolutionized human-machine interaction with their ability to converse and perform various language tasks. This study investigates the potential of LLMs for knowledge formalization using well-defined vocabularies, specifically focusing on OntoLex-Lemon. As a preliminary exploration, we test four languages (English, Italian, Albanian, Romanian) and analyze the formalization quality of nine words with varying characteristics applying a multidimensional evaluation approach. While manual validation provided initial insights, it highlights the need for developing scalable evaluation methods for future large-scale experiments. This research aims to initiate a discussion on the potential and challenges of utilizing LLMs for knowledge formalization within the Semantic Web framework.

pdf abs
Towards Automated Evaluation of Knowledge Encoded in Large Language Models
Bruno Carlos Luís Ferreira | Catarina Silva | Hugo Gonçalo Oliveira

Large Language Models (LLMs) have a significant user base and are gaining increasing interest and impact across various domains. Given their expanding influence, it is crucial to implement appropriate guardrails or controls to ensure ethical and responsible use. In this paper, we propose to automate the evaluation of the knowledge stored in LLMs. This is achieved by generating datasets tailored for this specific purpose, in any selected domain. Our approach consists of four major steps: (i) extraction of relevant entities; (ii) gathering of domain properties; (iii) dataset generation; and (iv) model evaluation. In order to materialize this vision, tools and resources were experimented for entity linking, knowledge acquisition, classification and prompt generation, yielding valuable insights and lessons. The generation of datasets for domain specific model evaluation has successfully proved that the approach can be a future tool for evaluating and moving LLMs “black-boxes” to human-interpretable knowledge bases.

pdf abs
Self-Evaluation of Generative AI Prompts for Linguistic Linked Open Data Modelling in Diachronic Analysis
Florentina Armaselu | Chaya Liebeskind | Giedre Valunaite Oleskeviciene

This article addresses the question of evaluating generative AI prompts designed for specific tasks such as linguistic linked open data modelling and refining of word embedding results. The prompts were created to assist the pre-modelling phase in the construction of LLODIA, a linguistic linked open data model for diachronic analysis. We present a self-evaluation framework based on the method known in literature as LLM-Eval. The discussion includes prompts related to the RDF-XML conception of the model, and neighbour list refinement, dictionary alignment and contextualisation for the term revolution in French, Hebrew and Lithuanian, as a proof of concept.

pdf (full)
bib (full) Proceedings of the Fifth International Workshop on Designing Meaning Representations @ LREC-COLING 2024

pdf bib
Proceedings of the Fifth International Workshop on Designing Meaning Representations @ LREC-COLING 2024
Claire Bonial | Julia Bonn | Jena D. Hwang

For human-robot dialogue in a search-and-rescue scenario, a strong knowledge of the conditions and objects a robot will face is essential for effective interpretation of natural language instructions. In order to utilize the power of large language models without overwhelming the limited storage capacity of a robot, we propose PropBank-Powered Data Creation. PropBank-Powered Data Creation is an expert-in-the-loop data generation pipeline which creates training data for disaster-specific language models. We leverage semantic role labeling and Rich Event Ontology resources to efficiently develop seed sentences for fine-tuning a smaller, targeted model that could operate onboard a robot for disaster relief. We developed 32 sentence templates, which we used to make 2 seed datasets of 175 instructions for earthquake search and rescue and train derailment response. We further leverage our seed datasets as evaluation data to test our baseline fine-tuned models.

pdf bib abs
Aspect Variability and the Annotation of Aspect in the IMAGACT Ontology of Action
Massimo Moneglia | Rossella Varvara

This paper highlights some theoretical and quantitative issues related to the representation and annotation of aspectual meaning in the IMAGACT corpus-based multimodal ontology of action. Given the multimodal nature of this ontology, in which actions are represented through both prototypical visual scenes and linguistic captions, the annotation of aspect in this resource allows us to draw some important considerations about the relation between aspectual meaning and eventualities. The annotation procedure is reported and quantitative data show that, both in the English and Italian corpora, many verbs present aspectual variation, and many eventualities can be represented by locally equivalent verbs with different aspect. The reason why verb aspectual class may vary is investigated. Our analysis makes once more evident that verbs may vary their aspectual properties with respect not only to their argument structure but, more precisely, to the inner qualities of the eventualities they express. Crucially, when eventualities are expressed by equivalent verbs with different aspectual properties, the verbs put on focus different parts of the structure of the eventuality.

pdf abs
NoVRol: A semantic role lexicon of Norwegian verbs
Henrik Torgersen | Erlend Ø. Ravnanger | Lars Hellan | Dag Haug

In this paper, we describe NoVRol, a semantic role lexicon of Norwegian verbs. We start from the NorVal valency lexicon, which describes the syntactic frames of 7.400 verbs. We then enrich each of these frames by annotating, based on the VerbNet annotation scheme, each argument of the verb with the semantic role that it gets. We also encode the syntactic roles of the arguments based on the UD annotation scheme. Our resource will faciliate future research on Norwegian verbs, and can at a future stage be expanded to a full VerbNet

Semantic role labeling (SRL) resources, such as Proposition Bank (PropBank), provide useful input to downstream applications. In this paper we present some challenges and insights we learned while expanding the previously developed Russian PropBank. This new effort involved annotation and adjudication of all predicates within a subset of the prior work in order to provide a test corpus for future applications. We discuss a number of new issues that arose while developing our PropBank for Russian as well as our solutions. Framing issues include: distinguishing between morphological processes that warrant new frames, differentiating between modal verbs and predicate verbs, and maintaining accurate representations of a given language’s semantics. Annotation issues include disagreements derived from variability in Universal Dependency parses and semantic ambiguity within the text. Finally, we demonstrate how Russian sentence structures reveal inherent limitations to PropBank’s ability to capture semantic data. These discussions should prove useful to anyone developing a PropBank or similar SRL resources for a new language.

pdf abs
Unveiling Semantic Information in Sentence Embeddings
Leixin Zhang | David Burian | Vojtěch John | Ondřej Bojar

This study evaluates the extent to which semantic information is preserved within sentence embeddings generated from state-of-art sentence embedding models: SBERT and LaBSE. Specifically, we analyzed 13 semantic attributes in sentence embeddings. Our findings indicate that some semantic features (such as tense-related classes) can be decoded from the representation of sentence embeddings. Additionally, we discover the limitation of the current sentence embedding models: inferring meaning beyond the lexical level has proven to be difficult.

pdf abs
A Quantum Theory of Terms and New Challenges to Meaning Representation of Quanterms
Diego Burgos

This article discusses the challenges to meaning representation of terms posed by a quantum theory of terms (QTT) that was recently reported. We first summarize this theory and then highlight the difficulties of representing quanterms, which is the name we coined for the view that the QTT has of terms as quantum systems by analogy with quantum objects in quantum mechanics. We briefly summarize the representation practices followed to date to record and represent terminology. We use findings reported in the literature to model both terms and quanterms and found that current representations of terms in specialized repositories are collapsed quanterms at the expense of other states of the original quanterm. In this work, both quanterms and collapsed quanterms are mathematically modelled following formulations used in quantum mechanics. These formulations suggest that representations of quanterms need to include information about the probabilities of quanterm states and the role they play in the entanglement of terms for phenomena such as specialized collocations.

pdf abs
VOLARE - Visual Ontological LAnguage REpresentation
Werner Winiwarter

In this paper, we introduce a novel meaning representation, which is based on AMR but extends it towards a visual ontological representation. We visualize concepts by representative images, and roles by emojis. All concepts are identified either by PropBank rolesets, Wikipedia page titles, WordNet synsets, or Wikidata lexeme senses. We have developed a Web-based annotation environment enabled by augmented browsing and interactive diagramming. As first application, we have implemented a multilingual annotation solution by using English as anchor language and comparing it with French and Japanese language versions. Therefore, we have extended our representation by a translation deviation annotation to document the differences between the language versions. The intended user groups are, besides professional translators and interpreters, students of translation, language, and literary studies. We describe a first use case in which we use novels by French authors and compare them with their English and Japanese translations. The main motivation for choosing Japanese is the soaring popularity of Japanese courses at our university and the particular challenges involved with trying to master this language.

pdf abs
YARN is All You Knit: Encoding Multiple Semantic Phenomena with Layers
Siyana Pavlova | Maxime Amblard | Bruno Guillaume

In this paper, we present the first version of YARN, a new semantic representation formalism. We propose this new formalism to unify the advantages of logic-based formalisms while retaining direct interpretation, making it widely usable. YARN is rooted in the encoding of different semantic phenomena as separate layers. We begin by presenting a formal definition of the mathematical structure that constitutes YARN. We then illustrate with concrete examples how this structure can be used in the context of semantic representation for encoding multiple phenomena (such as modality, negation and quantification) as layers built on top of a central predicate-argument structure. The benefit of YARN is that it allows for the independent annotation and analysis of different phenomena as they are easy to “switch off”. Furthermore, we have explored YARN’s ability to encode simple interactions between phenomena. We wrap up the work presented by a discussion of some of the interesting observations made during the development of YARN so far and outline our extensive future plans for this formalism.

pdf abs
Argument Sharing in Meaning Representation Parsing
Maja Buljan | Stephan Oepen | Lilja Øvrelid

We present a contrastive study of argument sharing across three graph-based meaning representation frameworks, where semantically shared arguments manifest as reentrant graph nodes. For a state-of-the-art graph parser, we observe how parser performance – in terms of output quality – covaries with overall graph complexity, on the one hand, and presence of different types of reentrancies, on the other hand. We identify common linguistic phenomena that give rise to shared arguments, and therefore node reentrancies, through a small-case and partially automated annotation study and parallel error anaylsis of actual parser outputs. Our results provide new insights into the distribution of different types of reentrancies in meaning representation graphs for three distinct frameworks, as well as on the effects that these structures have on parser performance, thus suggesting both novel cross-framework generalisations as well as avenues for focussed parser development.

pdf abs
Mapping PropBank Argument Labels to Czech Verbal Valency
Jan Hajič | Eva Fučíková | Marketa Lopatkova | Zdeňka Urešová

For many years, there has been attempts to compare predicate-argument labeling schemas between formalism, typically under the dependency assumptions (even if the annotation by these schemas could have been performed on either constituent-based specifications or dependency ones). Given the growing number of resources that link various lexical resources to one another, as well as thanks to parallel annotated corpora (with or without annotation), it is now possible to do more in-depth studies of those correspondences. We present here a high-coverage pilot study of mapping the labeling system used in PropBank (for English) to Czech, which has so far used mainly valency lexicons (in several closely related forms) for annotation projects, under a different level of specification and different theoretical assumptions. The purpose of this study is both theoretical (comparing the argument labeling schemes) and practical (to be able to annotate Czech under the standard UMR specifications).

pdf abs
Lexicalized Meaning Representation (LMR)
Jorge Baptista | Sónia Reis | João Dias | Pedro Santos

This paper presents an adaptation of the Abstract Meaning Representation (AMR) framework for European Portuguese. This adaptation, referred to as Lexicalized Meaning Representation (LMR), was deemed necessary to address specific challenges posed by the grammar of the language, as well as various linguistic issues raised by the current version of AMR annotation guidelines. Some of these aspects stemmed from the use of a notation similar to AMR to represent real texts from the legal domain, enabling its use in Natural Language Processing (NLP) applications. In this context, several aspects of AMR were significantly simplified (e.g., the representation of multi-word expressions, named entities, and temporal expressions), while others were introduced, with efforts made to maintain the representation scheme as compatible as possible with standard AMR notation.

pdf abs
Adjudicating LLMs as PropBank Adjudicators
Julia Bonn | Harish Tayyar Madabushi | Jena D. Hwang | Claire Bonial

We evaluate the ability of large language models (LLMs) to provide PropBank semantic role label annotations across different realizations of the same verbs in transitive, intransitive, and middle voice constructions. In order to assess the meta-linguistic capabilities of LLMs as well as their ability to glean such capabilities through in-context learning, we evaluate the models in a zero-shot setting, in a setting where it is given three examples of another verb used in transitive, intransitive, and middle voice constructions, and finally in a setting where it is given the examples as well as the correct sense and roleset information. We find that zero-shot knowledge of PropBank annotation is almost nonexistent. The largest model evaluated, GPT-4, achieves the best performance in the setting where it is given both examples and the correct roleset in the prompt, demonstrating that larger models can ascertain some meta-linguistic capabilities through in-context learning. However, even in this setting, which is simpler than the task of a human in PropBank annotation, the model achieves only 48% accuracy in marking numbered arguments correctly. To ensure transparency and reproducibility, we publicly release our dataset and model responses.

pdf abs
Extending VerbNet’s Verb-Specific Features to Enhance Selectional Preferences of Semantic Roles
Susan Windisch Brown

This work proposes expanding the thematic role selectional preferences used in the lexical resource VerbNet as a way to increase the available semantic information in the resource, induce semantically-based subclasses for the more generic VerbNet classes, and create new links across classes. The addition of verb-specific features in the latest version of VerbNet provides a means for adding more specific selectional preferences based on the meaning of a class’s individual member verbs. These features could refine both the instantiated class roles and the new implicit roles introduced in VerbNet version 4. We suggest 49 classes that would benefit from 111 verb-specific selectional preferences and explain how they would enhance VerbNet’s semantic representations.

We explore using LLMs, GPT-4 specifically, to generate draft sentence-level Chinese Uniform Meaning Representations (UMRs) that human annotators can revise to speed up the UMR annotation process. In this study, we use few-shot learning and Think-Aloud prompting to guide GPT-4 to generate sentence-level graphs of UMR. Our experimental results show that compared with annotating UMRs from scratch, using LLMs as a preprocessing step reduces the annotation time by two thirds on average. This indicates that there is great potential for integrating LLMs into the pipeline for complicated semantic annotation tasks.

pdf abs
Accelerating UMR Adoption: Neuro-Symbolic Conversion from AMR-to-UMR with Low Supervision
Claire Benet Post | Marie C. McGregor | Maria Leonor Pacheco | Alexis Palmer

Despite Uniform Meaning Representation’s (UMR) potential for cross-lingual semantics, limited annotated data has hindered its adoption. There are large datasets of English AMRs (Abstract Meaning Representations), but the process of converting AMR graphs to UMR graphs is non-trivial. In this paper we address a complex piece of that conversion process, namely cases where one AMR role can be mapped to multiple UMR roles through a non-deterministic process. We propose a neuro-symbolic method for role conversion, integrating animacy parsing and logic rules to guide a neural network, and minimizing human intervention. On test data, the model achieves promising accuracy, highlighting its potential to accelerate AMR-to-UMR conversion. Future work includes expanding animacy parsing, incorporating human feedback, and applying the method to broader aspects of conversion. This research demonstrates the benefits of combining symbolic and neural approaches for complex semantic tasks.

pdf abs
The Relative Clauses AMR Parsers Hate Most
Xiulin Yang | Nathan Schneider

This paper evaluates how well English Abstract Meaning Representation parsers process an important and frequent kind of Long-Distance Dependency construction, namely, relative clauses (RCs). On two syntactically parsed datasets, we evaluate five AMR parsers at recovering the semantic reentrancies triggered by different syntactic subtypes of relative clauses. Our findings reveal a general difficulty among parsers at predicting such reentrancies, with recall below 64% on the EWT corpus. The sequence-to-sequence models (regardless of whether structural biases were included in training) outperform the compositional model. An analysis by relative clause subtype shows that passive subject RCs are the easiest, and oblique and reduced RCs the most challenging, for AMR parsers.

pdf abs
Gaining More Insight into Neural Semantic Parsing with Challenging Benchmarks
Xiao Zhang | Chunliu Wang | Rik van Noord | Johan Bos

The Parallel Meaning Bank (PMB) serves as a corpus for semantic processing with a focus on semantic parsing and text generation. Currently, we witness an excellent performance of neural parsers and generators on the PMB. This might suggest that such semantic processing tasks have by and large been solved. We argue that this is not the case and that performance scores from the past on the PMB are inflated by non-optimal data splits and test sets that are too easy. In response, we introduce several changes. First, instead of the prior random split, we propose a more systematic splitting approach to improve the reliability of the standard test data. Second, except for the standard test set, we also propose two challenge sets: one with longer texts including discourse structure, and one that addresses compositional generalization. We evaluate five neural models for semantic parsing and meaning-to-text generation. Our results show that model performance declines (in some cases dramatically) on the challenge sets, revealing the limitations of neural models when confronting such challenges.

pdf (full)
bib (full) Proceedings of the Seventh Workshop on e-Commerce and NLP @ LREC-COLING 2024

pdf bib abs
Learning Reasons for Product Returns on E-Commerce
Miriam Farber | Slava Novgorodov | Ido Guy

In the rapidly evolving landscape of e-commerce, product returns have become a significant economic burden for businesses, where the reasons for returns may vary from wrong sizing and defective products to simply no longer needing the purchased product. This paper presents, to the best of our knowledge, the first comprehensive study of the complexities of product returns across a variety of e-commerce domains, focusing on the task of predicting the return reason. We propose a supervised approach for predicting return likelihood and the underlying return reason. We test our approach over a real-world dataset from a large e-commerce platform.

The context of modern smart voice assistants is often multi-modal, where images, audio and video content are consumed by users simultaneously. In such a setup, co-reference resolution is especially challenging, and runs across modalities and dialogue turns. We explore the problem of multi-modal co-reference resolution in multi-turn dialogues and quantify the performance of multi-modal LLMs on a specially curated dataset of long, image-interleaved conversations between a voice assistant and human in a shopping use case. We propose a custom architecture for multi-modal embedding alignment using a novel parameter augmentation technique. Our proposed Parameter Augmented LLM approach shows a 4.9% absolute F1 improvement above a cross-attention baseline while reducing the number of parameters being trained by 4x.

pdf abs
Efficient and Interpretable Information Retrieval for Product Question Answering with Heterogeneous Data
Biplob Biswas | Rajiv Ramnath

Expansion-enhanced sparse lexical representation improves information retrieval (IR) by minimizing vocabulary mismatch problems during lexical matching. In this paper, we explore the potential of jointly learning dense semantic representation and combining it with the lexical one for ranking candidate information. We present a hybrid information retrieval mechanism that maximizes lexical and semantic matching while minimizing their shortcomings. Our architecture consists of dual hybrid encoders that independently encode queries and information elements. Each encoder jointly learns a dense semantic representation and a sparse lexical representation augmented by a learnable term expansion of the corresponding text through contrastive learning. We demonstrate the efficacy of our model in single-stage ranking of a benchmark product question-answering dataset containing the typical heterogeneous information available on online product pages. Our evaluation demonstrates that our hybrid approach outperforms independently trained retrievers by 10.95% (sparse) and 2.7% (dense) in MRR@5 score. Moreover, our model offers better interpretability and performs comparably to state-of-the-art cross-encoders while reducing response time by 30% (latency) and cutting computational load by approximately 38% (FLOPs).

pdf abs
Hallucination Detection in LLM-enriched Product Listings
Ling Jiang | Keer Jiang | Xiaoyu Chu | Saaransh Gulati | Pulkit Garg

E-commerce faces persistent challenges with data quality issue of product listings. Recent advances in Large Language Models (LLMs) offer a promising avenue for automated product listing enrichment. However, LLMs are prone to hallucinations, which we define as the generation of content that is unfaithful to the source input. This poses significant risks in customer-facing applications. Hallucination detection is particularly challenging in the vast e-commerce domain, where billions of products are sold. In this paper, we propose a two-phase approach for detecting hallucinations in LLM-enriched product listings. The first phase prioritizes recall through cost-effective unsupervised techniques. The second phase maximizes precision by leveraging LLMs to validate candidate hallucinations detected in phase one. The first phase significantly reduces the inference space and enables the resource-intensive methods in the second phase to scale effectively. Experiments on two real-world datasets demonstrated that our approach achieved satisfactory recall on unstructured product attributes with suboptimal precision, primarily due to the inherent ambiguity of unstructured attributes and the presence of common sense reasoning. This highlights the necessity for a refined approach to distinguish between common sense and hallucination. On structured attributes with clearly de- fined hallucinations, our approach effectively detected hallucinations with precision and recall surpassing targeted level.

pdf abs
Self-Improving Customer Review Response Generation Based on LLMs
Guy Azov | Tatiana Pelc | Adi Fledel Alon | Gila Kamhi

Previous studies have demonstrated that proactive interaction with user reviews has a positive impact on the perception of app users and encourages them to submit revised ratings. Nevertheless, developers encounter challenges in managing a high volume of reviews, particularly in the case of popular apps with a substantial influx of daily reviews. Consequently, there is a demand for automated solutions aimed at streamlining the process of responding to user reviews. To address this, we have developed a new system for generating automatic responses by leveraging user-contributed documents with the help of retrieval-augmented generation (RAG) and advanced Large Language Models (LLMs). Our solution, named SCRABLE, represents an adaptive customer review response automation that enhances itself with self-optimizing prompts and a judging mechanism based on LLMs. Additionally, we introduce an automatic scoring mechanism that mimics the role of a human evaluator to assess the quality of responses generated in customer review domains. Extensive experiments and analyses conducted on real-world datasets reveal that our method is effective in producing high-quality responses, yielding improvement of more than 8.5% compared to the baseline. Further validation through manual examination of the generated responses underscores the efficacy our proposed system.

pdf abs
Don’t Just Translate, Summarize Too: Cross-lingual Product Title Generation in E-commerce
Bryan Zhang | Taichi Nakatani | Daniel Vidal Hussey | Stephan Walter | Liling Tan

Making product titles informative and concise is vital to delighting e-commerce customers. Recent advances have successfully applied monolingual product title summarization to shorten lengthy product titles. This paper explores the cross-lingual product title generation task that summarizes and translates the source language product title to a shortened product title in the target language. Our main contributions are as follows, (i) we investigate the optimal product title length within the scope of e-commerce localization, (ii) we introduce a simple yet effective data filtering technique to train a length-aware machine translation system and compare it to a publicly available LLM, (iii) we propose an automatic approach to validate experimental results using an open-source LLM without human input and show that these evaluation results are consistent with human preferences.

pdf abs
Turkish Typo Correction for E-Commerce Search Engines
Elif Oral | Koray Mancuhan | Hüseyin Varol Erdem | Pınar Ece Hatipoglu

Typo correction is a challenging problem when it is developed for morphologically rich languages. The existing approaches in the literature are successful mainly for English, leaving the problem open for such languages. This creates an issue, because the typo correction is a critical component in practice for many systems such as search engines. Especially, the search engines of e-commerce platforms rely heavily on typo correction for product relevancy. A bad performing typo corrector could result in very few number of relevant products when a user is looking for a product on an e-commerce platform, resulting in significant revenue decrease. For the first time in the literature, this paper proposes a modern typo corrector for a morphologically rich language, Turkish; which is integrated to the search engine of one of the leading e-commerce platforms in Turkey, Hepsiburada. Our thorough experiments show that this new typo corrector performs very successful in practice, outperforming the existing Turkish specific propositions in the literature; even if it is applied out of the context of the search engines.

pdf abs
Detecting AI-enhanced Opinion Spambots: a study on LLM-generated Hotel Reviews
Vijini Liyanage | Davide Buscaldi | Penelope Forcioli

Opinion spamming is the posting of fake opinions or reviews to promote or discredit target products, services, or individuals. The concern surrounding this activity has grown steadily especially because of the development of automated bots for this purpose (“spambots”). Nowadays, Large Language Models (LLMs) have proved their ability to generate text that is almost indistinguishable from human-written text. Therefore, there is a growing concern regarding the use of these models for malicious purposes, among them opinion spamming. In this paper, we carry out a study on LLM-generated reviews, in particular hotel reviews as we chose the well-known Opinion Spam corpus by Myle Ott as the seed for our dataset. We generated a set of fake reviews with various models and applied different classification algorithms to verify how difficult is it to detect this kind of generated content. The results show that by providing enough training data, it is not difficult to detect the fake reviews generated by such models, as they tend to associate the aspects in the reviews with the same attributes.

pdf abs
Assessing Image-Captioning Models: A Novel Framework Integrating Statistical Analysis and Metric Patterns
Qiaomu Li | Ying Xie | Nina Grundlingh | Varsha Rani Chawan | Cody Wang

In this study, we present a novel evaluation framework for image-captioning models that integrate statistical analysis with common evaluation metrics, utilizing two popular datasets, FashionGen and Amazon, with contrasting dataset variation to evaluate four models: Video-LLaVa, BLIP, CoCa and ViT-GPT2. Our approach not only reveals the comparative strengths of models, offering insights into their adaptability and applicability in real-world scenarios but also contributes to the field by providing a comprehensive evaluation method that considers both statistical significance and practical relevance to guide the selection of models for specific applications. Specifically, we propose Rank Score as a new evaluation metric that is designed for e-commerce image search applications and employ CLIP Score to quantify dataset variation to offer a holistic view of model performance.

pdf abs
Frogs into princes: A generative model to understand the success of product descriptions
Takehiro Takayanagi | Bruno Charron | Marco Visentini-Scarzanella

In the dynamic marketplace, vendors continuously seek innovative ideas for new products and ways to improve existing ones. These ideas can be uncovered by analyzing text data, such as product descriptions and customer reviews. However, the ever-increasing volume of text data poses a challenge in extracting meaningful insights. Therefore, this study addresses the challenge of extracting actionable insights from the growing volume of text data, with a specific focus on product descriptions. To this end, we investigate two primary research questions: the predictive power of product descriptions for product success, and the capability of style transfer to highlight the successful factors of these descriptions. In response to the first question, our findings validate that product descriptions are indeed reliable indicators of product success. Addressing our second question, we propose a Successful Style Transfer Variational Autoencoder (SST-VAE), a VAE-based language model designed for effective successful style transfer. Qualitative analysis indicates that the SST-VAE effectively enables successful style transfer conditional on a given label. In addition, case studies suggest that the proposed approach could be useful in gaining insights about product success, by highlighting key factors that may contribute to their success. On the other hand, our approach confronts issues such as hallucinations and the need for factual accuracy. These challenges underscore the necessity for continued research in the field of e-commerce natural language processing.

pdf abs
STA: Self-controlled Text Augmentation for Improving Text Classifications
Congcong Wang | Gonzalo Fiz Pontiveros | Steven Derby | Tri Kurniawan Wijaya

Despite recent advancements in Machine Learning, many tasks still involve working in low-data regimes which can make solving natural language problems difficult. Recently, a number of text augmentation techniques have emerged in the field of Natural Language Processing (NLP) which can enrich the training data with new examples, though they are not without their caveats. For instance, simple rule-based heuristic methods are effective, but lack variation in semantic content and syntactic structure with respect to the original text. On the other hand, more complex deep learning approaches can cause extreme shifts in the intrinsic meaning of the text and introduce unwanted noise into the training data. To more reliably control the quality of the augmented examples, we introduce a state-of-the-art approach for Self-Controlled Text Augmentation (STA). Our approach tightly controls the generation process by introducing a self-checking procedure to ensure that generated examples retain the semantic content of the original text. Experimental results on multiple benchmarking datasets demonstrate that STA substantially outperforms existing state-of-the-art techniques, whilst qualitative analysis reveals that the generated examples are both lexically diverse and semantically reliable.

pdf abs
Multi-word Term Embeddings Improve Lexical Product Retrieval
Viktor Shcherbakov | Fedor Krasnov

Product search is uniquely different from search for documents, Internet resources or vacancies, therefore it requires the development of specialized search systems. The present work describes the H1 embdedding model, designed for an offline term indexing of product descriptions at e-commerce platforms. The model is compared to other state-of-the-art (SoTA) embedding models within a framework of hybrid product search system that incorporates the advantages of lexical methods for product retrieval and semantic embedding-based methods. We propose an approach to building semantically rich term vocabularies for search indexes. Compared to other production semantic models, H1 paired with the proposed approach stands out due to its ability to process multi-word product terms as one token. As an example, for search queries “new balance shoes”, “gloria jeans kids wear” brand entity will be represented as one token - “new balance”, “gloria jeans”. This results in an increased precision of the system without affecting the recall. The hybrid search system with proposed model scores mAP@12 = 56.1% and R@1k = 86.6% on the WANDS public dataset, beating other SoTA analogues.

This paper presents a model architecture and training pipeline for attribute value extraction from search queries. The model uses weak labels generated from customer interactions to train a transformer-based NER model. A two-stage normalization process is then applied to deal with the problem of a large label space: first, the model output is normalized onto common generic attribute values, then it is mapped onto a larger range of actual product attribute values. This approach lets us successfully apply a transformer-based NER model to the extraction of a broad range of attribute values in a real-time production environment for e-commerce applications, contrary to previous research. In an online test, we demonstrate business value by integrating the model into a system for semantic product retrieval and ranking.

Pool-based active learning techniques have had success producing multi-class classifiers that achieve high accuracy with fewer labels com- pared to random labeling. However, in an industrial setting where we often have class-level business targets to achieve (e.g., 95% recall at 95% precision for each class), active learning techniques continue to acquire labels for classes that have already met their targets, thus consuming unnecessary manual annotations. We address this problem by proposing a framework called Target-Aware Active Learning that converts any active learning query strategy into its target-aware variant by leveraging the gap between each class’ current estimated accuracy and its corresponding business target. We show empirically that target-aware variants of state-of-the-art active learning techniques achieve business targets faster on 2 open-source image classification datasets and 2 proprietary product classification datasets.

pdf abs
Cluster Language Model for Improved E-Commerce Retrieval and Ranking: Leveraging Query Similarity and Fine-Tuning for Personalized Results
Duleep Rathgamage Don | Ying Xie | Le Yu | Simon Hughes | Yun Zhu

This paper proposes a novel method to improve the accuracy of product search in e-commerce by utilizing a cluster language model. The method aims to address the limitations of the bi-encoder architecture while maintaining a minimal additional training burden. The approach involves labeling top products for each query, generating semantically similar query clusters using the K-Means clustering algorithm, and fine-tuning a global language model into cluster language models on individual clusters. The parameters of each cluster language model are fine-tuned to learn local manifolds in the feature space efficiently, capturing the nuances of various query types within each cluster. The inference is performed by assigning a new query to its respective cluster and utilizing the corresponding cluster language model for retrieval. The proposed method results in more accurate and personalized retrieval results, offering a superior alternative to the popular bi-encoder based retrieval models in semantic search.

pdf (full)
bib (full) Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024

pdf bib
Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024
Atul Kr. Ojha | Sina Ahmadi | Silvie Cinková | Theodorus Fransen | Chao-Hong Liu | John P. McCrae

pdf bib abs
Low-Resource Machine Translation through Retrieval-Augmented LLM Prompting: A Study on the Mambai Language
Raphaël Merx | Aso Mahmudi | Katrina Langford | Leo Alberto de Araujo | Ekaterina Vylomova

This study explores the use of large language models (LLMs) for translating English into Mambai, a low-resource Austronesian language spoken in Timor-Leste, with approximately 200,000 native speakers. Leveraging a novel corpus derived from a Mambai language manual and additional sentences translated by a native speaker, we examine the efficacy of few-shot LLM prompting for machine translation (MT) in this low-resource context. Our methodology involves the strategic selection of parallel sentences and dictionary entries for prompting, aiming to enhance translation accuracy, using open-source and proprietary LLMs (LlaMa 2 70b, Mixtral 8x7B, GPT-4). We find that including dictionary entries in prompts and a mix of sentences retrieved through TF-IDF and semantic embeddings significantly improves translation quality. However, our findings reveal stark disparities in translation performance across test sets, with BLEU scores reaching as high as 21.2 on materials from the language manual, in contrast to a maximum of 4.4 on a test set provided by a native speaker. These results underscore the importance of diverse and representative corpora in assessing MT for low-resource languages. Our research provides insights into few-shot LLM prompting for low-resource MT, and makes available an initial corpus for the Mambai language.

pdf bib abs
Improved Neural Word Segmentation for Standard Tibetan
Collin J. Brown

As Tibetan is traditionally not written with word delimiters, various means of word segmentation are necessary to prepare data for downstream tasks. Neural word segmentation has proven a successful means of parsing Tibetan text, but current performance lags behind that of neural word segmenters in other languages, such as Chinese or Japanese, and even behind languages with relatively similar orthographic structures, such as Vietnamese or Thai. We apply methods that have proven useful for these latter two languages , in addition to Classical Tibetan, toward the development of a neural word segmenter with the goal of raising the peak performance of Tibetan neural word segmentation to a level comparable to that reached for orthographically similar languages.

pdf abs
Open Text Collections as a Resource for Doing NLP with Eurasian Languages
Sebastian Nordhoff | Christian Döhler | Mandana Seyfeddinipur

The Open Text Collections project establishes a high-quality publication channel for interlinear glossed text from endangered languages. Text collection will by made available in an open interoperable format and as a more traditional book publication. The project addresses a variety of audiences, eg. community members, typological linguists, anthropologists, NLP practitioners.

pdf abs
The Extraction and Fine-grained Classification of Written Cantonese Materials through Linguistic Feature Detection
Chaak-ming Lau | Mingfei Lau | Ann Wai Huen To

This paper presents a linguistically-informed, non-machine-learning tool for classifying Written Cantonese, Standard Written Chinese, and the intermediate varieties used by Cantonese-speaking users from Hong Kong, which are often grouped into a single “Traditional Chinese” label. Our approach addresses the lack of textual materials for Cantonese NLP, a consequence of a lower sociolinguistic status of Written Cantonese and the interchangeable use of these varieties by users without sufficient language labeling. The tool utilizes key strings and quotation markers, which can be reduced to string operations, to effectively extract Written Cantonese sentences and documents from materials mixed with Standard Written Chinese. This allows for the flexible and efficient extraction of high-quality Cantonese data from large datasets, catering to specific classification needs. This implementation ensures that the tool can process large amounts of data at a low cost by bypassing model-inferencing, which is particularly significant for marginalized languages. The tool also aims to provide a baseline measure for future classification systems, and the approach may be applicable to other low-resource regional or diglossic languages.

pdf abs
Neural Mining of Persian Short Argumentative Texts
Mohammad Yeghaneh Abkenar | Manfred Stede

Argumentation mining (AM) is concerned with extracting arguments from texts and classifying the elements (e.g.,claim and premise) and relations between them, as well as creating an argumentative structure. A significant hurdle to research in this area for the Persian language is the lack of annotated Persian language corpora. This paper introduces the first argument-annotated corpus in Persian and thereby the possibility of expanding argumentation mining to this low-resource language. The starting point is the English argumentative microtext corpus (AMT) (Peldszus and Stede, 2015), and we built the Persian variant by machine translation (MT) and careful post-editing of the output. We call this corpus Persian argumentative microtext (PAMT). Moreover, we present the first results for Argumentative Discourse Unit (ADU) classification for Persian, which is considered to be one of the main fundamental subtasks of argumentation mining. We adopted span categorization using the deep learning model of spaCy Version 3.0 (a CNN model on top of Bloom embedding with attention) on the corpus for determing argumentative units and their type (claim vs. premise).

pdf abs
Endangered Language Preservation: A Model for Automatic Speech Recognition Based on Khroskyabs Data
Ruiyao Li | Yunfan Lai

This is a report on an Automatic Speech Recognition (ASR) experiment conducted using the Khroskyabs data. With the impact of information technology development and globalization challenges on linguistic diversity, this study focuses on the preservation crisis of the endangered Gyalrongic language, particularly the Khroskyabs language. We used Automatic Speech Recognition technology and the Wav2Vec2 model to transcribe the Khroskyabs language. Despite challenges such as data scarcity and the language’s complex morphology, preliminary results show promising character accuracy from the model. Additionally, the linguist also has given relatively high evaluations to the transcription results of our model. Therefore, the experimental and evaluation results demonstrate the high practicality of our model. At the same time, the results also reveal issues with high word error rates, so we plan to augment our existing dataset with additional Khroskyabs data in our further studies. This study provides insights and methodologies for using Automatic Speech Recognition to transcribe and protect Khroskyabs, and we hope that this can contribute to the preservation efforts of other endangered languages.

pdf abs
This Word Mean What: Constructing a Singlish Dictionary with ChatGPT
Siew Yeng Chow | Chang-Uk Shin | Francis Bond

Despite the magnitude of recent progress in natural language processing and multilingual language modeling research, the vast majority of NLP research is focused on English and other major languages. This is because recent NLP research is mainly data-driven, and there is more data for resource-rich languages. In particular, Large Language Models (LLM) make use of large unlabeled datasets, a resource that many languages do not have. In this project, we built a new, open-sourced dictionary of Singlish, a contact variety that contains features from English and other local languages and is syntactically, phonologically and lexically distinct from Standard English (Tan, 2010). First, a list of Singlish words was extracted from various online sources. Then using an open Chat-GPT LLM API, the description, including the defintion, part of speech, pronunciation and examples was produced. These were then refined through post processing carried out by a native speaker. The dictionary currently has 1,783 entries and is published under the CC-BY-SA license. The project was carried out with the intention of facilitating future Singlish research and other applications as the accumulation and management of language resources will be of great help in promoting research on the language in the future.

Large Language Models (LLMs) have shown significant promise in various tasks, including identifying the political beliefs of English-speaking social media users from their posts. However, assessing LLMs for this task in non-English languages remains unexplored. In this work, we ask to what extent LLMs can predict the political ideologies of users in Persian social media. To answer this question, we first acknowledge that political parties are not well-defined among Persian users, and therefore, we simplify the task to a much simpler task of hyperpartisan ideology detection. We create a new benchmark and show the potential and limitations of both open-source and commercial LLMs in classifying the hyper-partisan ideologies of users. We compare these models with smaller fine-tuned models, both on the Persian language (ParsBERT) and translated data (RoBERTa), showing that they considerably outperform generative LLMs in this task. We further demonstrate that the performance of the generative LLMs degrades when classifying users based on their tweets instead of their bios and even when tweets are added as additional information, whereas the smaller fine-tuned models are robust and achieve similar performance for all classes. This study is a first step toward political ideology detection in Persian Twitter, with implications for future research to understand the dynamics of ideologies in Persian social media.

pdf (full)
bib (full) Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing

pdf bib abs
Construction of a Japanese Financial Benchmark for Large Language Models
Masanori Hirano

With the recent development of large language models (LLMs), models that focus on certain domains and languages have been discussed for their necessity. There is also a growing need for benchmarks to evaluate the performance of current LLMs in each domain. Therefore, in this study, we constructed a benchmark comprising multiple tasks specific to the Japanese and financial domains and performed benchmark measurements on some models. Consequently , we confirmed that GPT-4 is currently outstanding, and that the constructed benchmarks function effectively. According to our analysis, our benchmark can differentiate benchmark scores among models in all performance ranges by combining tasks with different difficulties.

pdf bib abs
KRX Bench: Automating Financial Benchmark Creation via Large Language Models
Guijin Son | Hyunjun Jeon | Chami Hwang | Hanearl Jung

In this work, we introduce KRX-Bench, an automated pipeline for creating financial benchmarks via GPT-4. To demonstrate the effectiveness of the pipeline, we create KRX-Bench-POC, a benchmark assessing the knowledge of LLMs in real-world companies. This dataset comprises 1,002 questions, each focusing on companies across the U.S., Japanese, and Korean stock markets. We make our pipeline and dataset publicly available and integrate the evaluation code into EleutherAI’s Language Model Evaluation Harness.

pdf abs
BLU-SynTra: Distinguish Synergies and Trade-offs between Sustainable Development Goals Using Small Language Models
Loris Bergeron | Jerome Francois | Radu State | Jean Hilger

Since the United Nations defined the Sustainable Development Goals, studies have shown that these goals are interlinked in different ways. The concept of SDG interlinkages refers to the complex network of interactions existing within and between the SDGs themselves. These interactions are referred to as synergies and trade-offs. Synergies represent positive interactions where the progress of one SDG contributes positively to the progress of another. On the other hand, trade-offs are negative interactions where the progress of one SDG has a negative impact on another. However, evaluating such interlinkages is a complex task, not only because of the multidimensional nature of SDGs, but also because it is highly exposed to personal interpretation bias and technical limitations. Recent studies are mainly based on expert judgements, literature reviews, sentiment or data analysis. To remedy these limitations we propose the use of Small Language Models in addition of an advanced Retrieval Augmented Generation to distinguish synergies and trade-offs between SDGs. In order to validate our results, we have drawn on the study carried out by the European Commission’s Joint Research Centre which provides a database of interlinkages labelled according to the presence of synergies or trade-offs.

pdf abs
Assessing the Impact of ESG-Related News on Stock Trading in the Indonesian Market: A Text Similarity Framework Approach
Okiriza Wibisono | Ali Akbar Septiandri | Reinhard Denis Najogie

Environmental, Social, and Governance (ESG) perspectives have become integral to corporate decision-making and investment, with global regulatory mandates for ESG disclosure. The reliability of ESG ratings, crucial for assessing corporate sustainability practices, is compromised by inconsistencies and discrepancies across and within rating agencies, casting doubt on their effectiveness in reflecting true ESG performance and impact on firm valuations. While there have been studies using ESG-related news articles to measure their effect on stock trading, none have studied the Indonesian stock market. To address this gap, we developed a text similarity framework to identify ESG-related news articles based on Sustainability Accounting Standards Board (SASB) Standards without the need for manual annotations. Using news articles from one of the prominent business media outlets in Indonesia and an event study method, we found that 17.9% out of 18,431 environment-related news are followed by increased stock trading on the firms mentioned in the news, compared to 16.0% on random-dates datasets of the same size and firm composition. This approach is intended as a simpler alternative to building an ESG-specific news labeling model or using third-party data providers, although further analyses may be required to evaluate its robustness.

pdf abs
Development and Evaluation of a German Language Model for the Financial Domain
Nata Kozaeva | Serhii Hamotskyi | Christian Hanig

Recent advancements in self-supervised pre-training of Language Models (LMs) have significantly improved their performance across a wide range of Natural Language Processing (NLP) tasks. Yet, the adaptation of these models to specialized domains remains a critical endeavor, as it enables the models to grasp domain-specific nuances, terminology, and patterns more effectively, thereby enhancing their utility in specialized contexts. This paper presents an in-depth investigation into the training and fine-tuning of German language models specifically for the financial sector. We construct various datasets for training and fine-tuning to examine the impact of different data construction strategies on the models’ performance. Our study provides detailed insights into essential pre-processing steps, including text extraction from PDF documents and language identification, to evaluate their influence on the performance of the language models. Addressing the scarcity of resources in the German financial domain, we also introduce a German Text Classification benchmark dataset, aimed at fostering further research and development in this area. The performance of the trained models is evaluated on two domain-specific tasks, demonstrating that fine-tuning with domain-specific data improves model outcomes, even with limited amounts of domain-specific data.

pdf abs
Evaluating Multilingual Language Models for Cross-Lingual ESG Issue Identification
Wing Yan Li | Emmanuele Chersoni | Cindy Sing Bik Ngai

The automation of information extraction from ESG reports has recently become a topic of increasing interest in the Natural Language Processing community. While such information is highly relevant for socially responsible investments, identifying the specific issues discussed in a corporate social responsibility report is one of the first steps in an information extraction pipeline. In this paper, we evaluate methods for tackling the Multilingual Environmental, Social and Governance (ESG) Issue Identification Task. Our experiments use existing datasets in English, French and Chinese with a unified label set. Leveraging multilingual language models, we compare two approaches that are commonly adopted for the given task: off-the-shelf and fine-tuning. We show that fine-tuning models end-to-end is more robust than off-the-shelf methods. Additionally, translating text into the same language has negligible performance benefits.

Financial prediction from Monetary Policy Conference (MPC) calls is a new yet challenging task, which targets at predicting the price movement and volatility for specific financial assets by analyzing multimodal information including text, video, and audio. Although the existing work has achieved great success using cross-modal transformer blocks, it overlooks the potential external financial knowledge, the varying contributions of different modalities to financial prediction, as well as the innate relations among different financial assets. To tackle these limitations, we propose a novel Modal-Adaptive kNowledge-enhAnced Graph-basEd financial pRediction scheme, named MANAGER. Specifically, MANAGER resorts to FinDKG to obtain the external related knowledge for the input text. Meanwhile, MANAGER adopts BEiT-3 and Hidden-unit BERT (HuBERT) to extract the video and audio features, respectively. Thereafter, MANAGER introduces a novel knowledge-enhanced cross-modal graph that fully characterizes the semantic relations among text, external knowledge, video and audio, to adaptively utilize the information in different modalities, with ChatGLM2 as the backbone. Extensive experiments on a publicly available dataset Monopoly verify the superiority of our model over cutting-edge methods.

pdf abs
NetZeroFacts: Two-Stage Emission Information Extraction from Company Reports
Marco Wrzalik | Florian Faust | Simon Sieber | Adrian Ulges

We address the challenge of efficiently extracting structured emission information, specifically emission goals, from company reports. Leveraging the potential of Large Language Models (LLMs), we propose a two-stage pipeline that first filters and retrieves potentially relevant passages and then extracts structured information from them using a generative model. We contribute an annotated dataset covering over 14.000 text passages, from which we extracted 739 expert annotated facts. On this dataset, we investigate the accuracy, efficiency and limitations of LLM-based emission information extraction, evaluate different retrieval techniques, and assess efficiency gains for human analysts by using the proposed pipeline. Our research demonstrates the promise of LLM technology in addressing the intricate task of sustainable emission data extraction from company reports.

pdf abs
FB-GAN: A Novel Neural Sentiment-Enhanced Model for Stock Price Prediction
Jainendra Kumar Jain | Ruchit Agrawal

Predicting stock prices remains a significant challenge in financial markets. This study explores existing stock price prediction systems, identifies their strengths and weaknesses, and proposes a novel method for stock price prediction that leverages a state-of-the-art neural network framework, combining the BERT language model for sentiment analysis on news articles and the GAN model for stock price prediction. We introduce the FB-GAN model, an ensemble model that leverages stock price history and market sentiment score for more accurate stock price prediction and propose effective strategies to capture the market sentiment. We conduct experiments on stock price prediction for five major equities (Amazon, Apple, Microsoft, Nvidia, and Adobe), and compare the performance obtained by our proposed model against the existing state-of-the-art baseline model. The results demonstrate that our proposed model outperforms existing models across the five major equities. We demonstrate that the strategic incorporation of market sentiment using both headlines as well summaries of news articles significantly enhances the accuracy and robustness of stock price prediction.

pdf abs
Unveiling Currency Market Dynamics: Leveraging Federal Reserve Communications for Strategic Investment Insights
Martina Menzio | Davide Paris | Elisabetta Fersini

The purpose of this paper is to extract market signals for the major currencies (EUR, USD, GBP, JPY, CNY) analyzing the Federal Reserve System (FED) minutes and speeches, and, consequently, making suggestions about going long/short or remaining neutral to investors thanks to the causal relationships between FED sentiment and currency exchange rates. To this purpose, we aim to verify the hypothesis that the currency market dynamics follow a trend that is subject to the sentiment of FED minutes and speeches related to specific relevant currencies. The proposed paper has highlighted two main findings: (1) the sentiment expressed in the FED minutes has a strong influence on financial market predictability on major currencies trend and (2) the sentiment over time Granger-causes the exchange rate of currencies not only immediately but also at increasing lags according to a monotonically decreasing impact.

Material facts (MF) are crucial and obligatory disclosures that can significantly influence asset values. Following their release, financial analysts embark on the meticulous and highly specialized task of crafting analyses to shed light on their impact on company assets, a challenge elevated by the daily amount of MFs released. Generative AI, with its demonstrated power of crafting coherent text, emerges as a promising solution to this task. However, while these analyses must incorporate the MF, they must also transcend it, enhancing it with vital background information, valuable and grounded recommendations, prospects, potential risks, and their underlying reasoning. In this paper, we approach this task as an instance of controllable text generation, aiming to ensure adherence to the MF and other pivotal attributes as control elements. We first explore language models’ capacity to manage this task by embedding those elements into prompts and engaging popular chatbots. A bilingual proof of concept underscores both the potential and the challenges of applying generative AI techniques to this task.

pdf abs
Exploring Large Language Models in Financial Argument Relation Identification
Yasser Otiefy | Alaa Alhamzeh

In the dynamic landscape of financial analytics, the argumentation within Earnings Conference Calls (ECCs) provides valuable insights for investors and market participants. This paper delves into the automatic relation identification between argument components in this type of data, a poorly studied task in the literature. To tackle this challenge, we empirically examined and analysed a wide range of open-source models, as well as the Generative Pre-trained Transformer GPT-4. On the one hand, our experiments in open-source models spanned general-purpose models, debate-fine-tuned models, and financial-fine-tuned models. On the other hand, we assessed the performance of GPT-4 zero-shot learning on a financial argumentation dataset (FinArg). Our findings show that a smaller open-source model, fine-tuned on relevant data, can perform as a huger general-purpose one, showing the value of enriching the local embeddings with the semantic context of data. However, GPT-4 demonstrated superior performance with F1-score of 0.81, even with no given samples or shots. In this paper, we detail our data, models and experimental setup. We also provide further performance analysis from different aspects.

In the banking and finance sectors, members of the business units focused on Trend and Risk Analysis daily process internal and external visually-rich documents including text, images, and tables. Given a facet (i.e., topic) of interest, they are particularly interested in retrieving the top trending keywords related to it and then use them to annotate the most relevant document elements (e.g., text paragraphs, images or tables). In this paper, we explore the use of both open-source and proprietary Large Language Models to automatically generate lists of facet-relevant keywords, automatically produce free-text descriptions of both keywords and multimedia document content, and then annotate documents by leveraging textual similarity approaches. The preliminary results, achieved on English and Italian documents, show that OpenAI GPT-4 achieves superior performance in keyword description generation and multimedia content annotation, while the open-source Meta AI Llama2 model turns out to be highly competitive in generating additional keywords.

pdf abs
ESG-FTSE: A Corpus of News Articles with ESG Relevance Labels and Use Cases
Mariya Pavlova | Bernard Casey | Miaosen Wang

We present ESG-FTSE, the first corpus comprised of news articles with Environmental, Social and Governance (ESG) relevance annotations. In recent years, investors and regulators have pushed ESG investing to the mainstream due to the urgency of climate change. This has led to the rise of ESG scores to evaluate an investment’s credentials as socially responsible. While demand for ESG scores is high, their quality varies wildly. Quantitative techniques can be applied to improve ESG scores, thus, responsible investing. To contribute to resource building for ESG and financial text mining, we pioneer the ESG-FTSE corpus. We further present the first of its kind ESG annotation schema. It has three levels: a binary classification (relevant versus irrelevant news articles), ESG classification (ESG-related news articles), and target company. Both supervised and unsupervised learning experiments for ESG relevance detection were conducted to demonstrate that the corpus can be used in different settings to derive accurate ESG predictions.

We present BBRC, a collection of 25 corpus of banking regulatory risk from different departments of Banco do Brasil (BB). These are individual corpus about investments, insurance, human resources, security, technology, treasury, loans, accounting, fraud, credit cards, payment methods, agribusiness, risks, etc. They were annotated in binary form by experts indicating whether each regulatory document contains regulatory risk that may require changes to products, processes, services, and channels of a bank department or not. The corpora in Portuguese contain documents from 26 Brazilian regulatory authorities in the financial sector. In total, there are 61,650 annotated documents, mostly between half and three pages long. The corpora belong to a Natural Language Processing (NLP) application that has been in production since 2020. In this work, we also performed binary classification benchmarks with some of the corpus. Experiments were carried out with different sampling techniques and in one of them we sought to solve an intraclass imbalance problem present in each corpus of the corpora. For the benchmarks, we used the following classifiers: Multinomial Naive Bayes, Random Forest, SVM, XGBoost, and BERTimbau (a version of BERT for Portuguese). The BBRC can be downloaded through a link in the article.

pdf abs
Stock Price Prediction with Sentiment Analysis for Chinese Market
Yuchen Luan | Haiyang Zhang | Chenlei Zhang | Yida Mu | Wei Wang

Accurate prediction of stock prices is considered as a significant practical challenge and has been a longstanding topic of debate within the economic domain. In recent years, sentiment analysis on social media comments has been considered an important data source for stock prediction. However, most of these works focus on exploring stocks with high market values or from specific industries. The extent to which sentiments affect a broader range of stocks and their overall performance remains uncertain. In this paper, we study the influence of sentiment analysis on stock price prediction with respect to (1) different market value groups and (2) different Book-to-Market ratio groups in the Chinese stock market. To this end, we create a new dataset that consists of 24 stocks across different market value groups and Book-to-Market ratio categories, along with 12,000 associated comments that have been collected and manually annotated. We then utilized this dataset to train a variety of sentiment classifiers, which were subsequently integrated into sequential neural-based models for stock price prediction. Experimental findings indicate that while sentiment integration generally improve the predictive performance for price prediction, it may not consistently lead to better results for individual stocks. Moreover, these outcomes are notably influenced by varying market values and Book-to-Market ratios, with stocks of higher market values and B/M ratios often exhibiting more accurate predictions. Among all the models tested, the Bi-LSTM model incorporated with the sentiment analysis, achieves the best prediction performance.

pdf abs
Topic Taxonomy Construction from ESG Reports
Saif Majdi AlNajjar | Xinyu Wang | Yulan He

The surge in Environmental, Societal, and Governance (ESG) reports, essential for corporate transparency and modern investments, presents a challenge for investors due to their varying lengths and sheer volume. We present a novel methodology, called MultiTaxoGen, for creating topic taxonomies designed specifically for analysing the ESG reports. Topic taxonomies serve to illustrate topics covered in a corpus of ESG reports while also highlighting the hierarchical relationships between them. Unfortunately, current state-of-the-art approaches for constructing topic taxonomies are designed for more general datasets, resulting in ambiguous topics and the omission of many latent topics presented in ESG-focused corpora. This makes them unsuitable for the specificity required by investors. Our method instead adapts topic modelling techniques by employing them recursively on each topic’s local neighbourhood, the subcorpus of documents assigned to that topic. This iterative approach allows us to identify the children topics and offers a better understanding of topic hierarchies in a fine-grained paradigm. Our findings reveal that our method captures more latent topics in our ESG report corpus than the leading method and provides more coherent topics with comparable relational accuracy.

pdf abs
Duration Dynamics: Fin-Turbo’s Rapid Route to ESG Impact Insight
Weijie Yang | Xinyun Rong

This study introduces “Duration Dynamics: Fin-Turbo’s Rapid Route to ESG Impact Insight”, an innovative approach employing advanced Natural Language Processing (NLP) techniques to assess the impact duration of ESG events on corporations. Leveraging a unique dataset comprising multilingual news articles, the research explores the utility of machine translation for language uniformity, text segmentation for contextual understanding, data augmentation for dataset balance, and an ensemble learning method integrating models like ESG-BERT, RoBERTa, DeBERTa, and Flan-T5 for nuanced analysis. Yielding excellent results, our research showcases the potential of using language models to improve ESG-oriented decision-making, contributing valuable insights to the FinNLP community.

pdf abs
Multilingual ESG News Impact Identification Using an Augmented Ensemble Approach
Harika Abburi | Ajay Kumar | Edward Bowen | Balaji Veeramani

Determining the duration and length of a news event’s impact on a company’s performance remains elusive for financial analysts. The complexity arises from the fact that the effects of these news articles are influenced by various extraneous factors and can change over time. As a result, in this work, we investigate our ability to predict 1) the duration (length) of a news event’s impact, and 2) level of impact on companies. The datasets used in this study are provided as part of the Multi-Lingual ESG Impact Duration Inference (ML-ESG-3) shared task. To handle the data scarcity, we explored data augmentation techniques to augment our training data. To address each of the research objectives stated above, we employ an ensemble approach combining transformer model, a variant of Convolutional Neural Networks (CNNs), specifically the KimCNN model and contextual embeddings. The model’s performance is assessed across a multilingual dataset encompassing English, French, Japanese, and Korean news articles. For the first task of determining impact duration, our model ranked in first, fifth, seventh, and eight place for Japanese, French, Korean and English texts respectively (with respective macro F1 scores of 0.256, 0.458, 0.552, 0.441). For the second task of assessing impact level, our model ranked in sixth, and eight place for French and English texts, respectively (with respective macro F1 scores of 0.488 and 0.550).

Numerous firms advertise action around corporate social responsibility (CSR) on social media. Using a Twitter corpus from S&P 500 companies and topic modeling, we investigate how companies talk about their social and sustainability efforts and whether CSR-related speech predicts Environmental, Social, and Governance (ESG) risk scores. As part of our work in progress, we present early findings suggesting a possible distinction in language between authentic discussion of positive practices and corporate posturing.

pdf abs
LLaMA-2-Econ: Enhancing Title Generation, Abstract Classification, and Academic Q&A in Economic Research
Onur Keles | Omer Turan Bayraklı

Using Quantized Low Rank Adaptation and Parameter Efficient Fine Tuning, we fine-tuned Meta AI’s LLaMA-2-7B large language model as a research assistant in the field of economics for three different types of tasks: title generation, abstract classification, and question and answer. The model was fine-tuned on economics paper abstracts and syntheticically created question-answer dialogues based on the abstracts. For the title generation, the results of the experiment demonstrated that LLaMA-2-Econ (the fine-tuned model) surpassed the base model (7B and 13B) with few shot learning, and comparable models of similar size like Mistral-7B and Bloom-7B in the BLEU and ROUGE metrics. For abstract categorization, LLaMA-2-Econ outperformed different machine and deep learning algorithms in addition to state-of-the-art models like GPT 3.5 and GPT 4 with both single and representative few shot learning. We tested the fine-tuned Q&A model by comparing its output with the base LLaMA-2-7B-chat with a Retrieval Augmented Generation (RAG) pipeline with semantic search and dense vector indexing, and found that LLaMA-2 performed on a par with the base model with RAG.

To accurately assess the dynamic impact of a company’s activities on its Environmental, Social, and Governance (ESG) scores, we have initiated a series of shared tasks, named ML-ESG. These tasks adhere to the MSCI guidelines for annotating news articles across various languages. This paper details the third iteration of our series, ML-ESG-3, with a focus on impact duration inference—a task that poses significant challenges in estimating the enduring influence of events, even for human analysts. In ML-ESG-3, we provide datasets in five languages (Chinese, English, French, Korean, and Japanese) and share insights from our experience in compiling such subjective datasets. Additionally, this paper reviews the methodologies proposed by ML-ESG-3 participants and offers a comparative analysis of the models’ performances. Concluding the paper, we introduce the concept for the forthcoming series of shared tasks, namely multi-lingual ESG promise verification, and discuss its potential contributions to the field.

Our team participated in the multi-lingual Environmental, Social, and Governance (ESG) classification task, focusing on datasets in three languages: English, French, and Japanese. This study leverages Pre-trained Language Models (PLMs), with a particular emphasis on the Bidirectional Encoder Representations from Transformers (BERT) framework, to analyze sentence and document structures across these varied linguistic datasets. The team’s experimentation with diverse PLM-based network designs facilitated a nuanced comparative analysis within this multi-lingual context. For each language-specific dataset, different BERT-based transformer models were trained and evaluated. Notably, in the experimental results, the RoBERTa-Base model emerged as the most effective in official evaluation, particularly in the English dataset, achieving a micro-F1 score of 58.82 %, thereby demonstrating superior performance in classifying ESG impact levels. This research highlights the adaptability and effectiveness of PLMs in tackling the complexities of multi-lingual ESG classification tasks, underscoring the exceptional performance of the Roberta Base model in processing English-language data.

pdf abs
DICE @ ML-ESG-3: ESG Impact Level and Duration Inference Using LLMs for Augmentation and Contrastive Learning
Konstantinos Bougiatiotis | Andreas Sideras | Elias Zavitsanos | Georgios Paliouras

We present the submission of team DICE for ML-ESG-3, the 3rd Shared Task on Multilingual ESG impact duration inference in the context of the joint FinNLP-KDF workshop series. The task provides news articles and seeks to determine the impact and duration of an event in the news article may have on a company. We experiment with various baselines and discuss the results of our best-performing submissions based on contrastive pre-training and a stacked model based on the bag-of-words assumption and sentence embeddings. We also explored the label correlations among events stemming from the same news article and the correlations between impact level and impact length. Our analysis shows that even simple classifiers trained in this task can achieve comparable performance with more complex models, under certain conditions.

pdf abs
Fine-tuning Language Models for Predicting the Impact of Events Associated to Financial News Articles
Neelabha Banerjee | Anubhav Sarkar | Swagata Chakraborty | Sohom Ghosh | Sudip Kumar Naskar

Investors and other stakeholders like consumers and employees, increasingly consider ESG factors when making decisions about investments or engaging with companies. Taking into account the importance of ESG today, FinNLP-KDF introduced the ML-ESG-3 shared task, which seeks to determine the duration of the impact of financial news articles in four languages - English, French, Korean, and Japanese. This paper describes our team, LIPI’s approach towards solving the above-mentioned task. Our final systems consist of translation, paraphrasing and fine-tuning language models like BERT, Fin-BERT and RoBERTa for classification. We ranked first in the impact duration prediction subtask for French language.

pdf abs
CriticalMinds: Enhancing ML Models for ESG Impact Analysis Categorisation Using Linguistic Resources and Aspect-Based Sentiment Analysis
Iana Atanassova | Marine Potier | Maya Mathie | Marc Bertin | Panggih Kusuma Ningrum

This paper presents our method and findings for the ML-ESG-3 shared task for categorising Environmental, Social, and Governance (ESG) impact level and duration. We introduce a comprehensive machine learning framework incorporating linguistic and semantic features to predict ESG impact levels and durations in English and French. Our methodology uses features that are derived from FastText embeddings, TF-IDF vectors, manually crafted linguistic resources, the ESG taxonomy, and aspect-based sentiment analysis (ABSA). We detail our approach, feature engineering process, model selection via grid search, and results. The best performance for this task was achieved by the Random Forest and XGBoost classifiers, with micro-F1 scores of 47.06 % and 65.44 % for English Impact level and Impact length, and 39.04 % and 54.79 % for French Impact level and Impact length respectively.

In this paper, we describe the different approaches explored by the Jetsons team for the Multi-Lingual ESG Impact Duration Inference (ML-ESG-3) shared task. The shared task focuses on predicting the duration and type of the ESG impact of a news article. The shared task dataset consists of 2,059 news titles and articles in English, French, Korean, and Japanese languages. For the impact duration classification task, we fine-tuned XLM-RoBERTa with a custom fine-tuning strategy and using self-training and DeBERTa-v3 using only English translations. These models individually ranked first on the leaderboard for Korean and Japanese and in an ensemble for the English language, respectively. For the impact type classification task, our XLM-RoBERTa model fine-tuned using a custom fine-tuning strategy ranked first for the English language.

pdf abs
ESG Classification by Implicit Rule Learning via GPT-4
Yun Hyojeong | Kim Chanyoung | Moonjeong Hahm | Kyuri Kim | Guijin Son

In this work, we adopt multiple prompting, chain-of-thought reasoning, and in-context learning strategies to guide GPT-4 in solving ESG classification tasks. We rank second in the Korean subset for Shared Task ML-ESG-3 in Impact Type prediction. Furthermore, we adopt open models to explain their calibration and robustness to different prompting strategies. The longer general pre-training correlates with enhanced performance in financial downstream tasks.

pdf abs
Leveraging Semi-Supervised Learning on a Financial-Specialized Pre-trained Language Model for Multilingual ESG Impact Duration and Type Classification
Jungdae Kim | Eunkwang Jeon | Jeon Sang Hyun

This paper presents the results of our participation in the Multilingual ESG Impact Duration Inference (ML-ESG-3) shared task organized by FinNLP-KDF@LREC-COLING-2024. The objective of this challenge is to leverage natural language processing (NLP) techniques to identify the impact duration or impact type of events that may affect a company based on news articles written in various languages. Our approach employs semi-supervised learning methods on a finance-specialized pre-trained language model. Our methodology demonstrates strong performance, achieving 1st place in the Korean - Impact Type subtask and 2nd place in the Korean - Impact Duration subtask. These results showcase the efficacy of our approach in detecting ESG-related issues from news articles. Our research shows the potential to improve existing ESG ratings by quickly reflecting the latest events of companies.

pdf abs
Adapting LLM to Multi-lingual ESG Impact and Length Prediction Using In-context Learning and Fine-Tuning with Rationale
Pawan Kumar Rajpoot | Ashvini Jindal | Ankur Parikh

The prediction of Environmental, Social, and Governance (ESG) impact and duration (length) of impact from company events, as reported in news articles, hold immense significance for investors, policymakers, and various stakeholders. In this paper, we describe solutions from our team “Upaya” to ESG impact and length prediction tasks on one such dataset ML-ESG-3. ML-ESG-3 dataset was released along with shared task as a part of the Fifth Workshop on Knowledge Discovery from Unstructured Data in Financial Services, co-located with LREC-COLING 2024. We employed two different paradigms to adapt Large Language Models (LLMs) to predict both the ESG impact and length of events. In the first approach, we leverage GPT-4 within the In-context learning (ICL) framework. A learning-free dense retriever identifies top K-relevant In-context learning examples from the training data for a given test example. The second approach involves instruction-tuning Mistral (7B) LLM to predict impact and duration, supplemented with rationale generated using GPT-4. Our models secured second place in French tasks and achieved reasonable results (fifth and ninth rank) in English tasks. These results demonstrate the potential of different LLM-based paradigms for delivering valuable insights within the ESG investing landscape.

pdf abs
ESG-GPT:GPT4-Based Few-Shot Prompt Learning for Multi-lingual ESG News Text Classification
Ke Tian | Hua Chen

Environmental, Social, and Governance (ESG) factors for company assessment have gained great attention from finance investors to identify companies’ risks and growth opportunities. ESG Text data regarding the company like sustainable reports, media news text, and social media text are important data sources for ESG analysis like ESG factors classification. Recently, FinNLP has proposed several ESG-related tasks. One of the tasks is Multi-Lingual ESG Issue Identification 3(ML-ESG-3) which is to determine the duration or impact level of the impact of an event in the news article regarding the company. In this paper, we mainly discussed our team: KaKa’s solution to this ML-ESG-3 task. We proposed the GPT4 model based on few-shot prompt learning to predict the impact level or duration of the impact of multi-lingual ESG news for the company. The experiment result demonstrates that GPT4-based few-shot prompt learning achieved good performance in leaderboard quantitative evaluations of ML-ESG-3 tasks across different languages.

pdf abs
Shared Task for Cross-lingual Classification of Corporate Social Responsibility (CSR) Themes and Topics
Yola Nayekoo | Sophia Katrenko | Veronique Hoste | Aaron Maladry | Els Lefever

This paper provides an overview of the Shared Task for Cross-lingual Classification of CSR Themes and Topics. We framed the task as two separate sub-tasks: one cross-lingual multi-class CSR theme recognition task for English, French and simplified Chinese and one multi-label fine-grained classification task of CSR topics for Environment (ENV) and Labor and Human Rights (LAB) themes in English. The participants were provided with URLs and annotations for both tasks. Several teams downloaded the data, of which two teams submitted a system for both sub-tasks. In this overview paper, we discuss the set-up of the task and our main findings.

pdf abs
Advancing CSR Theme and Topic Classification: LLMs and Training Enhancement Insights
Jens Van Nooten | Andriy Kosar

In this paper, we present our results of the classification of Corporate Social Responsibility (CSR) Themes and Topics shared task, which encompasses cross-lingual multi-class classification and monolingual multi-label classification. We examine the performance of multiple machine learning (ML) models, ranging from classical models to pre-trained large language models (LLMs), and assess the effectiveness of Data Augmentation (DA), Data Translation (DT), and Contrastive Learning (CL). We find that state-of-the-art generative LLMs in a zero-shot setup still fall behind on more complex classification tasks compared to fine-tuning local models with enhanced datasets and additional training objectives. Our work provides a wide array of comparisons and highlights the relevance of utilizing smaller language models for more complex classification tasks.

pdf abs
Improving Cross-Lingual CSR Classification Using Pretrained Transformers with Variable Selection Networks and Data Augmentation
Shubham Sharma | Himanshu Janbandhu | Ankush Chopra

This paper describes our submission to the Cross-Lingual Classification of Corporate Social Responsibility (CSR) Themes and Topics shared task, aiming to identify themes and fine-grained topics present in news articles. Classifying news articles poses several challenges, including limited training data, noisy articles, and longer context length. In this paper, we explore the potential of using pretrained transformer models to classify news articles into CSR themes and fine-grained topics. We propose two different approaches for these tasks. For multi-class classification of CSR themes, we suggest using a pretrained multi-lingual encoder-based model like microsoft/mDeBERTa-v3-base, along with a variable selection network to classify the article into CSR themes. To identify all fine-grained topics in each article, we propose using a pretrained encoder-based model like Longformer, which offers a higher context length. We employ chunking-based inference to avoid information loss in inference and experimented with using different parts and manifestation of original article for training and inference.

pdf (full)
bib (full) Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024

pdf bib
Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024
Chris Madge | Jon Chamberlain | Karen Fort | Udo Kruschwitz | Stephanie Lukin

pdf bib abs
“Actors Challenge”: Collecting Data to Study Prosodic Patterns and Their Mappings to Meanings Across Languages
Sia V. Sepanta

In this paper we describe “Actors Challenge”: a web-based interactive game designed to collect massively multi-speaker, multi-lingual oral data on the connection between prosody and various aspects of meaning. Game participants take on the two roles of auditioners and casting directors. Auditioners are asked to record certain target phrases modulated according to the emotional or attitudinal profiles that correspond to contexts or stage cues given to them. They then switch roles and become Casting Directors. Now they have to listen to other participants’ recordings, guess the corresponding context/stage cue that the auditioner tried to convey, and evaluate how good the performance was. By having the players alternate between these two roles we obtain both data creation and data validation from the same set of participants. We expect that the final dataset of labeled recordings will be valuable for a range of applications: training multilingual Speech Emotion Recognition classifiers; discovering correlations and variations in prosodic patterns among unrelated languages; examining correlations between prosodic patterns and emotion recognizability; probing the possibility that some prosodic patterns are universal.

pdf bib abs
Empowering Adaptive Digital Game-Based Language Learning for Under-Resourced Languages Through Text Analysis
Elaine Uí Dhonnchadha | Sally Bruen | Liang Xu | Monica Ward

This study explores Cipher, an adaptive language learning game tailored for the under-resourced Irish language, aimed mainly at primary school students. By integrating text analysis techniques, Cipher dynamically adjusts its difficulty based on the player’s language proficiency, offering a customised learning experience. The game’s narrative involves decoding spells to access Irish myths and stories, combining language learning with cultural elements. Development involved collaboration with educators to align the game content with curriculum standards and incorporate culturally relevant materials. This paper outlines the game’s development process, emphasising the use of text analysis for difficulty adjustment and the importance of engaging, educational gameplay. Preliminary results indicate that adaptive games like Cipher can enhance language learning by providing immersive, personalised experiences that maintain player motivation and engagement.

This paper presents the creation of Hostomytho, a game with a purpose intended for evaluating the quality of synthetic biomedical texts through multiple mini-games. Hostomytho was developed entirely using open source technologies both for internet browser and mobile platforms (IOS & Android). The code and the annotations created for synthetic clinical cases in French will be made freely available.

pdf abs
Using In-context Learning to Automate AI Image Generation for a Gamified Text Labelling Task
Fatima Althani | Chris Madge | Massimo Poesio

This paper explores a novel automated method to produce AI-generated images for a text-labelling gamified task. By leveraging the in-context learning capabilities of GPT-4, we automate the optimisation of text-to-image prompts to align with the text being labelled in the part-of-speech tagging task. As an initial evaluation, we compare the optimised prompts to the original sentences based on imageability and concreteness scores. Our results revealed that optimised prompts had significantly higher imageability and concreteness scores. Moreover, to evaluate text-to-image outputs, we generate images using Stable Diffusion XL based on the two prompt types, optimised prompts and the original sentences. Using the automated LIAON-Aesthetic predictor model, we assigned aesthetic scores for the generated images. This resulted in the outputs using optimised prompts scoring significantly higher in predicted aesthetics than those using original sentences as prompts. Our preliminary findings suggest that this methodology provides significantly more aesthetic text-to-image outputs than using the original sentence as a prompt. While the initial results are promising, the text labelling task and AI-generated images presented in this paper have yet to undergo human evaluation.

pdf abs
Aspect-based Sentiment Evaluation of Chess Moves (ASSESS): an NLP-based Method for Evaluating Chess Strategies from Textbooks
Haifa Alrdahi | Riza Batista-Navarro

The chess domain is well-suited for creating an artificial intelligence (AI) system that mimics real-world challenges, including decision-making. Throughout the years, minimal attention has been paid to investigating insights derived from unstructured chess data sources. In this study, we examine the complicated relationships between multiple referenced moves in a chess-teaching textbook, and propose a novel method designed to encapsulate chess knowledge derived from move-action phrases. This study investigates the feasibility of using a modified sentiment analysis method as a means for evaluating chess moves based on text. Our proposed Aspect-Based Sentiment Analysis (ABSA) method represents an advancement in evaluating the sentiment associated with referenced chess moves. By extracting insights from move-action phrases, our approach aims to provide a more fine-grained and contextually aware ‘chess move’-based sentiment classification. Through empirical experiments and analysis, we evaluate the performance of our fine-tuned ABSA model, presenting results that confirm the efficiency of our approach in advancing aspect-based sentiment classification within the chess domain. This research contributes to the area of game-playing by machines and shows the practical applicability of leveraging NLP techniques to understand the context of strategic games. Keywords: Natural Language Processing, Chess, Aspect-based Sentiment Analysis (ABSA), Chess Move Evaluation.

pdf abs
Generating Converging Narratives for Games with Large Language Models
Douglas Summers-Stay | Clare R. Voss

We explore methods of combining the probability distributions generated by two LLM prompts in order to generate a continuation that is appropriate for both prompts at once. This is a new capability that extends the possibilities for branching and rejoining narratives in games.

pdf abs
Leveraging Large Language Models for Spell-Generation in Dungeons & Dragons
Elio Musacchio | Lucia Siciliani | Pierpaolo Basile | Giovanni Semeraro

Dungeons & Dragons (D&D) is a classic tabletop game with a 50-year history. Its intricate and customizable gameplay allows players to create endless worlds and stories. Due to the highly narrative component of this game, D&D and many other interactive games represent a challenging setting for the Natural Language Generation (NLG) capabilities of LLMs. This paper explores using LLMs to generate new spells, which are one of the most captivating aspects of D&D gameplay. Due to the scarcity of resources available for such a specific task, we build a dataset of 3,259 instances by combining official and fan-made D&D spells. We considered several LLMs in generating spells, which underwent a quantitative and qualitative evaluation. Metrics including Bleu and BertScore were computed for quantitative assessments. Subsequently, we also conducted an in-vivo evaluation with a survey involving D&D players, which could assess the quality of the generated spells as well as their adherence to the rules. Furthermore, the paper emphasizes the open-sourcing of all models, datasets, and findings, aiming to catalyze further research on this topic.

pdf abs
Branching Narratives: Character Decision Points Detection
Alexey Tikhonov

This paper presents the Character Decision Points Detection (CHADPOD) task, a task of identification of points within narratives where characters make decisions that may significantly influence the story’s direction. We propose a novel dataset based on Choose Your Own Adventure (a registered trademark of Chooseco LLC) games graphs to be used as a benchmark for such a task. We provide a comparative analysis of different models’ performance on this task, including a couple of LLMs and several MLMs as baselines, achieving up to 89% accuracy. This underscores the complexity of narrative analysis, showing the challenges associated with understanding character-driven story dynamics. Additionally, we show how such a model can be applied to the existing text to produce linear segments divided by potential branching points, demonstrating the practical application of our findings in narrative analysis.

pdf abs
Utilizing GPT-4 to Solve TextWorld Commonsense Games Efficiently
Binggang Zhuo | Masaki Murata

Most artificial intelligence agents in interactive fiction games are implemented using reinforcement learning. Considering the recent rapid development of large language models, we propose an approach that utilizes a large language model to tackle interactive fiction game tasks. The chosen test dataset is TextWorld Commonsense, an interactive fiction game environment designed for artificial intelligence agents. In these games, the AI agent’s task is to organize rooms and place items in appropriate locations. To achieve a high score in the game, common sense knowledge about “which items belong to which locations” is important. Our approach is based on GPT-4 and a carefully designed prompt. Experimental results demonstrate that our approach outperforms prior research. Specifically, GPT-4 with feedback-augmented prompt successfully completed all tasks in both simple and medium level game environments without fine-tuning. In hard level game environments, our approach achieved a normalized score of 0.70, surpassing the best baseline score of 0.57.

pdf abs
Linguistic Acceptability and Usability Enhancement: A Case Study of GWAP Evaluation and Redesign
Wateen Abdullah Aliady | Massimo Poesio

Collecting high-quality annotations for Natural Language Processing (NLP) tasks poses challenges. Gamified annotation systems, like Games-with-a-Purpose (GWAP), have become popular tools for data annotation. For GWAPs to be effective, they must be user-friendly and produce high-quality annotations to ensure the collected data’s usefulness. This paper investigates the effectiveness of a gamified approach through two specific studies on an existing GWAP designed for collecting NLP coreference judgments. The first study involved preliminary usability testing using the concurrent think-aloud method to gather open-ended feedback. This feedback was crucial in pinpointing design issues. Following this, we conducted semi-structured interviews with our participants, and the insights collected from these interviews were instrumental in crafting player personas, which informed design improvements aimed at enhancing user experience. The outcomes of our research have been generalized to benefit other GWAP implementations. The second study evaluated the linguistic acceptability and reliability of the data collected through our GWAP. Our findings indicate that our GWAP produced reliable corpora with 91.49% accuracy and 0.787 Cohen’s kappa.

pdf abs
Riddle Me This: Evaluating Large Language Models in Solving Word-Based Games
Raffaele Manna | Maria Pia di Buono | Johanna Monti

In this contribution, we examine the proficiency of Large Language Models (LLMs) in solving the linguistic game “La Ghigliottina,” the final game of the popular Italian TV quiz show “L’Eredità”. This game is particularly challenging as it requires LLMs to engage in semantic inference reasoning for identifying the solutions of the game. Our experiment draws inspiration from Ghigliottin-AI, a task of EVALITA 2020, an evaluation campaign focusing on Natural Language Processing (NLP) and speech tools designed for the Italian language. To benchmark our experiment, we use the results of the most successful artificial player in this task, namely Il Mago della Ghigliottina. The paper describes the experimental setting and the results which show that LLMs perform poorly.

Human language interactions involve complex processes beyond pure information exchange, for example, actions aimed at influencing beliefs and behaviors within a communicative context. In this paper, we propose to investigate the dialogue understanding capabilities of large language models (LLMs), particularly in multi-party settings, where challenges like speaker identification and turn-taking are common. Through experiments on the game-based STAC dataset, we explore zero and few-shot learning approaches for dialogue act classification in a multi-party game setting. Our intuition is that LLMs may excel in tasks framed through examples rather than formal descriptions, influenced by a range of pragmatic features like information presentation order in prompts and others. We also explore the models’ predictive abilities regarding future dialogue acts and study integrating information on dialogue act sequences to improve predictions. Our findings suggest that ChatGPT can keep up with baseline models trained from scratch for classification of certain dialogue act types but also reveal biases and limitations associated with the approach. These insights can be valuable for the development of multi-party chatbots and we try to point out directions for future research towards nuanced understanding and adaptation in diverse conversational contexts

pdf (full)
bib (full) Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024

pdf bib
Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024
Isuri Anuradha | Martin Wynne | Francesca Frontini | Alistair Plum

pdf bib abs
The Impact of Digital Editing on the Study of Holocaust Survivors’ Testimonies in the context of Voci dall’Inferno Project
Angelo Mario Del Grosso | Marina Riccucci | Elvira Mercatanti

In Nazi concentration camps, approximately 20 million people perished. This included young and old, men and women, Jews, dissidents, and homosexuals. Only 10% of those deported survived. This paper introduces “Voci dall’Inferno” project, which aims to achieve two key objectives: a) Create a comprehensive digital archive: by encoding a corpus of non-literary testimonies including both written and oral sources. b) Analyze the use of Dante’s language: by identifying the presence of Dante’s lexicon and allusions. Currently, the project holds 47 testimonies, with 29 transcribed in full text and 18 encoded using the XML-TEI format. This project is propelled by a multidisciplinary and educational context with experts in humanities and computer science. The project’s findings will be disseminated through a user-friendly web application built on an XML foundation. Though currently in its prototyping phase, the application boasts several features, including a search engine for testimonies, terms, or phrases within the corpus. Additionally, a browsing interface allows users to read and listen the original testimonies, while a visualization tool enables deeper exploration of the corpus’s content. Adhering to the Text Encoding Initiative (TEI) guidelines, the project ensures a structured digital archive, aligned with the FAIR principles for data accessibility and reusability.

pdf bib abs
TEI Specifications for a Sustainable Management of Digitized Holocaust Testimonies
Sarah Bénière | Floriane Chiffoleau | Laurent Romary

Data modeling and standardization are central issues in the field of Digital Humanities, and all the more so when dealing with Holocaust testimonies, where stable preservation and long-term accessibility are key. The EHRI Online Editions are composed of documents of diverse nature (testimonies, letters, diplomatic reports, etc.), held by EHRI’s partnering institutions, and selected, gathered thematically and encoded according to the TEI Guidelines by the editors within the EHRI Consortium. Standardization is essential in order to make sure that the editions are consistent with one another. The issue of consistency also encourages a broader reflection on the usage of standards when processing data, and on the standardization of digital scholarly editions of textual documents in general. In this paper, we present the normalization work we carried out on the EHRI Online Editions. It includes a customization of the TEI adapted to Holocaust-related documents, and a focus on the implementation of controlled vocabulary. We recommend the use of these encoding specifications as a tool for researchers and/or non-TEI experts to ensure their encoding is valid and consistent across editions, but also as a mechanism for integrating the edition work smoothly within a wider workflow leading from image digitization to publication.

pdf abs
Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools
Maria Dermentzi | Hugo Scheithauer

The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. With this paper, we release EHRI-NER, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for Named Entity Recognition (NER) in Holocaust-related texts. EHRI-NER is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. We leverage this dataset to fine-tune the multilingual Transformer-based language model XLM-RoBERTa (XLM-R) to determine whether a single model can be trained to recognize entities across different document types and languages. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5%. We argue that this score is sufficiently high to consider the next steps towards deploying this model.

pdf abs
Dates and places as points of attachment for memorial contents in the ISW corpus: 1938 as a turning point
Carolina Flinz | Simona Leonardi

Aim of the paper is the identification and subsequent analysis of crisis years in the narrative biographical interviews with German speaking Jews from the corpus ISW (Emigrantendeutsch in Israel: Wiener in Jerusalem/ Migrant German in Israel: Viennese in Jerusalem); also the possible “chronological landmarks” within a year will be tackled, investigating how a certain year – 1938 – represents in the life story of the narrators a turning point, as it clusters most traumatic events linked to the Shoah. The transcripts were analysed using the tool Sketch Engine. An alternation of corpus-driven and corpus-based steps characterizes this study, which uses a quantitative-qualitative approach (see Lemnitzer and Zinsmeister, 2015) and integrates also approaches from narrative analysis. The research questions that guide our investigation are as follows: Are there any special dates that recur as chronological landmarks of crisis situations (Leonardi 2023a)? Which are they? Do they recur in connection with special places? which ones?

pdf abs
Creating a Typology of Places to Annotate Holocaust Testimonies Through Machine Learning
Christine Liu | William J.B. Mattingly

The Holocaust was not only experienced in iconic places like Auschwitz or the Warsaw ghetto. Ordinary places, such as city streets, forests, hills, and homes, were transformed by occupation and systematic violence. While most of these places are unnamed and locationally ambiguous, their omnipresence throughout post-war testimonies from witnesses and survivors of the Holocaust emphasize their undeniable importance. This paper shares a methodology for developing a typology of places in order to annotate both named and unnamed places within interview transcripts from the United States Holocaust Memorial Museum (USHMM) through a machine learning model. The approach underscores the benefits of hybrid analysis through both automated extraction and manual review to create distinct categories of places. This paper also reviews how testimony transcripts were converted into structured data for annotation and previews ongoing work to design a search engine for users to dynamically query this place-based approach to studying the Holocaust.

pdf abs
Speech Technology Services for Oral History Research
Christoph Draxler | Henk van den Heuvel | Arjan van Hessen | Pavel Ircing | Jan Lehečka

Oral history is about oral sources of witnesses and commentors on historical events. Speech technology is an important instrument to process such recordings in order to obtain transcription and further enhancements to structure the oral account In this contribution we address the transcription portal and the webservices associated with speech processing at BAS, speech solutions developed at LINDAT, how to do it yourself with Whisper, remaining challenges, and future developments.

pdf abs
Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling
Maxim Ifergan | Omri Abend | Renana Keydar | Amit Pinchevski

The vast collection of Holocaust survivor testimonies presents invaluable historical insights but poses challenges for manual analysis. This paper leverages advanced Natural Language Processing (NLP) techniques to explore the USC Shoah Foundation Holocaust testimony corpus. By treating testimonies as structured question-and-answer sections, we apply topic modeling to identify key themes. We experiment with BERTopic, which leverages recent advances in language modeling technology. We align testimony sections into fixed parts, revealing the evolution of topics across the corpus of testimonies. This highlights both a common narrative schema and divergences between subgroups based on age and gender. We introduce a novel method to identify testimonies within groups that exhibit atypical topic distributions resembling those of other groups. This study offers unique insights into the complex narratives of Holocaust survivors, demonstrating the power of NLP to illuminate historical discourse and identify potential deviations in survivor experiences.

pdf abs
Tracing the deportation to define Holocaust geometries. The exploratory case of Milan
Giovanni Pietro Vitali | Laura Brazzo

This paper presents a pilot project conducted in collaboration with the Fondazione CDEC to shed light on the historical dynamics of the arrests and deportations of Jews from Italy to foreign concentration camps between 1943 and 1945. Led by a multidisciplinary team, including a Digital Humanities expert, an archivist, a GIS developer, and an education manager, the project aimed to rework archival information into data visualisation models utilising a subset of data from the CDEC LOD dataset of the victims of the Holocaust in Italy to construct detailed visual representations of deportation routes. Drawing inspiration from previous projects like the Atlas of Nazi-Fascist Massacres and research on Holocaust testimonies, this project sought to create interactive maps, network and graphs illustrating the paths of forced transfers endured by arrested Jews, particularly focusing on those born or arrested in Milan. Despite challenges such as incomplete or imprecise data, the team managed to reconstruct deportation routes and classify transport convoys, enhancing the understanding of this dark period in history. The visualisations, along with detailed repositories and links provided on GitHub, serve as valuable research tools for both scholarly and educational purposes, offering users varying levels of granularity to explore historical events and timelines. Through meticulous data analysis and visualisation techniques, this project contributes to ongoing efforts to preserve and understand the tragic events of the Holocaust, emphasizing the importance of archival work and interdisciplinary collaboration in historical research.

pdf abs
Zero-shot Trajectory Mapping in Holocaust Testimonies
Eitan Wagner | Renana Keydar | Omri Abend

This work presents the task of Zero-shot Trajectory Mapping, which focuses on the spatial dimension of narratives. The task consists of two parts: (1) creating a “map” with all the locations mentioned in a set of texts, and (2) extracting a trajectory from a single testimony and positioning it within the map. Following recent advances in context length capabilities of large language models, we propose a pipeline for this task in a completely unsupervised manner, without the requirement of any type of labels. We demonstrate the pipeline on a set of ≈ 75 testimonies and present the resulting map and samples of the trajectory. We conclude that current long-range models succeed in generating meaningful maps and trajectories. Other than the visualization and indexing, we propose future directions for adaptation of the task as a step for dividing testimony sets into clusters and for alignment between parallel parts of different testimonies.

pdf (full)
bib (full) Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

pdf bib abs
Quality and Quantity of Machine Translation References for Automatic Metrics
Vilém Zouhar | Ondřej Bojar

Automatic machine translation metrics typically rely on human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average (or maximum) helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.

pdf bib abs
Exploratory Study on the Impact of English Bias of Generative Large Language Models in Dutch and French
Ayla Rigouts Terryn | Miryam de Lhoneux

The most widely used LLMs like GPT4 and Llama 2 are trained on large amounts of data, mostly in English but are still able to deal with non-English languages. This English bias leads to lower performance in other languages, especially low-resource ones. This paper studies the linguistic quality of LLMs in two non-English high-resource languages: Dutch and French, with a focus on the influence of English. We first construct a comparable corpus of text generated by humans versus LLMs (GPT-4, Zephyr, and GEITje) in the news domain. We proceed to annotate linguistic issues in the LLM-generated texts, obtaining high inter-annotator agreement, and analyse these annotated issues. We find a substantial influence of English for all models under all conditions: on average, 16% of all annotations of linguistic errors or peculiarities had a clear link to English. Fine-tuning a LLM to a target language (GEITje is fine-tuned on Dutch) reduces the number of linguistic issues and probably also the influence of English. We further find that using a more elaborate prompt leads to linguistically better results than a concise prompt. Finally, increasing the temperature for one of the models leads to lower linguistic quality but does not alter the influence of English.

pdf abs
Adding Argumentation into Human Evaluation of Long Document Abstractive Summarization: A Case Study on Legal Opinions
Mohamed Elaraby | Huihui Xu | Morgan Gray | Kevin Ashley | Diane Litman

Human evaluation remains the gold standard for assessing abstractive summarization. However, current practices often prioritize constructing evaluation guidelines for fluency, coherence, and factual accuracy, overlooking other critical dimensions. In this paper, we investigate argument coverage in abstractive summarization by focusing on long legal opinions, where summaries must effectively encapsulate the document’s argumentative nature. We introduce a set of human-evaluation guidelines to evaluate generated summaries based on argumentative coverage. These guidelines enable us to assess three distinct summarization models, studying the influence of including argument roles in summarization. Furthermore, we utilize these evaluation scores to benchmark automatic summarization metrics against argument coverage, providing insights into the effectiveness of automated evaluation methods.

pdf abs
A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian
Aleksandra Miletić | Filip Miletić

Bosnian, Croatian, Montenegrin and Serbian are the official standard linguistic varieties in Bosnia and Herzegovina, Croatia, Montenegro, and Serbia, respectively. When these four countries were part of the former Yugoslavia, the varieties were considered to share a single linguistic standard. After the individual countries were established, the national standards emerged. Today, a central question about these varieties remains the following: How different are they from each other? How hard is it to distinguish them? While this has been addressed in NLP as part of the task on Distinguishing Between Similar Languages (DSL), little is known about human performance, making it difficult to contextualize system results. We tackle this question by reannotating the existing BCMS dataset for DSL with annotators from all target regions. We release a new gold standard, replacing the original single-annotator, single-label annotation by a multi-annotator, multi-label one, thus improving annotation reliability and explicitly coding the existence of ambiguous instances. We reassess a previously proposed DSL system on the new gold standard and establish the human upper bound on the task. Finally, we identify sources of annotation difficulties and provide linguistic insights into the BCMS dialect continuum, with multiple indicators highlighting an intermediate position of Bosnian and Montenegrin.

pdf abs
Insights of a Usability Study for KBQA Interactive Semantic Parsing: Generation Yields Benefits over Templates but External Validity Remains Challenging
Ashley Lewis | Lingbo Mo | Marie-Catherine de Marneffe | Huan Sun | Michael White

We present our findings from a usability study of an interactive semantic parsing system for knowledge based question answering (KBQA). The system is designed to help users access information within a knowledge base without having to know its query language. The system translates the user’s question into the query language, retrieves an answer, then presents an English explanation of the process so that the user can make corrections if necessary. To our knowledge, our work is the most thorough usability study conducted for such a system and the only one that uses crowdworkers as participants to verify that the system is usable for average users. Our crowdworkers participate in KBQA dialogues using 4 versions of a system based on the framework by Mo et al. (2022) and answer surveys about their experiences. Some key takeaways from this work are: 1) we provide evidence for the benefits of interactivity in semantic parsing with human users and using generated questions in lieu of templated representations, 2) we identify limitations of simulations and provide contrasting evidence from actual system use, and 3) we provide an examination of crowdsourcing methodology, in particular the trade-offs of using crowdworkers vs. a specially trained group of evaluators.

There is often a significant disparity between the performance of Natural Language Processing (NLP) tools as evaluated on benchmark datasets using metrics like ROUGE or BLEU, and the actual user experience encountered when employing these tools in real-world scenarios. This highlights the critical necessity for user-oriented studies aimed at evaluating user experience concerning the effectiveness of developed methodologies. A primary challenge in such “ecological” user studies is their assessment of specific configurations of NLP tools, making replication under identical conditions impractical. Consequently, their utility is limited for the automated evaluation and comparison of different configurations of the same tool. The objective of this study is to conduct an “ecological” evaluation of a question generation within the context of an external task involving document linking. To do this we conducted an "ecological" evaluation of a document linking tool in the context of the exploration of a Social Science archives and from this evaluation, we aim to derive a form of a “reference corpus” that can be used offline for the automated comparison of models and quantitative tool assessment. This corpus is available on the following link: https://gitlab.lis-lab.fr/archival-public/autogestion-qa-linking

pdf abs
Towards Holistic Human Evaluation of Automatic Text Simplification
Luisa Carrer | Andreas Säuberli | Martin Kappus | Sarah Ebling

Text simplification refers to the process of rewording within a single language, moving from a standard form into an easy-to-understand one. Easy Language and Plain Language are two examples of simplified varieties aimed at improving readability and understanding for a wide-ranging audience. Human evaluation of automatic text simplification is usually done by employing experts or crowdworkers to rate the generated texts. However, this approach does not include the target readers of simplified texts and does not reflect actual comprehensibility. In this paper, we explore different ways of measuring the quality of automatically simplified texts. We conducted a multi-faceted evaluation study involving end users, post-editors, and Easy Language experts and applied a variety of qualitative and quantitative methods. We found differences in the perception and actual comprehension of the texts by different user groups. In addition, qualitative surveys and behavioral observations proved to be essential in interpreting the results.

pdf abs
Decoding the Metrics Maze: Navigating the Landscape of Conversational Question Answering System Evaluation in Procedural Tasks
Alexander Frummet | David Elsweiler

Conversational systems are widely used for various tasks, from answering general questions to domain-specific procedural tasks, such as cooking. While the effectiveness of metrics for evaluating general question answering (QA) tasks has been extensively studied, the evaluation of procedural QA remains a challenge as we do not know what answer types users prefer in such tasks. Existing studies on metrics evaluation often focus on general QA tasks and typically limit assessments to one answer type, such as short, SQuAD-like responses or longer passages. This research aims to achieve two objectives. Firstly, it seeks to identify the desired traits of conversational QA systems in procedural tasks, particularly in the context of cooking (RQ1). Second, it assesses how commonly used conversational QA metrics align with these traits and perform across various categories of correct and incorrect answers (RQ2). Our findings reveal that users generally favour concise conversational responses, except in time-sensitive scenarios where brief, clear answers hold more value (e.g. when heating in oil). While metrics effectively identify inaccuracies in short responses, several commonly employed metrics tend to assign higher scores to incorrect conversational answers when compared to correct ones. We provide a selection of metrics that reliably detect correct and incorrect information in short and conversational answers.

pdf abs
The 2024 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results
Anya Belz | Craig Thomson

This paper presents an overview of, and the results from, the 2024 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’24), following on from three previous shared tasks on reproducibility of evaluations in NLP, ReproNLP’23, ReproGen’22 and ReproGen’21. This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, against a backdrop of increasing recognition of the importance of reproducibility across the two fields. We describe the ReproNLP’24 shared task, summarise results from the reproduction studies submitted, and provide additional comparative analysis of their results.

pdf abs
Once Upon a Replication: It is Humans’ Turn to Evaluate AI’s Understanding of Children’s Stories for QA Generation
Andra-Maria Florescu | Marius Micluta-Campeanu | Liviu P. Dinu

The following paper presents the outcomes of a collaborative experiment on human evaluation from the ReproNLP 2024 shared task, track B, part of the ReproHum project. For this paper, we evaluated a QAG (question-answer generation) system centered on English children’s storybooks that was presented in a previous research, by using human evaluators for the study. The system generated relevant QA (Question-Answer) pairs based on a dataset with storybooks for early education (kindergarten up to middle school) called FairytaleQA. In the framework of the ReproHum project, we first outline the previous paper and the reproduction strategy that has been decided upon. The complete setup of the first human evaluation is then described, along with the modifications required to replicate it. We also add other relevant related works on this subject. In conclusion, we juxtapose the replication outcomes with those documented in the cited publication. Additionally, we explore the general features of this endeavor as well as its shortcomings.

pdf abs
Exploring Reproducibility of Human-Labelled Data for Code-Mixed Sentiment Analysis
Sachin Sasidharan Nair | Tanvi Dinkar | Gavin Abercrombie

Growing awareness of a ‘Reproducibility Crisis’ in natural language processing (NLP) has focused on human evaluations of generative systems. While labelling for supervised classification tasks makes up a large part of human input to systems, the reproduction of such efforts has thus far not been been explored. In this paper, we re-implement a human data collection study for sentiment analysis of code-mixed Malayalam movie reviews, as well as automated classification experiments. We find that missing and under-specified information makes reproduction challenging, and we observe potentially consequential differences between the original labels and those we collect. Classification results indicate that the reliability of the labels is important for stable performance.

pdf abs
Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques
Michela Lorandi | Anya Belz

Rerunning a metric-based evaluation should be more straightforward and results should be closer than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this brief report of our efforts to rerun a metric-based evaluation of a set of multi-aspect controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the orginal work.

In earlier work, August et al. (2022) evaluated three different Natural Language Generation systems on their ability to generate fluent, relevant, and factual scientific definitions. As part of the ReproHum project (Belz et al., 2023), we carried out a partial reproduction study of their human evaluation procedure, focusing on human fluency ratings. Following the standardised ReproHum procedure, our reproduction study follows the original study as closely as possible, with two raters providing 300 ratings each. In addition to this, we carried out a second study where we collected ratings from eight additional raters and analysed the variability of the ratings. We successfully reproduced the inferential statistics from the original study (i.e. the same hypotheses were supported), albeit with a lower inter-annotator agreement. The remainder of our paper shows significant variation between different raters, raising questions about what it really means to reproduce human evaluation studies.

pdf abs
ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text
Tanvi Dinkar | Gavin Abercrombie | Verena Rieser

ReproHum is a large multi-institution project designed to examine the reproducibility of human evaluations of natural language processing. As part of the second phase of the project, we attempt to reproduce an evaluation of the fluency of continuations generated by a pre-trained language model compared to a range of baselines. Working within the constraints of the project, with limited information about the original study, and without access to their participant pool, or the responses of individual participants, we find that we are not able to reproduce the original results. Our participants display a greater tendency to prefer one of the system responses, avoiding a judgement of ‘equal fluency’ more than in the original study. We also conduct further evaluations: we elicit ratings from (1) a broader range of participants; (2) from the same participants at different times; and (3) with an altered definition of fluency. Results of these experiments suggest that the original evaluation collected too few ratings, and that the task formulation may be quite ambiguous. Overall, although we were able to conduct a re-evaluation study, we conclude that the original evaluation was not comprehensive enough to make truly meaningful comparisons

pdf abs
ReproHum #0927-3: Reproducing The Human Evaluation Of The DExperts Controlled Text Generation Method
Javier González Corbelle | Ainhoa Vivel Couso | Jose Maria Alonso-Moral | Alberto Bugarín-Diz

This paper presents a reproduction study aimed at reproducing and validating a human NLP evaluation performed for the DExperts text generation method. The original study introduces DExperts, a controlled text generation method, evaluated using non-toxic prompts from the RealToxicityPrompts dataset. Our reproduction study aims to reproduce the human evaluation of the continuations generated by DExperts in comparison with four baseline methods, in terms of toxicity, topicality, and fluency. We first describe the agreed approach for reproduction within the ReproHum project and detail the configuration of the original evaluation, including necessary adaptations for reproduction. Then, we make a comparison of our reproduction results with those reported in the reproduced paper. Interestingly, we observe how the human evaluators in our experiment appreciate higher quality in the texts generated by DExperts in terms of less toxicity and better fluency. All in all, new scores are higher, also for the baseline methods. This study contributes to ongoing efforts in ensuring the reproducibility and reliability of findings in NLP evaluation and emphasizes the critical role of robust methodologies in advancing the field.

pdf abs
ReproHum #1018-09: Reproducing Human Evaluations of Redundancy Errors in Data-To-Text Systems
Filip Klubička | John D. Kelleher

This paper describes a reproduction of a human evaluation study evaluating redundancies generated in automatically generated text from a data-to-text system. While the scope of the original study is broader, a human evaluation—a manual error analysis—is included as part of the system evaluation. We attempt a reproduction of this human evaluation, however while the authors annotate multiple properties of the generated text, we focus exclusively on a single quality criterion, that of redundancy. In focusing our study on a single minimal reproducible experimental unit, with the experiment being fairly straightforward and all data made available by the authors, we encountered no challenges with our reproduction and were able to reproduce the trend found in the original experiment. However, while still confirming the general trend, we found that both our annotators identified twice as many errors in the dataset than the original authors.

pdf abs
ReproHum#0043: Human Evaluation Reproducing Language Model as an Annotator: Exploring Dialogue Summarization on AMI Dataset
Vivian Fresen | Mei-Shin Wu-Urbanek | Steffen Eger

This study, conducted as part of the ReproHum project, aimed to replicate and evaluate the experiment presented in “Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization” by Feng et al. (2021). By employing DialoGPT, BART, and PGN models, the study assessed dialogue summarization’s informativeness. Based on the ReproHum project’s baselines, we conducted a human evaluation for the AIMI dataset, aiming to compare the results of the original study with our own experiments. Our objective is to contribute to the research on human evaluation and the reproducibility of the original study’s findings in the field of Natural Language Processing (NLP). Through this endeavor, we seek to enhance understanding and establish reliable benchmarks in human evaluation methodologies within the NLP domain.

pdf abs
ReproHum #0712-01: Human Evaluation Reproduction Report for “Hierarchical Sketch Induction for Paraphrase Generation”
Mohammad Arvan | Natalie Parde

Human evaluations are indispensable in the development of NLP systems because they provide direct insights into how effectively these systems meet real-world needs and expectations. Ensuring the reproducibility of these evaluations is vital for maintaining credibility in natural language processing research. This paper presents our reproduction of the human evaluation experiments conducted by Hosking et al. (2022) for their paraphrase generation approach. Through careful replication we found that our results closely align with those in the original study, indicating a high degree of reproducibility.

pdf abs
ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation
Lewis N. Watson | Dimitra Gkatzia

Reproducibility is a cornerstone of scientific research, ensuring the reliability and generalisability of findings. The ReproNLP Shared Task on Reproducibility of Evaluations in NLP aims to assess the reproducibility of human evaluation studies. This paper presents a reproduction study of the human evaluation experiment presented in “Hierarchical Sketch Induction for Paraphrase Generation” by Hosking et al. (2022). The original study employed a human evaluation on Amazon Mechanical Turk, assessing the quality of paraphrases generated by their proposed model using three criteria: meaning preservation, fluency, and dissimilarity. In our reproduction study, we focus on the meaning preservation criterion and utilise the Prolific platform for participant recruitment, following the ReproNLP challenge’s common approach to reproduction. We discuss the methodology, results, and implications of our reproduction study, comparing them to the original findings. Our findings contribute to the understanding of reproducibility in NLP research and highlights the potential impact of platform changes and evaluation criteria on the reproducibility of human evaluation studies.

pdf abs
ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility
Mateusz Lango | Patricia Schmidtova | Simone Balloccu | Ondrej Dusek

In this paper, we describe several reproductions of a human evaluation experiment measuring the quality of automatic dialogue summarization (Feng et al., 2021). We investigate the impact of the annotators’ highest level of education, field of study, and native language on the evaluation of the informativeness of the summary. We find that the evaluation is relatively consistent regardless of these factors, but the biggest impact seems to be a prior specific background in natural language processing (as opposed to, e.g. a background in computer sci- ence). We also find that the experiment setup (asking for single vs. multiple criteria) may have an impact on the results.

pdf abs
ReproHum #0033-3: Comparable Relative Results with Lower Absolute Values in a Reproduction Study
Yiru Li | Huiyuan Lai | Antonio Toral | Malvina Nissim

In the context of the ReproHum project aimed at assessing the reliability of human evaluation, we replicated the human evaluation conducted in “Generating Scientific Definitions with Controllable Complexity” by August et al. (2022). Specifically, humans were asked to assess the fluency of automatically generated scientific definitions by three different models, with output complexity varying according to target audience. Evaluation conditions were kept as close as possible to the original study, except of necessary and minor adjustments. Our results, despite yielding lower absolute performance, show that relative performance across the three tested systems remains comparable to what was observed in the original paper. On the basis of lower inter-annotator agreement and feedback received from annotators in our experiment, we also observe that the ambiguity of the concept being evaluated may play a substantial role in human assessment.

pdf abs
ReproHum #0124-03: Reproducing Human Evaluations of end-to-end approaches for Referring Expression Generation
Saad Mahamood

In this paper we describe our attempt to reproduce a single human evaluation quality criterion of the human evaluation that was in conducted in the paper “NeuralREG: An end-to-end approach to referring expression generation”. In particular, this paper describes the approach and challenges involved in reproducing the human evaluation as done by the original authors of the paper, the results obtained, and what insights we have gained from attempting this particular reproduction. Insights that we hope will enable refinements to both how human evaluations are documented by author(s) and enable better reproductions of NLP experiments in the future.

pdf abs
ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations
Tyler Loakman | Chenghua Lin

This paper describes a partial reproduction of the work titled “Generating Fact Checking Explanations” by Atanasova et al. (2020) as part of the ReproHum element within the ReproNLP shared task, aimed at reproducing findings in NLP research related to human evaluation. The task investigates whether NLP research is becoming more or less reproducible over time. Following instructions from the task organizers and the original authors, we gathered relative rankings for three fact-checking explanations (including a gold standard and outputs from two models) for 40 inputs based on the criterion of Coverage. Our reproduction and reanalysis of the original study’s raw results support the initial findings, showing similar patterns between the original work and our reproduction. Though we observed slight variations from the original results, our findings align with the main conclusions drawn by the original authors regarding the effectiveness of their proposed models.

pdf abs
ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG
Irene Mondella | Huiyuan Lai | Malvina Nissim

In spite of the core role human judgement plays in evaluating the performance of NLP systems, the way human assessments are elicited in NLP experiments, and to some extent the nature of human judgement itself, pose challenges to the reliability and validity of human evaluation. In the context of the larger ReproHum project, aimed at running large scale multi-lab reproductions of human judgement, we replicated the understandability assessment by humans on several generated outputs of simplified text described in the paper “Neural Text Simplification of Clinical Letters with a Domain Specific Phrase Table” by Shardlow and Nawaz, appeared in the Proceedings of ACL 2019. Although we had to implement a series of modifications compared to the original study, which were necessary to run our human evaluation on exactly the same data, we managed to collect assessments and compare results with the original study. We obtained results consistent with those of the reference study, confirming their findings. The paper is complete with as much information as possible to foster and facilitate future reproduction.

pdf abs
ReproHum #0087-01: A Reproduction Study of the Human Evaluation of the Coverage of Fact Checking Explanations
Mingqi Gao | Jie Ruan | Xiaojun Wan

We present a reproduction study of the human evaluation of the coverage of fact checking explanations conducted by Atanasova et al. (2020), as a team in Track B of ReproNLP 2024. The setup of our reproduction study is almost the same as the original study, with some necessary modifications to the evaluation guideline and annotation interface. Our reproduction achieves a higher IAA of 0.20 compared to the original study’s 0.12, but discovers a mismatch between the IAA calculated by us with the raw annotation in the original study and the IAA reported in the original paper. Additionally, our reproduction results on the ranks of three types of explanations are drastically different from the original experiment, rendering that one important conclusion in the original paper cannot be confirmed at all. The case study illustrates that the annotators in the reproduction study may understand the quality criterion differently from the annotators in the original study.

pdf abs
ReproHum #0866-04: Another Evaluation of Readers’ Reactions to News Headlines
Zola Mahlaza | Toky Hajatiana Raboanary | Kyle Seakgwa | C. Maria Keet

The reproduction of Natural Language Processing (NLP) studies is important in establishing their reliability. Nonetheless, many papers in NLP have never been reproduced. This paper presents a reproduction of Gabriel et al. (2022)’s work to establish the extent to which their findings, pertaining to the utility of large language models (T5 and GPT2) to automatically generate writer’s intents when given headlines to curb misinformation, can be confirmed. Our results show no evidence to support two of their four findings and they partially support the rest of the original findings. Specifically, while we confirmed that all the models are judged to be capable of influencing readers’ trust or distrust, there was a difference in T5’s capability to reduce trust. Our results show that its generations are more likely to have greater influence in reducing trust while Gabriel et al. (2022) found more cases where they had no impact at all. In addition, most of the model generations are considered socially acceptable only if we relax the criteria for determining a majority to mean more than chance rather than the apparent > 70% of the original study. Overall, while they found that “machine-generated MRF implications alongside news headlines to readers can increase their trust in real news while decreasing their trust in misinformation”, we found that they are more likely to decrease trust in both cases vs. having no impact at all.

pdf (full)
bib (full) Proceedings of the 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024

pdf bib abs
The MEET Corpus: Collocated, Distant and Hybrid Three-party Meetings with a Ranking Task
Ghazaleh Esfandiari-Baiat | Jens Edlund

We introduce the MEET corpus. The corpus was collected with the aim of systematically studying the effects of collocated (physical), remote (digital) and hybrid work meetings on collaborative decision-making. It consists of 10 sessions, where each session contains three recordings: a collocated, a remote and a hybrid meeting between three participants. The participants are working on a different survival ranking task during each meeting. The duration of each meeting ranges from 10 to 18 minutes, resulting in 380 minutes of conversation altogether. We also present the annotation scheme designed specifically to target our research questions. The recordings are currently being transcribed and annotated in accordance with this scheme

pdf bib abs
MSNER: A Multilingual Speech Dataset for Named Entity Recognition
Quentin Meeus | Marie-Francine Moens | Hugo Van hamme

While extensively explored in text-based tasks, Named Entity Recognition (NER) remains largely neglected in spoken language understanding. Existing resources are limited to a single, English-only dataset. This paper addresses this gap by introducing MSNER, a freely available, multilingual speech corpus annotated with named entities. It provides annotations to the VoxPopuli dataset in four languages (Dutch, French, German, and Spanish). We have also releasing an efficient annotation tool that leverages automatic pre-annotations for faster manual refinement. This results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set. We further provide an analysis comparing silver and gold annotations. Finally, we present baseline NER models to stimulate further research on this newly available dataset.

pdf abs
Attitudes in Diplomatic Speeches: Introducing the CoDipA UNSC 1.0
Mariia Anisimova | Šárka Zikánová

This paper presents CoDipA UNSC 1.0, a Corpus of Diplomatic Attitudes of the United Nations Security Council annotated with the attitude-part of the Appraisal theory. The speeches were manually selected according to topic-related and temporal criteria. The texts were then annotated according to the predefined annotation scenario. The distinguishing features of the diplomatic texts require a modified approach to attitude evaluation, which was implemented and presented in the current work. The corpus analysis has proven diplomatic speeches to be consistently evaluative, offered an overview of the most prominent means of expressing subjectivity in the corpus, and provided the results of the inter-annotator agreement evaluation.

pdf abs
Automatic Alignment of Discourse Relations of Different Discourse Annotation Frameworks
Yingxue Fu

Existing discourse corpora are annotated based on different frameworks, which show significant dissimilarities in definitions of arguments and relations and structural constraints. Despite surface differences, these frameworks share basic understandings of discourse relations. The relationship between these frameworks has been an open research question, especially the correlation between relation inventories utilized in different frameworks. Better understanding of this question is helpful for integrating discourse theories and enabling interoperability of discourse corpora annotated under different frameworks. However, studies that explore correlations between discourse relation inventories are hindered by different criteria of discourse segmentation, and expert knowledge and manual examination are typically needed. Some semi-automatic methods have been proposed, but they rely on corpora annotated in multiple frameworks in parallel. In this paper, we introduce a fully automatic approach to address the challenges. Specifically, we extend the label-anchored contrastive learning method introduced by Zhang et al. (2022b) to learn label embeddings during discourse relation classification. These embeddings are then utilized to map discourse relations from different frameworks. We show experimental results on RST-DT (Carlson et al., 2001) and PDTB 3.0 (Prasad et al., 2018).

pdf abs
A New Annotation Scheme for the Semantics of Taste
Teresa Paccosi | Sara Tonelli

This paper introduces a new annotation scheme for the semantics of gustatory language in English, which builds upon a previous framework for olfactory language based on frame semantics. The purpose of this annotation framework is to be used for annotating comparable resources for the study of sensory language and to create training datasets for supervised systems aimed at extracting sensory information. Furthermore, our approach incorporates words from specific historical periods, thereby enhancing the framework’s utility for studying language from a diachronic perspective.

pdf abs
What to Annotate: Retrieving Lexical Markers of Conspiracy Discourse from an Italian-English Corpus of Telegram Data
Costanza Marini | Elisabetta Jezek

In this age of social media, Conspiracy Theories (CTs) have become an issue that can no longer be ignored. After providing an overview of CT literature and corpus studies, we describe the creation of a 40,000-token English-Italian bilingual corpus of conspiracy-oriented Telegram comments – the Complotto corpus – and the linguistic analysis we performed using the Sketch Engine online platform (Kilgarriff et al., 2010) on our annotated data to identify statistically relevant linguistic markers of CT discourse. Thanks to the platform’s keywords and key terms extraction functions, we were able to assess the statistical significance of the following lexical and semantic phenomena, both cross-linguistically and cross-CT, namely: (1) evidentiality and epistemic modality markers; (2) debunking vocabulary referring to another version of the truth lying behind the official one; (3) the conceptual metaphor INSTITUTIONS ARE ABUSERS. All these features qualify as markers of CT discourse and have the potential to be effectively used for future semantic annotation tasks to develop automatic systems for CT identification.

pdf abs
Lightweight Connective Detection Using Gradient Boosting
Mustafa Erolcan Er | Murathan Kurfalı | Deniz Zeyrek

In this work, we introduce a lightweight discourse connective detection system. Employing gradient boosting trained on straightforward, low-complexity features, this proposed approach sidesteps the computational demands of the current approaches that rely on deep neural networks. Considering its simplicity, our approach achieves competitive results while offering significant gains in terms of time even on CPU. Furthermore, the stable performance across two unrelated languages suggests the robustness of our system in the multilingual scenario. The model is designed to support the annotation of discourse relations, particularly in scenarios with limited resources, while minimizing performance loss.

pdf abs
Shallow Discourse Parsing on Twitter Conversations
Berfin Aktas | Burak Özmen

We present our PDTB-style annotations on conversational Twitter data, which was initially annotated by Scheffler et al. (2019). We introduced 1,043 new annotations to the dataset, nearly doubling the number of previously annotated discourse relations. Subsequently, we applied a neural Shallow Discourse Parsing (SDP) model to the resulting corpus, improving its performance through retraining with in-domain data. The most substantial improvement was observed in the sense identification task (+19%). Our experiments with diverse training data combinations underline the potential benefits of exploring various data combinations in domain adaptation efforts for SDP. To the best of our knowledge, this is the first application of Shallow Discourse Parsing on Twitter data

pdf abs
Search tool for An Event-Type Ontology
Nataliia Petliak | Cristina Fernandéz Alcaina | Eva Fučíková | Jan Hajič | Zdeňka Urešová

This short demo description paper presents a new tool designed for searching an event-type ontology with rich information, demonstrated on the SynSemClass ontology resource. The tool complements a web browser, created by the authors of the SynSemClass ontology previously. Due to the complexity of the resource, the search tool offers possibilities both for a linguistically-oriented researcher as well as for teams working with the resource from a technical point of view, such as building role labeling tools, automatic annotation tools, etc.

pdf abs
Tiny But Mighty: A Crowdsourced Benchmark Dataset for Triple Extraction from Unstructured Text
Muhammad Salman | Armin Haller | Sergio J. Rodriguez Mendez | Usman Naseem

In the context of Natural Language Processing (NLP) and Semantic Web applications, constructing Knowledge Graphs (KGs) from unstructured text plays a vital role. Several techniques have been developed for KG construction from text, but the lack of standardized datasets hinders the evaluation of triple extraction methods. The evaluation of existing KG construction approaches is based on structured data or manual investigations. To overcome this limitation, this work introduces a novel dataset specifically designed to evaluate KG construction techniques from unstructured text. Our dataset consists of a diverse collection of compound and complex sentences meticulously annotated by human annotators with potential triples (subject, verb, object). The annotations underwent further scrutiny by expert ontologists to ensure accuracy and consistency. For evaluation purposes, the proposed F-measure criterion offers a robust approach to quantify the relatedness and assess the alignment between extracted triples and the ground-truth triples, providing a valuable tool for evaluating the performance of triple extraction systems. By providing a diverse collection of high-quality triples, our proposed benchmark dataset offers a comprehensive training and evaluation set for refining the performance of state-of-the-art language models on a triple extraction task. Furthermore, this dataset encompasses various KG-related tasks, such as named entity recognition, relation extraction, and entity linking.

pdf abs
Less is Enough: Less-Resourced Multilingual AMR Parsing
Bram Vanroy | Tim Van de Cruys

This paper investigates the efficacy of multilingual models for the task of text-to-AMR parsing, focusing on English, Spanish, and Dutch. We train and evaluate models under various configurations, including monolingual and multilingual settings, both in full and reduced data scenarios. Our empirical results reveal that while monolingual models exhibit superior performance, multilingual models are competitive across all languages, offering a more resource-efficient alternative for training and deployment. Crucially, our findings demonstrate that AMR parsing benefits from transfer learning across languages even when having access to significantly smaller datasets. As a tangible contribution, we provide text-to-AMR parsing models for the aforementioned languages as well as multilingual variants, and make available the large corpora of translated data for Dutch, Spanish (and Irish) that we used for training them in order to foster AMR research in non-English languages. Additionally, we open-source the training code and offer an interactive interface for parsing AMR graphs from text.

This paper presents MoCCA, a Model of Comparative Concepts for Aligning Constructicons under development by a consortium of research groups building Constructicons of different languages including Brazilian Portuguese, English, German and Swedish. The Constructicons will be aligned by using comparative concepts (CCs) providing language-neutral definitions of linguistic properties. The CCs are drawn from typological research on grammatical categories and constructions, and from FrameNet frames, organized in a conceptual network. Language-specific constructions are linked to the CCs in accordance with general principles. MoCCA is organized into files of two types: a largely static CC Database file and multiple Linking files containing relations between constructions in a Constructicon and the CCs. Tools are planned to facilitate visualization of the CC network and linking of constructions to the CCs. All files and guidelines will be versioned, and a mechanism is set up to report cases where a language-specific construction cannot be easily linked to existing CCs.

pdf abs
ISO 24617-8 Applied: Insights from Multilingual Discourse Relations Annotation in English, Polish, and Portuguese
Aleksandra Tomaszewska | Purificação Silvano | António Leal | Evelin Amorim

The main objective of this study is to contribute to multilingual discourse research by employing ISO-24617 Part 8 (Semantic Relations in Discourse, Core Annotation Schema – DR-core) for annotating discourse relations. Centering around a parallel discourse relations corpus that includes English, Polish, and European Portuguese, we initiate one of the few ISO-based comparative analyses through a multilingual corpus that aligns discourse relations across these languages. In this paper, we discuss the project’s contributions, including the annotated corpus, research findings, and statistics related to the use of discourse relations. The paper further discusses the challenges encountered in complying with the ISO standard, such as defining the scope of arguments and annotating specific relation types like Expansion. Our findings highlight the necessity for clearer definitions of certain discourse relations and more precise guidelines for argument spans, especially concerning the inclusion of connectives. Additionally, the study underscores the importance of ongoing collaborative efforts to broaden the inclusion of languages and more comprehensive datasets, with the objective of widening the reach of ISO-guided multilingual discourse research.

pdf abs
Combining semantic annotation schemes through interlinking
Harry Bunt

This paper explores the possibilities of using combinations of different semantic annotation schemes. This is particularly interesting for annotation schemes developed under the umbrella of the ISO Semantic Annotation Framework (ISO 24617), since these schemes were intended to be complementary, providing ways of indicating different semantic information about the same entities. However, there are certain overlaps between the schemes of SemAF parts, due to overlaps of their semantic domains, which are a potential source of inconsistencies. The paper shows how issues relating to inconsistencies can be addressed at the levels of concrete representation, abstract syntax, and semantic interpretation.

pdf abs
Fusing ISO 24617-2 Dialogue Acts and Application-Specific Semantic Content Annotations
Andrei Malchanau | Volha Petukhova | Harry Bunt

Accurately annotated data determines whether a modern high-performing AI/ML model will present a suitable solution to a complex task/application challenge, or time and resources are wasted. The more adequate the structure of the incoming data is specified, the more efficient the data is translated to be used by the application. This paper presents an approach to an application-specific dialogue semantics design which integrates the dialogue act annotation standard ISO 24617-2 and various domain-specific semantic annotations. The proposed multi-scheme design offers a plausible and a rather powerful strategy to integrate, validate, extend and reuse existing annotations, and automatically generate code for dialogue system modules. Advantages and possible trade-offs are discussed.

pdf abs
Annotation-Based Semantics for Dialogues in the Vox World
Kiyong Lee

This paper aims at enriching Annotation-Based Semantics (ABS) with the notion of small visual worlds, called the Vox worlds, to interpret dialogues in natural language. It attempts to implement classical set-theoretic models with these Vox worlds that serve as interpretation models. These worlds describe dialogue situations while providing background for the visualization of those situations in which these described dialogues take place interactively among dialogue participants, often triggering actions and emotions. The enriched ABS is based on VoxML, a modeling language for visual object conceptual structures (vocs or vox) that constitute the structural basis of visual worlds.

pdf abs
Annotating Evaluative Language: Challenges and Solutions in Applying Appraisal Theory
Jiamei Zeng | Min Dong | Alex Chengyu Fang

This article describes a corpus-based experiment to identify the challenges and solutions in the annotation of evaluative language according to the scheme defined in Appraisal Theory (Martin and White, 2005). Originating from systemic functional linguistics, Appraisal Theory provides a robust framework for the analysis of linguistic expressions of evaluation, stance, and interpersonal relationships. Despite its theoretical richness, the practical application of Appraisal Theory in text annotation presents significant challenges, chiefly due to the intricacies of identifying and classifying evaluative expressions within its sub-system of Attitude, which comprises Affect, Judgement, and Appreciation. This study examines these challenges through the annotation of a corpus of editorials related to the Russian-Ukraine conflict and aims to offer practical solutions to enhance the transparency and consistency of the annotation. By refining the annotation process and addressing the subjective nature in the identification and classification of evaluative language, this work represents some timely effort in the annotation of pragmatic knowledge in language resources.

pdf abs
Attractive Multimodal Instructions, Describing Easy and Engaging Recipe Blogs
Ielka van der Sluis | Jarred Kiewiet de Jonge

This paper presents a corpus study that extends and generalises an existing annotation model which integrates functional content descriptions delivered via text, pictures and interactive components. The model is used to describe a new corpus with 20 online vegan recipe blogs in terms of their Attractiveness for at least two types of readers: vegan readers and readers interested in a vegan lifestyle. Arguably, these readers value a blog that shows that the target dish is Easy to Make which can be inferred from the number of ingredients, procedural steps and visualised actions, according to an Easy to Read cooking instruction that displays a coherent use of verbal and visual modalities presenting processes and results of the cooking actions involved. Moreover, added value may be attributed to invitations to Engage with the blog content and functionality through which information about the recipe, the author, diet and nutrition can be accessed. Thus, the corpus study merges generalisable annotations of verbal, visual and interaction phenomena to capture the Attractiveness of online vegan recipe blogs to inform reader and user studies and ultimately offer guidelines for authoring effective online multimodal instructions.

pdf (full)
bib (full) Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024

This article proposes a linguistic linked open data model for diachronic analysis (LLODIA) that combines data derived from diachronic analysis of multilingual corpora with dictionary-based evidence. A humanities use case was devised as a proof of concept that includes examples in five languages (French, Hebrew, Latin, Lithuanian and Romanian) related to various meanings of the term “revolution” considered at different time intervals. The examples were compiled through diachronic word embedding and dictionary alignment.

pdf bib abs
Cross-Lingual Ontology Matching using Structural and Semantic Similarity
Shubhanker Banerjee | Bharathi Raja Chakravarthi | John Philip McCrae

The development of ontologies in various languages is attracting attention as the amount of multilingual data available on the web increases. Cross-lingual ontology matching facilitates interoperability amongst ontologies in different languages. Although supervised machine learning-based methods have shown good performance on ontology matching, their application to the cross-lingual setting is limited by the availability of training data. Current state-of-the-art unsupervised methods for cross-lingual ontology matching focus on lexical similarity between entities. These approaches follow a two-stage pipeline where the entities are translated into a common language using a translation service in the first step followed by computation of lexical similarity between the translations to match the entities in the second step. In this paper we introduce a novel ontology matching method based on the fusion of structural similarity and cross-lingual semantic similarity. We carry out experiments using 3 language pairs and report substantial improvements on the performance of the lexical methods thus showing the effectiveness of our proposed approach. To the best of our knowledge this is the first work which tackles the problem of unsupervised ontology matching in the cross-lingual setting by leveraging both structural and semantic embeddings.

pdf abs
Querying the Lexicon der indogermanischen Verben in the LiLa Knowledge Base: Two Use Cases
Valeria Irene Boano | Marco Passarotti | Riccardo Ginevra

This paper presents two use cases of the etymological data provided by the *Lexicon der indogermanischen Verben* (LIV) after their publication as Linked Open Data and their linking to the LiLa Knowledge Base (KB) of interoperable linguistic resources for Latin. The first part of the paper briefly describes the LiLa KB and its structure. Then, the LIV and the information it contains are introduced, followed by a short description of the ontologies and the extensions used for modelling the LIV’s data and interlinking them to the LiLa ecosystem. The last section details the two use cases. The first case concerns the inflection types of the Latin verbs that reflect Proto-Indo-European stems, while the second one focusses on the Latin derivatives of the inherited stems. The results of the investigations are put in relation to current research topics in Historical Linguistics, demonstrating their relevance to the discipline.

pdf abs
Defining an Ontology for Museum Critical Cataloguing Terminology Guidelines
Erin Canning

Submission type: Short paper This paper presents the proposed ontology for the project Computational Approaches for Addressing Problematic Terminology (CAAPT). This schema seeks to represent contents and structure of language guideline documents produced by cultural heritage institutions seeking to engage with critical cataloguing or reparative description work, known as terminology guidance documents. It takes the Victoria & Albert Museum’s Terminology Guidance Document as a source for the initial modelling work. Ultimately, CAAPT seeks to expand the knowledge graph beyond the V&A Museum context to incorporate additional terminology guidance documents and linked open data vocabularies. The ontology seeks to bring together scholarly communities in areas relevant to this project, most notably those in cultural heritage and linguistics linked open data, by leveraging existing linked data resources in these areas: as such, OntoLex, CIDOC CRM, and SKOS are used as a foundation for this work, along with a proposed schema from a related project, CULCO. As the CAAPT project is in early stages, this paper presents the preliminary results of work undertaken thus far in order to seek feedback from the linguistics linked open data community.

pdf abs
The MOLOR Lemma Bank: a New LLOD Resource for Old Irish
Theodorus Fransen | Cormac Anderson | Sacha Beniamine | Marco Passarotti

This paper describes the first steps in creating a Lemma Bank for Old Irish (600-900CE) within the Linked Data paradigm, taking inspiration from a similar resource for Latin built as part of the LiLa project (2018–2023). The focus is on the extraction and RDF conversion of nouns from Goidelex, a novel and highly structured morphological resource for Old Irish. The aim is to strike a good balance between retaining a representative level of morphological granularity and at the same time keeping the amount of lemma variants within workable limits, to facilitate straightforward resource interlinking for Old Irish, planned as future work.

This paper presents the development of CHAMUÇA, a novel lexical resource designed to document the influence of the Portuguese language on various Asian languages, with an initial focus on the languages of South Asia. Through the utilization of linked open data and the OntoLex vocabulary, CHAMUÇA offers structured insights into the linguistic characteristics, and cultural ramifications of Portuguese borrowings across multiple languages. The article outlines CHAMUÇA’s potential contributions to the linguistic linked data community, emphasising its role in addressing the scarcity of resources for lesser-resourced languages and serving as a test case for organising etymological data in a queryable format. CHAMUÇA emerges as an initiative towards the comprehensive catalogization and analysis of Portuguese borrowings, offering valuable insights into language contact dynamics, historical evolution, and cultural exchange in Asia, one that is based on linked data technology.

We are presenting LODinG – Linked Open Data in the Humanities (abbreviated from Linked Open Data in den Geisteswissenschaften), a recently launched research initiative exploring the intersection of Linked Open Data (LOD) and a range of areas of work within the Humanities. We focus on effective methods of collecting, modeling, linking, releasing and analyzing machine-readable information relevant to (digital) humanities research in the form of LOD. LODinG combines the sources and methods of digital humanities, general and computational linguistics, digital lexicography, German and Romance philology, translatology, cultural and literary studies, media studies, information science and law to explore and expand the potential of the LOD paradigm for such a diverse and multidisciplinary field. The project’s primary objectives are to improve the methods of extracting, modeling and analyzing multilingual data in the LOD paradigm; to demonstrate the application of the linguistic LOD to various methods and domains within and beyond the humanities; and to develop a modular, cross-domain data model for the humanities.

Over the past few years, the deployment of Linked Open Data (LOD) technologies has witnessed significant advancements across a myriad of sectors, linguistics included. This progression is characterized by an exponential increase in the conversion of resources to adhere to contemporary encoding standards. Such transformations are driven by the objectives outlined in “ecological” methodologies, notably the FAIR data principles, which advocate for the reuse and interoperability of resources. This paper presents the DigItAnt architecture, developed in the context of a national project funded by the Italian Ministry of Research and in the service of a recently started Italian endeavor to realize a federation of infrastructures for the humanities. It details its services, utilities and data types, and shows how it manages to produce, exploit and interlink LLOD and non-LLOD datasets in ways that are meaningful to its intended target disciplinary context, i.e. historical linguistics over epigraphy data. The paper also introduces how DigItAnt services and functionalities will contribute to the empowerment of the H2IOSC Italian infrastructures cluster project, which is devoted to the construction of a nationwide research infrastructure federation for the humanities, and it will possibly contribute to its pilot project towards an authoritative LLOD platform.

pdf abs
Teanga Data Model for Linked Corpora
John P. McCrae | Priya Rani | Adrian Doyle | Bernardo Stearns

Corpus data is the main source of data for natural language processing applications, however no standard or model for corpus data has become predominant in the field. Linguistic linked data aims to provide methods by which data can be made findable, accessible, interoperable and reusable (FAIR). However, current attempts to create a linked data format for corpora have been unsuccessful due to the verbose and specialised formats that they use. In this work, we present the Teanga data model, which uses a layered annotation model to capture all NLP-relevant annotations. We present the YAML serializations of the model, which is concise and uses a widely-deployed format, and we describe how this can be interpreted as RDF. Finally, we demonstrate three examples of the use of the Teanga data model for syntactic annotation, literary analysis and multilingual corpora.

pdf abs
The Services of the LiLa Knowledge Base of Interoperable Linguistic Resources for Latin
Marco Passarotti | Francesco Mambrini | Giovanni Moretti

This paper describes three online services designed to ease the tasks of querying and populating the linguistic resources for Latin made interoperable through their publication as Linked Open Data in the LiLa Knowledge Base. As for querying the KB, we present an interface to search the collection of lemmas that represents the core of the Knowledge Base, and an interactive, graphical platform to run queries on the resources currently interlinked. As for populating the KB with new textual resources, we describe a tool that performs automatic tokenization, lemmatization and Part-of-Speech tagging of a raw text in Latin and links its tokens to LiLa.

pdf abs
An Annotated Dataset for Transformer-based Scholarly Information Extraction and Linguistic Linked Data Generation
Vayianos Pertsas | Marialena Kasapaki | Panos Constantopoulos

We present a manually curated and annotated, multidisciplinary dataset of 15,262 sentences from research articles (abstract and main text) that can be used for transformer-based extraction from scholarly publications of three types of entities: 1) research methods, named entities of variable length, 2) research goals, entities that appear as textual spans of variable length with mostly fixed lexico-syntactic-structure, and 3) research activities, entities that appear as textual spans of variable length with complex lexico-syntactic structure. We explore the capabilities of our dataset by using it for training/fine-tuning various ML and transformer-based models. We compare our finetuned models as well as LLM responses (chatGPT 3.5) based on 10-shot learning, by measuring F1 scores in token-based, entity-based strict and entity-based partial evaluations across interdisciplinary and discipline-specific datasets in order to capture any possible differences in discipline-oriented writing styles. Results show that fine tuning of transformer-based models significantly outperforms the performance of few- shot learning of LLMs such as chatGPT, highlighting the significance of annotation datasets in such tasks. Our dataset can also be used as a source for linguistic linked data by itself. We demonstrate this by presenting indicative queries in SPARQL, executed over such an RDF knowledge graph.

pdf abs
Linguistic LOD for Interoperable Morphological Description
Michael Rosner | Maxim Ionov

Interoperability is a characteristic of a product or system that seamlessly works with another product or system and implies a certain level of independence from the context of use. Turning to language resources, interoperability is frequently cited as one important rationale underlying the use of LLOD representations and is generally regarded as highly desirable. In this paper we further elaborate this theme, distinguishing three different kinds of interoperability providing practical implementations with examples from morphology.

pdf abs
Modeling linking between text and lexicon with OntoLex-Lemon: a case study of computational terminology for the Babylonian Talmud
Flavia Sciolette

This paper illustrates the first steps in the creation of a computational terminology for the Babylonian Talmud. After introducing reasons and the state of the art, the paper exposes the choice of using OntoLex-Lemon and the new FrAC module for encoding the attestations and quantitative data of the terminology extraction. After that, the Talmudic terminology base is introduced and an example entry with the above-mentioned data is shown. The scheme is motivated not only by the rich representation the model allows, but also by the future management of the link between text and lexical entries.

pdf abs
OntoLex Publication Made Easy: A Dataset of Verbal Aspectual Pairs for Bosnian, Croatian and Serbian
Ranka Stanković | Maxim Ionov | Medina Bajtarević | Lorena Ninčević

This paper introduces a novel language resource for retrieving and researching verbal aspectual pairs in BCS (Bosnian, Croatian, and Serbian) created using Linguistic Linked Open Data (LLOD) principles. As there is no resource to help learners of Bosnian, Croatian, and Serbian as foreign languages to recognize the aspect of a verb or its pairs, we have created a new resource that will provide users with information about the aspect, as well as the link to a verb’s aspectual counterparts. This resource also contains external links to monolingual dictionaries, Wordnet, and BabelNet. As this is a work in progress, our resource only includes verbs and their perfective pairs formed with prefixes “pro”, “od”, “ot”, “iz”, “is” and “na”. The goal of this project is to have a complete dataset of all the aspectual pairs in these three languages. We believe it will be useful for research in the field of aspectology, as well as machine translation and other NLP tasks. Using this resource as an example, we also propose a sustainable approach to publishing small to moderate LLOD resources on the Web, both in a user-friendly way and according to the Linked Data principles.

pdf abs
Towards Semantic Interoperability: Parallel Corpora as Linked Data Incorporating Named Entity Linking
Ranka Stanković | Milica Ikonić Nešić | Olja Perisic | Mihailo Škorić | Olivera Kitanović

The paper presents the results of the research related to the preparation of parallel corpora, focusing on transformation into RDF graphs using NLP Interchange Format (NIF) for linguistic annotation. We give an overview of the parallel corpus that was used in this case study, as well as the process of POS tagging, lemmatization, named entity recognition (NER), and named entity linking (NEL), which is implemented using Wikidata. In the first phase of NEL main characters and places mentioned in novels are stored in Wikidata and in the second phase they are linked with the occurrences of previously annotated entities in text. Next, we describe the named entity linking (NEL), data conversion to RDF, and incorporation of NIF annotations. Produced NIF files were evaluated through the exploration of triplestore using SPARQL queries. Finally, the bridging of Linked Data and Digital Humanities research is discussed, as well as some drawbacks related to the verbosity of transformation. Semantic interoperability concept in the context of linked data and parallel corpora ensures that data exchanged between systems carries shared and well-defined meanings, enabling effective communication and understanding.

pdf (full)
bib (full) Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024

pdf bib
Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024
Ingo Siegert | Khalid Choukri

pdf bib abs
Compliance by Design Methodologies in the Legal Governance Schemes of European Data Spaces
Kossay Talmoudi | Khalid Choukri | Isabelle Gavanon

Creating novel ways of sharing data to boost the digital economy has been one of the growing priorities of the European Union. In order to realise a set of data-sharing modalities, the European Union funds several projects that aim to put in place Common Data Spaces. These infrastructures are set to be a catalyser for the data economy. However, many hurdles face their implementation. Legal compliance is still one of the major ambiguities of European Common Data Spaces and many initiatives intend to proactively integrate legal compliance schemes in the architecture of sectoral Data Spaces. The various initiatives must navigate a complex web of cross-cutting legal frameworks, including contract law, data protection, intellectual property, protection of trade secrets, competition law, European sovereignty, and cybersecurity obligations. As the conceptualisation of Data Spaces evolves and shows signs of differentiation from one sector to another, it is important to showcase the legal repercussions of the options of centralisation and decentralisation that can be observed in different Data Spaces. This paper will thus delve into their legal requirements and attempt to sketch out a stepping stone for understanding legal governance in data spaces.

pdf bib abs
A Legal Framework for Natural Language Model Training in Portugal
Ruben Almeida | Evelin Amorim

Recent advances in deep learning have promoted the advent of many computational systems capable of performing intelligent actions that, until then, were restricted to the human intellect. In the particular case of human languages, these advances allowed the introduction of applications like ChatGPT that are capable of generating coherent text without being explicitly programmed to do so. Instead, these models use large volumes of textual data to learn meaningful representations of human languages. Associated with these advances, concerns about copyright and data privacy infringements caused by these applications have emerged. Despite these concerns, the pace at which new natural language processing applications continued to be developed largely outperformed the introduction of new regulations. Today, communication barriers between legal experts and computer scientists motivate many unintentional legal infringements during the development of such applications. In this paper, a multidisciplinary team intends to bridge this communication gap and promote more compliant Portuguese NLP research by presenting a series of everyday NLP use cases, while highlighting the Portuguese legislation that may arise during its development.

pdf abs
Intellectual property rights at the training, development and generation stages of Large Language Models
Christin Kirchhübel | Georgina Brown

Large Language Models (LLMs) prompt new questions around Intellectual Property (IP): what is the IP status of the datasets used to train LLMs, the resulting LLMs themselves, and their outputs? The training needs of LLMs may be at odds with current copyright law, and there are active conversations around the ownership of their outputs. A report published by the House of Lords Committee following its inquiry into LLMs and generative AI criticises, among other things, the lack of government guidance, and stresses the need for clarity (through legislation, where appropriate) in this sphere. This paper considers the little guidance and caselaw there is involving AI more broadly to allow us to anticipate legal cases and arguments involving LLMs. Given the pre-emptive nature of this paper, it is not possible to provide comprehensive answers to these questions, but we hope to equip language technology communities with a more informed understanding of the current position with respect to UK copyright and patent law.

pdf abs
Ethical Issues in Language Resources and Language Technology – New Challenges, New Perspectives
Pawel Kamocki | Andreas Witt

This article elaborates on the author’s contribution to the previous edition of the LREC conference, in which they proposed a tentative taxonomy of ethical issues that affect Language Resources (LRs) and Language Technology (LT) at the various stages of their lifecycle (conception, creation, use and evaluation). The proposed taxonomy was built around the following ethical principles: Privacy, Property, Equality, Transparency and Freedom. In this article, the authors would like to: 1) examine whether and how this taxonomy stood the test of time, in light of the recent developments in the legal framework and popularisation of Large Language Models (LLMs); 2) provide some details and a tentative checklist on how the taxonomy can be applied in practice; and 3) develop the taxonomy by adding new principles (Accountability; Risk Anticipation and Limitation; Reliability and Limited Confidence), to address the technological developments in LLMs and the upcoming Artificial Intelligence Act.

pdf abs
Legal and Ethical Considerations that Hinder the Use of LLMs in a Finnish Institution of Higher Education
Mika Hämäläinen

Large language models (LLMs) make it possible to solve many business problems easier than ever before. However, embracing LLMs in an organization may be slowed down due to ethical and legal considerations. In this paper, we will describe some of these issues we have faced at our university while developing university-level NLP tools to empower teaching and study planning. The identified issues touch upon topics such as GDPR, copyright, user account management and fear towards the new technology.

With the rise of Large Generative AI Models (LGAIMs), disinformation online has become more concerning than ever before. Within the super-election year 2024, the influence of mis- and disinformation can severely influence public opinion. To combat the increasing amount of disinformation online, humans need to be supported by AI-based tools to increase the effectiveness of detecting false content. This paper examines the critical intersection of the AI Act with the deployment of LGAIMs for disinformation detection and the implications from research, deployer, and the user’s perspective. The utilization of LGAIMs for disinformation detection falls under the high-risk category defined in the AI Act, leading to several obligations that need to be followed after the enforcement of the AI Act. Among others, the obligations include risk management, transparency, and human oversight which pose the challenge of finding adequate technical interpretations. Furthermore, the paper articulates the necessity for clear guidelines and standards that enable the effective, ethical, and legally compliant use of AI. The paper contributes to the discourse on balancing technological advancement with ethical and legal imperatives, advocating for a collaborative approach to utilizing LGAIMs in safeguarding information integrity and fostering trust in digital ecosystems.

pdf abs
Selling Personal Information: Data Brokers and the Limits of US Regulation
Denise DiPersio

A principal pillar of the US Blueprint for an AI Bill of Rights is data privacy, specifically, that individuals should be protected from abusive practices by data collectors and data aggregators, and that users should have control over how their personal information is collected and used. An area that spotlights the need for such protections is found in the common practices of data brokers who scrape, purchase, process and reassemble personal information in bulk and sell it for a variety of downstream uses. Such activities almost always occur in the absence of users’ knowledge or meaningful consent, yet they are legal under US law. This paper examines how data brokers operate, provides some examples of recent US regulatory actions taken against them, summarizes federal efforts to redress data broker practices and concludes that as long as there continues to be no comprehensive federal data protection and privacy scheme, efforts to control such behavior will have only a limited effect. This paper also addresses the limits of informed consent on the use of personal information in language resources and suggests a solution in an holistic approach to data protection and privacy across the data/development life cycle.

pdf abs
What Can I Do with this Data Point? Towards Modeling Legal and Ethical Aspects of Linguistic Data Collection and (Re-)use
Annett Jorschick | Paul T. Schrader | Hendrik Buschmeier

Linguistic data often inherits characteristics that limit open science practices such as data publication, sharing, and reuse. Part of the problem is researchers’ uncertainty about the legal requirements, which need to be considered at the beginning of study planning, when consent forms for participants, ethics applications, and data management plans need to be written. This paper presents a newly funded project that will develop a research data management infrastructure that will provide automated support to researchers in the planning, collection, storage, use, reuse, and sharing of data, taking into account ethical and legal aspects to encourage open science practices.

pdf abs
Data-Envelopes for Cultural Heritage: Going beyond Datasheets
Mrinalini Luthra | Maria Eskevich

Cultural heritage data is a rich source of information about the history and culture development in the past. When used with due understanding of its intrinsic complexity it can both support research in social sciences and humanities, and become input for machine learning and artificial intelligence algorithms. In all cases ethical and contextual considerations can be encouraged when the relevant information is provided in a clear and well structured form to potential users before they begin to interact with the data. Proposed data-envelopes, basing on the existing documentation frameworks, address the particular needs and challenges of the cultural heritage field while combining machine-readability and user-friendliness. We develop and test data-envelopes usability on the data from the Huygens Institute for History and Culture of the Netherlands. This paper presents the following contributions: i) we highlight the complexity of CH data, featuring the unique ethical and contextual considerations they entail; ii) we evaluate and compare existing dataset documentation frameworks, examining their suitability for CH datasets; iii) we introduce the “data-envelope”–a machine readable adaptation of existing dataset documentation frameworks, to tackle the specificities of CH datasets. Its modular form is designed to serve not only the needs of machine learning (ML), but also and especially broader user groups varying from humanities scholars, governmental monitoring authorities to citizen scientists and the general public. Importantly, the data-envelope framework emphasises the legal and ethical dimensions of dataset documentation, facilitating compliance with evolving data protection regulations and enhancing the accountability of data stewardship in the cultural heritage sector. We discuss and invite the readers for further conversation on the topic of ethical considerations, and how the different audiences should be informed about the importance of datasets documentation management and their context.

pdf abs
Emotional Toll and Coping Strategies: Navigating the Effects of Annotating Hate Speech Data
Maryam M. AlEmadi | Wajdi Zaghouani

Freedom of speech on online social media platforms, often comes with the cost of hate speech production. Hate speech can be very harmful to the peace and development of societies as they bring about conflict and encourage crime. To regulate the hate speech content, moderators and annotators are employed. In our research, we look at the effects of prolonged exposure to hate speech on the mental and physical health of these annotators, as well as researchers with work revolving around the topic of hate speech. Through the methodology of analyzing literature, we found that prolonged exposure to hate speech does mentally and physically impact annotators and researchers in this field. We also propose solutions to reduce these negative impacts such as providing mental health services, fair labor practices, psychological assessments and interventions, as well as developing AI to assist in the process of hate speech detection.

pdf abs
User Perspective on Anonymity in Voice Assistants – A comparison between Germany and Finland
Ingo Siegert | Silas Rech | Tom Bäckström | Matthias Haase

This study investigates the growing importance of voice assistants, particularly focusing on their usage patterns and associated user characteristics, trust perceptions, and concerns about data security. While previous research has identified correlations between the use of voice assistants and trust in these technologies, as well as data security concerns, little evidence exists regarding the relationship between individual user traits and perceived trust and security concerns. The study design involves surveying various user attributes, including technical proficiency, personality traits, and experience with digital technologies, alongside attitudes toward and usage of voice assistants. A comparison between Germany and Finland is conducted to explore potential cultural differences. The findings aim to inform strategies for enhancing voice assistant acceptance, including the implementation of anonymization methods.

pdf (full)
bib (full) Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024

pdf bib
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024
Rachele Sprugnoli | Marco Passarotti

pdf bib abs
Goidelex: A Lexical Resource for Old Irish
Cormac Anderson | Sacha Beniamine | Theodorus Fransen

We introduce Goidelex, a new lexical database resource for Old Irish. Goidelex is an openly accessible relational database in CSV format, linked by formal relationships. The launch version documents 695 headwords with extensive linguistic annotations, including orthographic forms using a normalised orthography, automatically generated phonemic transcriptions, and information about morphosyntactic features, such as gender, inflectional class, etc. Metadata in JSON format, following the Frictionless standard, provides detailed descriptions of the tables and dataset. The database is designed to be fully compatible with the Paralex and CLDF standards and is interoperable with existing lexical resources for Old Irish such as CorPH and eDIL. It is suited to both qualitative and quantitative investigation into Old Irish morphology and lexicon, as well as to comparative research. This paper outlines the creation process, rationale, and resulting structure of the database.

pdf bib abs
Developing a Part-of-speech Tagger for Diplomatically Edited Old Irish Text
Adrian Doyle | John P. McCrae

POS-tagging is typically considered a fundamental text preprocessing task, with a variety of downstream NLP tasks and techniques being dependent on the availability of POS-tagged corpora. As such, POS-taggers are important precursors to further NLP tasks, and their accuracy can impact the potential accuracy of these dependent tasks. While a variety of POS-tagging methods have been developed which work well with modern languages, historical languages present orthographic and editorial challenges which require special attention. The effectiveness of POS-taggers developed for modern languages is reduced when applied to Old Irish, with its comparatively complex orthography and morphology. This paper examines some of the obstacles to POS-tagging Old Irish text, and shows that inconsistencies between extant annotated corpora reduce the quantity of data available for use in training POS-taggers. The development of a multi-layer neural network model for POS-tagging Old Irish text is described, and an experiment is detailed which demonstrates that this model outperforms a variety of off-the-shelf POS-taggers. Moreover, this model sets a new benchmark for POS-tagging diplomatically edited Old Irish text.

pdf abs
From YCOE to UD: Rule-based Root Identification in Old English
Luca Brigada Villa | Martina Giarda

In this paper we apply a set of rules to identify the root of a dependency tree, following the Universal Dependencies formalism and starting from the constituency annotation of the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE). This rule-based root-identification task represents the first step towards a rule-based automatic conversion of this valuable resource into the UD format. After presenting Old English and the annotated resources available for this language, we describe the different rules we applied and then we discuss the results and the errors.

pdf abs
Too Young to NER: Improving Entity Recognition on Dutch Historical Documents
Vera Provatorova | Marieke van Erp | Evangelos Kanoulas

Named entity recognition (NER) on historical texts is beneficial for the field of digital humanities, as it allows to easily search for the names of people, places and other entities in digitised archives. While the task of historical NER in different languages has been gaining popularity in recent years, Dutch historical NER remains an underexplored topic. Using a recently released historical dataset from the Dutch Language Institute, we train three BERT-based models and analyse the errors to identify main challenges. All three models outperform a contemporary multilingual baseline by a large margin on historical test data.

pdf abs
Towards Named-Entity and Coreference Annotation of the Hebrew Bible
Daniel G. Swanson | Bryce D. Bussert | Francis Tyers

Named-entity annotation refers to the process of specifying what real-world (or, at least, external-to-the-text) entities various names and descriptions within a text refer to. Coreference annotation, meanwhile, specifies what context-dependent words or phrases, such as pronouns refer to. This paper describes an ongoing project to apply both of these to the Hebrew Bible, so far covering most of the book of Genesis, fully marking every person, place, object, and point in time which occurs in the text. The annotation process and possible future uses for the data are covered, along with the challenges involved in applying existing annotation guidelines to the Hebrew text.

The Latin language has received attention from the computational linguistics research community, which has built, over the years, several valuable resources, ranging from detailed annotated corpora to sophisticated tools for linguistic analysis. With the recent advent of large language models, researchers have also started developing models capable of generating vector representations of Latin texts. The performances of such models remain behind the ones for modern languages, given the disparity in available data. In this paper, we present the LiMe dataset, a corpus of 325 documents extracted from a series of medieval manuscripts called Libri sententiarum potestatis Mediolani, and thoroughly annotated by experts, in order to be employed for masked language model, as well as supervised natural language processing tasks.

pdf abs
The Rise and Fall of Dependency Parsing in Dante Alighieri’s Divine Comedy
Claudia Corbetta | Marco Passarotti | Giovanni Moretti

In this paper, we conduct parsing experiments on Dante Alighieri’s Divine Comedy, an Old Italian poem composed between 1306-1321 and organized into three Cantiche —Inferno, Purgatorio, and Paradiso. We perform parsing on subsets of the poem using both a Modern Italian training set and sections of the Divine Comedy itself to evaluate under which scenarios parsers achieve higher scores. We find that employing in-domain training data supports better results, leading to an increase of approximately +17% in Unlabeled Attachment Score (UAS) and +25-30% in Labeled Attachment Score (LAS). Subsequently, we provide brief commentary on the differences in scores achieved among subsections of Cantiche, and we conduct experimental parsing on a text from the same period and style as the Divine Comedy.

pdf abs
Unsupervised Authorship Attribution for Medieval Latin Using Transformer-Based Embeddings
Loic De Langhe | Orphee De Clercq | Veronique Hoste

We explore the potential of employing transformer-based embeddings in an unsupervised authorship attribution task for medieval Latin. The development of Large Language Models (LLMs) and recent advances in transfer learning alleviate many of the traditional issues associated with authorship attribution in lower-resourced (ancient) languages. Despite this, these methods remain heavily understudied within this domain. Concretely, we generate strong contextual embeddings using a variety of mono -and multilingual transformer models and use these as input for two unsupervised clustering methods: a standard agglomerative clustering algorithm and a self-organizing map. We show that these transformer-based embeddings can be used to generate high-quality and interpretable clusterings, resulting in an attractive alternative to the traditional feature-based methods.

pdf abs
“To Have the ‘Million’ Readers Yet”: Building a Digitally Enhanced Edition of the Bilingual Irish-English Newspaper an Gaodhal (1881-1898)
Oksana Dereza | Deirdre Ní Chonghaile | Nicholas Wolf

This paper introduces the ‘An Gaodhal’ project, which aims to serve the historically under-resourced and endangered language of Irish (known as Gaeilge) by providing new digital tools and resources. The initial goal of the project was the extraction of full text of ‘An Gaodhal’, a monthly bilingual Irish-English newspaper produced from 1881 to 1898, to the highest possible degree of accuracy via Optical Character Recognition (OCR), with a view to making its printed content searchable. The methodology applied toward achieving this goal yielded additional digital outputs including: 1. a new OCR model for the Irish language as printed in Cló Gaelach type; 2. a new OCR model for bilingual Irish-English content printed in Cló Gaelach and Roman types respectively; 3. a BART-based OCR post-correction model for historical bilingual Irish-English data; 4. a historical Irish training set for Named Entity Recognition (NER). All but the first of these four additional outputs appear to be the first of their kind. Each of the project outputs, including the full-text OCR outputs in ALTO XML format, is set for public release to enable open-access research. The paper also identifies the challenges historical Irish data poses to Natural Language Processing (NLP) in general and OCR in particular, and reports on project results and outputs to date. Finally, it contextualises the project within the wider field of NLP and considers its potential impact on under-resourced languages worldwide.

pdf abs
Introducing PaVeDa – Pavia Verbs Database: Valency Patterns and Pattern Comparison in Ancient Indo-European Languages
Silvia Luraghi | Alessio Palmero Aprosio | Chiara Zanchi | Martina Giuliani

The paper introduces [DATASET], a resource that builds on the ValPaL database of verbs’ valency patterns and alternations by adding a number of ancient languages (completely absent from ValPaL) and a number of new features that enable direct comparison, both diachronic and synchronic. For each verb, ValPaL contains the basic frame and ideally all possible valency alternations allowed by the verb (e.g. passive, causative, reflexive etc.). In order to enable comparison among alternations, an additional level has been added, the alternation class, that overcomes the issue of comparing language specific alternations which were added by individual contributors of ValPaL. The ValPaL had as its main aim typological comparison, and data collection was variously carried out using questionnaires, secondary sources and largely drawing on native speakers’ intuition by contributors. Working with ancient languages entails a methodological change, as the data is extracted from corpora. This has led to re-thinking the notion of valency as a usage-based feature of verbs and to planning future addition of corpus data to modern languages in the database. It further shows the impact of ancient languages on theoretical reflection.

pdf abs
Development of Robust NER Models and Named Entity Tagsets for Ancient Greek
Chiara Palladino | Tariq Yousef

This contribution presents a novel approach to the development and evaluation of transformer-based models for Named Entity Recognition and Classification in Ancient Greek texts. We trained two models with annotated datasets by consolidating potentially ambiguous entity types under a harmonized set of classes. Then, we tested their performance with out-of-domain texts, reproducing a real-world use case. Both models performed very well under these conditions, with the multilingual model being slightly superior on the monolingual one. In the conclusion, we emphasize current limitations due to the scarcity of high-quality annotated corpora and to the lack of cohesive annotation strategies for ancient languages.

pdf abs
Analysis of Glyph and Writing System Similarities Using Siamese Neural Networks
Claire Roman | Philippe Meyer

In this paper we use siamese neural networks to compare glyphs and writing systems. These deep learning models define distance-like functions and are used to explore and visualize the space of scripts by performing multidimensional scaling and clustering analyses. From 51 historical European, Mediterranean and Middle Eastern alphabets, we use a Ward-linkage hierarchical clustering and obtain 10 clusters of scripts including three isolated writing systems. To collect the glyph database we use the Noto family fonts that encode in a standard form the Unicode character repertoire. This approach has the potential to reveal connections among scripts and civilizations and to help the deciphering of ancient scripts.

pdf abs
How to Annotate Emotions in Historical Italian Novels: A Case Study on I Promessi Sposi
Rachele Sprugnoli | Arianna Redaelli

This paper describes the annotation of a chapter taken from I Promessi Sposi, the most famous Italian novel of the 19th century written by Alessandro Manzoni, following 3 emotion classifications. The aim of this methodological paper is to understand: i) how the annotation procedure changes depending on the granularity of the classification, ii) how the different granularities impact the inter-annotator agreement, iii) which granularity allows good coverage of emotions, iv) if the chosen classifications are missing emotions that are important for historical literary texts. The opinion of non-experts is integrated in the present study through an online questionnaire. In addition, preliminary experiments are carried out using the new dataset as a test set to evaluate the performances of different approaches for emotion polarity detection and emotion classification respectively. Annotated data are released both as aggregated gold standard and with non-aggregated labels (that is labels before reconciliation between annotators) so to align with the perspectivist approach, that is an established practice in the Humanities and, more recently, also in NLP.

pdf abs
Leveraging LLMs for Post-OCR Correction of Historical Newspapers
Alan Thomas | Robert Gaizauskas | Haiping Lu

Poor OCR quality continues to be a major obstacle for humanities scholars seeking to make use of digitised primary sources such as historical newspapers. Typical approaches to post-OCR correction employ sequence-to-sequence models for a neural machine translation task, mapping erroneous OCR texts to accurate reference texts. We shift our focus towards the adaptation of generative LLMs for a prompt-based approach. By instruction-tuning Llama 2 and comparing it to a fine-tuned BART on BLN600, a parallel corpus of 19th century British newspaper articles, we demonstrate the potential of a prompt-based approach in detecting and correcting OCR errors, even with limited training data. We achieve a significant enhancement in OCR quality with Llama 2 outperforming BART, achieving a 54.51% reduction in the character error rate against BART’s 23.30%. This paves the way for future work leveraging generative LLMs to improve the accessibility and unlock the full potential of historical texts for humanities research.

pdf abs
LLM-based Machine Translation and Summarization for Latin
Martin Volk | Dominic Philipp Fischer | Lukas Fischer | Patricia Scheurer | Phillip Benjamin Ströbel

This paper presents an evaluation of machine translation for Latin. We tested multilingual Large Language Models, in particular GPT-4, on letters from the 16th century that are in Latin and Early New High German. Our experiments include translation and cross-language summarization for the two historical languages into modern English and German. We show that LLM-based translation for Latin is clearly superior to previous approaches. We also show that LLM-based paraphrasing of Latin paragraphs from the historical letters produces English and German summaries that are close to human summaries published in the edition.

pdf abs
Exploring Aspect-Based Sentiment Analysis Methodologies for Literary-Historical Research Purposes
Tess Dejaeghere | Pranaydeep Singh | Els Lefever | Julie Birkholz

This study explores aspect-based sentiment analysis (ABSA) methodologies for literary-historical research, aiming to address the limitations of traditional sentiment analysis in understanding nuanced aspects of literature. It evaluates three ABSA toolchains: rule-based, machine learning-based (utilizing BERT and MacBERTh embeddings), and a prompt-based workflow with Mixtral 8x7B. Findings highlight challenges and potentials of ABSA for literary-historical analysis, emphasizing the need for context-aware annotation strategies and technical skills. The research contributes by curating a multilingual corpus of travelogues, publishing an annotated dataset for ABSA, creating openly available Jupyter Notebooks with Python code for each modeling approach, conducting pilot experiments on literary-historical texts, and proposing future endeavors to advance ABSA methodologies in this domain.

pdf abs
Early Modern Dutch Comedies and Farces in the Spotlight: Introducing EmDComF and Its Emotion Framework
Florian Debaene | Kornee van der Haven | Veronique Hoste

As computational drama studies are developing rapidly, the Dutch dramatic tradition is in need of centralisation still before it can benefit from state-of-the-art methodologies. This paper presents and evaluates EmDComF, a historical corpus of both manually curated and automatically digitised early modern Dutch comedies and farces authored between 1650 and 1725, and describes the refinement of a historically motivated annotation framework exploring sentiment and emotions in these two dramatic subgenres. Originating from Lodewijk Meyer’s philosophical writings on passions in the dramatic genre (±1670), published in Naauwkeurig onderwys in de tooneel-poëzy (Thorough instruction in the Poetics of Drama) by the literary society Nil Volentibus Arduum in 1765, a historical and genre-specific emotion framework is tested and operationalised for annotating emotions in the domain of early modern Dutch comedies and farces. Based on a frequency and cluster analysis of 782 annotated sentences by 2 expert annotators, the initial 38 emotion labels were restructured to a hierarchical label set of the 5 emotions Hatred, Anxiety, Sadness, Joy and Desire.

pdf abs
When Hieroglyphs Meet Technology: A Linguistic Journey through Ancient Egypt Using Natural Language Processing
Ricardo Muñoz Sánchez

Knowing our past can help us better understand our future. The explosive development of NLP in these past few decades has allowed us to study ancient languages and cultures in ways that we couldn’t have done in the past. However, not all languages have received the same level of attention. Despite its popularity in pop culture, the languages spoken in Ancient Egypt have been somewhat overlooked in terms of NLP research. In this paper we give an overview of how NLP has been used to study different variations of the Ancient Egyptian languages. This not only includes Old, Middle, and Late Egyptian but also Demotic and Coptic. We begin our survey paper by giving a short introduction to these languages and their writing systems, before talking about the corpora and lexical resources that are available digitally. We then show the different NLP tasks that have been tackled for different variations of Ancient Egyptian, as well as the approaches that have been used. We hope that our work can stoke interest in the study of these languages within the NLP community.

pdf abs
Towards a Readability Formula for Latin
Thomas Laurs

This research focuses on the development of a readability formula for Latin texts, a much-needed tool to assess the difficulty of Latin texts in educational settings. This study takes a comprehensive approach, exploring more than 100 linguistic variables, including lexical, morphological, syntactical, and discourse-related factors, to capture the multifaceted nature of text difficulty. The study incorporates a corpus of Latin texts that were assessed for difficulty, and their evaluations were used to establish the basis for the model. The research utilizes natural language processing tools to derive linguistic predictors, resulting in a multiple linear regression model that explains about 70% of the variance in text difficulty. While the model’s precision can be enhanced by adding further variables and a larger corpus, it already provides valuable insights into the readability of Latin texts and offers the opportunity to examine how different text genres and contents influence text accessibility. Additionally, the formula’s focus on objective text difficulty paves the way for future research on personal predictors, particularly in educational contexts.

pdf abs
Automatic Normalisation of Middle French and Its Impact on Productivity
Raphael Rubino | Sandra Coram-Mekkey | Johanna Gerlach | Jonathan David Mutal | Pierrette Bouillon

This paper presents a study on automatic normalisation of 16th century documents written in Middle French. These documents present a large variety of wordforms which require spelling normalisation to facilitate downstream linguistic and historical studies. We frame the normalisation process as a machine translation task starting with a strong baseline leveraging a pre-trained encoder–decoder model. We propose to improve this baseline by combining synthetic data generation methods and producing artificial training data, thus tackling the lack of parallel corpora relevant to our task. The evaluation of our approach is twofold, in addition to automatic metrics relying on gold references, we evaluate our models through post-editing of their outputs. This evaluation method directly measures the productivity gain brought by our models to experts conducting the normalisation task manually. Results show a 20+ token per minute increase in productivity when using automatic normalisation compared to normalising text from scratch. The manually post-edited dataset resulting from our study is the first parallel corpus of normalised 16th century Middle French to be publicly released, along with the synthetic data and the automatic normalisation models used and trained in the presented work.

pdf abs
Overview of the EvaLatin 2024 Evaluation Campaign
Rachele Sprugnoli | Federica Iurescia | Marco Passarotti

This paper describes the organization and the results of the third edition of EvaLatin, the campaign for the evaluation of Natural Language Processing tools for Latin. The two shared tasks proposed in EvaLatin 2024, i.,e., Dependency Parsing and Emotion Polarity Detection, are aimed to foster research in the field of language technologies for Classical languages. The shared datasets are described and the results obtained by the participants for each task are presented and discussed.

pdf abs
Behr at EvaLatin 2024: Latin Dependency Parsing Using Historical Sentence Embeddings
Rufus Behr

This paper identifies the system used for my submission to EvaLatin’s shared dependency parsing task as part of the LT4HALA 2024 workshop. EvaLatin presented new Latin prose and poetry dependency test data from potentially different time periods, and imposed no restriction on training data or model selection for the task. This paper, therefore, sought to build a general Latin dependency parser that would perform accurately regardless of the Latin age to which the test data belongs. To train a general parser, all of the available Universal Dependencies treebanks were used, but in order to address the changes in the Latin language over time, this paper introduces historical sentence embeddings. A model was trained to encode sentences of the same Latin age into vectors of high cosine similarity, which are referred to as historical sentence embeddings. The system introduces these historical sentence embeddings into a biaffine dependency parser with the hopes of enabling training across the Latin treebanks in a more efficacious manner, but their inclusion shows no improvement over the base model.

pdf abs
KU Leuven / Brepols-CTLO at EvaLatin 2024: Span Extraction Approaches for Latin Dependency Parsing
Wouter Mercelis

This report describes the KU Leuven / Brepols-CTLO submission to EvaLatin 2024. We present the results of two runs, both of which try to implement a span extraction approach. The first run implements span-span prediction, rooted in Machine Reading Comprehension, while making use of LaBERTa, a RoBERTa model pretrained on Latin texts. The first run produces meaningful results. The second, more experimental run operates on the token-level with a span-extraction approach based on the Question Answering task. This model finetuned a DeBERTa model, pretrained on Latin texts. The finetuning was set up in the form of a Multitask Model, with classification heads for each token’s part-of-speech tag and dependency relation label, while a question answering head handled the dependency head predictions. Due to the shared loss function, this paper tried to capture the link between part-of-speech tag, dependency relation and dependency heads, that follows the human intuition. The second run did not perform well.

pdf abs
ÚFAL LatinPipe at EvaLatin 2024: Morphosyntactic Analysis of Latin
Milan Straka | Jana Straková | Federica Gamba

We present LatinPipe, the winning submission to the EvaLatin 2024 Dependency Parsing shared task. Our system consists of a fine-tuned concatenation of base and large pre-trained LMs, with a dot-product attention head for parsing and softmax classification heads for morphology to jointly learn both dependency parsing and morphological analysis. It is trained by sampling from seven publicly available Latin corpora, utilizing additional harmonization of annotations to achieve a more unified annotation style. Before fine-tuning, we train the system for a few initial epochs with frozen weights. We also add additional local relative contextualization by stacking the BiLSTM layers on top of the Transformer(s). Finally, we ensemble output probability distributions from seven randomly instantiated networks for the final submission. The code is available at https://github.com/ufal/evalatin2024-latinpipe.

pdf abs
Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation
Stephen Bothwell | Abigail Swenor | David Chiang

This paper describes submissions from the team Nostra Domina to the EvaLatin 2024 shared task of emotion polarity detection. Given the low-resource environment of Latin and the complexity of sentiment in rhetorical genres like poetry, we augmented the available data through automatic polarity annotation. We present two methods for doing so on the basis of the k-means algorithm, and we employ a variety of Latin large language models (LLMs) in a neural architecture to better capture the underlying contextual sentiment representations. Our best approach achieved the second highest macro-averaged Macro-F1 score on the shared task’s test set.

pdf abs
TartuNLP at EvaLatin 2024: Emotion Polarity Detection
Aleksei Dorkin | Kairit Sirts

The technical report for our submission at EvaLatin 2024 shared task. We apply knowledge transfer techniques and two distinct approaches to data annotation: based on heuristics and based on LLMs.

Ancient Chinese texts have no sentence boundaries and punctuation. Adding modern Chinese punctuation to theses texts requires expertise, time and efforts. Automatic sentence segmentation and punctuation is considered as a basic task for Ancient Chinese processing, but there is no shared task to evaluate the performances of different systems. This paper presents the results of the first ancient Chinese sentence segmentation and punctuation bakeoff, which is held at the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) 2024. The contest uses metrics for detailed evaluations of 4 genres of unpublished texts with 11 punctuation types. Six teams submitted 32 running results. In the closed modality, the participants are only allowed to use the training data, the highest obtained F1 scores are respectively 88.47% and 75.29% in sentence segmentation and sentence punctuation. The perfermances on the unseen data is 10 percent lower than the published common data, which means there is still space for further improvement. The large language models outperform the traditional models, but LLM changes the original characters around 1-2%, due to over-generation. Thus, post-processing is needed to keep the text consistancy.

pdf abs
Two Sequence Labeling Approaches to Sentence Segmentation and Punctuation Prediction for Classic Chinese Texts
Xuebin Wang | Zhenghua Li

This paper describes our system for the EvaHan2024 shared task. We design and experiment with two sequence labeling approaches, i.e., one-stage and two-stage approaches. The one-stage approach directly predicts a label for each character, and the label may contain multiple punctuation marks. The two-stage approach divides punctuation marks into two classes, i.e., pause and non-pause, and separately handles them via two sequence labeling processes. The labels contain at most one punctuation marks. We use pre-trained SikuRoBERTa as a key component of the encoder and employ a conditional random field (CRF) layer on the top. According to the evaluation metrics adopted by the organizers, the two-stage approach is superior to the one-stage approach, and our system achieves the second place among all participant systems.

pdf abs
Ancient Chinese Sentence Segmentation and Punctuation on Xunzi LLM
Shitu Huo | Wenhui Chen

This paper describes the system submitted for the EvaHan 2024 Task on ancient Chinese sentence segmentation and punctuation. Our study utillizes the Xunzi large language model as the base model to evaluate the overall performance and the performance by record type. The applied methodologies and the prompts utilized in our study have shown to be helpful and effective in aiding the model’s performance evaluation.

pdf abs
Sentence Segmentation and Sentence Punctuation Based on XunziALLM
Zihong Chen

In ancient Chinese books, punctuation marks are typically absent in engraved texts. Sentence segmentation and punctuation heavily rely on the meticulous efforts of experts and scholars. Therefore, the work of automatic punctuation and sentence segmentation plays a very important role in promoting ancient books, as well as the inheritance of Chinese culture. In this paper, we present a method for fine-tuning downstream tasks for large language model using the LoRA approach, leveraging the EvaHan2024 dataset. This method ensures robust output and high accuracy while inheriting the knowledge from the large pre-trained language model Xunzi.

This paper describes the participation of team “TeleAI” in the third International Chinese Ancient Chinese Language Information Processing Evaluation (EvalHan24). The competition comprises a joint task of sentence segmentation and punctuation, categorized into open and closed tracks based on the models and data used. In the final evaluation, our system achieved significantly better results than the baseline. Specifically, in the closed-track sentence segmentation task, we obtained an F1 score of 0.8885, while in the sentence punctuation task, we achieved an F1 score of 0.7129.

pdf abs
SPEADO: Segmentation and Punctuation for Ancient Chinese Texts via Example Augmentation and Decoding Optimization
Tian Xia | Kai Yu | Qianrong Yu | Xinran Peng

The SPEADO model for sentence segmentation and punctuation tasks in ancient Chinese texts is proposed, which incorporates text chunking and MinHash indexing techniques to realise example argumentation. Additionally, decoding optimization strategies are introduced to direct the attention of the LLM model towards punctuation errors and address the issue of uncontrollable output. Experimental results show that the F1 score of the proposed method exceeds the baseline model by 14.18%, indicating a significant improvement in performance.

pdf abs
Ancient Chinese Punctuation via In-Context Learning
Jie Huang

EvaHan2024 focuses on sentence punctuation in ancient Chinese. Xunzi large language base model, which is specifically trained for ancient Chinese processing, is advised in the campaign. In general, we adopted the in-context learning (ICL) paradigm for this task and designed a post-processing scheme to ensure the standardability of final results. When constructing ICL prompts, we did feature extraction by LLM QA and selected demonstrations based on non-parametric metrics. We used Xunzi in two stages and neither did further training, so the model was generic and other fundamental abilities remained unaffected. Moreover, newly acquired training data can be directly utilized after identical feature extraction, showcasing the scalability of our system. As for the result, we achieved an F1-score of 67.7% on a complex test dataset consisting of multiple types of documents and 77.98% on Zuozhuan data.

pdf (full)
bib (full) Proceedings of the 2nd Workshop on Mathematical Natural Language Processing @ LREC-COLING 2024

pdf bib
Proceedings of the 2nd Workshop on Mathematical Natural Language Processing @ LREC-COLING 2024
Marco Valentino | Deborah Ferreira | Mokanarangan Thayaparan | Andre Freitas

pdf bib abs
An Approach to Co-reference Resolution and Formula Grounding for Mathematical Identifiers Using Large Language Models
Aamin Dev | Takuto Asakura | Rune Sætre

This paper outlines an automated approach to annotate mathematical identifiers in scientific papers — a process historically laborious and costly. We employ state-of-the-art LLMs, including GPT-3.5 and GPT-4, and open-source alternatives to generate a dictionary for annotating mathematical identifiers, linking each identifier to its conceivable descriptions and then assigning these definitions to the respective identifier in- stances based on context. Evaluation metrics include the CoNLL score for co-reference cluster quality and semantic correctness of the annotations.

pdf bib abs
Fluid Dynamics-Inspired Emotional Analysis in Shakespearean Tragedies: A Novel Computational Linguistics Methodology
Davide Picca

This study introduces an innovative method for analyzing emotions in texts, drawing inspiration from the principles of fluid dynamics, particularly the Navier-Stokes equations. It applies this framework to analyze Shakespeare’s tragedies “Hamlet” and “Romeo and Juliet”, treating emotional expressions as entities akin to fluids. By mapping linguistic characteristics onto fluid dynamics components, this approach provides a dynamic perspective on how emotions are expressed and evolve in narrative texts. The results, when compared with conventional sentiment analysis methods, reveal a more detailed and subtle grasp of the emotional arcs within these works. This interdisciplinary strategy not only enriches emotion analysis in computational linguistics but also paves the way for potential integrations with machine learning in NLP.

pdf abs
Math Problem Solving: Enhancing Large Language Models with Semantically Rich Symbolic Variables
Ali Emre Narin

The advent of Large Language Models (LLMs) based on the Transformer architecture has led to remarkable advancements in various domains, including reasoning tasks. However, accurately assessing the performance of Large Language Models, particularly in the reasoning domain, remains a challenge. In this paper, we propose the Semantically Rich Variable Substitution Method (SemRiVas) as an enhancement to existing symbolic methodologies for evaluating LLMs on Mathematical Word Problems (MWPs). Unlike previous approaches that utilize generic symbols for variable substitution, SemRiVas employs descriptive variable names, aiming to improve the problem-solving abilities of LLMs. Our method aims to eliminate the need for LLMs to possess programming proficiency and perform arithmetic operations, to be universally applicable. Our experimental results demonstrate the superior accuracy of SemRiVas compared to prior symbolic methods, particularly in resolving longer and more complex MWP questions. However, LLMs’ performance with SemRiVas and symbolic methods that utilize one-character variables still falls short compared to notable techniques like CoT and PaL.

pdf abs
Data Driven Approach for Mathematical Problem Solving
Byungju Kim | Wonseok Lee | Jaehong Kim | Jungbin Im

In this paper, we investigate and introduce a novel Llama-2 based model, fine-tuned with an original dataset designed to mirror real-world mathematical challenges. The dataset was collected through a question-answering platform, incorporating solutions generated by both rule-based solver and question answering, to cover a broad spectrum of mathematical concepts and problem-solving techniques. Experimental results demonstrate significant performance improvements when the models are fine-tuned with our dataset. The results suggest that the integration of contextually rich and diverse problem sets into the training substantially enhances the problem-solving capability of language models across various mathematical domains. This study showcases the critical role of curated educational content in advancing AI research.

pdf abs
Exploring Internal Numeracy in Language Models: A Case Study on ALBERT
Ulme Wennberg | Gustav Eje Henter

It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning. In this paper, we propose a method for studying how these models internally represent numerical data, and use our proposal to analyze the ALBERT family of language models. Specifically, we extract the learned embeddings these models use to represent tokens that correspond to numbers and ordinals, and subject these embeddings to Principal Component Analysis (PCA). PCA results reveal that ALBERT models of different sizes, trained and initialized separately, consistently learn to use the axes of greatest variation to represent the approximate ordering of various numerical concepts. Numerals and their textual counterparts are represented in separate clusters, but increase along the same direction in 2D space. Our findings illustrate that language models, trained purely to model text, can intuit basic mathematical concepts, opening avenues for NLP applications that intersect with quantitative reasoning.

pdf (full)
bib (full) Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

pdf bib
Every Time We Hire an LLM, the Reasoning Performance of the Linguists Goes Up
Harish Tayyar Madabushi

pdf bib
Using Universal Dependencies for testing hypotheses about communicative efficiency
Natalia Levshina

pdf abs
Automatic Manipulation of Training Corpora to Make Parsers Accept Real-world Text
Hiroshi Kanayama | Ran Iwamoto | Masayasu Muraoka | Takuya Ohko | Kohtaroh Miyamoto

This paper discusses how to build a practical syntactic analyzer, and addresses the distributional differences between existing corpora and actual documents in applications. As a case study we focus on noun phrases that are not headed by a main verb and sentences without punctuation at the end, which are rare in a number of Universal Dependencies corpora but frequently appear in the real-world use cases of syntactic parsers. We converted the training corpora so that their distribution is closer to that in realistic inputs, and obtained the better scores both in general syntax benchmarking and a sentiment detection task, a typical application of dependency analysis.

pdf abs
Assessing BERT’s sensitivity to idiomaticity
Li Liu | Francois Lareau

BERT-like language models have been demonstrated to capture the idiomatic meaning of multiword expressions. Linguists have also shown that idioms have varying degrees of idiomaticity. In this paper, we assess CamemBERT’s sensitivity to the degree of idiomaticity within idioms, as well as the dependency of this sensitivity on part of speech and idiom length. We used a demasking task on tokens from 3127 idioms and 22551 tokens corresponding to simple lexemes taken from the French Lexical Network (LN-fr), and observed that CamemBERT performs distinctly on tokens embedded within idioms compared to simple ones. When demasking tokens within idioms, the model is not proficient in discerning their level of idiomaticity. Moreover, regardless of idiomaticity, CamemBERT excels at handling function words. The length of idioms also impacts CamemBERT’s performance to a certain extent. The last two observations partly explain the difference between the model’s performance on idioms versus simple lexemes. We conclude that the model treats idioms differently from simple lexemes, but that it does not capture the difference in compositionality between subclasses of idioms.

pdf abs
Identification and Annotation of Body Part Multiword Expressions in Old Egyptian
Roberto Díaz Hernández

This paper presents the preliminary results of an ongoing study on the diachronic and synchronic use of multiword expressions (MWEs) in Egyptian, begun when I joined the COST Action Universality, Diversity and Idiosyncrasy in Language Technology (UniDive, CA21167). It analyzes, as a case study, Old Egyptian body part MWEs based on lexicographic and textual resources, and its aim is both to open up a research line in Egyptology, where the study of MWEs has been neglected, and to contribute to Natural Language Processing studies by determining the rules governing the morpho-syntactic formation of Old Egyptian body part MWEs in order to facilitate the identification of other types of MWEs.

pdf abs
Fitting Fixed Expressions into the UD Mould: Swedish as a Use Case
Lars Ahrenberg

Fixed multiword expressions are common in many, if not all, natural languages. In the Universal Dependencies framework, UD, a subset of these expressions are modelled with the dependency relation ‘fixed’ targeting the most grammaticalized cases of functional multiword items. In this paper we perform a detailed analysis of 439 expressions modelled with ‘fixed’ in two Swedish UD treebanks in order to reduce their numbers and fit the definition better. We identify a large number of dimensions of variation for fixed multiword expressions that can be used for the purpose. We also point out several problematic aspects of the current UD approach to multiword expressions and discuss different alternative solutions for modelling fixed expresions. We suggest that insights from Constructional Grammar (CxG) can help with a more systematic treatment of fixed expressions in UD.

pdf abs
Synthetic-Error Augmented Parsing of Swedish as a Second Language: Experiments with Word Order
Arianna Masciolini | Emilie Francis | Maria Irena Szawerna

Ungrammatical text poses significant challenges for off-the-shelf dependency parsers. In this paper, we explore the effectiveness of using synthetic data to improve performance on essays written by learners of Swedish as a second language. Due to their relevance and ease of annotation, we restrict our initial experiments to word order errors. To do that, we build a corrupted version of the standard Swedish Universal Dependencies (UD) treebank Talbanken, mimicking the error patterns and frequency distributions observed in the Swedish Learner Language (SweLL) corpus. We then use the MaChAmp (Massive Choice, Ample tasks) toolkit to train an array of BERT-based dependency parsers, fine-tuning on different combinations of original and corrupted data. We evaluate the resulting models not only on their respective test sets but also, most importantly, on a smaller collection of sentence-correction pairs derived from SweLL. Results show small but significant performance improvements on the target domain, with minimal decline on normative data.

pdf abs
The Vedic Compound Dataset
Sven Sellmer | Oliver Hellwig

This paper introduces the Vedic Compound Dataset (VCD), the first resource providing annotated compounds from Vedic Sanskrit, a South Asian Indo-European language used from ca. 1500 to 500 BCE. The VCD aims at facilitating the study of language change in early Indo-Iranian and offers comparative material for quantitative cross-linguistic research on compounds. The process of annotating Vedic compounds is complex as they contain five of the six basic types of compounds defined by Scalise & Bisetto (2005), which are, however, not consistently marked in morphosyntax, making their automatic classification a significant challenge. The paper details the process of collecting and preprocessing the relevant data, with a particular focus on the question of how to distinguish exocentric from endocentric usage. It further discusses experiments with a simple ML classifier that uses compound internal syntactic relations, outlines the composition of the dataset, and sketches directions for future research.

pdf abs
A Universal Dependencies Treebank for Gujarati
Mayank Jobanputra | Maitrey Mehta | Çağrı Çöltekin

The Universal Dependencies (UD) project has presented itself as a valuable platform to develop various resources for the languages of the world. We present and release a sample treebank for the Indo-Aryan language of Gujarati – a widely spoken language with little linguistic resources. This treebank is the first labeled dataset for dependency parsing in the language and the script (the Gujarati script). The treebank contains 187 part-of-speech and dependency annotated sentences from diverse genres. We discuss various idiosyncratic examples, annotation choices and present an elaborate corpus along with agreement statistics. We see this work as a valuable resource and a stepping stone for research in Gujarati Computational Linguistics.

UDify is a multilingual and multi-task parser fine-tuned on mBERT that achieves remarkable performance in high-resource languages. However, the performance saturates early and decreases gradually in low-resource languages as training proceeds. This work applies a data augmentation method and conducts experiments on seven few-shot and four zero-shot languages. The unlabeled attachment scores were improved on the zero-shot languages dependency parsing tasks, with the average score rising from 67.1% to 68.7%. Meanwhile, dependency parsing tasks for high-resource languages and other tasks were hardly affected. Experimental results indicate the data augmentation method is effective for low-resource languages in a multilingual dependency parsing.

pdf abs
Part-of-Speech Tagging for Northern Kurdish
Peshmerge Morad | Sina Ahmadi | Lorenzo Gatti

In the growing domain of natural language processing, low-resourced languages like Northern Kurdish remain largely unexplored due to the lack of resources needed to be part of this growth. In particular, the tasks of part-of-speech tagging and tokenization for Northern Kurdish are still insufficiently addressed. In this study, we aim to bridge this gap by evaluating a range of statistical, neural, and fine-tuned-based models specifically tailored for Northern Kurdish. Leveraging limited but valuable datasets, including the Universal Dependency Kurmanji treebank and a novel manually annotated and tokenized gold-standard dataset consisting of 136 sentences (2,937 tokens). We evaluate several POS tagging models and report that the fine-tuned transformer-based model outperforms others, achieving an accuracy of 0.87 and a macro-averaged F1 score of 0.77. Data and models are publicly available under an open license at https://github.com/peshmerge/northern-kurdish-pos-tagging

pdf abs
Diachronic Analysis of Multi-word Expression Functional Categories in Scientific English
Diego Alves | Stefania Degaetano-Ortlieb | Elena Schmidt | Elke Teich

We present a diachronic analysis of multi-word expressions (MWEs) in English based on the Royal Society Corpus, a dataset containing 300+ years of the scientific publications of the Royal Society of London. Specifically, we investigate the functions of MWEs, such as stance markers (“is is interesting”) or discourse organizers (“in this section”), and their development over time. Our approach is multi-disciplinary: to detect MWEs we use Universal Dependencies, to classify them functionally we use an approach from register linguistics, and to assess their role in diachronic development we use an information-theoretic measure, relative entropy.

This paper highlights the importance of integrating MWE identification with the development of syntactic MWE lexicons. It suggests that lexicons with minimal morphosyntactic information can amplify current MWE-annotated datasets and refine identification strategies. To our knowledge, this work represents the first attempt to focus on both seen and unseen of VMWEs for Arabic. It also deals with the challenge of differentiating between literal and figurative interpretations of idiomatic expressions. The approach involves a dual-phase procedure: first projecting a VMWE lexicon onto a corpus to identify candidate occurrences, then disambiguating these occurrences to distinguish idiomatic from literal instances. Experiments outlined in the paper aim to assess the efficacy of this technique, utilizing a lexicon known as LEXAR and the “parseme-ar” corpus. The findings suggest that lexicon-driven strategies have the potential to refine MWE identification, particularly for unseen occurrences.

pdf abs
Revisiting VMWEs in Hindi: Annotating Layers of Predication
Kanishka Jain | Ashwini Vaidya

Multiword expressions in languages like Hindi are both productive and challenging. Hindi not only uses a variety of verbal multiword expressions (VMWEs) but also employs different combinatorial strategies to create new types of multiword expressions. In this paper we are investigating two such strategies that are quite common in the language. Firstly, we describe that VMWEs in Hindi are not just lexical but also morphological. Causatives are formed morphologically in Hindi. Second, we examine Stacked VMWEs i.e. when at least two VMWEs occur together. We suggest that the existing PARSEME annotation framework can be extended to these two phenomena without changing the existing guidelines. We also propose rule-based heuristics using existing Universal Dependency annotations to automatically identify and annotate some of the VMWEs in the language. The goal of this paper is to refine the existing PARSEME corpus of Hindi for VMWEs while expanding its scope giving a more comprehensive picture of VMWEs in Hindi.

pdf abs
Towards the semantic annotation of SR-ELEXIS corpus: Insights into Multiword Expressions and Named Entities
Cvetana Krstev | Ranka Stanković | Aleksandra M. Marković | Teodora Sofija Mihajlov

This paper presents the work in progress on ELEXIS-sr corpus, the Serbian addition to the ELEXIS multilingual annotated corpus ElexisWSD, comprising semantic annotations and word sense repositories. The ELEXIS corpus has parallel annotations in ten European languages, serving as a cross-lingual benchmark for evaluating low and medium-resourced European languages. The focus in this paper is on multiword expressions (MWEs) and named entities (NEs), their recognition in the ELEXIS-sr sentence set, and comparison with annotations in other languages. The first steps in building the Serbian sense inventory are discussed, and some results concerning MWEs and NEs are analysed. Once completed, the ELEXIS-sr corpus will be the first sense annotated corpus using the Serbian WordNet (SrpWN). Finally, ideas to represent MWE lexicon entries as Linguistic Linked-Open Data (LLOD) and connect them with occurrences in the corpus are presented.

pdf abs
To Leave No Stone Unturned: Annotating Verbal Idioms in the Parallel Meaning Bank
Rafael Ehren | Kilian Evang | Laura Kallmeyer

Idioms present many challenges to semantic annotation in a lexicalized framework, which leads to them being underrepresented or inadequately annotated in sembanks. In this work, we address this problem with respect to verbal idioms in the Parallel Meaning Bank (PMB), specifically in its German part, where only some idiomatic expressions have been annotated correctly. We first select candidate idiomatic expressions, then determine their idiomaticity status and whether they are decomposable or not, and then we annotate their semantics using WordNet senses and VerbNet semantic roles. Overall, inter-annotator agreement is very encouraging. A difficulty, however, is to choose the correct word sense. This is not surprising, given that English synsets are many and there is often no unique mapping from German idioms and words to them. Besides this, there are many subtle differences and interesting challenging cases. We discuss some of them in this paper.

pdf abs
Universal Feature-based Morphological Trees
Federica Gamba | Abishek Stephen | Zdeněk Žabokrtský

The paper proposes a novel data representation inspired by Universal Dependencies (UD) syntactic trees, which are extended to capture the internal morphological structure of word forms. As a result, morphological segmentation is incorporated within the UD representation of syntactic dependencies. To derive the proposed data structure we leverage existing annotation of UD treebanks as well as available resources for segmentation, and we select 10 languages to work with in the presented case study. Additionally, statistical analysis reveals a robust correlation between morphs and sets of morphological features of words. We thus align the morphs to the observed feature inventories capturing the morphological meaning of morphs. Through the beneficial exploitation of cross-lingual correspondence of morphs, the proposed syntactic representation based on morphological segmentation proves to enhance the comparability of sentence structures across languages.

pdf abs
Combining Grammatical and Relational Approaches. A Hybrid Method for the Identification of Candidate Collocations from Corpora
Damiano Perri | Irene Fioravanti | Osvaldo Gervasi | Stefania Spina

We present an evaluation of three different methods for the automatic identification of candidate collocations in corpora, part of a research project focused on the development of a learner dictionary of Italian collocations. We compare the commonly used POS-based method and the syntactic dependency-based method with a hybrid method integrating both approaches. We conduct a statistical analysis on a sample corpus of written and spoken texts of different registers. Results show that the hybrid method can correctly detect more candidate collocations against a human annotated benchmark. The scores are particularly high in adjectival modifier rela- tions. A hybrid approach to candidate collocation identification seems to lead to an improvement in the quality of results.

We present ongoing work towards defining a lexicon-corpus interface to serve as a benchmark in the representation of multiword expressions (of various parts of speech) in dedicated lexica and the linking of these entries to their corpus occurrences. The final aim is the harnessing of such resources for the automatic identification of multiword expressions in a text. The involvement of several natural languages aims at the universality of a solution not centered on a particular language, and also accommodating idiosyncrasies. Challenges in the lexicographic description of multiword expressions are discussed, the current status of lexica dedicated to this linguistic phenomenon is outlined, as well as the solution we envisage for creating an ecosystem of interlinked lexica and corpora containing and, respectively, annotated with multiword expressions.

pdf abs
Annotation of Multiword Expressions in the SUK 1.0 Training Corpus of Slovene: Lessons Learned and Future Steps
Jaka Čibej | Polona Gantar | Mija Bon

Recent progress within the UniDive COST Action on the compilation of universal guidelines for the annotation of non-verbal multiword expressions (MWEs) has provided an opportunity to improve and expand the work previously done within the PARSEME COST Action on the annotation of verbal multiword expressions in the SUK 1.0 Training Corpus of Slovene. A segment of the training corpus had already been annotated with verbal MWEs during PARSEME. As a follow-up and part of the New Grammar of Modern Standard Slovene (NSSSS) project, the same segment was annotated with non verbal MWEs, resulting in approximately 6, 500 sentences annotated by at least three annotators (described in Gantar et al., 2019). Since then, the entire SUK 1.0 was also manually annotated with UD part-of-speech tags. In the paper, we present an analysis of the MWE annotations exported from the corpus along with their part-of-speech structures through the lens of Universal Dependencies. We discuss the usefulness of the data in terms of potential insight for the further compilation and fine-tuning of guidelines particularly for non-verbal MWEs, and conclude with our plans for future work.

pdf abs
Light Verb Constructions in Universal Dependencies for South Asian Languages
Abishek Stephen | Daniel Zeman

We conduct a morphosyntactic investigation into the light verb constructions (LVCs) or the verbo-nominal predicates in South Asian languages. This work spans the Indo-Aryan and Dravidian language families in treebanks based on Universal Dependencies (UD). For the selected languages we show how well the existing annotation guidelines fare for the LVCs. We also reiterate the importance of the core and oblique distinction in UD and how informative it is for making accurate morphosyntactic annotation judgments for such predicates.

pdf abs
Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection
Dylan Phelps | Thomas M. R. Pickard | Maggie Mi | Edward Gow-Smith | Aline Villavicencio

Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.

pdf abs
Universal Dependencies for Saraiki
Meesum Alam | Francis Tyers | Emily Hanink | Sandra Kübler

We present the first treebank of the Saraiki/Siraiki [ISO 639-3 skr] language, using the Universal Dependency annotation scheme (de Marneffe et al., 2021). The treebank currently comprises 587 annotated sentences and 7597 tokens. We explain the most relevant syntactic and morphological features of Saraiki, along with the decision we have made for a range of language specific constructions, namely compounds, verbal structures including light verb and serial verb constructions, and relative clauses.

pdf abs
Domain-Weighted Batch Sampling for Neural Dependency Parsing
Jacob Striebel | Daniel Dakota | Sandra Kübler

In neural dependency parsing, as well as in the broader field of NLP, domain adaptation remains a challenging problem. When adapting a parser to a target domain, there is a fundamental tension between the need to make use of out-of-domain data and the need to ensure that syntactic characteristic of the target domain are learned. In this work we explore a way to balance these two competing concerns, namely using domain-weighted batch sampling, which allows us to use all available training data, while controlling the probability of sampling in- and out-of-domain data when constructing training batches. We conduct experiments using ten natural language domains and find that domain-weighted batch sampling yields substantial performance improvements in all ten domains compared to a baseline of conventional randomized batch sampling.

As part of our efforts to develop unified Universal Dependencies (UD) guidelines for Turkic languages, we evaluate multiple approaches to a difficult morphosyntactic phenomenon, pronominal locative expressions formed by a suffix -ki. These forms result in multiple syntactic words, with potentially conflicting morphological features, and participating in different dependency relations. We describe multiple approaches to the problem in current (and upcoming) Turkic UD treebanks, and show that none of them offers a solution that satisfies a number of constraints we consider (including constraints imposed by UD guidelines). This calls for a compromise with the ‘least damage’ that should be adopted by most, if not all, Turkic treebanks. Our discussion of the phenomenon and various annotation approaches may also help treebanking efforts for other languages or language families with similar constructions.

pdf abs
BERT-based Idiom Identification using Language Translation and Word Cohesion
Arnav Yayavaram | Siddharth Yayavaram | Prajna Devi Upadhyay | Apurba Das

An idiom refers to a special type of multi-word expression whose meaning is figurative and cannot be deduced from the literal interpretation of its components. Idioms are prevalent in almost all languages and text genres, necessitating explicit handling by comprehensive NLP systems. Such phrases are referred to as Potentially Idiomatic Expressions (PIEs) and automatically identifying them in text is a challenging task. In this paper, we propose using a BERT-based model fine-tuned with custom objectives, to improve the accuracy of detecting PIEs in text. Our custom loss functions capture two important properties (word cohesion and language translation) to distinguish PIEs from non-PIEs. We conducted several experiments on 7 datasets and showed that incorporating custom objectives while training the model leads to substantial gains. Our models trained using this approach also have better sequence accuracy over DISC, a state-of-the-art PIE detection technique, along with good transfer capabilities.

In this paper we focus on a subclass of multi-word expressions, namely compound formation in German. The automatic detection of compounds is a known problem and we argue that its resolution should be given more urgency in light of a new role we uncovered with respect to ad hoc compound formation: the systematic expression of attitudinal meaning and its potential importance for the down-stream NLP task of stance detection. We demonstrate that ad hoc compounds in German indeed systematically express attitudinal meaning by adducing corpus linguistic and psycholinguistic experimental data. However, an investigation of state-of-the-art dependency parsers and Universal Dependency treebanks shows that German compounds are parsed and annotated very unevenly, so that currently one cannot reliably identify or access ad hoc compounds with attitudinal meaning in texts. Moreover, we report initial experiments with large language models underlining the challenges in capturing attitudinal meanings conveyed by ad hoc compounds. We consequently suggest a systematized way of annotating (and thereby also parsing) ad hoc compounds that is based on positive experiences from within the multilingual ParGram grammar development effort.

pdf (full)
bib (full) Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge) @ LREC-COLING-2024

pdf bib abs
Probing Large Language Models from a Human Behavioral Perspective
Xintong Wang | Xiaoyu Li | Xingshan Li | Chris Biemann

Large Language Models (LLMs) have emerged as dominant foundational models in modern NLP. However, the understanding of their prediction processes and internal mechanisms, such as feed-forward networks (FFN) and multi-head self-attention (MHSA), remains largely unexplored. In this work, we probe LLMs from a human behavioral perspective, correlating values from LLMs with eye-tracking measures, which are widely recognized as meaningful indicators of human reading patterns. Our findings reveal that LLMs exhibit a similar prediction pattern with humans but distinct from that of Shallow Language Models (SLMs). Moreover, with the escalation of LLM layers from the middle layers, the correlation coefficients also increase in FFN and MHSA, indicating that the logits within FFN increasingly encapsulate word semantics suitable for predicting tokens from the vocabulary.

pdf bib abs
The Semantic Relations in LLMs: An Information-theoretic Compression Approach
Yu-Hsiang Tseng | Pin-Er Chen | Da-Chen Lian | Shu-Kai Hsieh

Compressibility is closely related to the predictability of the texts from the information theory viewpoint. As large language models (LLMs) are trained to maximize the conditional probabilities of upcoming words, they may capture the subtlety and nuances of the semantic constraints underlying the texts, and texts aligning with the encoded semantic constraints are more compressible than those that do not. This paper systematically tests whether and how LLMs can act as compressors of semantic pairs. Using semantic relations from English and Chinese Wordnet, we empirically demonstrate that texts with correct semantic pairings are more compressible than incorrect ones, measured by the proposed compression advantages index. We also show that, with the Pythia model suite and a fine-tuned model on Chinese Wordnet, compression capacities are modulated by the model’s seen data. These findings are consistent with the view that LLMs encode the semantic knowledge as underlying constraints learned from texts and can act as compressors of semantic information or potentially other structured knowledge.

pdf abs
Word Sense Disambiguation as a Game of Neurosymbolic Darts
Tiansi Dong | Rafet Sifa

Word Sense Disambiguation (WSD) is one of the hardest tasks in natural language understanding and knowledge engineering. The glass ceiling of the 80% F1 score is recently achieved through supervised learning, enriched by knowledge graphs. Here, we propose a novel neurosymbolic methodology that may push the F1 score above 90%. The core of our methodology is a neurosymbolic sense embedding, in terms of a configuration of nested n-dimensional balls. The central point of a ball well preserves pre-trained word embeddings learned from data, which partially fixes the locations of balls. Inclusion relations among balls precisely encode symbolic hypernym relations among senses, and enable simple logic deduction among sense embeddings. We trained a Transformer to learn the mapping from a contextualized word embedding to its sense ball embedding, just like playing the game of darts (a game of shooting darts into a dartboard). A series of experiments are carried out using pre-training n ball embeddings, which cover around 70% training data and 75% testing data in the benchmark WSD corpus. Euclidean distance and cosine similarity functions are used as objective functions, separately, and each reaches >95.0% F1 score in the ALL-n-ball dataset. This substantially breaks the glass ceiling of deep learning methods. Future work is discussed to develop a full-fledged neurosymbolic WSD system that substantially outperforms deep learning approaches.

pdf abs
Open Event Causality Extraction by the Assistance of LLM in Task Annotation, Dataset, and Method
Kun Luo | Tong Zhou | Yubo Chen | Jun Zhao | Kang Liu

Event Causality Extraction (ECE) aims to extract explicit causal relations between event pairs from the text. However, the event boundary deviation and the causal event pair mismatching are two crucial challenges that remain unaddressed. To address the above issues, we propose a paradigm to utilize LLM to optimize the task definition, evolve the datasets, and strengthen our proposed customized Contextual Highlighting Event Causality Extraction framework (CHECE). Specifically in CHECE, we propose an Event Highlighter and an Event Concretization Module, guiding the model to represent the event by a higher-level cluster and consider its causal counterpart in event boundary prediction to deal with event boundary deviation. And we propose a Contextual Event Causality Matching mechanism, meanwhile, applying LLM to diversify the content templates to force the model to learn causality from context to targeting on causal event pair mismatching. Experimental results on two ECE datasets demonstrate the effectiveness of our method.

pdf abs
The Need for Grounding in LLM-based Dialogue Systems
Kristiina Jokinen

Grounding is a pertinent part of the design of LLM-based dialogue systems. Although research on grounding has a long tradition, the paradigm shift caused by LLMs has brought the concept onto the foreground, in particular in the context of cognitive robotics. To avoid generation of irrelevant or false information, the system needs to ground its utterances into real-world events, and to avoid the statistical parrot effect, the system needs to construct shared understanding of the dialogue context and of the partner’s intents. Grounding and construction of the shared context enables cooperation between the participants, and thus supports trustworthy interaction. This paper discusses grounding using neural LLM technology. It aims to bridge neural and symbolic computing on the cognitive architecture level, so as to contribute to a better understanding of how conversational reasoning and collaboration can be linked to LLM implementations to support trustworthy and flexible interaction.

pdf (full)
bib (full) Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024

pdf bib abs
Is a picture of a bird a bird? A mixed-methods approach to understanding diverse human perspectives and ambiguity in machine vision models
Alicia Parrish | Susan Hao | Sarah Laszlo | Lora Aroyo

Human experiences are complex and subjective. This subjectivity is reflected in the way people label images for machine vision models. While annotation tasks are often assumed to deliver objective results, this assumption does not allow for the subjectivity of human experience. This paper examines the implications of subjective human judgments in the behavioral task of labeling images used to train machine vision models. We identify three primary sources of ambiguity: (1) depictions of labels in the images can be simply ambiguous, (2) raters’ backgrounds and experiences can influence their judgments and (3) the way the labeling task is defined can also influence raters’ judgments. By taking steps to address these sources of ambiguity, we can create more robust and reliable machine vision models.

pdf bib abs
Wisdom of Instruction-Tuned Language Model Crowds. Exploring Model Label Variation
Flor Miriam Plaza-del-Arco | Debora Nozza | Dirk Hovy

Large Language Models (LLMs) exhibit remarkable text classification capabilities, excelling in zero- and few-shot learning (ZSL and FSL) scenarios. However, since they are trained on different datasets, performance varies widely across tasks between those models. Recent studies emphasize the importance of considering human label variation in data annotation. However, how this human label variation also applies to LLMs remains unexplored. Given this likely model specialization, we ask: Do aggregate LLM labels improve over individual models (as for human annotators)? We evaluate four recent instruction-tuned LLMs as “annotators” on five subjective tasks across four languages. We use ZSL and FSL setups and label aggregation from human annotation. Aggregations are indeed substantially better than any individual model, benefiting from specialization in diverse tasks or languages. Surprisingly, FSL does not surpass ZSL, as it depends on the quality of the selected examples. However, there seems to be no good information-theoretical strategy to select those. We find that no LLM method rivals even simple supervised models. We also discuss the tradeoffs in accuracy, cost, and moral/ethical considerations between LLM and human annotation.

pdf abs
Revisiting Annotation of Online Gender-Based Violence
Gavin Abercrombie | Nikolas Vitsakis | Aiqi Jiang | Ioannis Konstas

Online Gender-Based Violence is an increasing problem, but existing datasets fail to capture the plurality of possible annotator perspectives or ensure representation of affected groups. In a pilot study, we revisit the annotation of a widely used dataset to investigate the relationship between annotator identities and underlying attitudes and the responses they give to a sexism labelling task. We collect demographic and attitudinal information about crowd-sourced annotators using two validated surveys from Social Psychology. While we do not find any correlation between underlying attitudes and annotation behaviour, ethnicity does appear to be related to annotator responses for this pool of crowd-workers. We also conduct initial classification experiments using Large Language Models, finding that a state-of-the-art model trained with human feedback benefits from our broad data collection to perform better on the new labels. This study represents the initial stages of a wider data collection project, in which we aim to develop a taxonomy of GBV in partnership with affected stakeholders.

pdf abs
A Perspectivist Corpus of Numbers in Social Judgements
Marlon May | Lucie Flek | Charles Welch

With growing interest in the use of large language models, it is becoming increasingly important to understand whose views they express. These models tend to generate output that conforms to majority opinion and are not representative of diverse views. As a step toward building models that can take differing views into consideration, we build a novel corpus of social judgements. We crowdsourced annotations of a subset of the Commonsense Norm Bank that contained numbers in the situation descriptions and asked annotators to replace the number with a range defined by a start and end value that, in their view, correspond to the given verdict. Our corpus contains unaggregated annotations and annotator demographics. We describe our annotation process for social judgements and will release our dataset to support future work on numerical reasoning and perspectivist approaches to natural language processing.

pdf abs
An Overview of Recent Approaches to Enable Diversity in Large Language Models through Aligning with Human Perspectives
Benedetta Muscato | Chandana Sree Mala | Marta Marchiori Manerba | Gizem Gezici | Fosca Giannotti

The varied backgrounds and experiences of human annotators inject different opinions and potential biases into the data, inevitably leading to disagreements. Yet, traditional aggregation methods fail to capture individual judgments since they rely on the notion of a single ground truth. Our aim is to review prior contributions to pinpoint the shortcomings that might cause stereotypical content generation. As a preliminary study, our purpose is to investigate state-of-the-art approaches, primarily focusing on the following two research directions. First, we investigate how adding subjectivity aspects to LLMs might guarantee diversity. We then look into the alignment between humans and LLMs and discuss how to measure it. Considering existing gaps, our review explores possible methods to mitigate the perpetuation of biases targeting specific communities. However, we recognize the potential risk of disseminating sensitive information due to the utilization of socio-demographic data in the training process. These considerations underscore the inclusion of diverse perspectives while taking into account the critical importance of implementing robust safeguards to protect individuals’ privacy and prevent the inadvertent propagation of sensitive information.

pdf abs
Disagreement in Argumentation Annotation
Anna Lindahl

Disagreement, perspective or error? There is a growing discussion against the idea of a unified ground truth in annotated data, as well as the usefulness of such a ground truth and resulting gold standard. In data perspectivism, this issue is exemplified with tasks such as hate speech or sentiment classification in which annotators’ different perspectives are important to include. In this paper we turn to argumentation, a related field which has had less focus from this point of view. Argumentation is difficult to annotate for several reasons, from the more practical parts of deciding where the argumentation begins and ends to questions of how argumentation is defined and what it consists of. Learning more about disagreement is therefore important in order to improve argument annotation and to better utilize argument annotated data. Because of this, we examine disagreement in two corpora annotated with argumentation both manually and computationally. We find that disagreement is often not because of annotation errors or mistakes but due to the possibility of multiple possible interpretations. More specifically, these interpretations can be over boundaries, label or existence of argumentation. These results emphasize the need for more thorough analysis of disagreement in data, outside of the more common inter-annotator agreement measures.

pdf abs
Moral Disagreement over Serious Matters: Discovering the Knowledge Hidden in the Perspectives
Anny D. Alvarez Nogales | Oscar Araque

Moral values significantly define decision-making processes, notably on contentious issues like global warming. The Moral Foundations Theory (MFT) delineates morality and aims to reconcile moral expressions across cultures, yet different interpretations arise, posing challenges for computational modeling. This paper addresses the need to incorporate diverse moral perspectives into the learning systems used to estimate morality in text. To do so, it explores how training language models with varied annotator perspectives affects the performance of the learners. Building on top if this, this work also proposes an ensemble method that exploits the diverse perspectives of annotators to construct a more robust moral estimation model. Additionally, we investigate the automated identification of texts that pose annotation challenges, enhancing the understanding of linguistic cues towards annotator disagreement. To evaluate the proposed models we use the Moral Foundations Twitter Corpus (MFTC), a resource that is currently the reference for modeling moral values in computational social sciences. We observe that incorporating the diverse perspectives of annotators into an ensemble model benefits the learning process, showing large improvements in the classification performance. Finally, the results also indicate that instances that convey strong moral meaning are more challenging to annotate.

pdf abs
Perspectives on Hate: General vs. Domain-Specific Models
Giulia Rizzi | Michele Fontana | Elisabetta Fersini

The rise of online hostility, combined with broad social media use, leads to the necessity of the comprehension of its human impact. However, the process of hate identification is challenging because, on the one hand, the line between healthy disagreement and poisonous speech is not well defined, and, on the other hand, multiple socio-cultural factors or prior beliefs shape people’s perceptions of potentially harmful text. To address disagreements in hate speech identification, Natural Language Processing (NLP) models must capture several perspectives. This paper introduces a strategy based on the Contrastive Learning paradigm for detecting disagreements in hate speech using pre-trained language models. Two approaches are proposed: the General Model, a comprehensive framework, and the Domain-Specific Model, which focuses on more specific hate-related tasks. The source code is available at ://anonymous.4open.science/r/Disagreement-530C.

The move towards preserving judgement disagreements in NLP requires the identification of adequate evaluation metrics. We identify a set of key properties that such metrics should have, and assess the extent to which natural candidates for soft evaluation such as Cross Entropy satisfy such properties. We employ a theoretical framework, supported by a visual approach, by practical examples, and by the analysis of a real case scenario. Our results indicate that Cross Entropy can result in fairly paradoxical results in some cases, whereas other measures Manhattan distance and Euclidean distance exhibit a more intuitive behavior, at least for the case of binary classification.

pdf abs
Designing NLP Systems That Adapt to Diverse Worldviews
Claudiu Creanga | Liviu P. Dinu

Natural Language Inference (NLI) is foundational for evaluating language understanding in AI. However, progress has plateaued, with models failing on ambiguous examples and exhibiting poor generalization. We argue that this stems from disregarding the subjective nature of meaning, which is intrinsically tied to an individual’s weltanschauung (which roughly translates to worldview). Existing NLP datasets often obscure this by aggregating labels or filtering out disagreement. We propose a perspectivist approach: building datasets that capture annotator demographics, values, and justifications for their labels. Such datasets would explicitly model diverse worldviews. Our initial experiments with a subset of the SBIC dataset demonstrate that even limited annotator metadata can improve model performance.

pdf abs
The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation
Maja Pavlovic | Massimo Poesio

Recent studies focus on exploring the capability of Large Language Models (LLMs) for data annotation. Our work, firstly, offers a comparative overview of twelve such studies that investigate labelling with LLMs, particularly focusing on classification tasks. Secondly, we present an empirical analysis that examines the degree of alignment between the opinion distributions returned by GPT and those provided by human annotators across four subjective datasets. Our analysis supports a minority of studies that are considering diverse perspectives when evaluating data annotation tasks and highlights the need for further research in this direction.

pdf abs
What Does Perspectivism Mean? An Ethical and Methodological Countercriticism
Mathieu Valette

In this paper, we address the epistemological and ethical break of perspectivism in NLP. First, we propose to consider data annotation from the point of view of the scientific management of annotation work - which is part of the automation process inherent in NLP, in order to ideologically situate the perspectivist paradigm. We then analyze some of the concepts of perspectivism (in particular, truth). Finally, based on this analysis, we formulate a set of proposals aimed at overcoming the observed limitations of corpus annotation in general and perspectivism in particular.

pdf abs
OrigamIM: A Dataset of Ambiguous Sentence Interpretations for Social Grounding and Implicit Language Understanding
Liesbeth Allein | Marie-Francine Moens

Sentences elicit different interpretations and reactions among readers, especially when there is ambiguity in their implicit layers. We present a first-of-its kind dataset of sentences from Reddit, where each sentence is annotated with multiple interpretations of its meanings, understandings of implicit moral judgments about mentioned people, and reader impressions of its author. Scrutiny of the dataset proves the evoked variability and polarity in reactions. It further shows that readers strongly disagree on both the presence of implied judgments and the social acceptability of the behaviors they evaluate. In all, the dataset offers a valuable resource for socially grounding language and modeling the intricacies of implicit language understanding from multiple reader perspectives.

pdf abs
Linguistic Fingerprint in Transformer Models: How Language Variation Influences Parameter Selection in Irony Detection
Michele Mastromattei | Fabio Massimo Zanzotto

This paper explores the correlation between linguistic diversity, sentiment analysis and transformer model architectures. We aim to investigate how different English variations impact transformer-based models for irony detection. To conduct our study, we used the EPIC corpus to extract five diverse English variation-specific datasets and applied the KEN pruning algorithm on five different architectures. Our results reveal several similarities between optimal subnetworks, which provide insights into the linguistic variations that share strong resemblances and those that exhibit greater dissimilarities. We discovered that optimal subnetworks across models share at least 60% of their parameters, emphasizing the significance of parameter values in capturing and interpreting linguistic variations. This study highlights the inherent structural similarities between models trained on different variants of the same language and also the critical role of parameter values in capturing these nuances.

State-of-the-art conversational AI exhibits a level of sophistication that promises to have profound impacts on many aspects of daily life, including how people seek information, create content, and find emotional support. It has also shown a propensity for bias, offensive language, and false information. Consequently, understanding and moderating safety risks posed by interacting with AI chatbots is a critical technical and social challenge. Safety annotation is an intrinsically subjective task, where many factors—often intersecting—determine why people may express different opinions on whether a conversation is safe. We apply Bayesian multilevel models to surface factors that best predict rater behavior to a dataset of 101,286 annotations of conversations between humans and an AI chatbot, stratified by rater gender, age, race/ethnicity, and education level. We show that intersectional effects involving these factors play significant roles in validating safety in conversational AI data. For example, race/ethnicity and gender show strong intersectional effects, particularly among South Asian and East Asian women. We also find that conversational degree of harm impacts raters of all race/ethnicity groups, but that Indigenous and South Asian raters are particularly sensitive. Finally, we discover that the effect of education is uniquely intersectional for Indigenous raters. Our results underscore the utility of multilevel frameworks for uncovering underrepresented social perspectives.

pdf abs
A Dataset for Multi-Scale Film Rating Inference from Reviews
Frankie Robertson | Stefano Leone

This resource paper introduces a dataset for multi-scale rating inference of film review scores based upon review summaries. The dataset and task are unique in pairing a text regression problem with ratings given on multiple scales, e.g. the A-F letter scale and the 4-point star scale. It retains entity identifiers such as film and reviewer names. The paper describes the construction of the dataset before exploring potential baseline architectures for the task, and evaluating their performance. Baselines based on classifier-per-scale, affine-per-scale, and ordinal regression models are presented and evaluated with the BERT-base backbone. Additional experiments are used to ground a discussion of the different architectures’ merits and drawbacks with regards to explainability and model interpretation.

pdf (full)
bib (full) Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

pdf bib
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
Hend Al-Khalifa | Kareem Darwish | Hamdy Mubarak | Mona Ali | Tamer Elsayed

pdf bib abs
AraTar: A Corpus to Support the Fine-grained Detection of Hate Speech Targets in the Arabic Language
Seham Alghamdi | Youcef Benkhedda | Basma Alharbi | Riza Batista-Navarro

We are currently witnessing a concerning surge in the spread of hate speech across various social media platforms, targeting individuals or groups based on their protected characteristics such as race, religion, nationality and gender. This paper focuses on the detection of hate type (Task 1) and hate target (Task 2) in the Arabic language. To comprehensively address this problem, we have combined and re-annotated hate speech tweets from existing publicly available corpora, resulting in the creation of AraTar, the first and largest Arabic corpus annotated with support for multi-label classification for both hate speech types and target detection with a high inter-annotator agreement. Additionally, we sought to determine the most effective machine learning-based approach for addressing this issue. To achieve this, we compare and evaluate different approaches, including: (1) traditional machine learning-based models, (2) deep learning-based models fed with contextual embeddings, and (3) fine-tuning language models (LMs). Our results demonstrate that fine-tuning LMs, specifically using AraBERTv0.2-twitter (base), achieved the highest performance, with a micro-averaged F1-score of 84.5% and 85.03%, and a macro-averaged F1-score of 77.46% and 73.15%, for Tasks 1 and 2, respectively.

pdf bib abs
CLEANANERCorp: Identifying and Correcting Incorrect Labels in the ANERcorp Dataset
Mashael AlDuwais | Hend Al-Khalifa | Abdulmalik AlSalman

Label errors are a common issue in machine learning datasets, particularly for tasks such as Named Entity Recognition. Such label erros might hurt model training, affect evaluation results, and lead to an inaccurate assessment of model performance. In this study, we dived deep into one of the widely adopted Arabic NER benchmark datasets (ANERcorp) and found a significant number of annotation errors, missing labels, and inconsistencies. Therefore, in this study, we conducted empirical research to understand these erros, correct them and propose a cleaner version of the dataset named CLEANANERCorp. CLEANANERCorp will serve the research community as a more accurate and consistent benchmark.

pdf abs
Munazarat 1.0: A Corpus of Arabic Competitive Debates
Mohammad M. Khader | AbdulGabbar Al-Sharafi | Mohamad Hamza Al-Sioufy | Wajdi Zaghouani | Ali Al-Zawqari

This paper introduces the Corpus of Arabic Competitive Debates (Munazarat). Despite the significance of competitive debating as an activity of fostering critical thinking and promoting dialogue, researchers within the fields of Arabic Natural Language Processing (NLP), linguistics, argumentation studies, and education have access to very limited datasets about competitive debating. At this study stage, we introduce Munazarat 1.0, which combines recordings of approximately 50 hours collected from 73 debates at QatarDebate-recognized tournaments, where all of those debates were available on YouTube. Munazarat is a novel specialized speech Arabic corpus, mostly in Modern Standard Arabic (MSA), consisting of diverse debating topics and showing rich metadata for each debate. The transcription of debates was done using Fenek, a speech-to-text Kanari AI tool, and three native Arabic speakers reviewed each transcription file to enhance the quality provided by the machine. The Munazarat 1.0 dataset can be used to train Arabic NLP tools, develop an argumentation mining machine, and analyze Arabic argumentation and rhetoric styles. Keywords: Arabic Speech Corpus, Modern Standard Arabic, Debates

pdf abs
Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
Saied Alshahrani | Hesham Haroon Mohammed | Ali Elfilali | Mariama Njie | Jeanna Matthews

Wikipedia articles (content pages) are commonly used corpora in Natural Language Processing (NLP) research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY), and documented issues in the Egyptian Arabic Wikipedia edition regarding the massive automatic creation of its articles using template-based translation from English to Arabic without human involvement, overwhelming the Egyptian Arabic Wikipedia with articles that do not only have low-quality content but also with articles that do not represent the Egyptian people, their culture, and their dialect. In this paper, we aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics through exploratory analysis and building automatic detection systems. We first explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and utilize the resulting insights to build multivariate machine learning classifiers leveraging articles’ metadata to detect the template-translated articles automatically. We then publicly deploy and host the best-performing classifier as an online application called ‘Egyptian Wikipedia Scanner’ and release the extracted, filtered, labeled, and preprocessed datasets to the research community to benefit from our datasets and the online, web-based detection system.

pdf abs
A Novel Approach for Root Selection in the Dependency Parsing
Sharefah Ahmed Al-Ghamdi | Hend Al-Khalifa | Abdulmalik AlSalman

Although syntactic analysis using the sequence labeling method is promising, it can be problematic when the labels sequence does not contain a root label. This can result in errors in the final parse tree when the postprocessing method assumes the first word as the root. In this paper, we present a novel postprocessing method for BERT-based dependency parsing as sequence labeling. Our method leverages the root’s part of speech tag to select a more suitable root for the dependency tree, instead of using the default first token. We conducted experiments on nine dependency treebanks from different languages and domains, and demonstrated that our technique consistently improves the labeled attachment score (LAS) on most of them.

pdf abs
AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models
Ashwag Alasmari | Sarah Alhumoud | Waad Alshammari

Medical Question Answering systems have gained significant attention in recent years due to their potential to enhance medical decision-making and improve patient care. However, most of the research in this field has focused on English-language datasets, limiting the generalizability of MQA systems to non-English speaking regions. This study introduces AraMed, a large-scale Arabic Medical Question Answering dataset addressing the limited resources available for Arabic medical question answering. AraMed comprises of 270k question-answer pairs based on health consumer questions submitted to online medical forum. Experiments using various deep learning models showcase the dataset’s effectiveness, particularly with AraBERT models achieving highest results, specifically AraBERTv2 obtained an F1 score of 96.73% in the answer selection task. The comparative analysis of different deep learning models provides insights into their strengths and limitations. These findings highlight the potential of AraMed for advancing Arabic medical question answering research and development.

pdf abs
The Multilingual Corpus of World’s Constitutions (MCWC)
Mo El-Haj | Saad Ezzini

The “Multilingual Corpus of World’s Constitutions” (MCWC) serves as a valuable resource for the NLP community, offering a comprehensive collection of constitutions from around the world. Its focus on data quality and breadth of coverage enables advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. The MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on the MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. The MCWC’s rich multilingual content and rigorous data quality standards raise the bar for legal text analysis and inspire innovation in the NLP community, opening new avenues for studying constitutional texts and multilingual data analysis.

pdf abs
TafsirExtractor: Text Preprocessing Pipeline preparing Classical Arabic Literature for Machine Learning Applications
Carl Kruse | Sajawel Ahmed

In this paper, we present a comprehensive tool of preprocessing Classical Arabic (CA) literature in the field of historical exegetical studies for machine learning (ML) evaluations. Most recent ML models require the training data to be in a specific format (e.g. XML, TEI, CoNLL) to use it afterwards for ML applications such as Named Entity Recognition (NER) or Topic Modeling (TM). We report on how our method works and can be applied by other researchers with similar endeavors. Thereby, the importance of this comprehensive tool of preprocessing is demonstrated, as this novel approach has no predecessors for CA yet. We achieve results that enable the training of current ML models leading to state-of-the art performance for NER and TM on CA literature. We make our tool along its source code and data freely available for the Natural Language Processing (NLP) research community.

pdf abs
Advancing the Arabic WordNet: Elevating Content Quality
Abed Alhakim Freihat | Hadi Mahmoud Khalilia | Gábor Bella | Fausto Giunchiglia

High-quality WordNets are crucial for achieving high-quality results in NLP applications that rely on such resources. However, the wordnets of most languages suffer from serious issues of correctness and completeness with respect to the words and word meanings they define, such as incorrect lemmas, missing glosses and example sentences, or an inadequate, Western-centric representation of the morphology and the semantics of the language. Previous efforts have largely focused on increasing lexical coverage while ignoring other qualitative aspects. In this paper, we focus on the Arabic language and introduce a major revision of the Arabic WordNet that addresses multiple dimensions of lexico-semantic resource quality. As a result, we updated more than 58% of the synsets of the existing Arabic WordNet by adding missing information and correcting errors. In order to address issues of language diversity and untranslatability, we also extended the wordnet structure by new elements: phrasets and lexical gaps.

pdf abs
Arabic Speech Recognition of zero-resourced Languages: A case of Shehri (Jibbali) Language
Norah A. Alrashoudi | Omar Said Alshahri | Hend Al-Khalifa

Many under-resourced languages lack computational resources for automatic speech recognition (ASR) due to data scarcity issues. This makes developing accurate ASR models challenging. Shehri or Jibbali, spoken in Oman, lacks extensive annotated speech data. This paper aims to improve an ASR model for this under-resourced language. We collected a Shehri (Jibbali) speech corpus and utilized transfer learning by fine-tuning pre-trained ASR models on this dataset. Specifically, models like Wav2Vec2.0, HuBERT and Whisper were fine-tuned using techniques like parameter-efficient fine-tuning. Evaluation using word error rate (WER) and character error rate (CER) showed that the Whisper model, fine-tuned on the Shehri (Jibbali) dataset, significantly outperformed other models, with the best results from Whisper-medium achieving 3.5% WER. This demonstrates the effectiveness of transfer learning for resource-constrained tasks, showing high zero-shot performance of pre-trained models.

pdf abs
OSACT6 Dialect to MSA Translation Shared Task Overview
Ashraf Hatim Elneima | AhmedElmogtaba Abdelmoniem Ali Abdelaziz | Kareem Darwish

This paper presents the Dialectal Arabic (DA) to Modern Standard Arabic (MSA) Machine Translation (MT) shared task in the sixth Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT6). The paper describes the creation of the validation and test data and the metrics used; and provides a brief overview of the submissions to the shared task. In all, 29 teams signed up and 6 teams made actual submissions. The teams used a variety of datasets and approaches to build their MT systems. The most successful submission involved using zero-shot and n-shot prompting of chatGPT.

pdf abs
OSACT 2024 Task 2: Arabic Dialect to MSA Translation
Hanin Atwany | Nour Rabih | Ibrahim Mohammed | Abdul Waheed | Bhiksha Raj

We present the results of Shared Task “Dialect to MSA Translation”, which tackles challenges posed by the diverse Arabic dialects in machine translation. Covering Gulf, Egyptian, Levantine, Iraqi and Maghrebi dialects, the task offers 1001 sentences in both MSA and dialects for fine-tuning, alongside 1888 blind test sentences. Leveraging GPT-3.5, a state-of-the-art language model, our method achieved the a BLEU score of 29.61. This endeavor holds significant implications for Neural Machine Translation (NMT) systems targeting low-resource langu ages with linguistic variation. Additionally, negative experiments involving fine-tuning AraT5 and No Language Left Behind (NLLB) using the MADAR Dataset resulted in BLEU scores of 10.41 and 11.96, respectively. Future directions include expanding the dataset to incorporate more Arabic dialects and exploring alternative NMT architectures to further enhance translation capabilities.

The translation between Modern Standard Arabic (MSA) and the various Arabic dialects presents unique challenges due to the significant linguistic, cultural, and contextual variations across the regions where Arabic is spoken. This paper presents a system description of our participation in the OSACT 2024 Dialect to MSA Translation Shared Task. We explain our comprehensive approach which combines data augmentation techniques using generative pre-trained transformer models (GPT-3.5 and GPT-4) with fine-tuning of AraT5 V2, a model specifically designed for Arabic translation tasks. Our methodology has significantly expanded the training dataset, thus improving the model’s performance across five major Arabic dialects, namely Gulf, Egyptian, Levantine, Iraqi, and Maghrebi. We have rigorously evaluated our approach, using BLEU score, to ensure translation accuracy, fluency, and the preservation of meaning. Our results showcase the effectiveness of our refined models in addressing the challenges posed by diverse Arabic dialects and Modern Standard Arabic (MSA), achieving a BLEU score of 80% on the validation test set and 22.25% on the blind test set. However, it’s important to note that while utilizing a larger dataset, such as Madar + Dev, resulted in significantly higher evaluation BLEU scores, the performance on the blind test set was relatively lower. This observation underscores the importance of dataset size in model training, revealing potential limitations in generalization to unseen data due to variations in data distribution and domain mismatches.

pdf abs
LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task
AhmedElmogtaba Abdelmoniem Ali Abdelaziz | Ashraf Hatim Elneima | Kareem Darwish

This paper presents our approach to the Dialect to Modern Standard Arabic (MSA) Machine Translation shared task, conducted as part of the sixth Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT6). Our primary contribution is the development of a novel dataset derived from The Saudi Audio Dataset for Arabic (SADA) an Arabic audio corpus. By employing an automated method utilizing ChatGPT 3.5, we translated the dialectal Arabic texts to their MSA equivalents. This process not only yielded a unique and valuable dataset but also showcased an efficient method for leveraging language models in dataset generation. Utilizing this dataset, alongside additional resources, we trained a machine translation model based on the Transformer architecture. Through systematic experimentation with model configurations, we achieved notable improvements in translation quality. Our findings highlight the significance of LLM-assisted dataset creation methodologies and their impact on advancing machine translation systems, particularly for languages with considerable dialectal diversity like Arabic.

pdf abs
Sirius_Translators at OSACT6 2024 Shared Task: Fin-tuning Ara-T5 Models for Translating Arabic Dialectal Text to Modern Standard Arabic
Salwa Saad Alahmari

This paper presents the findings from our participation in the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT6) in 2024. Our specific focus was on the second task (Task 2), which involved translating text at the sentence level from five distinct Dialectal Arabic (DA) (Gulf, Egyptian, Levantine, Iraqi, and Maghrebi) into Modern Standard Arabic (MSA). Our team, Sirius_Translators, fine-tuned four AraT5 models namely; AraT5 base, AraT5v2-base-1024, AraT5-MSA-Small, and AraT5-MSA-Base for the Arabic machine translation (MT) task. These models were fine-tuned using a variety of parallel corpora containing Dialectal Arabic and Modern Standard Arabic. Based on the evaluation results of OSACT6 2024 Shared Task2, our fine-tuned AraT5v2-base-1024 model achieved an overall BLEU score of 21.0 on the development (Dev) set and 9.57 on the test set, respectively.

pdf abs
AraT5-MSAizer: Translating Dialectal Arabic to MSA
Murhaf Fares

This paper outlines the process of training the AraT5-MSAizer model, a transformer-based neural machine translation model aimed at translating five regional Arabic dialects into Modern Standard Arabic (MSA). Developed for Task 2 of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools, the model attained a BLEU score of 21.79% on the test set associated with this task.

This research delves into the issue of hallucination detection in Large Language Models (LLMs) using Arabic language datasets. As LLMs are increasingly being used in various applications, the phenomenon of hallucination, which refers to generating factually inaccurate content despite grammatical coherence, poses significant challenges. We participate in the OSACT 2024 Shared-task (Detection of Hallucination in Arabic Factual Claims Generated by ChatGPT and GPT4). We explore various approaches for detecting and mitigating hallucination, using models such as GPT-4, Mistral, and Gemini within a novel experimental framework. Our research findings reveal that the effectiveness of these models in classifying claims into Fact-Claim, Fact-Improvement, and Non-Fact categories varies greatly, underscoring the complexities of addressing hallucination in morphologically rich languages. The study emphasizes the need for advanced modelling and training strategies to enhance the reliability and factual accuracy of LLM-generated content, laying the groundwork for future explorations in mitigating hallucination risks. In our experiments we achieved a 0.54 F1 in GPT-4 LLM.

pdf (full)
bib (full) Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

pdf bib
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024
Darja Fiser | Maria Eskevich | David Bordon

pdf bib abs
Parliamentary Discourse Research in Political Science: Literature Review
Jure Skubic | Darja Fišer

One of the major research interests for political science has always been the study of political discourse and parliamentary debates. This literature review offers an overview of the most prominent research methods used in political science when studying political discourse. We identify the commonalities and the differences of the political science and corpus-driven approaches and show how parliamentary corpora and corpus-based approaches could be successfully integrated in political science research.

pdf bib abs
Compiling and Exploring a Portuguese Parliamentary Corpus: ParlaMint-PT
José Aires | Aida Cardoso | Rui Pereira | Amalia Mendes

As part of the project ParlaMint II, a new corpus of the sessions of the Portuguese Parliament from 2015 to 2022 has been compiled, encoded and annotated following the ParlaMint guidelines. We report on the contents of the corpus and on the specific nature of the political settings in Portugal during the time period covered. Two subcorpora were designed that would enable comparisons of the political speeches between pre and post covid-19 pandemic. We discuss the pipeline applied to download the original texts, ensure their preprocessing and encoding in XML, and the final step of annotation. This new resource covers a period of changes in the political system in Portugal and will be an important source of data for political and social studies. Finally, Finally, we have explored the political stance on immigration in the ParlaMint-PT corpus.

pdf abs
Gender, Speech, and Representation in the Galician Parliament: An Analysis Based on the ParlaMint-ES-GA Dataset
Adina I. Vladu | Elisa Fernández Rei | Carmen Magariños | Noelia García Díaz

This paper employs the ParlaMint-ES-GA dataset to scrutinize the intersection of gender, speech, and representation within the Parliament of Galicia, an autonomous region located in North-western Spain. The research questions center around the dynamics of women’s participation in parliamentary proceedings. Contrary to numerical parity, we explore whether increased female presence in the parliament correlates with equitable access to the floor. Analyzing parliamentary proceedings from 2015 to 2022, our quantitative study investigates the relationship between the legislative body’s composition, the number of speeches by Members of Parliament (MPs), and references made by MPs in their speeches. The findings reveal nuances in gender representation and participation, challenging assumptions about proportional access to parliamentary discourse.

pdf abs
Bulgarian ParlaMint 4.0 corpus as a testset for Part-of-speech tagging and Named Entity Recognition
Petya Osenova | Kiril Simov

The paper discusses some fine-tuned models for the tasks of part-of-speech tagging and named entity recognition. The fine-tuning was performed on the basis of an existing BERT pre-trained model and two newly pre-trained BERT models for Bulgarian that are cross-tested on the domain of the Bulgarian part of the ParlaMint corpora as a new domain. In addition, a comparison has been made between the performance of the new fine-tuned BERT models and the available results from the Stanza-based model which the Bulgarian part of the ParlaMint corpora has been annotated with. The observations show the weaknesses in each model as well as the common challenges.

pdf abs
Resources and Methods for Analysing Political Rhetoric and Framing in Parliamentary Debates
Ines Rehbein

Recent work in political science has made exten- sive use of NLP methods to produce evidential sup- port for a variety of analyses, for example, inferring an actor’s ideological positions from textual data or identifying the polarisation of the political discourse over the last decades. Most work has employed variations of lexical features extracted from text or has learned latent representations in a mostly un- supervised manner. While such approaches have the potential to enable political analyses at scale, they are often limited by their lack of interpretabil- ity. In the talk, I will instead look at semantic and pragmatic representations of political rhethoric and ideological framing and present several case stud- ies that showcase how linguistic annotation and the use of NLP methods can help to investigate dif- ferent framing strategies in parliamentary debates. The first part of the talk investigates populist framing strategies, specifically, the use of pronouns to create in- and out-groups and the identification of people-centric messages. The second part of the presentation focusses on framing strategies on the pragmatic level.

pdf abs
PTPARL-V: Portuguese Parliamentary Debates for Voting Behaviour Study
Afonso Sousa | Henrique Lopes Cardoso

We present a new dataset, , that provides valuable insight for advancing discourse analysis of parliamentary debates in Portuguese. This is achieved by processing the open-access information available at the official Portuguese Parliament website and scraping the information from the debate minutes’ PDFs contained therein. Our dataset includes interventions from 547 different deputies of all major Portuguese parties, from 736 legislative initiatives spanning five legislatures from 2005 to 2021. We present a statistical analysis of the dataset compared to other publicly available Portuguese parliamentary debate corpora. Finally, we provide baseline performance analysis for voting behaviour classification.

pdf abs
Polish Round Table Corpus
Maciej Ogrodniczuk | Ryszard Tuora | Beata Wójtowicz

The paper describes the process of preparation of the Polish Round Table Corpus (Pol. Korpus Okrągłego Stołu), a new resource documenting negotiations taking place in 1989 between the representatives of the communist government of the People’s Republic of Poland and the Solidarity opposition. The process consisted of OCR of graphical transcripts of the talks stored in the form of parliament-like stenographic transcripts, carrying out their manual correction and making them available for search in a concordancer currently used for standard parliamentary transcripts.

In this paper, we use automatic language identification to investigate the usage of different languages in the plenary sessions of the Parliament of Finland. Finland has two national languages, Finnish and Swedish. The plenary sessions are published as transcriptions of speeches in Parliament, reflecting the language the speaker used. In addition to charting out language use, we demonstrate how language identification can be used to audit the quality of the dataset. On the one hand, we made slight improvements to our language identifier; on the other hand, we made a list of improvement suggestions for the next version of the dataset.

pdf abs
Exploring Word Formation Trends in Written, Spoken, Translated and Interpreted European Parliament Data – A Case Study on Initialisms in English and German
Katrin Menzel

This paper demonstrates the research potential of a unique European Parliament dataset for register studies, contrastive linguistics, translation and interpreting studies. The dataset consists of parallel data for several European languages, including written source texts and their translations as well as spoken source texts and the transcripts of their simultaneously interpreted versions. The paper presents a cross-linguistic, corpus-based case study on a word formation phenomenon in these European Parliament data that are enriched with various linguistic annotations and metadata as well as with information-theoretic surprisal scores. It addresses the questions of how initialisms are used across languages and production modes in the English and German corpus sections of these European Parliament data, whether there is a correlation between the use of initialisms and the use of their corresponding multiword full forms in the analysed corpus sections and what insights on the informativity and possible processing difficulties of initialisms we can gain from an analysis of information-theoretic surprisal values. The results show that English written originals and German translations are the corpus sections with the highest frequencies of initialisms. The majority of cross-language transfer situations lead to fewer initialisms in the target texts than in the source texts. In the English data, there is a positive correlation between the frequency of initialisms and the frequency of the respective full forms. There is a similar correlation in the German data, apart from the interpreted data. Additionally, the results show that initialisms represent peaks of information with regard to their surprisal values within their segments. Particularly the German data show higher surprisal values of initialisms in mediated language than in non-mediated discourse types, which indicates that in German mediated discourse, initialisms tend to be used in less conventionalised textual contexts than in English.

pdf abs
Quantitative Analysis of Editing in Transcription Process in Japanese and European Parliaments and its Diachronic Changes
Tatsuya Kawahara

In making official transcripts for meeting records in Parliament, some edits are made from faithful transcripts of utterances for linguistic correction and formality. Classification of these edits is provided in this paper, and quantitative analysis is conducted for Japanese and European Parliamentary meetings by comparing the faithful transcripts of audio recordings against the official meeting records. Different trends are observed between the two Parliaments due to the nature of the language used and the meeting style. Moreover, its diachronic changes in the Japanese transcripts are presented, showing a significant decrease in the edits over the past decades. It was found that a majority of edits in the Japanese Parliament (Diet) simply remove fillers and redundant words, keeping the transcripts as verbatim as possible. This property is useful for the evaluation of the automatic speech transcription system, which was developed by us and has been used in the Japanese Parliament.

In this paper, we test the efficacy of using GPT-4 to annotate a dataset that is the used to train a BERT classifier for emotion analysis. Manual data annotation is often a laborious and expensive task and emotion annotation, specifically, has proved difficult even for expert annotators. We show that using GPT-4 can produce equally good results as doing data annotation manually while saving a lot of time and money. We train a BERT classifier on our automatically annotated dataset and get results that outperform a BERT classifier that is trained on machine translated data. Our paper shows how Large Language Models can be used to work with and analyse parliamentary corpora.

pdf abs
Making Parliamentary Debates More Accessible: Aligning Video Recordings with Text Proceedings in Open Parliament TV
Olivier Aubert | Joscha Jäger

We are going to describe the Open Parliament TV project and more specifically the work we have done on alignment of video recordings with text proceedings of the german Bundestag. This has allowed us to create a comprehensive and accessible platform for citizens and journalists to engage with parliamentary proceedings. Through our diligent work, we have ensured that the video recordings accurately correspond to the corresponding text, providing a seamless and synchronised experience for users. In this article, we describe the issues we were faced with and the method we used to solve it, along with the visualisations we developed to investigate and assess the content.

pdf abs
Russia and Ukraine through the Eyes of ParlaMint 4.0: A Collocational CADS Profile of Spanish and British Parliamentary Discourses
Maria Calzada Perez

This article resorts to mixed methods to examine British and Spanish parliamentary discourse. The quantitative corpus-assisted (lexical priming) theory and data are complemented by the qualitative discourse historical approach. Two CLARIN ParlaMint corpora – ParlamMint-GB and ParlaMint-ES – are queried in the analysis, which focuses on English (“Rusia” and “Ukraine”) and Spanish (“Rusia” and “Ucrania”) nodes and collocations. In sum, the analysis sketches a brief profile of each corpus. The British House of Commons is more homogenous, strongly associating “Russia” and “Ukraine” with their participation in the war. Furthermore, this chamber shows a greater interest in “Russia. The Spanish Congreso de los Diputados indicates greater quantitative differences (heterogeneity). Here, “Russia” clearly transcends its role as a military contender and is also portrayed as an economic competitor for the West. Unlike in Britain, the Spanish lower house shows more mentions of “Ucrania”, which is assigned just one role – as an invasion victim. In conclusion, the productivity of corpus-assisted mixed methods is confirmed along with the precious value of the ParlaMint constellation.

We introduce a dataset on political orientation and power position identification. The dataset is derived from ParlaMint, a set of comparable corpora of transcribed parliamentary speeches from 29 national and regional parliaments. We introduce the dataset, provide the reasoning behind some of the choices during its creation, present statistics on the dataset, and, using a simple classifier, some baseline results on predicting political orientation on the left-to-right axis, and on power position identification, i.e., distinguishing between the speeches delivered by governing coalition party members from those of opposition party members.

pdf abs
IMPAQTS: a multimodal corpus of parliamentary and other political speeches in Italy (1946-2023), annotated with implicit strategies
Federica Cominetti | Lorenzo Gregori | Edoardo Lombardi Vallauri | Alessandro Panunzi

The paper introduces the IMPAQTS corpus of Italian political discourse, a multimodal corpus of around 2.65 million tokens including 1,500 speeches uttered by 150 prominent politicians spanning from 1946 to 2023. Covering the entire history of the Italian Republic, the collection exhibits a non-homogeneous consistency that progressively increases in quantity towards the present. The corpus is balanced according to textual and socio-linguistic criteria and includes different types of speeches. The sociolinguistic features of the speakers are carefully considered to ensure representation of Republican Italian politicians. For each speaker, the corpus contains 4 parliamentary speeches, 2 rallies, 1 party assembly, and 3 statements (in person or broadcasted). Parliamentary speeches therefore constitute the largest section of the corpus (40% of the total), enabling direct comparison with other types of political speeches. The collection procedure, including details relevant to the transcription protocols, and the processing pipeline are described. The corpus has been pragmatically annotated to include information about the implicitly conveyed questionable contents, paired with their explicit paraphrasis, providing the largest Italian collection of ecologic examples of linguistic implicit strategies. The adopted ontology of linguistic implicitness and the fine-grained annotation scheme are presented in detail.

pdf abs
ParlaMint Ngram viewer: Multilingual Comparative Diachronic Search Across 26 Parliaments
Asher de Jong | Taja Kuzman | Maik Larooij | Maarten Marx

We demonstrate the multilingual search engine and Ngram viewer that was built on top of the Parlamint dataset using the recently available translations. The user interface and SERP are carefully designed for querying parliamentary proceedings and for the intended use by citizens, journalists and political scholars. Demo at https://debateabase.wooverheid.nl. Keywords: Multilingual Search, Parliamentary Proceedings, Ngram Viewer, Machine Translation

pdf abs
Investigating Political Ideologies through the Greek ParlaMint corpus
Maria Gavriilidou | Dimitris Gkoumas | Stelios Piperidis | Prokopis Prokopidis

This paper has two objectives: to present (a) the creation of ParlaMint-GR, the Greek part of the ParlaMint corpora of debates in the parliaments of Europe, and (b) preliminary results on its comparison with a corpus of Greek party manifestos, aiming at the investigation of the ideologies of the Greek political parties and members of the Parliament. Additionally, a gender related comparison is explored. The creation of the ParlaMint-GR corpus is discussed, together with the solutions adopted for various challenges faced. The corpus of party manifestos, available through CLARIN:EL, serves for a comparative study with the corpus of speeches delivered by the members of the Greek Parliament, with the aim to identify the ideological positions of parties and politicians.

pdf abs
ParlaMint in TEITOK
Maarten Janssen | Matyáš Kopp

This paper describes the ParlaMint 4.0 parliamentary corpora as made available in TEITOK at LINDAT. The TEITOK interface makes it possible to search through the corpus, to view each session in a readable manner, and to explore the names in the corpus. The interface does not present any new data, but provides an access point to the ParlaMint corpus that is less oriented to linguistic use only, and more accessible for the general public or researchers from other fields.

pdf abs
Historical Parliamentary Corpora Viewer
Alenka Kavčič | Martin Stojanoski | Matija Marolt

Historical parliamentary debates offer a window into the past and provide valuable insights for academic research and historical analysis. This paper presents a novel web application tailored to the exploration of historical parliamentary corpora in the context of Slovenian national identity. The developed web viewer enables advanced search functions within collections of historical parliamentary records and has an intuitive and user-friendly interface. Users can enter search terms and apply filters to refine their search results. The search function allows keyword and phrase searching, including the ability to search by delegate and place names. It is also possible to search for translations of the text by selecting the desired languages. The search results are displayed with a preview of the proceedings and highlighted phrases that match the search query. To review a specific record, the full PDF document can be displayed in a separate view, allowing the user to scroll through the PDF document and search the content. In addition, the two corpora of Slovenian historical records integrated into the viewer—the Carniolan Provincial Assembly Corpus and the Parliamentary Corpus of the First Yugoslavia—are described and an insight into the corresponding preparation processes is provided.

pdf abs
The dbpedia R Package: An Integrated Workflow for Entity Linking (for ParlaMint Corpora)
Christoph Leonhardt | Andreas Blaette

Entity Linking is a powerful approach for linking textual data to established structured data such as survey data or adminstrative data. However, in the realm of social science, the approach is not widely adopted. We argue that this is, at least in part, due to specific setup requirements which constitute high barriers for usage and workflows which are not well integrated into analyitical scenarios commonly deployed in social science research. We introduce the dbpedia R package to make the approach more accessible. It has a focus on functionality that is easily adoptable to the needs of social scientists working with textual data, including the support of different input formats, limited setup costs and various output formats. Using a ParlaMint corpus, we show the applicability and flexibility of the approach for parliamentary debates.

pdf abs
Video Retrieval System Using Automatic Speech Recognition for the Japanese Diet
Mikitaka Masuyama | Tatsuya Kawahara | Kenjiro Matsuda

The Japanese House of Representatives, one of the two houses of the Diet, has adopted an Automatic Speech Recognition (ASR) system, which directly transcribes parliamentary speech with an accuracy of 95 percent. The ASR system also provides a timestamp for every word, which enables retrieval of the video segments of the Parliamentary meetings. The video retrieval system we have developed allows one to pinpoint and play the parliamentary video clips corresponding to the meeting minutes by keyword search. In this paper, we provide its overview and suggest various ways we can utilize the system. The system is currently extended to cover meetings of local governments, which will allow us to investigate dialectal linguistic variations.

pdf abs
One Year of Continuous and Automatic Data Gathering from Parliaments of European Union Member States
Ota Mikušek

This paper provides insight into automatic parliamentary corpora development. One year ago, I created a simple set of tools designed to continuously and automatically download, process, and create corpora from speeches in the parliaments of European Union member states. Despite the existence of numerous corpora providing speeches from European Union parliaments, the tools are more focused on collecting and building such corpora with minimal human interaction. These tools have been operating continuously for over a year, gathering parliamentary data and extending corpora, which together have more than one billion words. However, the process of maintaining these tools has brought unforeseen challenges, including issues such as being blocked by some parliaments due to overloading the parliament with requests, the inability to access the most recent data of a parliament, and effectively managing interrupted connections. Additionally, potential problems that may arise in the future are provided, along with possible solutions. These include problems with data loss prevention and adaptation to changes in the sources from which speeches are downloaded.

pdf abs
Government and Opposition in Danish Parliamentary Debates
Costanza Navarretta | Dorte Haltrup Hansen

In this paper, we address government and opposition speeches made by the Danish Parliament’s members from 2014 to 2022. We use the linguistic annotations and metadata in ParlaMint-DK, one of the ParlaMint corpora, to investigate some characteristics of the transcribed speeches made by government and opposition and test how well classifiers can identify the speeches delivered by these groups. Our analyses confirm that there are differences in the speeches made by government and opposition e.g., in the frequency of some modality expressions. In our study, we also include parties, which do not directly support or are against the government, the “other” group. The best performing classifier for identifying speeches made by parties in government, in opposition or in “other” is a transformer with a pre-trained Danish BERT model which gave an F1-score of 0.64. The same classifier obtained an F1-score of 0.77 on the binary identification of speeches made by government or opposition parties.

pdf abs
A new Resource and Baselines for Opinion Role Labelling in German Parliamentary Debates
Ines Rehbein | Simone Paolo Ponzetto

Detecting opinions, their holders and targets in parliamentary debates provides an interesting layer of analysis, for example, to identify frequent targets of opinions for specific topics, actors or parties. In the paper, we present GePaDe-ORL, a new dataset for German parliamentary debates where subjective expressions, their opinion holders and targets have been annotated. We describe the annotation process and report baselines for predicting those annotations in our new dataset.

pdf abs
ParlaMint Widened: a European Dataset of Freedom of Information Act Documents (Position Paper)
Gerda Viira | Maarten Marx | Maik Larooij

This position paper makes an argument for creating a corpus similar to that of ParlaMint, not consisting of parliamentary proceedings, but of documents released under Freedom of Information Acts. Over 100 countries have such an act, and almost all European countries. Bringing these now dispersed document collections together in a uniform format into one portal will result in a valuable language resource. Besides that, our Dutch experience shows that such new larger exposure of these documents leads to efforts to improve their quality at the sources. Keywords: Freedom of Information Act, ParlaMint, Government Data

pdf (full)
bib (full) Proceedings of the Second Workshop on Natural Language Processing for Political Sciences @ LREC-COLING 2024

pdf bib
Proceedings of the Second Workshop on Natural Language Processing for Political Sciences @ LREC-COLING 2024
Haithem Afli | Houda Bouamor | Cristina Blasi Casagran | Sahar Ghannay

pdf bib abs
Deciphering Political Entity Sentiment in News with Large Language Models: Zero-Shot and Few-Shot Strategies
Alapan Kuila | Sudeshna Sarkar

Sentiment analysis plays a pivotal role in understanding public opinion, particularly in the political domain where the portrayal of entities in news articles influences public perception. In this paper, we investigate the effectiveness of Large Language Models (LLMs) in predicting entity-specific sentiment from political news articles. Leveraging zero-shot and few-shot strategies, we explore the capability of LLMs to discern sentiment towards political entities in news content. Employing a chain-of-thought (COT) approach augmented with rationale in few-shot in-context learning, we assess whether this method enhances sentiment prediction accuracy. Our evaluation on sentiment-labeled datasets demonstrates that LLMs, outperform fine-tuned BERT models in capturing entity-specific sentiment. We find that learning in-context significantly improves model performance, while the self-consistency mechanism enhances consistency in sentiment prediction. Despite the promising results, we observe inconsistencies in the effectiveness of the COT prompting method. Overall, our findings underscore the potential of LLMs in entity-centric sentiment analysis within the political news domain and highlight the importance of suitable prompting strategies and model architectures.

pdf bib abs
Event Detection in the Socio Political Domain
Emmanuel Cartier | Hristo Tanev

In this paper we present two approaches for detection of socio political events: the first is based on manually crafted keyword combinations and the second one is based on a BERT classifier. We compare the performance of the two systems on a dataset of socio-political events. Interestingly, the systems demonstrate complementary performance: both showing their best accuracy on non overlapping sets of event types. In the evaluation section we provide insights on the effect of taxonomy mapping on the event detection evaluation. We also review in the related work section the most important resources and approaches for event extraction in the recent years.

pdf abs
Multi-Dimensional Insights: Annotated Dataset of Stance, Sentiment, and Emotion in Facebook Comments on Tunisia’s July 25 Measures
Sanaa Laabar | Wajdi Zaghouani

On July 25, 2021, Tunisian President Kais Saied announced the suspension of parliament and dismissal of Prime Minister Hichem Mechichi, a move that sparked intense public debate. This study investigates Tunisian public opinion regarding these events by analyzing a corpus of 7,535 Facebook comments collected from the official Tunisian presidency page, specifically the post announcing the July 25 measures. A team of three annotators labeled a subset of 5,000 comments, categorizing each comment’s political stance (supportive, opposing, or neutral), sentiment (positive, negative, or neutral), emotions, presence of hate speech, aggressive tone, and racism. The inter-annotator agreement, measured by Cohen’s kappa, was 0.61, indicating substantial consensus. The analysis reveals that a majority of commenters supported President Saied’s actions, outnumbering those who opposed or took a neutral stance. Moreover, the overall sentiment expressed in the comments was predominantly positive. This study provides valuable insights into the complex landscape of public opinion in Tunisia during a crucial moment in the country’s ongoing political transformation, highlighting the role of social media as a platform for political discourse and engagement.

pdf abs
Masking Explicit Pro-Con Expressions for Development of a Stance Classification Dataset on Assembly Minutes
Tomoyosi Akiba | Yuki Gato | Yasutomo Kimura | Yuzu Uchida | Keiichi Takamaru

In this paper, a new dataset for Stance Classification based on assembly minutes is introduced. We develop it by using publicity available minutes taken from diverse Japanese local governments including prefectural, city, and town assemblies. In order to make the task to predict a stance from content of a politician’s utterance without explicit stance expressions, predefined words that directly convey the speaker’s stance in the utterance are replaced by a special token. Those masked words are also used to assign a golden label, either agreement or disagreement, to the utterance. Finally, we constructed total 15,018 instances automatically from 47 Japanese local governments. The dataset is used in the shared Stance Classification task evaluated in the NTCIR-17 QA-Lab-PoliInfo-4, and is now publicity available. Since the construction method of the dataset is automatic, we can still apply it to obtain more instances from the other Japanese local governments.

pdf abs
Analysing Pathos in User-Generated Argumentative Text
Natalia Evgrafova | Veronique Hoste | Els Lefever

While persuasion has been extensively examined in the context of politicians’ speeches, there exists a notable gap in the understanding of the pathos role in user-generated argumentation. This paper presents an exploratory study into the pathos dimension of user-generated arguments and formulates ideas on how pathos could be incorporated in argument mining. Using existing sentiment and emotion detection tools, this research aims to obtain insights into the role of emotion in argumentative public discussion on controversial topics, explores the connection between sentiment and stance, and detects frequent emotion-related words for a given topic.

pdf abs
Knowledge Graph Representation for Political Information Sources
Tinatin Osmonova | Alexey Tikhonov | Ivan P. Yamshchikov

With the rise of computational social science, many scholars utilize data analysis and natural language processing tools to analyze social media, news articles, and other accessible data sources for examining political and social discourse. Particularly, the study of the emergence of echo-chambers due to the dissemination of specific information has become a topic of interest in mixed methods research areas. In this paper, we analyze data collected from two news portals, Breitbart News (BN) and New York Times (NYT) to prove the hypothesis that the formation of echo-chambers can be partially explained on the level of an individual information consumption rather than a collective topology of individuals’ social networks. Our research findings are presented through knowledge graphs, utilizing a dataset spanning 11.5 years gathered from BN and NYT media portals. We demonstrate that the application of knowledge representation techniques to the aforementioned news streams highlights, contrary to common assumptions, shows relative “internal” neutrality of both sources and polarizing attitude towards a small fraction of entities. Additionally, we argue that such characteristics in information sources lead to fundamental disparities in audience worldviews, potentially acting as a catalyst for the formation of echo-chambers.

pdf abs
Analyzing Conflict Through Data: A Dataset on the Digital Framing of Sheikh Jarrah Evictions
Anatolii Shestakov | Wajdi Zaghouani

This study empirically investigates the role of social media in tracing the evolution of the May 2021 Israeli-Palestinian crisis, centered on the Sheikh Jarrah evictions. Analyzing a dataset of 370,747 English tweets from 120,173 users from May 9-21, 2021, the research employs a mixed-methods approach combining computational techniques and qualitative content analysis. Findings support the hypothesis that social media interactions reliably map crisis dynamics, as evidenced by hashtags like #SaveSheikhJarrah corresponding to critical shifts, though virality did not correlate with hashtag use. In contrast to prior sentiment-focused studies, the context-driven analysis reveals influencers and state actors shaping polarized narratives along geopolitical lines, with high-profile voices backing Palestinian solidarity while Israeli state accounts endorsed military operations. Evidence of a transcontinental cybercampaign emerged, albeit with limitations due to the English language scope and potential biases from data collection and keyword choices. The study contributes empirical insights into the mediatization of armed conflicts through social media’s competing narratives and information flows within the Israeli-Palestinian context. Recommendations for future multilingual, multi-platform analyses are provided to address limitations.

This paper introduces a novel framework to harness Large Language Models (LLMs) for Epidemic Intelligence, focusing on identifying and categorizing emergent socio-political phenomena within health crises, with a spotlight on the COVID-19 pandemic. Our approach diverges from traditional methods, such as Topic Models, by providing explicit support to analysts through the identification of distinct thematic areas and the generation of clear, actionable statements for each topic. This supports a Zero-shot Classification mechanism, enabling effective matching of news articles to fine-grain topics without the need for model fine-tuning. The framework is designed to be as transparent as possible, producing linguistically informed insights to make the analysis more accessible to analysts who may not be familiar with every subject matter of inherently emerging phenomena. This process not only enhances the precision and relevance of the extracted Epidemic Intelligence but also fosters a collaborative environment where system linguistic abilities and the analyst’s domain expertise are integrated.

We aim to develop a metric of politicization by investigating whether this concept can be operationalized computationally using document embeddings. We are interested in measuring the extent to which foreign aid is politicized. Textual reports of foreign aid projects are often made available by donor governments, but these are large and unstructured. By embedding them in vector space, we can compute similarities between sets of known politicized keywords and the foreign aid reports. We present a pilot study where we apply this metric to USAID reports.

pdf abs
Echo-chambers and Idea Labs: Communication Styles on Twitter
Aleksandra Sorokovikova | Michael Becker | Ivan P. Yamshchikov

This paper investigates the communication styles and structures of Twitter (X) communities within the vaccination context. While mainstream research primarily focuses on the echo-chamber phenomenon, wherein certain ideas are reinforced and participants are isolated from opposing opinions, this study reveals the presence of diverse communication styles across various communities. In addition to the communities exhibiting echo-chamber behavior, this research uncovers communities with distinct communication patterns. By shedding light on the nuanced nature of communication within social networks, this study emphasizes the significance of understanding the diversity of perspectives within online communities.

pdf (full)
bib (full) Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024

pdf bib
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024
Rooweither Mabuya | Muzi Matfunjwa | Mmasibidi Setaka | Menno van Zaanen

pdf bib abs
Doing Phonetics in the Rift Valley: Sound Systems of Maasai, Iraqw and Hadza
Alain Ghio | Didier Demolin | Michael Karani | Yohann Meynadier

This article discusses the contribution of experimental techniques to recording phonetic data in the field. Only a small part of the phonological systems of African languages is described with precision. This is why it is important to collect empirical data in the form of sound, video and physiological recordings. This allows research questions such as patterns of variation to be addressed. Analytical methods show how to interpret data from physical principles and integrate them into appropriate models. The question of linguistic contact between different language families is also addressed. To achieve these general objectives, we present the way we design corpora, and the different ways of recording data with crucial technical considerations during fieldwork. Finally, we focus on 3 languages spoken in the Great African Rift Zone, which includes several linguistic areas belonging to the four major linguistic families of the continent. (1) Hadza is a click language with a very complex consonant system. (2) Iraqw is a Cushitic language with ejective consonants. (3) Maasai is a Nilotic language with implosive consonants and a very elaborate set of interjections, ideophones and animal calls that include sounds not described in the International Phonetic Alphabet.

pdf bib abs
Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal
Elodie Gauthier | Aminata Ndiaye | Abdoulaye Guissé

This work is part of the Kallaama project, whose objective is to produce and disseminate national languages corpora for speech technologies developments, in the field of agriculture. Except for Wolof, which benefits from some language data for natural language processing, national languages of Senegal are largely ignored by language technology providers. However, such technologies are keys to the protection, promotion and teaching of these languages. Kallaama focuses on the 3 main spoken languages by Senegalese people: Wolof, Pulaar and Sereer. These languages are widely spoken by the population, with around 10 million of native Senegalese speakers, not to mention those outside the country. However, they remain under-resourced in terms of machine-readable data that can be used for automatic processing and language technologies, all the more so in the agricultural sector. We release a transcribed speech dataset containing 125 hours of recordings, about agriculture, in each of the above-mentioned languages. These resources are specifically designed for Automatic Speech Recognition purpose, including traditional approaches. To build such technologies, we provide textual corpora in Wolof and Pulaar, and a pronunciation lexicon containing 49,132 entries from the Wolof dataset.

pdf abs
Long-Form Recordings to Study Children’s Language Input and Output in Under-Resourced Contexts
Joseph R. Coffey | Alejandrina Cristia

A growing body of research suggests that young children’s early speech and language exposure is associated with later language development (including delays and diagnoses), school readiness, and academic performance. The last decade has seen increasing use of child-worn devices to collect long-form audio recordings by educators, economists, and developmental psychologists. The most commonly used system for analyzing this data is LENA, which was trained on North American English child-centered data and generates estimates of children’s speech-like vocalization counts, adult word counts, and child-adult turn counts. Recently, cheaper and open-source non-LENA alternatives with multilingual training have been proposed. Both kinds of systems have been employed in under-resourced, sometimes multilingual contexts, including Africa where access to printed or digital linguistic resources may be limited. In this paper, we describe each kind of system (LENA, non-LENA), provide information on audio data collected with them that is available for reuse, review evidence of the accuracy of extant automated analyses, and note potential strengths and shortcomings of their use in African communities.

pdf abs
Developing Bilingual English-Setswana Datasets for Space Domain
Tebatso G. Moape | Sunday Olusegun Ojo | Oludayo O. Olugbara

In the current digital age, languages lacking digital presence face an imminent risk of extinction. In addition, the absence of digital resources poses a significant obstacle to the development of Natural Language Processing (NLP) applications for such languages. Therefore, the development of digital language resources contributes to the preservation of these languages and enables application development. This paper contributes to the ongoing efforts of developing language resources for South African languages with a specific focus on Setswana and presents a new English-Setswana bilingual dataset that focuses on the space domain. The dataset was constructed using the expansion method. A subset of space domain English synsets from Princeton WordNet was professionally translated to Setswana. The initial submission of translations demonstrated an accuracy rate of 99% before validation. After validation, continuous revisions and discussions between translators and validators resulted in a unanimous agreement, ultimately achieving a 100% accuracy rate. The final version of the resource was converted into an XML format due to its machine-readable framework, providing a structured hierarchy for the organization of linguistic data.

pdf abs
Compiling a List of Frequently Used Setswana Words for Developing Readability Measures
Johannes Sibeko

This paper addresses the pressing need for improved readability assessment in Setswana through the creation of a list of frequently used words in Setswana. The end goal is to integrate this list into the adaptation of traditional readability measures in Setswana, such as the Dale-Chall index, which relies on frequently used words. Our initial list is developed using corpus-based methods utilising frequency lists obtained from five sets of corpora. It is then refined using manual methods. The analysis section delves into the challenges encountered during the development of the final list, encompassing issues like the inclusion of non-Setswana words, proper names, unexpected terms, and spelling variations. The decision-making process is clarified, highlighting crucial choices such as the retention of contemporary terms and the acceptance of diverse spelling variations. These decisions reflect a nuanced balance between linguistic authenticity and readability. This paper contributes to the discourse on text readability in indigenous Southern African languages. Moreover, it establishes a foundation for tailored literacy initiatives and serves as a starting point for adapting traditional frequency-list-based readability measures to Setswana.

pdf abs
A Qualitative Inquiry into the South African Language Identifier’s Performance on YouTube Comments.
Nkazimlo N. Ngcungca | Johannes Sibeko | Sharon Rudman

The South African Language Identifier (SA-LID) has proven to be a valuable tool for data analysis in the multilingual context of South Africa, particularly in governmental texts. However, its suitability for broader projects has yet to be determined. This paper aims to assess the performance of the SA-LID in identifying isiXhosa in YouTube comments as part of the methodology for research on the expression of cultural identity through linguistic strategies. We curated a selection of 10 videos which focused on the isiXhosa culture in terms of theatre, poetry, language learning, culture, or music. The videos were predominantly in English as were most of the comments, but the latter were interspersed with elements of isiXhosa, identifying the commentators as speakers of isiXhosa. The SA-LID was used to identify all instances of the use of isiXhosa to facilitate the analysis of the relevant items. Following the application of the SA-LID to this data, a manual evaluation was conducted to gauge the effectiveness of this tool in selecting all isiXhosa items. Our findings reveal significant limitations in the use of the SA-LID, encompassing the oversight of unconventional spellings in indigenous languages and misclassification of closely related languages within the Nguni group. Although proficient in identifying the use of Nguni languages, differentiating within this language group proved challenging for the SA-LID. These results underscore the necessity for manual checks to complement the use of the SA-LID when other Nguni languages may be present in the comment texts.

pdf abs
The First Universal Dependency Treebank for Tswana: Tswana-Popapolelo
Tanja Gaustad | Ansu Berg | Rigardt Pretorius | Roald Eiselen

This paper presents the first publicly available UD treebank for Tswana, Tswana-Popapolelo. The data used consists of the 20 Cairo CICLing sentences translated to Tswana. After pre-processing these sentences with detailed POS (XPOS) and converting them to universal POS (UPOS), we proceeded to annotate the data with dependency relations, documenting decisions for the language specific constructions. Linguistic issues encountered are described in detail as this is the first application of the UD framework to produce a dependency treebank for the Bantu language family in general and for Tswana specifically.

pdf abs
Adapting Nine Traditional Text Readability Measures into Sesotho
Johannes Sibeko | Menno van Zaanen

This article discusses the adaptation of traditional English readability measures into Sesotho, a Southern African indigenous low-resource language. We employ the use of a translated readability corpus to extract textual features from the Sesotho texts and readability levels from the English translations. We look at the correlation between the different features to ensure that non-competing features are used in the readability metrics. Next, through linear regression analyses, we examine the impact of the text features from the Sesotho texts on the overall readability levels (which are gauged from the English translations). Starting from the structure of the traditional English readability measures, linear regression models identify coefficients and intercepts for the different variables considered in the readability formulas for Sesotho. In the end, we propose ten readability formulas for Sesotho (one more than the initial nine; we provide two formulas based on the structure of the Gunning Fog index). We also introduce intercepts for the Gunning Fog index, the Läsbarhets index and the Readability index (which do not have intercepts in the English variants) in the Sesotho formulas.

pdf abs
Bootstrapping Syntactic Resources from isiZulu to Siswati
Laurette Marais | Laurette Pretorius | Lionel Clive Posthumus

IsiZulu and Siswati are mutually intelligible languages that are considered under-resourced despite their status as official languages. Even so, the available digital and computational language resources for isiZulu significantly outstrip those for Siswati, such that it is worth investigating to what degree bootstrapping approaches can be leveraged to develop resources for Siswati. In this paper, we present the development of a computational grammar and parallel treebank, based on parallel linguistic descriptions of the two languages.

pdf abs
Early Child Language Resources and Corpora Developed in Nine African Languages by the SADiLaR Child Language Development Node
Michelle J. White | Frenette Southwood | Sefela Londiwe Yalala

Prior to the initiation of the project reported on in this paper, there were no instruments available with which to measure the language skills of young speakers of nine official African languages of South Africa. This limited the kind of research that could be conducted, and the rate at which knowledge creation on child language development could progress. Not only does this result in a dearth of knowledge needed to inform child language interventions but it also hinders the development of child language theories that would have good predictive power across languages. This paper reports on (i) the development of a questionnaire that caregivers complete about their infant’s communicative gestures and vocabulary or about their toddler’s vocabulary and grammar skills, in isiNdebele, isiXhosa, isiZulu, Sesotho, Sesotho sa Leboa, Setswana, Siswati, Tshivenda, and Xitsonga; and (ii) the 24 child language corpora thus far developed with these instruments. The potential research avenues opened by the 18 instruments and 24 corpora are discussed.

pdf abs
Morphological Synthesizer for Ge’ez Language: Addressing Morphological Complexity and Resource Limitations
Gebrearegawi Gebremariam Gidey | Hailay Kidu Teklehaymanot | Gebregewergs Mezgebe Atsbha

Ge’ez is an ancient Semitic language renowned for its unique alphabet. It serves as the script for numerous lan- guages, including Tigrinya and Amharic, and played a pivotal role in Ethiopia’s cultural and religious development during the Aksumite kingdom era. Ge’ez remains significant as a liturgical language in Ethiopia and Eritrea, with much of the national identity documentation recorded in Ge’ez. These written materials are invaluable primary sources for studying Ethiopian and Eritrean philosophy, creativity, knowledge, and civilization. Ge’ez is a complex morphological structure with rich inflectional and derivational morphology, and no usable NLP has been developed and published until now due to the scarcity of annotated linguistic data, corpora, labeled datasets, and lexicons. Therefore, we proposed a rule-based Ge’ez morphological synthesis to generate surface words from root words according to the morphological structures of the language. Consequently, we proposed an automatic morphological synthesizer for Ge’ez using TLM. We used 1,102 sample verbs, representing all verb morphological structures, to test and evaluate the system. Finally, we get a performance of 97.4%. This result outperforms the baseline model, suggesting that other scholars build a comprehensive system considering morphological variations of the language. Keywords: Ge’ez, NLP, morphology, morphological synthesizer, rule-based

pdf abs
EthioMT: Parallel Corpus for Low-resource Ethiopian Languages
Atnafu Lambebo Tonja | Olga Kolesnikova | Alexander Gelbukh | Jugal Kalita

Recent research in natural language processing (NLP) has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available at all. NLP in Ethiopian languages suffers from the same issues due to the unavailability of publicly accessible datasets for NLP tasks, including MT. To help the research community and foster research for Ethiopian languages, we introduce EthioMT – a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.

pdf abs
Resources for Annotating Hate Speech in Social Media Platforms Used in Ethiopia: A Novel Lexicon and Labelling Scheme
Nuhu Ibrahim | Felicity Mulford | Matt Lawrence | Riza Batista-Navarro

Hate speech on social media has proliferated in Ethiopia. To support studies aimed at investigating the targets and types of hate speech circulating in the Ethiopian context, we developed a new fine-grained annotation scheme that captures three elements of hate speech: the target (i.e., any groups with protected characteristics), type (i.e., the method of abuse) and nature (i.e., the style of the language used). We also developed a new lexicon of hate speech-related keywords in the four most prominent languages found on Ethiopian social media: Amharic, Afaan Oromo, English and Tigrigna. These keywords enabled us to retrieve social media posts (also in the same four languages) from three platforms (i.e., X, Telegram and Facebook), that are likely to contain hate speech. Experts in the Ethiopian context then manually annotated a sample of those retrieved posts, obtaining fair to moderate inter-annotator agreement. The resulting annotations formed the basis of a case study of which groups tend to be targeted by particular types of hate speech or by particular styles of hate speech language.

pdf abs
Low Resource Question Answering: An Amharic Benchmarking Dataset
Tilahun Abedissa Taffa | Ricardo Usbeck | Yaregal Assabie

Question Answering (QA) systems return concise answers or answer lists based on natural language text, which uses a given context document. Many resources go into curating QA datasets to advance the development of robust QA models. There is a surge in QA datasets for languages such as English; this is different for low-resource languages like Amharic. Indeed, there is no published or publicly available Amharic QA dataset. Hence, to foster further research in low-resource QA, we present the first publicly available benchmarking Amharic Question Answering Dataset (Amh-QuAD). We crowdsource 2,628 question-answer pairs from over 378 Amharic Wikipedia articles. Using the training set, we fine-tune an XLM-R-based language model and introduce a new reader model. Leveraging our newly fine-tuned reader run a baseline model to spark open-domain Amharic QA research interest. The best- performing baseline QA achieves an F-score of 80.3 and 81.34 in retriever-reader and reading comprehension settings.

A significant number of research studies have been presented for detecting hate speech in social media during the last few years. However, the majority of these studies are in English. Only a few studies focus on Arabic and its dialects (especially the Algerian dialect) with a smaller number of them targeting sexism detection (or hate speech against women). Even the works that have been proposed on Arabic sexism detection consider two classes only (hateful and non-hateful), and three classes(adding the neutral class) in the best scenario. This paper aims to propose the first fine-grained corpus focusing on 13 classes. However, given the challenges related to hate speech and fine-grained annotation, the Kappa metric is relatively low among the annotators (i.e. 35% ). This work in progress proposes three main contributions: 1) Annotation of different categories related to hate speech such as insults, vulgar words or hate in general. 2) Annotation of 10,000 comments, in Arabic and Algerian dialects, automatically extracted from Youtube. 3) High-lighting the challenges related to manual annotation such as subjectivity, risk of bias, lack of annotation guidelines, etc

pdf abs
Advancing Language Diversity and Inclusion: Towards a Neural Network-based Spell Checker and Correction for Wolof
Thierno Ibrahima Cissé | Fatiha Sadat

This paper introduces a novel approach to spell checking and correction for low-resource and under-represented languages, with a specific focus on an African language, Wolof. By leveraging the capabilities of transformer models and neural networks, we propose an efficient and practical system capable of correcting typos and improving text quality. Our proposed technique involves training a transformer model on a parallel corpus consisting of misspelled sentences and their correctly spelled counterparts, generated using a semi-automatic method. As we fine tune the model to transform misspelled text into accurate sentences, we demonstrate the immense potential of this approach to overcome the challenges faced by resource-scarce and under-represented languages in the realm of spell checking and correction. Our experimental results and evaluations exhibit promising outcomes, offering valuable insights that contribute to the ongoing endeavors aimed at enriching linguistic diversity and inclusion and thus improving digital communication accessibility for languages grappling with scarcity of resources and under-representation in the digital landscape.

pdf abs
Lateral Inversions, Word Form/Order, Unnamed Grammatical Entities and Ambiguities in the Constituency Parsing and Annotation of the Igala Syntax through the English Language
Mahmud Mohammed Momoh

The aim of this paper is expose the structural form of the Igala language and the inherent complexity related to the translation of the language to a second language – i.e. the English language, through an inquisition into its the word order, lateral inversions, and unnamed grammatical entities inherent in the language. While this study finds out that there is a preponderance of a linguistic typology with subject-verb-object word order and the total absence of preposition in the speech composition of the Igala language. The implication of these trio of topic sentences (syntactic inversion, word ordering, unnamed entities) have remain within the dark corner of intellectual consideration and worst still the incorporation of this considerations in syntax parsing and annotation in computing. Rising from ongoing abstruseness and incongruity in machine translation of Igala, a comprehension model for automotive identification, application and/or conversion of these structural forms to the English language shall be the focus of this paper.

pdf (full)
bib (full) Proceedings of the Fifth Workshop on Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments @LREC-COLING 2024

pdf bib
Proceedings of the Fifth Workshop on Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments @LREC-COLING 2024
Dimitrios Kokkinakis | Kathleen C. Fraser | Charalambos K. Themistocleous | Kristina Lundholm Fors | Athanasios Tsanas | Fredrik Ohman

pdf bib abs
Semantic-based NLP techniques discriminate schizophrenia and Wernicke’s aphasia based on spontaneous speech
Frank Tsiwah | Anas Mayya | Andreas van Cranenburgh

People with schizophrenia spectrum disorder (SSD)—a psychiatric disorder, and people with Wernicke’s aphasia — an acquired neurological disorder, are both known to display semantic deficits in their spontaneous speech outputs. Very few studies directly compared the two groups on their spontaneous speech (Gerson et al., 1977; Faber et al., 1983), and no consistent results were found. Our study uses word (based on the word2vec model with moving windows across words) and sentence (transformer based-model) embeddings as features for a machine learning classification model to differentiate between the spontaneous speech of both groups. Additionally, this study uses these measures to differentiate between people with Wernicke’s aphasia and healthy controls. The model is able to classify patients with Wernicke’s aphasia and patients with SSD with a cross-validated accuracy of 81%. Additionally, it is also able to classify patients with Wernicke’s aphasia versus healthy controls and SSD versus healthy controls with cross-validated accuracy of 93.72% and 84.36%, respectively. For the SSD individuals, sentence and/or discourse level features are deemed more informative by the model, whereas for the Wernicke group, only intra-sentential features are more informative. Overall, we show that NLP-based semantic measures are sensitive to identifying Wernicke’s aphasic and schizophrenic speech.

pdf bib abs
Speech Rate and Salient Syllables Position in Spontaneous Speech of Children with Autism Spectrum Disorder
Valentina Saccone

The study employs a semi-automatic approach to analyze speech rate in spoken Italian, aiming to identify acoustic parameters associated with perceptual atypicality in the speech of children diagnosed with Autism Spectrum Disorder (ASD). The research focuses on a dataset comprising recordings of semi-spontaneous interactions, in comparison with interviews of Typically Developing (TD) children. A detailed examination of speech rate variability is conducted, progressing from assessing overall speech rate in conversation to the analysis of individual utterances. Furthermore, salient syllables within utterances are identified using an automatic procedure through the Salient Detector Praat script and analyzed for stress position. The study highlights specific speech style, including rapid-telegraphic and reading-performed speech. Additionally, it reveals a higher speech rate with the increasing length of utterance when <10 syllables; conversely, a speech rate diminishing in 20-25 syllables utterances, suggesting potential difficulty in producing longer utterances associated with increased cognitive load.

pdf abs
Cross-Lingual Examination of Language Features and Cognitive Scores From Free Speech
Hali Lindsay | Giorgia Albertin | Louisa Schwed | Nicklas Linz | Johannes Tröger

Speech analysis is gaining significance for monitoring neurodegenerative disorders, but with a view of application in clinical practice, solid evidence of the association of language features with cognitive scores is still needed. A cross-linguistic investigation has been pursued to examine whether language features show significance correlation with two cognitive scores, i.e. Mini-Mental State Examination and ki:e SB-C scores, on Alzheimer’s Disease patients. We explore 23 language features, representative of syntactic complexity and semantic richness, extracted on a dataset of free speech recordings of 138 participants distributed in four languages (Spanish, Catalan, German, Dutch). Data was analyzed using the speech library SIGMA; Pearson’s correlation was computed with Bonferroni correction, and a mixed effects linear regression analysis is done on the significant correlated results. MMSE and the SB-C are found to be correlated with no significant differences across languages. Three features were found to be significantly correlated with the SB-C scores. Among these, two features of lexical richness show consistent patterns across languages, while determiner rate showed language-specific patterns.

In the last decade, a rapidly growing body of studies has shown promising results for the automatic detection and extraction of speech and language features as biomarkers of neurodegenerative conditions such as Alzheimer’s disease. This has sparked great optimism and the development of various digital health tools, but also warnings regarding the predominance of English in the field and calls for linguistically diverse research as well as global, equitable access to novel clinical instruments. To automatically extract clinically relevant features from transcripts in low-resource languages, two approaches are possible: 1) utilizing a limited range of language-specific tools or 2) translating text to English and then extracting the features. We evaluate these approaches for part-of-speech (POS) rates in transcripts of recorded picture descriptions from a cross-sectional study of Icelandic speakers at different stages of Alzheimer’s disease and healthy controls. While the translation method merits further exploration, only a subset of the POS categories show a promising correspondence to the direct extraction from the Icelandic transcripts in our results, indicating that the translation method has to be linguistically validated at the individual POS category level.

pdf abs
Automatic Detection of Rhythmic Features in Pathological Speech of MCI and Dementia Patients
Marica Belmonte | Gloria Gagliardi | Dimitrios Kokkinakis | Fabio Tamburini

Linguistic alterations represent one of the prodromal signs of cognitive decline associated with Dementia. In recent years, a growing body of work has been devoted to the development of algorithms for the automatic linguistic analysis of both oral and written texts, for diagnostic purposes. The extraction of Digital Linguistic Biomarkers from patients’ verbal productions can indeed provide a rapid, ecological, and cost-effective system for large-scale screening of the pathology. This article contributes to the ongoing research in the field by exploring a traditionally less studied aspect of language in Dementia, namely the rhythmic characteristics of speech. In particular, the paper focuses on the automatic detection of rhythmic features in Italian-connected speech. A landmark-based system was developed and evaluated to segment the speech flow into vocalic and consonantal intervals and to calculate several rhythmic metrics. Additionally, the reliability of these metrics in identifying Mild Cognitive Impairment and Dementia patients was tested.

pdf abs
Open Brain AI. Automatic Language Assessment
Charalambos Themistocleous

Language assessment plays a crucial role in diagnosing and treating individuals with speech, language, and communication disorders caused by neurogenic conditions, whether developmental or acquired. To support clinical assessment and research, we developed Open Brain AI (https://openbrainai.com). This computational platform employs AI techniques, namely machine learning, natural language processing, large language models, and automatic speech-to-text transcription, to automatically analyze multilingual spoken and written productions. This paper discusses the development of Open Brain AI, the AI language processing modules, and the linguistic measurements of discourse macro-structure and micro-structure. The fast and automatic analysis of language alleviates the burden on clinicians, enabling them to streamline their workflow and allocate more time and resources to direct patient care. Open Brain AI is freely accessible, empowering clinicians to conduct critical data analyses and give more attention and resources to other critical aspects of therapy and treatment.

pdf abs
Exploring the Relationship Between Intrinsic Stigma in Masked Language Models and Training Data Using the Stereotype Content Model
Mario Mina | Júlia Falcão | Aitor Gonzalez-Agirre

Much work has gone into developing language models of increasing size, but only recently have we begun to examine them for pernicious behaviour that could lead to harming marginalised groups. Following Lin et al. (2022) in rooting our work in psychological research, we prompt two masked language models (MLMs) of different specialisations in English and Spanish with statements from a questionnaire developed to measure stigma to determine if they treat physical and mental illnesses equally. In both models we find a statistically significant difference in the treatment of physical and mental illnesses across most if not all latent constructs as measured by the questionnaire, and thus they are more likely to associate mental illnesses with stigma. We then examine their training data or data retrieved from the same domain using a computational implementation of the Stereotype Content Model (SCM) (Fiske et al., 2002; Fraser et al., 2021) to interpret the questionnaire results based on the SCM values as reflected in the data. We observe that model behaviour can largely be explained by the distribution of the mentions of illnesses according to their SCM values.

pdf abs
Establishing Control Corpora for Depression Detection in Modern Greek: Methodological Insights
Vivian Stamou | George Mikros | George Markopoulos | Spyridoula Varlokosta

This paper presents a methodological approach for establishing control corpora in the context of depression detection in the Modern Greek language. We discuss various methods used to create control corpora, focusing on the challenge of selecting representative samples from the general population when the target reference is the depressed population. Our approach includes traditional random selection among Twitter users, as well as an innovative method for creating topic-oriented control corpora. Through this study, we provide insights into the development of control corpora, offering valuable considerations for researchers working on similar projects in linguistic analysis and mental health studies. In addition, we identify several dominant topics in the depressed population such as religion, sentiments, health and digestion, which seem to align with findings consistently reported in the literature

pdf abs
A Preliminary Evaluation of Semantic Coherence and Cohesion in Aphasic and Non-Aphasic Discourse Across Test and Retest
Snigdha Khanna | Brielle C. Stark

This paper evaluates global and local semantic coherence in aphasic and non-aphasic discourse tasks using the Tool for the Automatic Analysis of Cohesion (TAACO). The motivation for this paper stems from a lack of automatic methods to evaluate discourse-level phenomena, such as semantic cohesion, in transcripts derived from persons with aphasia. It leverages existing test-retest data to evaluate two main objectives: (1) Test-Retest Reliability, to identify if variables significantly differ across test and retest time points for either group (aphasia, control), and (2) Inter-Group Discourse Cohesion, where aphasic discourse is expected to be less cohesive than control discourse, resulting in lower cohesion scores for the aphasia group. Exploratory analysis examines correlations between variables for both groups, identifying any relationships between word-level and sentence-level semantic variables. Results verify that semantic cohesion and coherence are generally preserved in both groups, except for word-level and a few sentence-level semantic measures,w which are higher for the control group. Overall, variables tend to be reliable across time points for both groups. Notably, the aphasia group demonstrates more variability in cohesion than the control group, which is to be expected after brain injury. A close relationship between word-level indices and other indices is observed, suggesting a disconnection between word-level factors and sentence-level metrics.

pdf abs
Harnessing Linguistic Analysis for ADHD Diagnosis Support: A Stylometric Approach to Self-Defining Memories
Florian Raphaël Cafiero | Juan Barrios Rudloff | Simon Gabay

This study explores the potential of stylometric analysis in identifying Self-Defining Memories (SDMs) authored by individuals with Attention-Deficit/Hyperactivity Disorder (ADHD) versus a control group. A sample of 198 SDMs were written by 66 adolescents and were then analysed using Support Vector Classifiers (SVC). The analysis included a variety of linguistic features such as character 3-grams, function words, sentence length, or lexical richness among others. It also included metadata about the participants (gender, age) and their SDMs (self-reported sentiment after recalling their memories). The results reveal a promising ability of linguistic analysis to accurately classify SDMs, with perfect prediction (F1=1.0) in the contextually simpler setup of text-by-text prediction, and satisfactory levels of precision (F1 = 0.77) when predicting individual by individual. Such results highlight the significant role that linguistic characteristics play in reflecting the distinctive cognitive patterns associated with ADHD. While not a substitute for professional diagnosis, textual analysis offers a supportive avenue for early detection and a deeper understanding of ADHD.

pdf abs
Crosslinguistic Acoustic Feature-based Dementia Classification Using Advanced Learning Architectures
Anna Seo Gyeong Choi | Jin-seo Kim | Seo-hee Kim | Min Seok Back | Sunghye Cho

In this study, we rigorously evaluated eight machine learning and deep learning classifiers for identifying Alzheimer’s Disease (AD) patients using crosslinguistic acoustic features automatically extracted from one-minute oral picture descriptions produced by speakers of American English, Korean, and Mandarin Chinese. We employed eGeMAPSv2 and ComParE feature sets on segmented and non-segmented audio data. The Multilayer Perceptron model showed the highest performance, achieving an accuracy of 83.54% and an AUC of 0.8 on the ComParE features extracted from non-segmented picture description data. Our findings suggest that classifiers trained with acoustic features extracted from one-minute picture description data in multiple languages are highly promising as a quick, language-universal, large-scale, remote screening tool for AD. However, the dataset included predominantly English-speaking participants, indicating the need for more balanced multilingual datasets in future research.

pdf (full)
bib (full) Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024

pdf bib
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024
Rodrigo Wilkens | Rémi Cardon | Amalia Todirascu | Núria Gala

pdf bib abs
Evaluating Document Simplification: On the Importance of Separately Assessing Simplicity and Meaning Preservation
Liam Cripwell | Joël Legrand | Claire Gardent

Text simplification intends to make a text easier to read while preserving its core meaning. Intuitively and as shown in previous works, these two dimensions (simplification and meaning preservation) are often-times inversely correlated. An overly conservative text will fail to simplify sufficiently, whereas extreme simplification will degrade meaning preservation. Yet, popular evaluation metrics either aggregate meaning preservation and simplification into a single score (SARI, LENS), or target meaning preservation alone (BERTScore, QuestEval). Moreover, these metrics usually require a set of references and most previous work has only focused on sentence-level simplification. In this paper, we focus on the evaluation of document-level text simplification and compare existing models using distinct metrics for meaning preservation and simplification. We leverage existing metrics from similar tasks and introduce a reference-less metric variant for simplicity, showing that models are mostly biased towards either simplification or meaning preservation, seldom performing well on both dimensions. Making use of the fact that the metrics we use are all reference-less, we also investigate the performance of existing models when applied to unseen data (where reference simplifications are unavailable).

pdf bib abs
Malmon: A Crowd-Sourcing Platform for Simple Language
Helgi Björn Hjartarson | Steinunn Rut Friðriksdóttir

This paper presents a crowd-sourcing platform designed to address the need for parallel corpora in the field of Automatic Text Simplification (ATS). ATS aims to automatically reduce the linguistic complexity of text to aid individuals with reading difficulties, such as those with cognitive disorders, dyslexia, children, and non-native speakers. ATS does not only facilitate improved reading comprehension among these groups but can also enhance the preprocessing stage for various NLP tasks through summarization, contextual simplification, and paraphrasing. Our work introduces a language independent, openly accessible platform that crowdsources training data for ATS models, potentially benefiting low-resource languages where parallel data is scarce. The platform can efficiently aid in the collection of parallel corpora by providing a user-friendly data-collection environment. Furthermore, using human crowd-workers for the data collection process offers a potential resource for linguistic research on text simplification practices. The paper discusses the platform’s architecture, built with modern web technologies, and its user-friendly interface designed to encourage widespread participation. Through gamification and a robust admin panel, the platform incentivizes high-quality data collection and engagement from crowdworkers.

pdf abs
Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models
Andreas Säuberli | Simon Clematide

Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human and automatic evaluation, including a metric we call text informativity, which is based on guessability and answerability. We then used this protocol and the dataset to evaluate the quality of items generated by Llama 2 and GPT-4. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2. We also show that LLMs can be used for automatic evaluation by eliciting item reponses from them. In this scenario, evaluation results with GPT-4 were the most similar to human annotators. Overall, zero-shot generation with LLMs is a promising approach for generating and evaluating reading comprehension test items, in particular for languages without large amounts of available data.

We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.

pdf abs
SIERA: An Evaluation Metric for Text Simplification using the Ranking Model and Data Augmentation by Edit Operations
Hikaru Yamanaka | Takenobu Tokunaga

Automatic evaluation metrics are indispensable for text simplification (TS) research. The past TS research adopts three evaluation aspects: fluency, meaning preservation and simplicity. However, there is little consensus on a metric to measure simplicity, a unique aspect of TS compared with other text generation tasks. In addition, many of the existing metrics require reference simplified texts for evaluation. Thus, the cost of collecting reference texts is also an issue. This study proposes a new automatic evaluation metric, SIERA, for sentence simplification. SIERA employs a ranking model for the order relation of simplicity, which is trained by pairs of the original and simplified sentences. It does not require reference sentences for either training or evaluation. The sentence pairs for training are further augmented by the proposed method that utlizes edit operations to generate intermediate sentences with the simplicity between the original and simplified sentences. Using three evaluation datasets for text simplification, we compare SIERA with other metrics by calculating the correlations between metric values and human ratings. The results showed SIERA’s superiority over other metrics with a reservation that the quality of evaluation sentences is consistent with that of the training data.

pdf abs
Transfer Learning for Russian Legal Text Simplification
Mark Athugodage | Olga Mitrofanove | Vadim Gudkov

We present novel results in legal text simplification for Russian. We introduce the first dataset for such a task in Russian - a parallel corpus based on the data extracted from “Rossiyskaya Gazeta Legal Papers”. In this study we discuss three approaches for text simplification which involve T5 and GPT model architectures. We evaluate the proposed models on a set of metrics: ROUGE, SARI and BERTScore. We also analysed the models’ results on such readability indices as Flesch-Kinkaid Grade Level and Gunning Fog Index. And, finally, we performed human evaluation of simplified texts generated by T5 and GPT models; expertise was carried out by native speakers of Russian and Russian lawyers. In this research we compared T5 and GPT architectures for text simplification task and found out that GPT handles better when it is fine-tuned on dataset of coped texts. Our research makes a big step in improving Russian legal text readability and accessibility for common people.

pdf abs
Accessible Communication: a systematic review and comparative analysis of official English Easy-to-Understand (E2U) language guidelines
Andreea Maria Deleanu | Constantin Orasan | Sabine Braun

Easy-to-Understand (E2U) language varieties have been recognized by the United Nation’s Convention on the Rights of Persons with Disabilities (2006) as a means to guarantee the fundamental right to Accessible Communication. Increased awareness has driven changes in European (European Commission, 2015, 2021; European Parliament, 2016) and International legislation (ODI, 2010), prompting public-sector and other institutions to offer domain-specific content into E2U language to prevent communicative exclusion of those facing cognitive barriers (COGA, 2017; Maaß, 2020; Perego, 2020). However, guidance on what it is that makes language actually ‘easier to understand’ is still fragmented and vague. For this reason, we carried out a systematic review of official guidelines for English Plain Language and Easy Language to identify the most effective lexical, syntactic and adaptation strategies that can reduce complexity in verbal discourse according to official bodies. This article will present the methods and preliminary results of the guidelines analysis.

pdf abs
LanguageTool as a CAT tool for Easy-to-Read in Spanish
Margot Madina | Itziar Gonzalez-Dios | Melanie Siegel

Easy-to-Read (E2R) is an approach to content creation that emphasizes simplicity and clarity in language to make texts more accessible to readers with cognitive challenges or learning disabilities. The Spanish version of E2R is called Lectura Fácil (LF). E2R and its variants, such as LF, focus on straightforward language and structure to enhance readability. The manual production of such texts is both time and resource expensive. In this work, we have developed LFWriteAssist, an authoring support tool that aligns with the guidelines of LF. It is underpinned by the functionalities of LanguageTool, a free and open source grammar, style and spelling checker. Our tool assists in ensuring compliance with LF standard, provides definitions for complex, polysemic, or infrequently used terms, and acronym extensions. The tool is primarily targeted at LF creators, as it serves as an authoring aid, identifying any rule infringements and assisting with language simplifications. However, it can be used by anyone who seek to enhance text readability and inclusivity. The tool’s code is made available as open source, thereby contributing to the wider effort of creating inclusive and comprehensible content.

pdf abs
Paying attention to the words: explaining readability prediction for French as a foreign language
Rodrigo Wilkens | Patrick Watrin | Thomas François

Automatic text Readability Assessment (ARA) has been seen as a way of helping people with reading difficulties. Recent advancements in Natural Language Processing have shifted ARA from linguistic-based models to more precise black-box models. However, this shift has weakened the alignment between ARA models and the reading literature, potentially leading to inaccurate predictions based on unintended factors. In this paper, we investigate the explainability of ARA models, inspecting the relationship between attention mechanism scores, ARA features, and CEFR level predictions made by the model. We propose a method for identifying features associated with the predictions made by a model through the use of the attention mechanism. Exploring three feature families (i.e., psycho-linguistic, work frequency and graded lexicon), we associated features with the model’s attention heads. Finally, while not fully explanatory of the model’s performance, the correlations of these associations surpass those between features and text readability levels.

pdf (full)
bib (full) Proceedings of the First Workshop on Reference, Framing, and Perspective @ LREC-COLING 2024

pdf bib
Proceedings of the First Workshop on Reference, Framing, and Perspective @ LREC-COLING 2024
Pia Sommerauer | Tommaso Caselli | Malvina Nissim | Levi Remijnse | Piek Vossen

pdf bib abs
Tracking Perspectives on Event Participants: a Structural Analysis of the Framing of Real-World Events in Co-Referential Corpora
Levi Remijnse | Pia Sommerauer | Antske Fokkens | Piek T.J.M. Vossen

In this paper, we present the outcome of a structural linguistic analysis performed on a referentially grounded FrameNet dataset. In this dataset, multiple Dutch events are referenced by multiple co-referential Dutch news texts. Mentions in those documents are annotated with respect to their referential grounding (i.e., links to structured Wikidata), and their conceptual representation (i.e., frames). Provided with each document’s temporal reporting distance, we selected documents for two events - the Utrecht shooting and MH17 - and performed an analysis in which we tracked the events’ participants over time in both their focalization (number of mentions) and their framing (distribution of frame element labels). This way, we use the carefully collected and annotated data to schematize shifts in focalization and perspectivization of the participants as a result of the constantly developing narrative surrounding the events. This novel type of linguistic research involves reference to the real-world referents and takes into account storytelling in news streams.

pdf bib abs
TimeFrame: Querying and Visualizing Event Semantic Frames in Time
Davide Lamorte | Marco Rovera | Alfio Ferrara | Sara Tonelli

In this work we introduce TimeFrame, an online platform to easily query and visualize events and participants extracted from document collections in Italian following a frame-based approach. The system allows users to select one or more events (frames) or event categories and to display their occurrences on a timeline. Different query types, from coarse to fine-grained, are available through the interface, enabling a time-bound analysis of large historical corpora. We present three use cases based on the full archive of news published in 1948 by the newspaper “Corriere della Sera”. We show that different crucial events can be explored, providing interesting insights into the narratives around such events, the main participants and their points of view.

We present an experiment on classifying news frames in a language unseen by the learner, using zero-shot cross-lingual transfer learning. We used two pre-trained multilingual Transformer Encoder neural network models and tested with four specific news frames, investigating two approaches to the resulting multi-label task: Binary Relevance (treating each frame independently) and Label Power-set (predicting each possible combination of frames). We train our classifiers on an available annotated multilingual migration news dataset and test on an unseen Slovene language migration news corpus, first evaluating performance and then using the classifiers to analyse how media framed the news during the periods of Syria and Ukraine conflict-related migrations.

pdf abs
Manosphrames: exploring an Italian incel community through the lens of NLP and Frame Semantics
Sara Gemelli | Gosse Minnema

We introduce a large corpus of comments extracted from an Italian online incel (‘involuntary incelibate’) forum, a community of men who build a collective identity and anti-feminist ideology centered around their inability to find a sexual or romantic partner and who frequently use explicitly misogynistic language. Our corpus consists of 2.4K comments that have been manually collected, analyzed and annotated with topic labels, and a further 32K threads (300K comments) that have been automatically scraped and automatically annotated with FrameNet annotations. We show how large-scale frame semantic analysis can shed a light on what is discussed in the community, and introduce incel topic classification as a new NLP task and benchmark.

pdf abs
Broadening the coverage of computational representations of metaphor through Dynamic Metaphor Theory
Xiaojuan Tan | Jelke Bloem

Current approaches to computational metaphor processing typically incorporate static representations of metaphor. We aim to show that this limits the coverage of such systems. We take insights from dynamic metaphor theory and discuss how existing computational models of metaphor might benefit from representing the dynamics of metaphor when applied to the analysis of conflicting discourse. We propose that a frame-based approach to metaphor representation based on the model of YinYang Dynamics of Metaphoricity (YYDM) would pave the way to more comprehensive modeling of metaphor. In particular, the metaphoricity cues of the YYDM model could be used to address the task of dynamic metaphor identification. Frame-based modeling of dynamic metaphor would facilitate the computational analysis of perspectives in conflicting discourse, with potential applications in analyzing political discourse.

pdf (full)
bib (full) Proceedings of Safety4ConvAI: The Third Workshop on Safety for Conversational AI @ LREC-COLING 2024

pdf bib abs
Grounding LLMs to In-prompt Instructions: Reducing Hallucinations Caused by Static Pre-training Knowledge
Angus Addlesee

When deploying LLMs in certain commercial or research settings, domain specific knowledge must be explicitly provided within the prompt. This in-prompt knowledge can conflict with an LLM’s static world knowledge learned at pre-training, causing model hallucination (see examples in Table 1). In safety-critical settings, like healthcare and finance, these hallucinations can harm vulnerable users. We have curated a QA corpus containing information that LLMs could not have seen at pre-training. Using our corpus, we have probed various LLMs, manipulating both the prompt and the knowledge representation. We have found that our ‘Jodie’ prompt consistently improves the model’s textual grounding to the given knowledge, and in-turn the overall answer accuracy. This is true in both the healthcare and finance domains - improving accuracy by up to 28% (mean: 12%). We have also identified that hierarchical and direct node-property graph structures could lead to more interpretable and controllable systems that provide a natural language interface with real-time in-domain knowledge. Our corpus will enable further work on this critical challenge.

How people interpret content is deeply influenced by their socio-cultural backgrounds and lived experiences. This is especially crucial when evaluating AI systems for safety, where accounting for such diversity in interpretations and potential impacts on human users will make them both more successful and inclusive. While recent work has demonstrated the importance of diversity in human ratings that underlie AI pipelines, effective and efficient ways to incorporate diverse perspectives in human data annotation pipelines is still largely elusive. In this paper, we discuss the primary challenges faced in incorporating diversity into model evaluations, and propose a practical diversity-aware annotation approach. Using an existing dataset with highly parallel safety annotations, we take as a test case a policy that prioritizes recall of safety issues, and demonstrate that our diversity-aware approach can efficiently obtain a higher recall of safety issues flagged by minoritized rater groups without hurting overall precision.

pdf abs
Using Information Retrieval Techniques to Automatically Repurpose Existing Dialogue Datasets for Safe Chatbot Development
Tunde Oluwaseyi Ajayi | Gaurav Negi | Mihael Arcan | Paul Buitelaar

There has been notable progress in the development of open-domain dialogue systems (chatbots) especially with the rapid advancement of the capabilities of Large Language Models. Chatbots excel at holding conversations in a manner that keeps a user interested and engaged. However, their responses can be unsafe, as they can respond in an offensive manner or offer harmful professional advice. As a way to mitigate this issue, recent work crowdsource datasets with exemplary responses or annotate dialogue safety datasets, which are relatively scarce compared to casual dialogues. Despite the quality of data obtained from crowdsourcing, it can be expensive and time consuming. This work proposes an effective pipeline, using information retrieval, to automatically repurpose existing dialogue datasets for safe chatbot development, as a way to address the aforementioned challenges. We select an existing dialogue dataset, revise its unsafe responses, as a way to obtain a dataset with safer responses to unsafe user inputs. We then fine-tune dialogue models on the original and revised datasets and generate responses to evaluate the safeness of the models.

pdf abs
FairPair: A Robust Evaluation of Biases in Language Models through Paired Perturbations
Jane Dwivedi-Yu

The accurate evaluation of differential treatment in language models to specific groups is critical to ensuring a positive and safe user experience. An ideal evaluation should have the properties of being robust, extendable to new groups or attributes, and being able to capture biases that appear in typical usage (rather than just extreme, rare cases). Relatedly, bias evaluation should surface not only egregious biases but also ones that are subtle and commonplace, such as a likelihood for talking about appearances with regard to women. We present FairPair, an evaluation framework for assessing differential treatment that occurs during ordinary usage. FairPair operates through counterfactual pairs, but crucially, the paired continuations are grounded in the same demographic group, which ensures equivalent comparison. Additionally, unlike prior work, our method factors in the inherent variability that comes from the generation process itself by measuring the sampling variability. We present an evaluation of several commonly used generative models and a qualitative analysis that indicates a preference for discussing family and hobbies with regard to women.

pdf abs
Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks
Georgios Pantazopoulos | Amit Parekh | Malvina Nikandrou | Alessandro Suglia

Augmenting Large Language Models (LLMs) with image-understanding capabilities has resulted in a boom of high-performing Vision-Language models (VLMs). While studying the alignment of LLMs to human values has received widespread attention, the safety of VLMs has not received the same attention. In this paper, we explore the impact of jailbreaking on three state-of-the-art VLMs, each using a distinct modeling approach. By comparing each VLM to their respective LLM backbone, we find that each VLM is more susceptible to jailbreaking. We consider this as an undesirable outcome from visual instruction-tuning, which imposes a forgetting effect on an LLM’s safety guardrails. Therefore, we provide recommendations for future work based on evaluation strategies that aim to highlight the weaknesses of a VLM, as well as take safety measures into account during visual instruction tuning.

pdf (full)
bib (full) Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources

pdf bib
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources
Eleni Efthimiou | Stavroula-Evita Fotinea | Thomas Hanke | Julie A. Hochgesang | Johanna Mesch | Marc Schulder

pdf bib
Advancing Annotation for Continuous Data in Swiss German Sign Language
Alessia Battisti | Katja Tissi | Sandra Sidler-Miserez | Sarah Ebling

pdf bib
Person Identification from Pose Estimates in Sign Language
Alessia Battisti | Emma van den Bold | Anne Göhring | Franz Holzknecht | Sarah Ebling

pdf
Data Integration, Annotation, and Transcription Methods for Sign Language Dialogue with Latency in Videoconferencing
Mayumi Bono | Tomohiro Okada | Victor Skobov | Robert Adam

pdf
Evaluating the Alignment of Utterances in the Swedish Sign Language Corpus
Carl Börstell

pdf
How to Approach Lexical Variation in Sign Language Corpora
Carl Börstell

pdf
Systemic Biases in Sign Language AI Research: A Deaf-Led Call to Reevaluate Research Agendas
Aashaka Desai | Maartje De Meulder | Julie A. Hochgesang | Annemarie Kocab | Alex X. Lu

pdf
Evaluating Inter-Annotator Agreement for Non-Manual Markers in Sign Languages
Lyke D. Esselink | Marloes Oomen | Floris Roelofsen

pdf
A software editor for the AZVD graphical Sign Language representation system
Michael Filhol | Thomas von Ascheberg

pdf
Content Questions in Sign Language – From theory to language description via corpus, experiments, and fieldwork
Robert Gavrilescu | Carlo Geraci | Johanna Mesch

pdf
Enhancing Syllabic Component Classification in Japanese Sign Language by Pre-training on Non-Japanese Sign Language Data
Jundai Inoue | Makoto Miwa | Yutaka Sasaki | Daisuke Hara

pdf
Building Your Query Step by Step: A Query Wizard for the MY DGS – ANNIS Portal of the DGS Corpus
Amy Isard

pdf
Investigating Motion History Images and Convolutional Neural Networks for Isolated Irish Sign Language Fingerspelling Recognition
Sarmad Khan | Irene Murtagh | Simon D. McLoughlin

pdf
Shedding Light on the Underexplored: Tackling the Minor Sign Language Research Topics
Jung-Ho Kim | Changyong Ko | Mathew Huerta-Enochian | Seung Yong Ko

pdf
Headshakes in NGT: Relation between Phonetic Properties & Linguistic Functions
Vadim Kimmelman | Marloes Oomen | Roland Pfau

pdf
Nonmanual Marking of Questions in Balinese Homesign Interactions: a Computer-Vision Assisted Analysis
Vadim Kimmelman | Ari Price | Josefina Safar | Connie de Vos | Jan Bulla

pdf
Annotation of LSF subtitled videos without a pre-existing dictionary
Julie Lascar | Michèle Gouiffès | Annelies Braffort | Claire Danet

pdf
Capturing Motion: Using Radar to Build Better Sign Language Corpora
Evie Malaia | Joshua Borneman | Sevgi Gurbuz

pdf
Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data
Fredrik Malmberg | Anna Klezovich | Johanna Mesch | Jonas Beskow

pdf
Quantitative Analysis of Hand Locations in both Sign Language and Non-linguistic Gesture Videos
Niels Martínez-Guevara | Arturo Curiel

pdf
Formal Representation of Interrogation in French Sign Language
Emmanuella Martinod | Michael Filhol

pdf
Multilingual Synthesis of Depictions through Structured Descriptions of Sign: An Initial Case Study
John McDonald | Eleni Efthimiou | Stavroula-Evita Fotinea | Rosalee Wolfe

pdf
Swedish Sign Language Resources from a User’s Perspective
Johanna Mesch | Thomas Björkstrand | Eira Balkstam | Patrick Hansson | Nikolaus Riemer Kankkonen

pdf
Sign Language Translation with Gloss Pair Encoding
Taro Miyazaki | Sihan Tan | Tsubasa Uchida | Hiroyuki Kaneko

pdf
SignCollect: A ‘Touchless’ Pipeline for Constructing Large-scale Sign Language Repositories
Gomèr Otterspeer | Ulrika Klomp | Floris Roelofsen

pdf
3D-LEX v1.0 – 3D Lexicons for American Sign Language and Sign Language of the Netherlands
Oline Ranum | Gomèr Otterspeer | Jari I. Andersen | Robert G. Belleman | Floris Roelofsen

pdf
Signbank 2.0 of Sign Languages: Easy to Administer, Easy to Use, Easy to Share
Ronice Muller de Quadros | Christian Rathmann | Peter Zalán Romanek | Francisco Fernandes | Sther Condé

pdf
STK LSF: A Motion Capture Dataset in LSF for SignToKids
Clément Reverdy | Sylvie Gibet | Thibaut Le Naour

pdf
Preprocessing Mediapipe Keypoints with Keypoint Reconstruction and Anchors for Isolated Sign Language Recognition
Kyunggeun Roh | Huije Lee | Eui Jun Hwang | Sukmin Cho | Jong C. Park

pdf
Decoding Sign Languages: The SL-FE Framework for Phonological Analysis and Automated Annotation
Karahan Şahin | Kadir Gökgöz

pdf
Facial Expressions for Sign Language Synthesis using FACSHuman and AZee
Paritosh Sharma | Camille Challant | Michael Filhol

pdf
Eye Blink Detection in Sign Language Data Using CNNs and Rule-Based Methods
Margaux Susman | Vadim Kimmelman

pdf
SEDA: Simple and Effective Data Augmentation for Sign Language Understanding
Sihan Tan | Taro Miyazaki | Katsutoshi Itoyama | Kazuhiro Nakadai

pdf
HamNoSys-based Motion Editing Method for Sign Language
Tsubasa Uchida | Taro Miyazaki | Hiroyuki Kaneko

pdf
SignaMed: a Cooperative Bilingual LSE-Spanish Dictionary in the Healthcare Domain
Manuel Vázquez-Enríquez | José Luis Alba-Castro | Ania Pérez-Pérez | Carmen Cabeza-Pereiro | Laura Docío-Fernández

pdf
Diffusion Models for Sign Language Video Anonymization
Zhaoyang Xia | Yang Zhou | Ligong Han | Carol Neidle | Dimitris N. Metaxas

pdf
A Multimodal Spatio-Temporal GCN Model with Enhancements for Isolated Sign Recognition
Yang Zhou | Zhaoyang Xia | Yuxiao Chen | Carol Neidle | Dimitris N. Metaxas

pdf (full)
bib (full) Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

pdf bib
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Maite Melero | Sakriani Sakti | Claudia Soria

pdf bib abs
A Bit of a Problem: Measurement Disparities in Dataset Sizes across Languages
Catherine Arnett | Tyler A. Chang | Benjamin Bergen

How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use linear regressions to estimate byte premiums for other languages. We release a tool to obtain byte premiums for any two languages, enabling comparisons of dataset sizes across languages for more equitable multilingual model development and data practices.

pdf bib abs
A Novel Corpus for Automated Sexism Identification on Social Media
Lutfiye Seda Mut Altin | Horacio Saggion

In this paper, we present a novel dataset for the study of automated sexism identification and categorization on social media in Turkish. For this purpose, we have collected, following a well established methodology, a set of Tweets and YouTube comments. Relying on expert organizations in the area of gender equality, each text has been annotated based on a two-level labelling schema derived from previous research. Our resulting dataset consists of around 7,000 annotated instances useful for the study of expressions of sexism and misogyny on the Web. To the best of our knowledge, this is the first two-level manually annotated comprehensive Turkish dataset for sexism identification. In order to fuel research in this relevant area, we also present the result of our benchmarking experiments in the area of sexism identification in Turkish.

pdf abs
Advancing Generative AI for Portuguese with Open Decoder Gervásio PT*
Rodrigo Santos | João Ricardo Silva | Luís Gomes | João Rodrigues | António Branco

To advance the neural decoding of Portuguese, in this paper we present a fully open Transformer-based, instruction-tuned decoder model that sets a new state of the art in this respect. To develop this decoder, which we named Gervásio PT*, a strong LLaMA 2 7B model was used as a starting point, and its further improvement through additional training was done over language resources that include new instruction data sets of Portuguese prepared for this purpose, which are also contributed in this paper. All versions of Gervásio are open source and distributed for free under an open license, including for either research or commercial usage, and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese.

pdf abs
Assessing Pre-Built Speaker Recognition Models for Endangered Language Data
Gina-Anne Levow

Significant research has focused on speaker recognition, determining which speaker is speaking in a segment of audio. However, few experiments have investigated speaker recognition for very low-resource or endangered languages. Furthermore, speaker recognition has the potential to support language documentation and revitalization efforts, making recordings more accessible to researchers and communities. Since endangered language datasets are too small to build competitive speaker representations from scratch, we investigate the application of large-scale pre-built speaker recognition models to bridge this gap. This paper compares four speaker recognition models on six diverse endangered language data sets. Comparisons contrast three recent neural network-based x-vector models and an earlier baseline i-vector model. Experiments demonstrate significantly stronger performance for some of the studied models. Further analysis highlights differences in effectiveness tied to the lengths of test audio segments and amount of data used for speaker modeling.

pdf abs
BERTbek: A Pretrained Language Model for Uzbek
Elmurod Kuriyozov | David Vilares | Carlos Gómez-Rodríguez

Recent advances in neural networks based language representation made it possible for pretrained language models to outperform previous models in many downstream natural language processing (NLP) tasks. These pretrained language models have also shown that if large enough, they exhibit good few-shot abilities, which is especially beneficial for low-resource scenarios. In this respect, although there are some large-scale multilingual pretrained language models available, language-specific pretrained models have demonstrated to be more accurate for monolingual evaluation setups. In this work, we present BERTbek - pretrained language models based on the BERT (Bidirectional Encoder Representations from Transformers) architecture for the low-resource Uzbek language. We also provide a comprehensive evaluation of the models on a number of NLP tasks: sentiment analysis, multi-label topic classification, and named entity recognition, comparing the models with various machine learning methods as well as multilingual BERT (mBERT). Experimental results indicate that our models outperform mBERT and other task-specific baseline models in all three tasks. Additionally, we also show the impact of training data size and quality on the downstream performance of BERT models, by training three different models with different text sources and corpus sizes.

Automatic spell and grammar checking can be done using various system architectures, and large language models have recently been used to solve the task with promising results. Here we describe a new method of creating test data to measure the performance of spell and grammar checkers, including large language models. Three types of test data represent different approaches to evaluation, from basic error detection to error correction with natural language explanations of the corrections made and error severity scores, which is the main novelty of this approach. These additions are especially useful when evaluating large language models. We present a spell and grammar checking test set for Icelandic in which the described approach is applied. The data consists of whole texts instead of discrete sentences, which facilitates evaluating context awareness of models. The resulting test set can be used to compare different spell and grammar checkers and is published under permissive licenses.

pdf abs
Bidirectional English-Nepali Machine Translation(MT) System for Legal Domain
Shabdapurush Poudel | Bal Krishna Bal | Praveen Acharya

Nepali, a low-resource language belonging to the Indo-Aryan language family and spoken in Nepal, India, Sikkim, and Burma has comparatively very little digital content and resources, more particularly in the legal domain. However, the need to translate legal documents is ever-increasing in the context of growing volumes of legal cases and a large population seeking to go abroad for higher education or employment. This underscores the need for developing an English-Nepali Machine Translation for the legal domain. We attempt to address this problem by utilizing a Neural Machine Translation (NMT) System with an encoder-decoder architecture, specifically designed for legal Nepali-English translation. Leveraging a custom-built legal corpus of 125,000 parallel sentences, our system achieves encouraging BLEU scores of 7.98 in (Nepali → English) and 6.63 (English → Nepali) direction

Bangsamoro languages are among the under-resourced languages in the Mindanao region in the Philippines. Moreover, there is no currently publicly available data for children’s speech on most of these languages. BK3AT children’s speech corpus is a corpus designed for creating speech technologies that could help facilitators and teachers in K-3 education. The corpus consists of 122 hours of children speech data across 10 languages: Bahasa Sug, Chavacano, English, Filipino, Iranun, Maguindanaon, Meranaw, Sinama, Teduray, and Yakan. Preliminary experiments using Wav2Vec-XLSR architecture have been done in fine-tuning the Tagalog and L2 English corpus subsets to develop automatic speech recognition backend for literacy assessment. Results from the experiments show low word error rates (WERs) for small-vocabulary and targeted domains.

pdf abs
CorpusArièja: Building an Annotated Corpus with Variation in Occitan
Clamenca Poujade | Myriam Bras | Assaf Urieli

The Occitan language is a less resourced language and is classified as ‘in danger’ by the UNESCO. Thereby, it is important to build resources and tools that can help to safeguard and develop the digitisation of the language. CorpusArièja is a collection of 72 texts (just over 41,000 tokens) in the Occitan language of the French department of Ariège. The majority of the texts needed to be digitised and pass within an Optical Character Recognition. This corpus contains dialectal and spelling variation, but is limited to prose, without diachronic variation or genre variation. It is an annotated corpus with two levels of lemmatisation, POS tags and verbal inflection. One of the main aims of the corpus is to enable the conception of tools that can automatically annotate all Occitan texts, regardless of the dialect or spelling used. The Ariège territory is interesting because it includes the two variations that we focus on, dialectal and spelling. It has plenty of authors that write in their native language, their variety of Occitan.

For many of the world’s small languages, few resources are available. In this project, a written online accessible corpus was created for the minority language variant Gronings, which serves both researchers interested in language change and variation and a general audience of (new) speakers interested in finding real-life examples of language use. The corpus was created using a combination of volunteer work and automation, which together formed an efficient pipeline for converting printed text to Key Words in Context (KWICs), annotated with lemmas and part-of-speech tags. In the creation of the corpus, we have taken into account several of the challenges that can occur when creating resources for minority languages, such as a lack of standardisation and limited (financial) resources. As the solutions we offer are applicable to other small languages as well, each step of the corpus creation process is discussed and resources will be made available benefiting future projects on other low-resource languages.

pdf abs
Evaluating Icelandic Sentiment Analysis Models Trained on Translated Data
Ólafur A. Jóhannsson | Birkir H. Arndal | Eysteinn Ö. Jónsson | Stefan Olafsson | Hrafn Loftsson

We experiment with sentiment classification models for Icelandic that leverage machine-translated data for training. Since no large sentiment dataset exists for Icelandic, we translate 50,000 English IMDb reviews, classified either as positive or negative, into Icelandic using two services: Google Translate and GreynirTranslate. After machine translation, we assess whether the sentiment of the source language text is retained in the target language. Moreover, we evaluate the accuracy of the sentiment classifiers on non-translated Icelandic text.The performance of three types of baseline classifiers is compared, i.e., Support Vector Machines, Logistic Regression and Naive Bayes, when trained on translated data generated by either translation service. Furthermore, we fine-tune and evaluate three pre-trained transformer-based models, RoBERTa, IceBERT and ELECTRA, on both the original English texts and the translated texts. Our results indicate that the transformer models perform better than the baseline classifiers on all datasets. Moreover, our evaluation shows that the transformer models trained on data translated from English reviews can be used to effectively classify sentiment on non-translated Icelandic movie reviews.

Digital game-based language learning (DGBLL) can help with the language learning process. DGBLL applications can make learning more enjoyable and engaging, but they are difficult to develop. A DBGLL app that relies on target language texts obviously needs to be able to use texts of the appropriate level for the individual learners. This implies that text classification tools should be available to DGBLL developers, who may not be familiar with the target language, in order to incorporate suitable texts into their games. While text difficulty classifiers exist for many of the most commonly spoken languages, this is not the case for under-resourced languages, such as Irish. In this paper, we explore approaches to the development of text classifiers for Irish. In the first approach to text analysis and grading, we apply linguistic analysis to assess text complexity. Features from this approach are then used in machine learning-based text classification, which explores the application of a number of machine learning algorithms to the problem. Although the development of these text classifiers is at an early stage, they show promise, particularly in a low-resourced scenario.

pdf abs
Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to Luxembourgish
Fred Philippy | Shohreh Haddadan | Siwen Guo

In NLP, zero-shot classification (ZSC) is the task of assigning labels to textual data without any labeled examples for the target classes. A common method for ZSC is to fine-tune a language model on a Natural Language Inference (NLI) dataset and then use it to infer the entailment between the input document and the target labels. However, this approach faces certain challenges, particularly for languages with limited resources. In this paper, we propose an alternative solution that leverages dictionaries as a source of data for ZSC. We focus on Luxembourgish, a low-resource language spoken in Luxembourg, and construct two new topic relevance classification datasets based on a dictionary that provides various synonyms, word translations and example sentences. We evaluate the usability of our dataset and compare it with the NLI-based approach on two topic classification tasks in a zero-shot manner. Our results show that by using the dictionary-based dataset, the trained models outperform the ones following the NLI-based approach for ZSC. While we focus on a single low-resource language in this study, we believe that the efficacy of our approach can also transfer to other languages where such a dictionary is available.

To foster the neural encoding of Portuguese, this paper contributes foundation encoder models that represent an expansion of the still very scarce ecosystem of large language models specifically developed for this language that are fully open, in the sense that they are open source and openly distributed for free under an open license for any purpose, thus including research and commercial usages. Like most languages other than English, Portuguese is low-resourced in terms of these foundational language resources, there being the inaugural 900 million parameter Albertina and 335 million Bertimbau. Taking this couple of models as an inaugural set, we present the extension of the ecosystem of state-of-the-art open encoders for Portuguese with a larger, top performance-driven model with 1.5 billion parameters, and a smaller, efficiency-driven model with 100 million parameters. While achieving this primary goal, further results that are relevant for this ecosystem were obtained as well, namely new datasets for Portuguese based on the SuperGLUE benchmark, which we also distribute openly.

pdf abs
Improving Language Coverage on HeLI-OTS
Tommi Jauhiainen | Krister Lindén

In this paper, we add under-resourced languages into the language repertoire of an existing off-the-shelf language identifier, HeLI-OTS. Adding more languages to a language identifier often comes with the drawback of lessened accuracy for the languages already part of the repertoire. We aim to minimize this effect. As sources for training and development data in the new languages, we use the OpenLID and FLORES-200 datasets. They are openly available high-quality datasets that are especially well-suited for language identifier development. By carefully inspecting the effect of each added language and the quality of their training and development data, we managed to add support for 20 new under-resourced languages to HeLI-OTS without affecting the performance of any existing languages to a noticeable extent.

pdf abs
Improving Legal Judgement Prediction in Romanian with Long Text Encoders
Mihai Masala | Traian Rebedea | Horia Velicu

In recent years,the entire field of Natural Language Processing (NLP) has enjoyed amazing novel results achieving almost human-like performance on a variety of tasks. Legal NLP domain has also been part of this process, as it has seen an impressive growth. However, general-purpose models are not readily applicable for legal domain. Due to the nature of the domain (e.g. specialized vocabulary, long documents) specific models and methods are often needed for Legal NLP. In this work we investigate both specialized and general models for predicting the final ruling of a legal case, task known as Legal Judgment Prediction (LJP). We particularly focus on methods to extend to sequence length of Transformer-based models to better understand the long documents present in legal corpora. Extensive experiments on 4 LJP datasets in Romanian, originating from 2 sources with significantly different sizes and document lengths, show that specialized models and handling long texts are critical for a good performance.

pdf abs
Improving Noisy Student Training for Low-resource Languages in End-to-End ASR Using CycleGAN and Inter-domain Losses
Chia-Yu Li | Ngoc Thang Vu

Training a semi-supervised end-to-end speech recognition system using noisy student training has significantly improved performance. However, this approach requires a substantial amount of paired speech-text and unlabeled speech, which is costly for low-resource languages. Therefore, this paper considers a more extreme case of semi-supervised end-to-end automatic speech recognition where there are limited paired speech-text, unlabeled speech (less than five hours), and abundant external text. Firstly, we observe improved performance by training the model using our previous work on semi-supervised learning “CycleGAN and inter-domain losses” solely with external text. Secondly, we enhance “CycleGAN and inter-domain losses” by incorporating automatic hyperparameter tuning, calling “enhanced CycleGAN inter-domain losses.” Thirdly, we integrate it into the noisy student training approach pipeline for low-resource scenarios. Our experimental results, conducted on six non-English languages from Voxforge and Common Voice, show a 20% word error rate reduction compared to the baseline teacher model and a 10% word error rate reduction compared to the baseline best student model, highlighting the significant improvements achieved through our proposed method.

Indonesia is home to a diverse linguistic landscape, where individuals seamlessly transition between Indonesian, English, and local dialects in their everyday conversations—a phenomenon known as code-switching. Understanding and accommodating this linguistic fluidity is essential, particularly in the development of accurate speech recognition systems. However, tackling code-switching in Indonesian poses a challenge due to the scarcity of paired code-switching data. Thus, this study endeavors to address Indonesian-English code-switching in speech recognition, leveraging unlabeled data and employing a semi-supervised technique known as the machine speech chain. Our findings demonstrate that the machine speech chain method effectively enhances Automatic Speech Recognition (ASR) performance in recognizing code-switching between Indonesian and English, utilizing previously untapped resources of unlabeled data.

pdf abs
Inter-language Transfer Learning for Visual Speech Recognition toward Under-resourced Environments
Fumiya Kondo | Satoshi Tamura

In this study, we introduce a method of inter-language transfer learning for under-resourced visual speech recognition. Deploying speech-related technology to all languages is a quite important activity. However, applying state-of-the-art deep-learning techniques requires huge-size labeled corpora, which makes it hard for under-resourced languages. Our approach leverages a small amount of labeled video data of the target language, and employs inter-language transfer learning using a pre-trained English lip-reading model. By applying the proposed scheme, we build a Japanese lip-reading model, using the ROHAN corpus, the size of which is about one 450th of the size of English datasets. The front-end encoder part of the pre-trained model is fine-tuned to improve the acquisition of pronunciation and lip movement patterns unique to Japanese. On the other hand, the back-end encoder and the decoder are built using the Japanese dataset. Although English and Japanese have different language structures, evaluation experiments show that it is possible to build the Japanese lip-reading model efficiently. Comparison with competitive schemes demonstrates the effectiveness of our method.

pdf abs
Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study
Wan-hua Her | Udo Kruschwitz

Machine Translation has made impressive progress in recent years offering close to human-level performance on many languages, but studies have primarily focused on high-resource languages with broad online presence and resources. With the help of growing Large Language Models, more and more low-resource languages achieve better results through the presence of other languages. However, studies have shown that not all low-resource languages can benefit from multilingual systems, especially those with insufficient training and evaluation data. In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian. We investigate conditions of low-resource languages such as data scarcity and parameter sensitivity and focus on refined solutions that combat low-resource difficulties and creative solutions such as harnessing language similarity. Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance. We demonstrate noisiness in the data and present our approach to carry out text preprocessing extensively. Evaluation was conducted using combined metrics: BLEU, chrF and TER. Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement. Furthermore, we present a qualitative analysis of translation errors and system limitations.

pdf abs
Italian-Ligurian Machine Translation in Its Cultural Context
Christopher R. Haberland | Jean Maillard | Stefano Lusito

Large multilingual machine translation efforts are driving improved access and performance for under-resourced languages, but often fail to translate culturally specific and local concepts. Additionally, translation from practically relevant input languages may flag behind those that are comparatively over-represented in the training dataset. In this work, we release a new corpus, ZenaMT, containing 7,561 parallel Ligurian-Italian sentences, nearly a fifth of which are also translated in English. This corpus spans five domains: local and international news, Ligurian literature, Genoese Ligurian linguistics concepts, traditional card game rules, and Ligurian geographic expressions. We find that a translation model augmented with ZenaMT improves a baseline by 20%, and by over 25% (BLEU) compared to NLLB-3.3B, which is over 50 times the size. Our results demonstrate the utility of creating data sets for MT that are specifically tailored for the cultural context of Ligurian speakers. We freely release ZenaMT and expect to periodically update the corpus to improve MT performance and domain coverage.

pdf abs
Labadain-30k+: A Monolingual Tetun Document-Level Audited Dataset
Gabriel de Jesus | Sérgio Nunes

This paper introduces Labadain-30k+, a monolingual dataset comprising 33.6k documents in Tetun, a low-resource language spoken in Timor-Leste. The dataset was acquired through web crawling and augmented with Wikipedia documents released by Wikimedia. Both sets of documents underwent thorough manual audits at the document level by native Tetun speakers, resulting in the construction of a Tetun text dataset well-suited for a variety of natural language processing and information retrieval tasks. This dataset was employed to conduct a comprehensive content analysis aimed at providing a nuanced understanding of document composition and the evolution of Tetun documents on the web. The analysis revealed that news articles constitute the predominant documents within the dataset, accounting for 89.87% of the total, followed by Wikipedia documents at 4.34%, and legal and governmental documents at 3.65%, among others. Notably, there was a substantial increase in the number of documents in 2020, indicating 11.75 percentage points rise in document quantity, compared to an average of 4.76 percentage points per year from 2001 to 2023. Moreover, the year 2017, marked by the increased popularity of online news in Tetun, served as a threshold for analyzing the evolution of document writing on the web pre- and post-2017, specifically regarding vocabulary usage. Surprisingly, this analysis showed a significant increase of 6.12 percentage points in the Tetun written adhering to the Tetun official standard. Additionally, the persistence of Portuguese loanwords in that trajectory remained evident, reflecting an increase of 5.09 percentage points.

pdf abs
Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining
Nikola Ljubešić | Vít Suchomel | Peter Rupnik | Taja Kuzman | Rik van Noord

The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are still very much needed, their primary usage being in enriching large collections of data with metadata necessary for downstream research. We investigate the best way to ensure the existence of such encoder models on the set of very closely related languages - Croatian, Serbian, Bosnian and Montenegrin, by setting up a diverse benchmark for these languages, and comparing the trained-from-scratch models with the new models constructed via additional pretraining of existing multilingual models. We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation. We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.

pdf abs
Man or Machine: Evaluating Spelling Error Detection in Danish Newspaper Corpora
Eckhard Bick | Jonas Nygaard Blom | Marianne Rathje | Jørgen Schack

This paper evaluates frequency and detection performance for both spelling and grammatical errors in a corpus of published Danish newspaper texts, comparing the results of three human proofreaders with those of an automatic system, DanProof. Adopting the error categorization scheme of the latter, we look at the accuracy of individual error types and their relative distribution over time, as well as the adequacy of suggested corrections. Finally, we discuss so-called artefact errors introduced by corpus processing, and the potential of DanProof as a corpus cleaning tool for identifying and correcting format conversion, OCR or other compilation errors. In the evaluation, with balanced F1-scores of 77.6 and 67.6 for 1999 texts and 2019 texts, respectively, DanProof achieved a higher recall and accuracy than the individual human annotators, and contributed the largest share of errors not detected by others (16.4% for 1999 and 23.6% for 2019). However, the human annotators had a significantly higher precision. Not counting artifacts, the overall error frequency in the corpus was low ( 0.5%), and less than half in the newer texts compared to the older ones, a change that mostly concerned orthographical errors, with a correspondingly higher relative share of grammatical errors.

Metadata are key components of language resources and facilitate their exploitation and re-use. Their creation is a labour intensive process and requires a modeling step, which identifies resource-specific information as well as standards and controlled vocabularies that can be reused. In this article, we focus on metadata for documenting text bases for regional languages of France characterised by several levels of variation (space, time, usage, social status), based on a survey of existing metadata schema. Moreover, we implement our metadata model as a database structure for the Heurist data management system, which combines both the ease of use of spreadsheets and the ability to model complex relationships between entities of relational databases. The Heurist template is made freely available and was used to describe metadata for text bases in Alsatian and Poitevin-Santongeais. We also propose tools to automatically generate XML metadata headers files from the database.

pdf abs
Mixat: A Data Set of Bilingual Emirati-English Speech
Maryam Khalifa Al Ali | Hanan Aldarmaki

This paper introduces Mixat: a dataset of Emirati speech code-mixed with English. Mixat was developed to address the shortcomings of current speech recognition resources when applied to Emirati speech, and in particular, to bilignual Emirati speakers who often mix and switch between their local dialect and English. The data set consists of 15 hours of speech derived from two public podcasts featuring native Emirati speakers, one of which is in the form of conversations between the host and a guest. Therefore, the collection contains examples of Emirati-English code-switching in both formal and natural conversational contexts. In this paper, we describe the process of data collection and annotation, and describe some of the features and statistics of the resulting data set. In addition, we evaluate the performance of pre-trained Arabic and multi-lingual ASR systems on our dataset, demonstrating the shortcomings of existing models on this low-resource dialectal Arabic, and the additional challenge of recognizing code-switching in ASR. The dataset will be made publicly available for research use.

pdf abs
Bi-dialectal ASR of Armenian from Naturalistic and Read Speech
Malajyan Arthur | Victoria Khurshudyan | Karen Avetisyan | Hossep Dolatian | Damien Nouvel

The paper explores the development of Automatic Speech Recognition (ASR) models for Armenian, by using data from two standard dialects (Eastern Armenian and Western Armenian). The goal is to develop a joint bi-variational model. We achieve state-of-the-art results. Results from our ASR experiments demonstrate the impact of dataset selection and data volume on model performance. The study reveals limited transferability between dialects, although integrating datasets from both dialects enhances overall performance. The paper underscores the importance of dataset diversity and volume in ASR model training for under-resourced languages like Armenian.

pdf abs
Multilingual Self-supervised Visually Grounded Speech Models
Huynh Phuong Thanh Nguyen | Sakriani Sakti

Developing a multilingual speech-to-speech translation system poses challenges due to the scarcity of paired speech data in various languages, particularly when dealing with unknown and untranscribed languages. However, the shared semantic representation across multiple languages presents an opportunity to build a translation system based on images. Recently, researchers have explored methods for aligning bilingual speech as a novel approach to discovering speech pairs using semantic images from unknown and untranscribed speech. These aligned speech pairs can then be utilized to train speech-to-speech translation systems. Our research builds upon these approaches by expanding into multiple languages and focusing on achieving multimodal multilingual pairs alignment, with a key component being multilingual visually grounded speech models. The objectives of our research are twofold: (1) to create visually grounded speech datasets for English, Japanese, Indonesian, and Vietnamese, and (2) to develop self-supervised visually grounded speech models for these languages. Our experiments have demonstrated the feasibility of this approach, showcasing the ability to retrieve associations between speeches and images. The results indicate that our multilingual visually grounded speech models yield promising outcomes in representing speeches using semantic images across multiple languages.

pdf abs
Nepal Script Text Recognition Using CRNN CTC Architecture
Swornim Nakarmi | Sarin Sthapit | Arya Shakya | Rajani Chulyadyo | Bal Krishna Bal

Nepal Script (also known as Prachalit Script) is the widely used script of Nepal Bhasa, the native language of the Kathmandu Valley in Nepal. Derived from the Brahmi Script, the Nepal Script was developed in the 9th century and was extensively used till the 20th century, before being replaced by the Devanagari script. Numerous ancient manuscripts, inscriptions, and documents written in the Nepal Script are still available containing immense knowledge on architecture, arts, astrology, ayurveda, literature, music, tantrism, etc. To preserve and revive Nepal Bhasa, digitizing such documents plays a crucial role. This paper presents our work on text recognition for the Nepal Script. The implementation includes the Nepal Script text recognizer based on CRNN CTC architecture aided by line and word segmentations. Leveraging a carefully curated dataset that encompasses handwritten and printed texts in the Nepal Script, our work has achieved CER of 6.65% and WER of 13.11%. The dataset used for this work is available as Nepal Script Text Dataset on Kaggle. The paper further explores the associated challenges due to the complex nature of the script such as conjuncts, modifiers and variations; and the current state of the script.

pdf abs
NLP for Arbëresh: How an Endangered Language Learns to Write in the 21st Century
Giulio Cusenza | Çağrı Çöltekin

Societies are becoming more and more connected, and minority languages often find themselves helpless against the advent of the digital age, with their speakers having to regularly turn to other languages for written communication. This work introduces the case of Arbëresh, a southern Italian language related to Albanian. It presents the very first machine-readable Arbëresh data, collected through a web campaign, and describes a set of tools developed to enable the Arbëresh people to learn how to write their language, including a spellchecker, a conjugator, a numeral generator, and an interactive platform to learn Arbëresh spelling. A comprehensive web application was set up to make these tools available to the public, as well as to collect further data through them. This method can be replicated to help revive other minority languages in a situation similar to Arbëresh’s. The main challenges of the process were the extremely low-resource setting and the variability of Arbëresh dialects.

pdf abs
PersianEmo: Enhancing Farsi-Dari Emotion Analysis with a Hybrid Transformer and Recurrent Neural Network Model
Mohammad Ali Hussiny | Mohammad Arif Payenda | Lilja Øvrelid

Emotion analysis is a critical research domain within the field of natural language processing (NLP). While substantial progress has been made in this area for the Persian language, there is still a need for more precise models and larger datasets specifically focusing on the Farsi and Dari dialects. In this research, we introduce “LearnArmanEmo” as a new dataset and a superior ensemble approach for Persian text emotion classification. Our proposed model, which combines XLM-RoBERTa-large and BiGRU, undergoes evaluation on LetHerLearn for the Dari dialect, ARMANEMO for the Farsi dialect, and LearnArmanEmo for both Dari and Farsi dialects. The empirical results substantiate the efficacy of our approach with the combined model demonstrating superior performance. Specifically, our model achieves an F1 score of 72.9% on LetHerLearn, an F1 score of 77.1% on ARMANEMO, and an F1 score of 78.8% on the LearnArmanEmo dataset, establishing it as a better ensemble model for these datasets. These findings underscore the potential of this hybrid model as a useful tool for enhancing the performance of emotion analysis in Persian language processing.

pdf abs
Philippine Languages Database: A Multilingual Speech Corpora for Developing Systems for Low-Resource Languages
Rowena Cristina L. Guevara | Rhandley D. Cajote | Michael Gringo Angelo R. Bayona | Crisron Rudolf G. Lucas

Previous efforts to collect Filipino speech were done in the development of Filipino-Speech Corpus, TAGCO, and Filipino-Bisaya speech corpus. These corpora, however, are either domain-specific, non-parallel, non-multilingual or relatively insufficient for the development of state-of-the-art Automatic Speech Recognizers (ASR) and Text-To-Speech Systems (TTS) which usually requires hundreds of hours of speech data. This paper presents a multilingual corpora for the Philippine languages namely: Filipino, English, Cebuano, Kapampangan, Hiligaynon, Ilokano, Bikolano, Waray, and Tausug. PLD includes over 454 hours of recordings from speakers of the ten languages, covering multiple domains in news, medical, education, tourism and spontaneous speech. The applicability of the corpus has also been demonstrated in adult and children ASR, phoneme transcriber, voice conversion, and TTS applications.

pdf abs
Prompting towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot
Michelle Terblanche | Kayode Olaleye | Vukosi Marivate

Many multilingual communities, including numerous in Africa, frequently engage in code-switching during conversations. This behaviour stresses the need for natural language processing technologies adept at processing code-switched text. However, data scarcity, particularly in African languages, poses a significant challenge, as many are low-resourced and under-represented. In this study, we prompted GPT 3.5 to generate Afrikaans–English and Yoruba–English code-switched sentences, enhancing diversity using topic-keyword pairs, linguistic guidelines, and few-shot examples. Our findings indicate that the quality of generated sentences for languages using non-Latin scripts, like Yoruba, is considerably lower when compared with the high Afrikaans–English success rate. There is therefore a notable opportunity to refine prompting guidelines to yield sentences suitable for the fine-tuning of language models. We propose a framework for augmenting the diversity of synthetically generated code-switched data using GPT and propose leveraging this technology to mitigate data scarcity in low-resourced languages, underscoring the essential role of native speakers in this process.

pdf abs
Quantifying the Ethical Dilemma of Using Culturally Toxic Training Data in AI Tools for Indigenous Languages
Pedro Henrique Domingues | Claudio Santos Pinhanez | Paulo Cavalin | Julio Nogima

This paper tries to quantify the ethical dilemma of using culturally toxic training data to improve the performance of AI tools for ultra low-resource languages such as Indigenous languages. Our case study explores the use of Bible data which is both a commonly available source of training pairs for translators of Indigenous languages and a text which has a trail of physical and cultural violence for many Indigenous communities. In the context of fine-tuning a WMT19 German-to-English model into a Guarani Mbya-to-English translator, we first show, with two commonly-used Machine Translation metrics, that using only Bible data is not enough to create successful translators for everyday sentences gathered from a dictionary. Indeed, even fine-tuning with only 3,000 pairs of data from the dictionary produces significant increases in accuracy compared to Bible-only models. We then show that simultaneously fine-tuning with dictionary and Bible data achieves a substantial increase over the accuracy of a dictionary-only trained translator, and similarly happens when using two-step methods of fine-tuning. However, we also observed some, measurable, contaminated text from the Bible into the outputs of the best translator, creating concerns about its release to an Indigenous community. We end by discussing mechanisms to mitigate the negative impacts of this contamination.

pdf abs
Residual Dropout: A Simple Approach to Improve Transformer’s Data Efficiency
Carlos Escolano | Francesca De Luca Fornaciari | Maite Melero

Transformer models often demand a vast amount of training data to achieve the desired level of performance. However, this data requirement poses a major challenge for low-resource languages seeking access to high-quality systems, particularly in tasks like Machine Translation. To address this issue, we propose adding Dropout to Transformer’s Residual Connections. Our experimental results demonstrate that this modification effectively mitigates overfitting during training, resulting in substantial performance gains of over 4 BLEU points on a dataset consisting of merely 10 thousand examples.

pdf abs
Resource Acquisition for Understudied Languages: Extracting Wordlists from Dictionaries for Computer-assisted Language Comparison
Frederic Blum | Johannes Englisch | Alba Hermida Rodriguez | Rik van Gijn | Johann-Mattis List

Comparative wordlists play a crucial role for historical language comparison. They are regularly used for the identification of related words and languages, or for the reconstruction of language phylogenies and proto-languages. While automated solutions exist for the majority of methods used for this purpose, no standardized computational or computer-assisted approaches for the compilation of comparative wordlists have been proposed so far. Up to today, scholars compile wordlists by sifting manually through dictionaries or similar language resources and typing them into spreadsheets. In this study we present a semi-automatic approach to extract wordlists from machine-readable dictionaries. The transparent workflow allows to build user-defined wordlists for individual languages in a standardized format. By automating the search for translation equivalents in dictionaries, our approach greatly facilitates the aggregation of individual resources into multilingual comparative wordlists that can be used for a variety of purposes.

pdf abs
Robust Guidance for Unsupervised Data Selection: Capturing Perplexing Named Entities for Domain-Specific Machine Translation
Seunghyun Ji | Hagai Raja Sinulingga | Darongsae Kwon

Low-resourced data presents a significant challenge for neural machine translation. In most cases, the low-resourced environment is caused by high costs due to the need for domain experts or the lack of language experts. Therefore, identifying the most training-efficient data within an unsupervised setting emerges as a practical strategy. Recent research suggests that such effective data can be identified by selecting ‘appropriately complex data’ based on its volume, providing strong intuition for unsupervised data selection. However, we have discovered that establishing criteria for unsupervised data selection remains a challenge, as the ‘appropriate level of difficulty’ may vary depending on the data domain. We introduce a novel unsupervised data selection method named ‘Capturing Perplexing Named Entities,’ which leverages the maximum inference entropy in translated named entities as a metric for selection. When tested with the ‘Korean-English Parallel Corpus of Specialized Domains,’ our method served as robust guidance for identifying training-efficient data across different domains, in contrast to existing methods.

pdf abs
Seeding Alignment between Language Technology and Indigenous Methodologies: A Decolonizing Framework for Endangered Language Revitalization
Craig John Carpenter | John Lyon | Miles Thorogood | Jeannette C. Armstrong

The integration of a speech technology into a digital edition to support the acquisition of a critically endangered Indigenous language is a complex task. More than simply consisting of technical challenges of working with an under-resourced language, researchers face the potential of re-enacting causes of language endangerment without rigorous adherence to qualitative methodologies. Based on reflections throughout the development process of a speech technology, this paper proposes a cross-disciplinary decolonizing framework for researchers working in the field of computational linguistics for Indigenous Language Revitalization (ILR). The authors propose a series of qualitative methodologies to ensure alignment with the language community which the technology is intended to benefit. The proposed relational framework is designed to sustain the integrity of the Four Rs: a series of principles first presented by Verna J. Kirkness and Ray Barnhardt in their 1991 article, “First Nations and Higher Education: The Four R’s - Respect, Relevance, Reciprocity, Responsibility”.

To produce high-quality Natural Language Processing (NLP) technologies for low-resource languages, authentic leadership and participation from the low-resource language community is crucial. This reduces chances of bias, surveillance and the inclusion of inaccurate data that can negatively impact output in language technologies. It also ensures that decision-making throughout the pipeline of work centres on the language community rather than only prioritising metrics. The NLP building process involves a range of steps and decisions to ensure the production of successful models and outputs. Rarely does a model perform as expected or desired the first time it is deployed for testing, resulting in the need for re-assessment and re-deployment. This paper discusses the process involved in solving failure modes for a Māori language automatic speech recognition (ASR) model. It explains how the data is curated and how language and data specialists offer unparalleled insight into the debugging process because of their knowledge of the data. This expertise has a significant influence on decision-making to ensure the entire pipeline is embedded in ethical practice and the work is culturally appropriate for the Māori language community thus creating trustworthy language technology.

pdf abs
Tandem Long-Short Duration-based Modeling for Automatic Speech Recognition
Dalai Mengke | Yan Meng | Peter Mihajlik

This study outlines our duration-dependent modeling experiments on limited-resource Hungarian speech recognition tasks. As it is well known, very short utterances pose significant challenges in automatic speech recognition due to the lack of context and other phenomena. In particular, we found that that the exclusion of shorter speech samples from fine-tuning for longer duration test data significantly improves the recognition rate measured on public Hungarian datasets, BEA-Base and CommonVoice (CV). Therefore we apply a tandem modeling approach, separate models are used for short and long duration test data. Our strategy improved the ability to recognize short utterances while maintaining recognition of long utterances efficiently, which led to a significant increase in overall recognition accuracy.

pdf abs
TELP – Text Extraction with Linguistic Patterns
João Cordeiro | Purificação Moura Silvano | António Leal | Sebastião Pais

Linguistic studies in under-resourced languages pose additional challenges at various levels, including the automatic collection of examples, cases, and corpora construction. Several sophisticated applications, such as GATE (Cunningham, 2002), can be configured/adjusted/programmed by experts to automatically collect examples from the Web in any language. However, these applications are too complex and intricate to be operated, requiring, in some cases, skills in computer science. In this work, we present TELP, a tool that allows for the simplified expression of linguistic patterns to extract case studies automatically from World Wide Web sites. It is a straightforward application with an intuitive GUI and a quick learning curve, facilitating its broad use by researchers from different domains. In this paper, we describe the operational and technical aspects of TELP and some relatively recent and relevant use cases in the field of linguistic studies.

pdf abs
The First Parallel Corpus and Neural Machine Translation Model of Western Armenian and English
Ari Nubar Boyacıoğlu | Jan Niehues

Western Armenian is a low-resource language spoken by the Armenian Diaspora residing in various places of the world. Although having content on the internet as well as a relatively rich literary heritage for a minority language, there is no data for the machine translation task and only a very limited amount of labeled data for other NLP tasks. In this work, we build the first machine translation system between Western Armenian and English. We explore different techniques for data collection and evaluate their impact in this very low-resource scenario. Then, we build the machine translation system while focusing on the possibilities of performing knowledge transfer from Eastern Armenian. The system is finetuned with the data collected for the first Western Armenian-English parallel corpus, which contains a total of approximately 147k sentence pairs, whose shareable part of 52k examples was made open-source. The best system through the experiments performs with a BLEU score of 29.8 while translating into English and 17 into Western Armenian.

pdf abs
Tracing Linguistic Heritage: Constructing a Somali-Italian Terminological Resource through Explorers’ Notebooks and Contemporary Corpus Analysis
Silvia Piccini | Giuliana Elizabeth Vilela Ruiz | Andrea Bellandi | Enrico Carniani

The aim of this contribution is to introduce the initial phases of constructing a Somali-Italian terminological resource that dates back to Italy’s colonial expansion into Africa. Specifically, the terminological data was extracted from the notebooks authored by the Italian explorer Ugo Ferrandi (1852 - 1928) and published by the Società Geografica in 1903 under the title “Lugh. Emporio Commerciale sul Giuba”. In order to develop Ferrandi’s terminological resource, we have employed Semantic Web technologies (RDF, OWL, and SPARQL) and embraced the Linked Open Data paradigm. This ensures the FAIRness of the data and enables the publication and sharing of our terminological resource within an open interconnected Web of Data, thus contributing to addressing the absence of Somali in the Linguistic Linked Data cloud. Whenever feasible, Ferrandi’s lexicon entries have been linked and enriched with information derived from a Somali lexicon included in a contemporary Somali Corpus. This approach allows the synchronic corpus-related Somali lexicon to acquire historical depth, thereby illuminating the linguistic dynamics that have transpired over time and would otherwise have remained obscure.

pdf abs
Uncovering Social Changes of the Basque Speaking Twitter Community During COVID-19 Pandemic
Joseba Fernandez de Landa | Iker García-Ferrero | Ander Salaberria | Jon Ander Campos

The aim of this work is to study the impact of the COVID-19 pandemic on the Basque speaking Twitter community by applying Natural Language Processing unsupervised techniques. In order to carry out this study, we collected and publicly released the biggest dataset of Basque tweets containing up to 8M tweets from September 2019 to February 2021. To analyze the impact of the pandemic, the variability of the content over time was studied through quantitative and qualitative analysis of words and emojis. For the quantitative analysis, the shift at the frequency of the terms was calculated using linear regression over frequencies. On the other hand, for the qualitative analysis, word embeddings were used to study the changes in the meaning of the most significant words and emojis at different periods of the pandemic. Through this multifaceted approach, we discovered noteworthy alterations in the political inclinations exhibited by Basque users throughout the course of the pandemic.

This paper presents the objectives, organization and activities of the UniDive COST Action, a scientific network dedicated to universality, diversity and idiosyncrasy in language technology. We describe the objectives and organization of this initiative, the people involved, the working groups and the ongoing tasks and activities. This paper is also an pen call for participation towards new members and countries.

pdf abs
Unsupervised Outlier Detection for Language-Independent Text Quality Filtering
Jón Daðason | Hrafn Loftsson

Web-crawled corpora offer an abundant source of training data for language models. However, they are generally noisy and are typically filtered using heuristic rules or classifiers. These methods require careful tuning or labeling by fluent speakers. In this paper, we assess the effectiveness of commonly applied rules on TQ-IS, a manually labeled text quality dataset for Icelandic. Additionally, we advocate for the utilization of unsupervised clustering and outlier detection algorithms for filtering. These algorithms are language-independent, computationally efficient and do not require language expertise. Using grid search, we find the optimal configuration for every combination of rules, optimizing for F1 score on TQ-IS. For a rule-based approach, we discover that optimal results can be achieved with only a small subset of the full ruleset. Using five rules, we obtain an F1 score of 98.2%. We then evaluate three unsupervised algorithms, i.e., Gaussian Mixture Models (GMMs), Isolation Forests and One-Class SVMs. Our findings reveal that unsupervised algorithms perform well on the TQ-IS dataset, with GMMs obtaining the best results, comparable to those obtained with the rule-based approach. Finally, we show that unsupervised methods appear to be equally suitable for languages other than Icelandic, including Estonian and Basque.

pdf abs
UzABSA: Aspect-Based Sentiment Analysis for the Uzbek Language
Sanatbek Gayratovich Matlatipov | Jaloliddin Rajabov | Elmurod Kuriyozov | Mersaid Aripov

The objective of enhancing the availability of natural language processing technologies for low-resource languages has significant importance in facilitating technological accessibility within the populations of speakers of these languages. Our current grasping shows that there are no established linguistic resources available open source to develop aspect-based sentiment analysis (ABSA) tools tailored to the Uzbek language. This work aims to address the aforementioned gap by presenting the first high-quality annotated ABSA dataset - UzABSA. The data used in this study was obtained from a compilation of online reviews of Uzbek restaurants. Consequently, the constructed dataset has a length of 3500 reviews at the document level and 6100+ sentences at the sentence level. The popular approach to language resources of this kind explores four distinctive characteristics, namely Aspect Terms, Aspect Term Polarities, Aspect Category Terms, as well as Aspect Category Polarities. To the best of our knowledge, it is the first and the largest ABSA dataset for the Uzbek language. To evaluate the annotation process of our dataset, we used established statistical techniques such as Cohen’s kappa coefficient and Krippendorff’s 𝛼 to assess agreement between annotators. Subsequently, a classification model, namely K-Nearest Neighbour (KNN), was used to evaluate the performance of the created dataset. Both sets of evaluation techniques demonstrate comparable levels of accuracy. The first findings across the various tasks showed promising outcomes, with accuracy rates ranging from 72% to 88%. This study not only highlights the significance of our acquired dataset but also plays a valuable tool for scholars interested in furthering sentiment analysis in the Uzbek language.

pdf abs
ViHealthNLI: A Dataset for Vietnamese Natural Language Inference in Healthcare
Huyen Nguyen | Quyen The Ngo | Thanh-Ha Do | Tuan-Anh Hoang

This paper introduces ViHealthNLI, a large dataset for the natural language inference problem for Vietnamese. Unlike the similar Vietnamese datasets, ours is specific to the healthcare domain. We conducted an exploratory analysis to characterize the dataset and evaluated the state-of-the-art methods on the dataset. Our findings indicate that the dataset poses significant challenges while also holding promise for further advanced research and the creation of practical applications.

pdf abs
Why the Unexpected? Dissecting the Political and Economic Bias in Persian Small and Large Language Models
Ehsan Barkhordar | Surendrabikram Thapa | Ashwarya Maratha | Usman Naseem

Recently, language models (LMs) like BERT and large language models (LLMs) like GPT-4 have demonstrated potential in various linguistic tasks such as text generation, translation, and sentiment analysis. However, these abilities come with a cost of a risk of perpetuating biases from their training data. Political and economic inclinations play a significant role in shaping these biases. Thus, this research aims to understand political and economic biases in Persian LMs and LLMs, addressing a significant gap in AI ethics and fairness research. Focusing on the Persian language, our research employs a two-step methodology. First, we utilize the political compass test adapted to Persian. Second, we analyze biases present in these models. Our findings indicate the presence of nuanced biases, underscoring the importance of ethical considerations in AI deployments within Persian-speaking contexts.

pdf abs
Work in Progress: Text-to-speech on Edge Devices for Te Reo Māori and ‘Ōlelo Hawaiʻi
Tūreiti Keith

Existing popular text-to-speech technologies focus on large models requiring a large corpus of recorded speech to train. The resulting models are typically run on high-resource servers where users synthesise speech from a client device requiring constant connectivity. For speakers of low-resource languages living in remote areas, this approach does not work. Corpora are typically small and synthesis needs to run on an unconnected, battery or solar-powered edge device. In this paper, we demonstrate how knowledge transfer and adversarial training can be used to create efficient models capable of running on edge devices using a corpus of only several hours. We apply these concepts to create a voice synthesiser for te reo Māori (the indigenous language of Aotearoa New Zealand) for a non-speaking user and ‘ōlelo Hawaiʻi (the indigenous language of Hawaiʻi) for a legally blind user, thus creating the first high-quality text-to-speech tools for these endangered, central-eastern Polynesian languages capable of running on a low powered edge device.

pdf (full)
bib (full) Proceedings of the Second International Workshop Towards Digital Language Equality (TDLE): Focusing on Sustainability @ LREC-COLING 2024

Many of the world’s languages are left behind when it comes to Language Technology applications, since most of these are available only in a limited number of languages, creating a digital divide that affects millions of users worldwide. It is crucial, therefore, to monitor and quantify the progress of technology support for individual languages, which also enables comparisons across language communities. In this way, efforts can be directed towards reducing language barriers, promoting economic and social inclusion, and ensuring that all citizens can use their preferred language in the digital age. This paper critically reviews and compares recent quantitative approaches to measuring technology support for languages. Despite using different approaches and methodologies, the findings of all analysed papers demonstrate the unequal distribution of technology support and emphasise the existence of a digital divide among languages.

pdf bib abs
Which Domains, Tasks and Languages are in the Focus of NLP Research on the Languages of Europe?
Diego Alves | Marko Tadić | Georg Rehm

This article provides a thorough mapping of NLP and Language Technology research on 39 European languages onto 46 domains. Our analysis is based on almost 50,000 papers published between 2010 and October 2022 in the ACL Anthology. We use a dictionary-based approach to identify 1) languages, 2) domains, and 3) NLP tasks in these papers; the dictionary-based method using exact terms has a precision value of 0.81. Moreover, we identify common mistakes which can be useful to fine-tune the methodology for future work. While we are only able to highlight selected results in this submitted version, the final paper will contain detailed analyses and charts on a per-language basis. We hope that this study can contribute to digital language equality in Europe by providing information to the academic and industrial research community about the opportunities for novel LT/NLP research.

pdf abs
Fine-Tuning Open Access LLMs for High-Precision NLU in Goal-Driven Dialog Systems
Lluís Padró | Roser Saurí

This paper presents a set of experiments on fine-tuning LLMs to produce high-precision semantic representations for the NLU component of a dialog system front-end. The aim of this research is threefold: First, we want to explore the capabilities of LLMs on real, industry-based use cases that involve complex data and strict requirements on results. Since the LLM output should usable by the application back-end, the produced semantic representation must satisfy strict format and consistency requirements. Second, we want to evaluate the cost-benefit of open-source LLMs, that is, the feasibility of running this kind of models in machines affordable to small-medium enterprises (SMEs), in order to assess how far this organizations can go without depending on the large players controlling the market, and with a moderate use of computation resources. Finally, we also want to assess the language scalability of the LLMs in this kind of applications; specifically, whether a multilingual model is able to cast patterns learnt from one language to other ones –with special attention to underresourced languages–, thus reducing required training data and computation costs. This work was carried out within an R&D context of assisting a real company in defining its NLU model strategy, and thus the results have a practical, industry-level focus.

pdf abs
Could We Have Had Better Multilingual LLMs if English Was Not the Central Language?
Ryandito Diandaru | Lucky Susanto | Zilu Tang | Ayu Purwarianti | Derry Tanti Wijaya

Large Language Models (LLMs) demonstrate strong machine translation capabilities on languages they are trained on. However, the impact of factors beyond training data size on translation performance remains a topic of debate, especially concerning languages not directly encountered during training. Our study delves into Llama2’s translation capabilities. By modeling a linear relationship between linguistic feature distances and machine translation scores, we ask ourselves if there are potentially better central languages for LLMs other than English. Our experiments show that the 7B Llama2 model yields above 10 BLEU when translating into all languages it has seen, which rarely happens for languages it has not seen. Most translation improvements into unseen languages come from scaling up the model size rather than instruction tuning or increasing shot count. Furthermore, our correlation analysis reveals that syntactic similarity is not the only linguistic factor that strongly correlates with machine translation scores. Interestingly, we discovered that under specific circumstances, some languages (e.g. Swedish, Catalan), despite having significantly less training data, exhibit comparable correlation levels to English. These insights challenge the prevailing landscape of LLMs, suggesting that models centered around languages other than English could provide a more efficient foundation for multilingual applications.

pdf abs
A Language Model Trained on Uruguayan Spanish News Text
Juan Pablo Filevich | Gonzalo Marco | Santiago Castro | Luis Chiruzzo | Aiala Rosá

This paper presents a language model trained from scratch exclusively on a brand new corpus consisting of about 6 GiB of Uruguayan newspaper text. We trained the model for 30 days on a single Nvidia P100 using the RoBERTa-base architecture but with considerably fewer parameters than other standard RoBERTa models. We evaluated the model on two NLP tasks and found that it outperforms BETO, the widely used Spanish BERT pre-trained model. We also compared our model on the masked-word prediction task with two popular multilingual BERT-based models, Multilingual BERT and XLM-RoBERTa, obtaining outstanding results on sentences from the Uruguayan press domain. Our experiments show that training a language model on a domain-specific corpus can significantly improve performance even when the model is smaller and was trained with significantly less data than more standard pre-trained models.

pdf abs
Environmental Impact Measurement in the MentalRiskES Evaluation Campaign
Alba M. Mármol Romero | Adrián Moreno-Muñoz | Flor Miriam Plaza-del-Arco | M. Dolores Molina González | Arturo Montejo-Ráez

With the rise of Large Language Models (LLMs), the NLP community is increasingly aware of the environmental consequences of model development due to the energy consumed for training and running these models. This study investigates the energy consumption and environmental impact of systems participating in the MentalRiskES shared task, at the Iberian Language Evaluation Forum (IberLEF) in the year 2023, which focuses on early risk identification of mental disorders in Spanish comments. Participants were asked to submit, for each prediction, a set of efficiency metrics, being carbon dioxide emissions among them. We conduct an empirical analysis of the data submitted considering model architecture, task complexity, and dataset characteristics, covering a spectrum from traditional Machine Learning (ML) models to advanced LLMs. Our findings contribute to understanding the ecological footprint of NLP systems and advocate for prioritizing environmental impact assessment in shared tasks to foster sustainability across diverse model types and approaches, being evaluation campaigns an adequate framework for this kind of analysis.

pdf (full)
bib (full) Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

pdf bib abs
The Constant in HATE: Toxicity in Reddit across Topics and Languages
Wondimagegnhue Tsegaye Tufa | Ilia Markov | Piek T.J.M. Vossen

Toxic language remains an ongoing challenge on social media platforms, presenting significant issues for users and communities. This paper provides a cross-topic and cross-lingual analysis of toxicity in Reddit conversations. We collect 1.5 million comment threads from 481 communities in six languages. By aligning languages with topics, we thoroughly analyze how toxicity spikes within different communities. Our analysis targets six languages spanning different communities and topics such as Culture, Politics, and News. We observe consistent patterns across languages where toxicity increases within the same topics while also identifying significant differences where specific language communities exhibit notable variations in relation to certain topics.

pdf bib abs
A Federated Learning Approach to Privacy Preserving Offensive Language Identification
Marcos Zampieri | Damith Premasiri | Tharindu Ranasinghe

The spread of various forms of offensive speech online is an important concern in social media. While platforms have been investing heavily in ways of coping with this problem, the question of privacy remains largely unaddressed. Models trained to detect offensive language on social media are trained and/or fine-tuned using large amounts of data often stored in centralized servers. Since most social media data originates from end users, we propose a privacy preserving decentralized architecture for identifying offensive language online by introducing Federated Learning (FL) in the context of offensive language identification. FL is a decentralized architecture that allows multiple models to be trained locally without the need for data sharing hence preserving users’ privacy. We propose a model fusion approach to perform FL. We trained multiple deep learning models on four publicly available English benchmark datasets (AHSD, HASOC, HateXplain, OLID) and evaluated their performance in detail. We also present initial cross-lingual experiments in English and Spanish. We show that the proposed model fusion approach outperforms baselines in all the datasets while preserving privacy.

pdf abs
CLTL@HarmPot-ID: Leveraging Transformer Models for Detecting Offline Harm Potential and Its Targets in Low-Resource Languages
Yeshan Wang | Ilia Markov

We present the winning approach to the TRAC 2024 Shared Task on Offline Harm Potential Identification (HarmPot-ID). The task focused on low-resource Indian languages and consisted of two sub-tasks: 1a) predicting the offline harm potential and 1b) detecting the most likely target(s) of the offline harm. We explored low-source domain specific, cross-lingual, and monolingual transformer models and submitted the aggregate predictions from the MuRIL and BERT models. Our approach achieved 0.74 micro-averaged F1-score for sub-task 1a and 0.96 for sub-task 1b, securing the 1st rank for both sub-tasks in the competition.

pdf abs
NJUST-KMG at TRAC-2024 Tasks 1 and 2: Offline Harm Potential Identification
Jingyuan Wang | Jack Depp | Yang Yang

This report provide a detailed description of the method that we proposed in the TRAC-2024 Offline Harm Potential dentification which encloses two sub-tasks. The investigation utilized a rich dataset comprised of social media comments in several Indian languages, annotated with precision by expert judges to capture the nuanced implications for offline context harm. The objective assigned to the participants was to design algorithms capable of accurately assessing the likelihood of harm in given situations and identifying the most likely target(s) of offline harm. Our approach ranked second in two separate tracks, with F1 values of 0.73 and 0.96 respectively. Our method principally involved selecting pretrained models for finetuning, incorporating contrastive learning techniques, and culminating in an ensemble approach for the test set.

pdf abs
ScalarLab@TRAC2024: Exploring Machine Learning Techniques for Identifying Potential Offline Harm in Multilingual Commentaries
Anagha H C | Saatvik M. Krishna | Soumya Sangam Jha | Vartika T. Rao | Anand Kumar M

The objective of the shared task, Offline Harm Potential Identification (HarmPot-ID), is to build models to predict the offline harm potential of social media texts. “Harm potential” is defined as the ability of an online post or comment to incite offline physical harm such as murder, arson, riot, rape, etc. The first subtask was to predict the level of harm potential, and the second was to identify the group to which this harm was directed towards. This paper details our submissions for the shared task that includes a cascaded SVM model, an XGBoost model, and a TF-IDF weighted Word2Vec embedding-supported SVM model. Several other models that were explored have also been detailed.

pdf abs
LLM-Based Synthetic Datasets: Applications and Limitations in Toxicity Detection
Udo Kruschwitz | Maximilian Schmidhuber

Large Language Model (LLM)-based Synthetic Data is becoming an increasingly important field of research. One of its promising application is in training classifiers to detect online toxicity, which is of increasing concern in today’s digital landscape. In this work, we assess the feasibility of generative models to generate synthetic data for toxic speech detection. Our experiments are conducted on six different toxicity datasets, four of whom are hateful and two are toxic in the broader sense. We then employ a classifier trained on the original data for filtering. To explore the potential of this data, we conduct experiments using combinations of original and synthetic data, synthetic oversampling of the minority class, and a comparison of original vs. synthetic-only training. Results indicate that while our generative models offer benefits in certain scenarios, it does not improve hateful dataset classification. However, it does boost patronizing and condescending language detection. We find that synthetic data generated by LLMs is a promising avenue of research, but further research is needed to improve the quality of the generated data and develop better filtering methods. Code is available on GitHub; the generated dataset will be available on Zenodo in the final submission.

pdf abs
Using Sarcasm to Improve Cyberbullying Detection
Xiaoyu Guo | Susan Gauch

Cyberbullying has become more prevalent over time, especially towards minority groups, and online human moderators cannot detect cyberbullying content efficiently. Prior work has addressed this problem by detecting cyberbullying with deep learning approaches. In this project, we compare several BERT-based benchmark methods for cyberbullying detection and do a failure analysis to see where the model fails to correctly identify cyberbullying. We find that many falsely classified texts are sarcastic, so we propose a method to mitigate the false classifications by incorporating neural network-based sarcasm detection. We define a simple multilayer perceptron (MLP) that incorpo- rates sarcasm detection in the final cyberbully classifications and demonstrate improvement over benchmark methods.

pdf abs
Analyzing Offensive Language and Hate Speech in Political Discourse: A Case Study of German Politicians
Maximilian Weissenbacher | Udo Kruschwitz

Social media platforms have become key players in political discourse. Twitter (now ‘X’), for example, is used by many German politicians to communicate their views and interact with others. Due to its nature, however, social networks suffer from a number of issues such as offensive content, toxic language and hate speech. This has attracted a lot of research interest but in the context of political discourse there is a noticeable gap with no such study specifically looking at German politicians in a systematic way. We aim to help addressing this gap. We first create an annotated dataset of 1,197 Twitter posts mentioning German politicians. This is the basis to explore a number of approaches to detect hate speech and offensive language (HOF) and identify an ensemble of transformer models that achieves an F1-Macros score of 0.94. This model is then used to automatically classify two much larger, longitudinal datasets: one with 520,000 tweets posted by MPs, and the other with 2,200,000 tweets which comprise posts from the public mentioning politicians. We obtain interesting insights in regards to the distribution of hate and offensive content when looking at different independent variables.

This study introduces “Ice and Fire,” a Multi-Task Learning (MTL) dataset tailored for sentiment analysis in the Icelandic language, encompassing a wide range of linguistic tasks, including sentiment and emotion detection, as well as identification of toxicity, hate speech, encouragement, sympathy, sarcasm/irony, and trolling. With 261 fully annotated blog comments and 1045 comments annotated in at least one task, this contribution marks a significant step forward in the field of Icelandic natural language processing. It provides a comprehensive dataset for understanding the nuances of online communication in Icelandic and an interface to expand the annotation effort. Despite the challenges inherent in subjective interpretation of text, our findings highlight the positive potential of this dataset to improve text analysis techniques and encourage more inclusive online discourse in Icelandic communities. With promising baseline performances, “Ice and Fire” sets the stage for future research to enhance automated text analysis and develop sophisticated language technologies, contributing to healthier online environments and advancing Icelandic language resources.

pdf abs
Detecting Hate Speech in Amharic Using Multimodal Analysis of Social Media Memes
Melese Ayichlie Jigar | Abinew Ali Ayele | Seid Muhie Yimam | Chris Biemann

In contemporary society, the proliferation of hate speech is increasingly prevalent across various social media platforms, with a notable trend of incorporating memes to amplify its visual impact and reach. The conventional text-based detection approaches frequently fail to address the complexities introduced by memes, thereby aggravating the challenges, particularly in low-resource languages such as Amharic. We develop Amharic meme hate speech detection models using 2,000 memes collected from Facebook, Twitter, and Telegram over four months. We employ native Amharic speakers to annotate each meme using a web-based tool, yielding a Fleiss’ kappa score of 0.50. We utilize different feature extraction techniques, namely VGG16 for images and word2Vec for textual content, and build unimodal and multimodal models such as LSTM, BiLSTM, and CNN. The BiLSTM model shows the best performance, achieving 63% accuracy for text and 75% for multimodal features. In image-only experiments, the CNN model achieves 69% in accuracy. Multimodal models demonstrate superior performance in detecting Amharic hate speech in memes, showcasing their potential to address the unique challenges posed by meme-based hate speech on social media.

pdf abs
Content Moderation in Online Platforms: A Study of Annotation Methods for Inappropriate Language
Baran Barbarestani | Isa Maks | Piek T.J.M. Vossen

Detecting inappropriate language in online platforms is vital for maintaining a safe and respectful digital environment, especially in the context of hate speech prevention. However, defining what constitutes inappropriate language can be highly subjective and context-dependent, varying from person to person. This study presents the outcomes of a comprehensive examination of the subjectivity involved in assessing inappropriateness within conversational contexts. Different annotation methods, including expert annotation, crowd annotation, ChatGPT-generated annotation, and lexicon-based annotation, were applied to English Reddit conversations. The analysis revealed a high level of agreement across these annotation methods, with most disagreements arising from subjective interpretations of inappropriate language. This emphasizes the importance of implementing content moderation systems that not only recognize inappropriate content but also understand and adapt to diverse user perspectives and contexts. The study contributes to the evolving field of hate speech annotation by providing a detailed analysis of annotation differences in relation to the subjective task of judging inappropriate words in conversations.

pdf abs
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
Caroline Brun | Vassilina Nikoulina

Large language models (LLMs) are increasingly popular but are also prone to generating bias, toxic or harmful language, which can have detrimental effects on individuals and communities. Although most efforts is put to assess and mitigate toxicity in generated content, it is primarily concentrated on English, while it’s essential to consider other languages as well. For addressing this issue, we create and release FrenchToxicityPrompts, a dataset of 50K naturally occurring French prompts and their continuations, annotated with toxicity scores from a widely used toxicity classifier. We evaluate 14 different models from four prevalent open-sourced families of LLMs against our dataset to assess their potential toxicity across various dimensions. We hope that our contribution will foster future research on toxicity detection and mitigation beyond English.

pdf abs
Studying Reactions to Stereotypes in Teenagers: an Annotated Italian Dataset
Elisa Chierchiello | Tom Bourgeade | Giacomo Ricci | Cristina Bosco | Francesca D’Errico

The paper introduces a novel corpus collected in a set of experiments in Italian schools, annotated for the presence of stereotypes, and related categories. It consists of comments written by teenage students in reaction to fabricated fake news, designed to elicit prejudiced responses, by featuring racial stereotypes. We make use of an annotation scheme which takes into account the implicit or explicit nature of different instances of stereotypes, alongside their forms of discredit. We also annotate the stance of the commenter towards the news article, using a schema inspired by rumor and fake news stance detection tasks. Through this rarely studied setting, we provide a preliminary exploration of the production of stereotypes in a more controlled context. Alongside this novel dataset, we provide both quantitative and qualitative analyses of these reactions, to validate the categories used in their annotation. Through this work, we hope to increase the diversity of available data in the study of the propagation and the dynamics of negative stereotypes.

pdf abs
Offensiveness, Hate, Emotion and GPT: Benchmarking GPT3.5 and GPT4 as Classifiers on Twitter-specific Datasets
Nikolaj Bauer | Moritz Preisig | Martin Volk

In this paper, we extend the work of benchmarking GPT by turning GPT models into classifiers and applying them on three different Twitter datasets on Hate-Speech Detection, Offensive Language Detection, and Emotion Classification. We use a Zero-Shot and Few-Shot approach to evaluate the classification capabilities of the GPT models. Our results show that GPT models do not always beat fine-tuned models on the tested benchmarks. However, in Hate-Speech and Emotion Detection, using a Few-Shot approach, state-of-the-art performance can be achieved. The results also reveal that GPT-4 is more sensitive to the examples given in a Few-Shot prompt, highlighting the importance of choosing fitting examples for inference and prompt formulation.

Public figures receive disproportionate levels of abuse on social media, impacting their active participation in public life. Automated systems can identify abuse at scale but labelling training data is expensive and potentially harmful. So, it is desirable that systems are efficient and generalisable, handling shared and specific aspects of abuse. We explore the dynamics of cross-group text classification in order to understand how well models trained on one domain or demographic can transfer to others, with a view to building more generalisable abuse classifiers. We fine-tune language models to classify tweets targeted at public figures using our novel DoDo dataset, containing 28,000 entries with fine-grained labels, split equally across four Domain-Demographic pairs (male and female footballers and politicians). We find that (i) small amounts of diverse data are hugely beneficial to generalisation and adaptation; (ii) models transfer more easily across demographics but cross-domain models are more generalisable; (iii) some groups contribute more to generalisability than others; and (iv) dataset similarity is a signal of transferability.

Social media have become an integral part of our daily lives, yet they have also resulted in various negative effects on users, ranging from offensive or hateful content to the spread of misinformation. In recent years, numerous automated approaches have been proposed to identify and combat such harmful content. However, it is crucial to recognize the human aspect of users who engage with this content in designing efforts to mitigate these threats. We propose to incorporate principles of behavioral science, specifically the concept of nudging into social media platforms. Our approach involves augmenting social media feeds with informative diagrams, which provide insights into the content that users are presented. The goal of our work is to empower social media users to make well-informed decisions for themselves and for others within these platforms. Nudges serve as a means to gently draw users’ attention to content in an unintrusive manner, a crucial consideration in the context of social media. To evaluate the effectiveness of our approach, we conducted a user study involving 120 Italian-speaking participants who interacted with a social media interface augmented with these nudging diagrams. Participants who had used the augmented interface were able to outperform those using the plain interface in a successive harmful content detection test where nudging diagrams were not visible anymore. Our findings demonstrate that our approach significantly improves users’ awareness of potentially harmful content with effects lasting beyond the duration of the interaction. In this work, we provide a comprehensive overview of our experimental materials and setup, present our findings, and refer to the limitations identified during our study.

pdf abs
Exploring Boundaries and Intensities in Offensive and Hate Speech: Unveiling the Complex Spectrum of Social Media Discourse
Abinew Ali Ayele | Esubalew Alemneh Jalew | Adem Chanie Ali | Seid Muhie Yimam | Chris Biemann

The prevalence of digital media and evolving sociopolitical dynamics have significantly amplified the dissemination of hateful content. Existing studies mainly focus on classifying texts into binary categories, often overlooking the continuous spectrum of offensiveness and hatefulness inherent in the text. In this research, we present an extensive benchmark dataset for Amharic, comprising 8,258 tweets annotated for three distinct tasks: category classification, identification of hate targets, and rating offensiveness and hatefulness intensities. Our study highlights that a considerable majority of tweets belong to the less offensive and less hate intensity levels, underscoring the need for early interventions by stakeholders. The prevalence of ethnic and political hatred targets, with significant overlaps in our dataset, emphasizes the complex relationships within Ethiopia’s sociopolitical landscape. We build classification and regression models and investigate the efficacy of models in handling these tasks. Our results reveal that hate and offensive speech can not be addressed by a simplistic binary classification, instead manifesting as variables across a continuous range of values. The afro-XLMR-large model exhibits the best performances achieving F1-scores of 75.30%, 70.59%, and 29.42% for the category, target, and regression tasks, respectively. The 80.22% correlation coefficient of the Afro-XLMR-large model indicates strong alignments.

pdf (full)
bib (full) Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024

pdf bib
Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024
Mariana Romanyshyn | Nataliia Romanyshyn | Andrii Hlybovets | Oleksii Ignatenko

We present a corpus of contemporary Ukrainian news articles published between 2019 and 2022 on the news website of the national public broadcaster of Ukraine, commonly known as SUSPILNE. The current release comprises 87 210 364 words in 292 955 texts. Texts are annotated with titles and their time of publication. In addition, the corpus has been linguistically annotated at the token level with a dependency parser. To provide further aspects for investigation, a topic model was trained on the corpus. The corpus is hosted (Fischer et al., 2023) at the Saarbrücken CLARIN center under a CC BY-NC-ND 4.0 license and available in two tab-separated formats: CoNLL-U (de Marneffe et al., 2021) and vertical text format (VRT) as used by the IMS Open Corpus Workbench (CWB; Evert and Hardie, 2011) and CQPweb (Hardie, 2012). We show examples of using the CQPweb interface, which allows to extract the quantitative data necessary for distributional and collocation analyses of the CNC-UA. As the CNC-UA contains news texts documenting recent events, it is highly relevant not only for linguistic analyses of the modern Ukrainian language but also for socio-cultural and political studies.

pdf bib abs
Introducing the Djinni Recruitment Dataset: A Corpus of Anonymized CVs and Job Postings
Nazarii Drushchak | Mariana Romanyshyn

This paper introduces the Djinni Recruitment Dataset, a large-scale open-source corpus of candidate profiles and job descriptions. With over 150,000 jobs and 230,000 candidates, the dataset includes samples in English and Ukrainian, thereby facilitating advancements in the recruitment domain of natural language processing (NLP) for both languages. It is one of the first open-source corpora in the recruitment domain, opening up new opportunities for AI-driven recruitment technologies and related fields. Notably, the dataset is accessible under the MIT license, encouraging widespread adoption for both scientific research and commercial projects.

pdf abs
Creating Parallel Corpora for Ukrainian: A German-Ukrainian Parallel Corpus (ParaRook||DE-UK)
Maria Shvedova | Arsenii Lukashevskyi

Parallel corpora are currently a popular and vibrantly developing category of linguistic resources, used both in literature and translation studies, as well as in the field of NLP. For Ukrainian, though, there are still not enough significant parallel corpora compiled within a single roof project and made available to the research community. In this paper we present a newly developed resource, the German-Ukrainian Parallel Corpus — ParaRook||DE-UK, searchable online. We describe various issues related to its compilation, text selection, and annotation. The paper also features several examples of how the corpus can be used in linguistic research and translation studies. Using the experience of the German-Ukrainian parallel corpus, parallel corpora for other languages with Ukrainian can be developed.

pdf abs
Introducing NER-UK 2.0: A Rich Corpus of Named Entities for Ukrainian
Dmytro Chaplynskyi | Mariana Romanyshyn

This paper presents NER-UK 2.0, a corpus of texts in the Ukrainian language manually annotated for the named entity recognition task. The corpus contains 560 texts of multiple genres, boasting 21,993 entities in total. The annotation scheme covers 13 entity types, namely location, person name, organization, artifact, document, job title, date, time, period, money, percentage, quantity, and miscellaneous. Such a rich set of entities makes the corpus valuable for training named-entity recognition models in various domains, including news, social media posts, legal documents, and procurement contracts. The paper presents an updated baseline solution for named entity recognition in Ukrainian with 0.89 F1. The corpus is the largest of its kind for the Ukrainian language and is available for download.

pdf abs
Instant Messaging Platforms News Multi-Task Classification for Stance, Sentiment, and Discrimination Detection
Taras Ustyianovych | Denilson Barbosa

In the digital age, geopolitical events frequently catalyze discussions among global web users. Platforms such as social networks and messaging applications serve as vital means for information spreading and acquisition. The Russian aggression against Ukraine has notably intensified online discourse on the matter, drawing a significant audience eager for real-time updates. This surge in online activity inevitably results in the proliferation of content, some of which may be unreliable or manipulative. Given this context, the identification of such content with information distortion is imperative to mitigate bias and promote fairness. However, this task presents considerable challenges, primarily due to the lack of sophisticated language models capable of understanding the nuances and context of texts in low-resource languages, and the scarcity of well-annotated datasets for training such models. To address these gaps, we introduce the TRWU dataset - a meticulously annotated collection of Telegram news about the Russian war in Ukraine gathered starting from January 1, 2022. This paper outlines our methodology for semantic analysis and classification of these messages, aiming to ascertain their bias. Such an approach enhances our ability to detect manipulative and destructive content. Through descriptive statistical analysis, we explore deviations in message sentiment, stance, and metadata across different types of channels and levels of content creation activity. Our findings indicate a predominance of negative sentiment within the dataset. Additionally, our research elucidates distinct differences in the linguistic choices and phraseology among channels, based on their stance towards the war. This study contributes to the broader effort of understanding the spread and mitigating the impact of biased and manipulative content in digital communications.

pdf abs
Setting up the Data Printer with Improved English to Ukrainian Machine Translation
Yurii Paniv | Dmytro Chaplynskyi | Nikita Trynus | Volodymyr Kyrylov

To build large language models for Ukrainian we need to expand our corpora with large amounts of new algorithmic tasks expressed in natural language. Examples of task performance expressed in English are abundant, so with a high-quality translation system our community will be enabled to curate datasets faster. To aid this goal, we introduce a recipe to build a translation system using supervised finetuning of a large pretrained language model with a noisy parallel dataset of 3M pairs of Ukrainian and English sentences followed by a second phase of training using 17K examples selected by k-fold perplexity filtering on another dataset of higher quality. Our decoder-only model named Dragoman beats performance of previous state of the art encoder-decoder models on the FLORES devtest set.

pdf abs
Automated Extraction of Hypo-Hypernym Relations for the Ukrainian WordNet
Nataliia Romanyshyn | Dmytro Chaplynskyi | Mariana Romanyshyn

WordNet is a crucial resource in linguistics and natural language processing, providing a detailed and expansive set of lexico-semantic relationships among words in a language. The trend toward automated construction and expansion of WordNets has become increasingly popular due to the high costs of manual development. This study aims to automate the development of the Ukrainian WordNet, explicitly concentrating on hypo-hypernym relations that are crucial building blocks of the hierarchical structure of WordNet. Utilizing the linking between Princeton WordNet, Wikidata, and multilingual resources from Wikipedia, the proposed approach successfully mapped 17% of Princeton WordNet (PWN) content to Ukrainian Wikipedia. Furthermore, the study introduces three innovative strategies for generating new entries to fill in the gaps of the Ukrainian WordNet: machine translation, the Hypernym Discovery model, and the Hypernym Instruction-Following LLaMA model. The latter model shows a high level of effectiveness, evidenced by a 41.61% performance on the Mean Overlap Coefficient (MOC) metric. With the proposed approach that combines automated techniques with expert human input, we provide a reliable basis for creating the Ukrainian WordNet.

This study presents a benchmark for evaluating the Visual Word Sense Disambiguation (Visual-WSD) task in Ukrainian. The main goal of the Visual-WSD task is to identify, with minimal contextual information, the most appropriate representation of a given ambiguous word from a set of ten images. To construct this benchmark, we followed a methodology similar to that proposed by (CITATION), who previously introduced benchmarks for the Visual-WSD task in English, Italian, and Farsi. This approach allows us to incorporate the Ukrainian benchmark into a broader framework for cross-language model performance comparisons. We collected the benchmark data semi-automatically and refined it with input from domain experts. We then assessed eight multilingual and multimodal large language models using this benchmark. All tested models performed worse than the zero-shot CLIP-based baseline model (CITATION) used by (CITATION) for the English Visual-WSD task. Our analysis revealed a significant performance gap in the Visual-WSD task between Ukrainian and English.

pdf abs
The UNLP 2024 Shared Task on Fine-Tuning Large Language Models for Ukrainian
Mariana Romanyshyn | Oleksiy Syvokon | Roman Kyslyi

This paper presents the results of the UNLP 2024 shared task, the first Shared Task on Fine-Tuning Large Language Models for the Ukrainian language. The goal of the task was to facilitate the creation of models that have knowledge of the Ukrainian language, history, and culture, as well as common knowledge, and are capable of generating fluent and accurate responses in Ukrainian. The participants were required to use models with open weights and reasonable size to ensure the reproducibility of the solutions. The participating systems were evaluated using multiple-choice exam questions and manually crafted open questions. Three teams submitted their solutions before the deadline, and two teams submitted papers that were accepted to appear in the UNLP workshop proceedings and are referred to in this report. The Codabench leaderboard is left open for further submissions.

pdf abs
Fine-Tuning and Retrieval Augmented Generation for Question Answering Using Affordable Large Language Models
Tiberiu Boros | Radu Chivereanu | Stefan Dumitrescu | Octavian Purcaru

We present our proposed system named Sherlock to UNLP 2024 Shared Task on Question Answering winning first place. We employ a mix of methods, from using automatically translated datasets to perform supervised fine-tuning and direct preference optimization on instruction-tuned models, to model weight merging and retrieval augmented generation. We present and motivate our chosen sequence of steps, as well as an ablation study to understand the effect of each additional step. The resulting model and code are made publicly available (download links provided in the paper).

In the rapidly advancing field of AI and NLP, generative large language models (LLMs) stand at the forefront of innovation, showcasing unparalleled abilities in text understanding and generation. However, the limited representation of low-resource languages like Ukrainian poses a notable challenge, restricting the reach and relevance of this technology. Our paper addresses this by fine-tuning the open-source Gemma and Mistral LLMs with Ukrainian datasets, aiming to improve their linguistic proficiency and benchmarking them against other existing models capable of processing Ukrainian language. This endeavor not only aims to mitigate language bias in technology but also promotes inclusivity in the digital realm. Our transparent and reproducible approach encourages further NLP research and development. Additionally, we present the Ukrainian Knowledge and Instruction Dataset (UKID) to aid future efforts in language model fine-tuning. Our research not only advances the field of NLP but also highlights the importance of linguistic diversity in AI, which is crucial for cultural preservation, education, and expanding AI’s global utility. Ultimately, we advocate for a future where technology is inclusive, enabling AI to communicate effectively across all languages, especially those currently underrepresented.

pdf abs
Spivavtor: An Instruction Tuned Ukrainian Text Editing Model
Aman Saini | Artem Chernodub | Vipul Raheja | Vivek Kulkarni

We introduce Spivavtor, a dataset, and instruction-tuned models for text editing focused on the Ukrainian language. Spivavtor is the Ukrainian-focused adaptation of the English-only CoEdIT (Raheja et al., 2023) model. Similar to CoEdIT, Spivavtor performs text editing tasks by following instructions in Ukrainian like “Виправте граматику в цьому реченнi” and “Спростiть це речення” which translate to “Correct the grammar in this sentence” and “Simplify this sentence” in English, respectively. This paper describes the details of the Spivavtor-Instruct dataset and Spivavtor models. We evaluate Spivavtor on a variety of text editing tasks in Ukrainian, such as Grammatical Error Correction (GEC), Text Simplification, Coherence, and Paraphrasing, and demonstrate its superior performance on all of them. We publicly release our best performing models and data as resources to the community to advance further research in this space.

pdf abs
Eval-UA-tion 1.0: Benchmark for Evaluating Ukrainian (Large) Language Models
Serhii Hamotskyi | Anna-Izabella Levbarg | Christian Hänig

In this paper, we introduce Eval-UA-tion, a set of novel Ukrainian-language datasets aimed at evaluating the performance of language models on the Ukrainian language. The tasks include UA-CBT (inspired by the Children’s Book Test, a fill-in-the-gaps type task aimed at gauging the extent to which a story narrative is understood), UP-Titles (where the online newspaper Ukrainska Pravda‘s articles have to be matched to the correct title among 10 similar ones), and LMentry-static-UA/LMES (inspired by the LMentry benchmark, a set of tasks simple to solve for humans but hard for LMs, such as ‘which of these words is longer’ and ‘what is the fifth word of this sentence’). With the exception of UP-Titles, the tasks are built in a way to minimize contamination and use material unlikely to be present in the training sets of language models, and include a split for few-shot model prompting use that minimizes contamination. For each task human and random baselines are provided.

pdf abs
LiBERTa: Advancing Ukrainian Language Modeling through Pre-training from Scratch
Mykola Haltiuk | Aleksander Smywiński-Pohl

Recent advancements in Natural Language Processing (NLP) have spurred remarkable progress in language modeling, predominantly benefiting English. While Ukrainian NLP has long grappled with significant challenges due to limited data and computational resources, recent years have seen a shift with the emergence of new corpora, marking a pivotal moment in addressing these obstacles. This paper introduces LiBERTa Large, the inaugural BERT Large model pre-trained entirely from scratch only on Ukrainian texts. Leveraging extensive multilingual text corpora, including a substantial Ukrainian subset, LiBERTa Large establishes a foundational resource for Ukrainian NLU tasks. Our model outperforms existing multilingual and monolingual models pre-trained from scratch for Ukrainian, demonstrating competitive performance against those relying on cross-lingual transfer from English. This achievement underscores our ability to achieve superior performance through pre-training from scratch with additional enhancements, obviating the need to rely on decisions made for English models to efficiently transfer weights. We establish LiBERTa Large as a robust baseline, paving the way for future advancements in Ukrainian language modeling.

pdf abs
Entity Embellishment Mitigation in LLMs Output with Noisy Synthetic Dataset for Alignment
Svitlana Galeshchuk

The present work focuses on the entity embellishments when named entities are accompanied by additional information that is not supported by the context or the source material. Our paper contributes into mitigating this problem in large language model’s generated texts, summaries in particular, by proposing the approach with synthetic noise injection in the generated samples that are further used for alignment of finetuned LLM. We also challenge the issue of solutions scarcity for low-resourced languages and test our approach with corpora in Ukrainian.

pdf abs
Language-Specific Pruning for Efficient Reduction of Large Language Models
Maksym Shamrai

Delving into pruning techniques is essential to boost the efficiency of Large Language Models (LLMs) by reducing their size and computational demands, resulting in faster and more cost-effective inference. In this work, our key contribution lies in recognizing that LLMs trained on diverse languages manifest distinct language-specific weight distributions. Exploiting this insight, we illustrate that pruning LLMs using language-specific data results in a more potent model compression. Empirical evidence underscores the critical nature of pruning on language-specific data, highlighting a noteworthy impact on the perplexity of Ukrainian texts compared to pruning on English data. The proposed methodology significantly reduces the size of LLaMA, LLaMA 2 and Mistral models while preserving competitive performance. This research underscores the significance of linguistic considerations in LLM pruning and advocates for language-specific optimization, establishing a framework for more efficient and tailored language models across diverse linguistic contexts. Additionally, all experiments were conducted using a single consumer-grade NVIDIA RTX 3090 GPU, and the code is available at https://github.com/mshamrai/language-specific-pruning.

pdf (full)
bib (full) Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

pdf bib
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation
Girish Nath Jha | Sobha L. | Kalika Bali | Atul Kr. Ojha

pdf bib abs
Towards Disfluency Annotated Corpora for Indian Languages
Chayan Kochar | Vandan Vasantlal Mujadia | Pruthwik Mishra | Dipti Misra Sharma

In the natural course of spoken language, individuals often engage in thinking and self-correction during speech production. These instances of interruption or correction are commonly referred to as disfluencies. When preparing data for subsequent downstream NLP tasks, these linguistic elements can be systematically removed, or handled as required, to enhance data quality. In this study, we present a comprehensive research on disfluencies in Indian languages. Our approach involves not only annotating real-world conversation transcripts but also conducting a detailed analysis of linguistic nuances inherent to Indian languages that are necessary to consider during annotation. Additionally, we introduce a robust algorithm for the synthetic generation of disfluent data. This algorithm aims to facilitate more effective model training for the identification of disfluencies in real-world conversations, thereby contributing to the advancement of disfluency research in Indian languages.

pdf bib abs
EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi for Emotion Detection
Nishat Raihan | Dhiman Goswami | Antara Mahmud | Antonios Anastasopoulos | Marcos Zampieri

Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce EmoMix-3L, a novel multi-label emotion detection dataset containing code-mixed data from three different languages. We experiment with several models on EmoMix-3L and we report that MuRIL outperforms other models on this dataset.

This paper describes the structure and findings of the WILDRE 2024 shared task on Code-mixed Less-resourced Sentiment Analysis for Indo-Aryan Languages. The participants were asked to submit the test data’s final prediction on CodaLab. A total of fourteen teams registered for the shared task. Only four participants submitted the system for evaluation on CodaLab, with only two teams submitting the system description paper. While all systems show a rather promising performance, they outperform the baseline scores.

Lack of diverse perspectives causes neutrality bias in Wikipedia content leading to millions of worldwide readers getting exposed by potentially inaccurate information. Hence, neutrality bias detection and mitigation is a critical problem. Although previous studies have proposed effective solutions for English, no work exists for Indian languages. First, we contribute two large datasets, mWIKIBIAS and mWNC, covering 8 languages, for the bias detection and mitigation tasks respectively. Next, we investigate the effectiveness of popular multilingual Transformer-based models for the two tasks by modeling detection as a binary classification problem and mitigation as a style transfer problem. We make the code and data publicly available.

pdf abs
Dharmaśāstra Informatics: Concept Mining System for Socio-Cultural Facet in Ancient India
Arooshi Nigam | Subhash Chandra

The heritage of Dharmaśāstra (DS) represents an extensive cultural legacy, spanning diverse fields such as family law, social ethics, culture and economics. In this paper, a new term “Dharmaśāstric Informatics,” is proposed which leverages computational methods for concept mining to unravel the socio-cultural complexities of ancient India as reflected in the DS. Despite its profound significance, the digitization and online information retrieval of DS texts encounter notable challenges. Therefore, the primary aim of this paper is to synergize digital accessibility and information mining techniques to enhance access to DS knowledge traditions. Through the utilization of heritage computing methodologies, it is an endeavour to develop a robust system for digitizing DS texts comprehensively, facilitating instant referencing and efficient retrieval, catering to the needs of researchers and scholars across disciplines worldwide. By leveraging advanced digital technologies and the burgeoning IT landscape, it seeks to create a seamless and user-friendly platform for accessing and exploring DS texts. This experiment not only promotes scholarly engagement but also serves as an invaluable resource for individuals interested in delving into the intricate realms of archaic Indian knowledge traditions. Ultimately, our efforts aim to amplify the visibility and accessibility of DS knowledge, fostering a deeper understanding and appreciation of this profound cultural heritage.

pdf abs
Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo
Abhinaba Bala | Ashok Urlana | Rahul Mishra | Parameswari Krishnamurthy

Obtaining sufficient information in one’s mother tongue is crucial for satisfying the information needs of the users. While high-resource languages have abundant online resources, the situation is less than ideal for very low-resource languages. Moreover, the insufficient reporting of vital national and international events continues to be a worry, especially in languages with scarce resources, like Mizo. In this paper, we conduct a study to investigate the effectiveness of a simple methodology designed to generate a holistic summary for Mizo news articles, which leverages English-language news to supplement and enhance the information related to the corresponding news events. Furthermore, we make available 500 Mizo news articles and corresponding enriched holistic summaries. Human evaluation confirms that our approach significantly enhances the information coverage of Mizo news articles.

pdf abs
Finding the Causality of an Event in News Articles
Sobha Lalitha Devi | Pattabhi RK Rao

This paper discusses about the finding of causality of an event in newspaper articles. The analysis of causality , otherwise known as cause and effect is crucial for building efficient Natural Language Understanding (NLU) supported AI systems such as Event tracking and it is considered as a complex semantic relation under discourse theory. A cause-effect relation consists of a linguistic marker and its two arguments. The arguments are semantic arguments where the cause is the first argument (Arg1) and the effect is the second argument(Arg2). In this work we have considered the causal relations in Tamil Newspaper articles. The analysis of causal constructions, the causal markers and their syntactic relation lead to the identification of different features for developing the language model using RBMs (Restricted Boltzmann Machine). The experiments we performed have given encouraging results. The Cause-Effect system developed is used in a mobile App for Event profiling called “Nigalazhvi” where the cause and effect of an event is identified and given to the user.

pdf abs
Creating Corpus of Low Resource Indian Languages for Natural Language Processing: Challenges and Opportunities
Pratibha Dongare

Addressing tasks in Natural Language Processing requires access to sufficient and high-quality data. However, working with languages that have limited resources poses a significant challenge due to the absence of established methodologies, frameworks, and collaborative efforts. This paper intends to briefly outline the challenges associated with standardization in data creation, focusing on Indian languages, which are often categorized as low resource languages. Additionally, potential solutions and the importance of standardized procedures for low-resource language data are proposed. Furthermore, the critical role of standardized protocols in corpus creation and their impact on research is highlighted. Lastly, this paper concludes by defining what constitutes a corpus.

pdf abs
FZZG at WILDRE-7: Fine-tuning Pre-trained Models for Code-mixed, Less-resourced Sentiment Analysis
Gaurish Thakkar | Marko Tadić | Nives Mikelic Preradovic

This paper describes our system used for a shared task on code-mixed, less-resourced sentiment analysis for Indo-Aryan languages. We are using the large language models (LLMs) since they have demonstrated excellent performance on classification tasks. In our participation in all tracks, we use unsloth/mistral-7b-bnb-4bit LLM for the task of code-mixed sentiment analysis. For track 1, we used a simple fine-tuning strategy on PLMs by combining data from multiple phases. Our trained systems secured first place in four phases out of five. In addition, we present the results achieved using several PLMs for each language.

pdf abs
MLInitiative@WILDRE7: Hybrid Approaches with Large Language Models for Enhanced Sentiment Analysis in Code-Switched and Code-Mixed Texts
Hariram Veeramani | Surendrabikram Thapa | Usman Naseem

Code-switched and code-mixed languages are prevalent in multilingual societies, reflecting the complex interplay of cultures and languages in daily communication. Understanding the sentiment embedded in such texts is crucial for a range of applications, from improving social media analytics to enhancing customer feedback systems. Despite their significance, research in code-mixed and code-switched languages remains limited, particularly in less-resourced languages. This scarcity of research creates a gap in natural language processing (NLP) technologies, hindering their ability to accurately interpret the rich linguistic diversity of global communications. To bridge this gap, this paper presents a novel methodology for sentiment analysis in code-mixed and code-switched texts. Our approach combines the power of large language models (LLMs) and the versatility of the multilingual BERT (mBERT) framework to effectively process and analyze sentiments in multilingual data. By decomposing code-mixed texts into their constituent languages, employing mBERT for named entity recognition (NER) and sentiment label prediction, and integrating these insights into a decision-making LLM, we provide a comprehensive framework for understanding sentiment in complex linguistic contexts. Our system achieves competitive rank on all subtasks in the Code-mixed Less-Resourced Sentiment analysis (Code-mixed) shared task at WILDRE-7 (LREC-COLING).

Tamil is a relatively low-resource language in the field of Natural Language Processing (NLP). Recent years have seen a growth in Tamil NLP datasets in Natural Language Understanding (NLU) or Natural Language Generation (NLG) tasks, but high-quality linguistic resources remain scarce. In order to alleviate this gap in resources, this paper introduces Aalamaram, a treebank with rich linguistic annotations for the Tamil language. It is hitherto the largest publicly available Tamil treebank with almost 10,000 sentences from diverse sources and is annotated for the tasks of Part-of-speech (POS) tagging, Named Entity Recognition (NER), Morphological Parsing and Dependency Parsing. Close attention has also been paid to multi-word segmentation, especially in the context of Tamil clitics. Although the treebank is based largely on the Universal Dependencies (UD) specifications, significant effort has been made to adjust the annotation rules according to the idiosyncrasies and complexities of the Tamil language, thereby providing a valuable resource for linguistic research and NLP developments.