Proceedings of the 5th Conference on Language, Data and Knowledge

Mehwish Alam, Andon Tchechmedjiev, Jorge Gracia, Dagmar Gromann, Maria Pia di Buono, Johanna Monti, Maxim Ionov (Editors)

Anthology ID:: 2025.ldk-1
Month:: September
Year:: 2025
Address:: Naples, Italy
Venues:: LDK | WS
SIG:
Publisher:: Unior Press
URL:: https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1/
DOI:
ISBN:: 978-88-6719-333-2
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1.pdf

PDF (full) BibTeX Search

pdf bib abs
DiaSafety-CC: Annotating Dialogues with Safety Labels and Reasons for Cross-Cultural Analysis
Tunde Oluwaseyi Ajayi | Mihael Arcan | Paul Buitelaar

42 A dialogue dataset developed in a language can have diverse safety annotations when presented to raters from different cultures. What is considered acceptable in one culture can be perceived as offensive in another culture. Cultural differences in dialogue safety annotation is yet to be fully explored. In this work, we use the geopolitical entity, Country, as our base for cultural study. We extend DiaSafety, an existing English dialogue safety dataset that was originally annotated by raters from Western culture, to create a new dataset, DiaSafety-CC. In our work, three raters each from Nigeria and India reannotate the DiaSafety dataset and provide reasons for their choice of labels. We perform pairwise comparisons of the annotations across the cultures studied. Furthermore, we compare the representative labels of each rater group to that of an existing large language model (LLM). Due to the subjectivity of the dialogue annotation task, 32.6% of the considered dialogues achieve unanimous annotation consensus across the labels of DiaSafety and the six raters. In our analyses, we observe that the Unauthorized Expertise and Biased Opinion categories have dialogues with the highest label disagreement ratio across the cultures studied. On manual inspection of the reasons provided for the choice of labels, we observe that raters across the cultures in DiaSafety-CC are sensitive to dialogues directed at target groups compared to dialogues directed at individuals. We also observe that GPT-4o annotation shows a more positive agreement with DiaSafety labels in terms of F1 score and phi coefficient.

pdf bib abs
The Leibniz List as Linguistic Linked Data in the LiLa Knowledge Base
Lisa Sophie Albertelli | Giulia Calvi | Francesco Mambrini

35 This paper presents the integration of the Leibniz List, a concept list from the Concepticon project, into the LiLa Knowledge Base of Latin interoperable resources. The modeling experiment was conducted using W3C standards like Ontolex and SKOS. This work, which originated in a project for a university course, is limited to a short list of words, but it already enables interoperability between the Concepticon and the language resources in a LOD architecture like LiLa. The integration enriches the LiLa ecosystem, allowing users to explore Latin lexicon from an onomasiological perspective and links concepts to lexical entries from various dictionaries and corpus attestations. The work showcases how standard Semantic Web technologies can effectively model and connect historical concept lists within larger linguistic knowledge infrastructures and provides an example for further experiments with the Concepticon’s data.

pdf bib abs
Benchmarking Hindi Term Extraction in Education: A Dataset and Analysis
Shubhanker Banerjee | Bharathi Raja Chakravarthi | John Philip McCrae

18 This paper introduces the HTEC HindiTerm Extraction Dataset 2.0, a resourcedesigned to support terminology extractionand classification tasks within the education domain. HTEC 2.0 has been developed with the objective of providing a high-quality benchmark dataset for the evaluation of term recognition and classification methodologies in Hindi educationaldiscourse. The dataset consists of 97 documents sourced from Hindi Wikipedia, covering a diverse range of topics relevant tothe education sector. Within these documents, 1,702 terms have been manuallyannotated where each term is defined as asingle-word or multi-word expression thatconveys a domain-specific meaning. Theannotated terms in HTEC 2.0 are systematically categorized into seven distinct classes.Furthermore, this paper outlines the development of annotation guidelines, detailingthe criteria used to determine term boundaries and category assignments. By offeringa structured dataset with clearly definedterm classifications, HTEC 2.0 serves as avaluable resource for researchers workingon terminology extraction, domain-specificnamed entity recognition, and text classification in Hindi.

pdf bib abs
CoWoYTP1Att: A Social Media Comment Dataset on Gender Discourse with Appraisal Theory Annotations
Valentina Tretti Beckles | Adrian Vergara Heidke | Natalia Molina-Valverde

10 This paper presents the Corpus on Women in YouTube on Performance with Attitude Annotations (CoWoYTP1Att), developed based on Appraisal Theory (Martin & White, 2005). Between September 2020 and May 2021, 14,883 comments were extracted from a YouTube video featuring a compilation of the performance “Un violador en tu camino” (A Rapist in Your Path) by the feminist collective LasTesis, published on the channel of the Costa Rican newspaper La Nación. The extracted comments were manually and automatically classified based on several criteria to determine their relevance to the video. As a result, 5,939 comments were identified as related to the video. These comments were annotated with the three attitude subdomains (affect, judgement, and appreciation) proposed on the Appraisal Theory (Martin & White, 2005), as well as their polarity, target, fragment, and whether the attitude was implicit or explicit. The statistical analysis of the corpus highlights the predominant negative evaluation of individuals present in the comments on this social media platform.

pdf bib abs
Detecting Changing Culinary Trends Through Historical Recipes
Gauri Bhagwat | Marieke van Erp | Teresa Paccosi | Rik Hoekstra

48 Culinary trends evolve in response to social, economic, and cultural influences, reflecting broader historical transformations. We present an exploration into Dutch culinary trends from 1910 to 1995 by analysing recipes from housekeeping school cookbooks and newspaper recipe collections. Using computational techniques, we extract and examine ingredient frequency, recipe complexity, and shifts in recipe categories to identify trends in Dutch cuisine from a quantitative point of view. Additionally, we experimented with Large Language Models (LLMs) to structure and extract recipes’ features, demonstrating their potential for historical recipe parsing.

pdf bib abs
Towards Multilingual Haikus: Representing Accentuation to Build Poems
Fernando Bobillo | Maxim Ionov | Eduardo Mena | Carlos Bobed

34 The paradigm of neuro-symbolic Artificial Intelligence is receiving an increasing attention in the last years to improve the results of intelligent systems by combining symbolic and subsymbolic methods. For example, existing Large Language Models (LLMs) could be enriched by taking into account background knowledge encoded using semantic technologies, such as Linguistic Linked Data (LLD). In this paper, we claim that LLD can aid Large Language Models by providing the necessary information to compute the number of poetic syllables, which would help LLMs to correctly generate poems with a valid metric. To do so, we propose an encoding for syllabic structure based on an extension of RDF vocabularies widely used in the field: POSTDATA and OntoLex-Lemon.

pdf bib abs
Assigning FrameNet Frames to a Croatian Verb Lexicon
Ivana Brač | Ana Ostroški Anić

50 This paper presents the Croatian verb lexicon Verbion that describes verbs on multiple levels. The semantic level includes verb senses, corresponding semantic classes according to VerbNet and WordNet, as well as semantic frames based on FrameNet. Each verb sense is linked to one or more valency frames, which include corpus-based examples accompanied by syntactic, morphological, and semantic analyses of each argument. This study focuses on assigning FrameNet frames to the verb misliti ‘think’ and its prefixed forms. Based on 170 manually annotated sentences, the paper discusses the advantages and challenges of assigning semantic frames to Croatian verbs.

pdf bib abs
Putting Low German on the Map (of Linguistic Linked Open Data)
Christian Chiarcos | Tabea Gröger | Christian Fäth

47 We describe the creation of a cross-dialectal lexical resource for Low German, a regional language spoken primarily in Germany and the Netherlands, based on the application of Linguistic Linked Open Data (LLOD) technologies. We argue that this approach is particularly well-suited for a language without a written standard, but with multiple, incompatible orthographies and considerable internal variation in phonology, spelling and grammar. A major hurdle in the preservation and documentation of and in the creation of educational materials (such as texts and dictionaries) for this variety is its internal degree of linguistic and orthographic variation, intensified by mutually exclusive influences from different national languages and their respective orthographies. We thus aim to provide a “digital Rosetta stone” to unify lexical materials from different dialects through linking dictionaries and mapping corresponding words without the need for a standardvariety. This involves two components, a mapping between different orthographies and phonological systems, and a technology for linking regional dictionaries maintained by different hosts and developed by or for different communities of speakers.

pdf bib abs
Tracing Organisation Evolution in Wikidata
Marieke van Erp | Jiaqi Zhu | Vera Provatorova

37 Entities change over time, and while information about entity change is contained in knowledge graphs (KGs), it is often not stated explicitly. This makes KGs less useful for investigating entities over time, or downstream tasks such as historical entity linking. In this paper, we present an approach and experiments that make explicit entity change in Wikidata. Our contributions are a mapping between an existing change ontology and Wikidata properties to identify types of change, and a dataset of entities with explicit evolution information and analytics on this dataset.

pdf bib abs
Automated Concept Map Extraction from Text
Martina Galletti | Inès Blin | Eleni Ilkou

14 Concept Maps are semantic graph summary representations of relations between concepts in text. They are particularly beneficial for students with difficulty in reading comprehension, such as those with special educational needs and disabilities. Currently, the field of concept map extraction from text is outdated, relying on old baselines, limited datasets, and limited performances with F1 scores below 20%. We propose a novel neuro-symbolic pipeline and a GPT3.5-based method for automated concept map extraction from text evaluated over the WIKI dataset. The pipeline is a robust, modularized, and open-source architecture, the first to use semantic and neural techniques for automatic concept map extraction while also using a preliminary summarization component to reduce processing time and optimize computational resources. Furthermore, we investigate the large language model in zero-shot, one-shot, and decomposed prompting for concept map generation. Our approaches achieve state-of-the-art results in METEOR metrics, with F1 scores of 25.7 and 28.5, respectively, and in ROUGE-2 recall, with respective scores of 24.3 and 24.3. This contribution advances the task of automated concept map extraction from text, opening doors to wider applications such as education and speech-language therapy. The code is openly available.

pdf bib abs
Ligt: Towards an Ecosystem for Managing Interlinear Glossed Texts with Linguistic Linked Data
Maxim Ionov

53 Ligt is an RDF vocabulary developed for representing Interlinear Glossed Text, a common representation of language material used in particular in field linguistics and linguistic typology. In this paper, we look at its current status and different aspects of its adoption. More specifically, we explore the questions of data conversion, storage, and exploitation. We present ligttools, a set of newly developed converters, report on a series of experiments regarding querying Ligt datasets, and analyse the performance with various infrastructure configurations.

pdf bib abs
A Corpus of Early Modern Decision-Making - the Resolutions of the States General of the Dutch Republic
Marijn Koolen | Rik Hoekstra

46 This paper presents a corpus of early modern Dutch resolutions made in the daily meetings of the States General, the central governing body of the Dutch Republic, over a period of 220 years, from 1576 to 1796. This corpus has been digitised from over half a million scans of mostly handwritten text, segmented into individual resolutions (decisions) and enriched with named entities and metadata extracted from the text of the resolutions. We developed a pipeline for automatic text recognition for historic Dutch, and a document segmentation approach that combines ML classifiers trained on annotated data with rule-based fuzzy matching of the highly formulaic language of the resolutions. The decisions that the States General made were often based on propositions (requests or proposals) submitted in writing, by other governing bodies and by citizens of the republic. The resolutions contain information about these submitted propositions, including the persons and organisations who submitted them. The second part of this paper includes an analysis of the information about these proposition documents that can be extracted from the resolutions, and the potential to link the resolutions to their corresponding propositions using named entities and extracted metadata. We find that for the overwhelming majority of propositions, we can identify the name of person or organisation who submitted it, making it feasible to (semi-)automatically link the resolutions to their corresponding proposition documents. This will allow historians and genealogists to study not only the decision making of the States General in the early modern period, but also the concerns put forward by both high-ranking officials and regular citizens of the Republic.

pdf bib abs
Culturally Aware Content Moderation for Facebook Reels: A Cross-Modal Attention-Based Fusion Model for Bengali Code-Mixed Data
Momtazul Arefin Labib | Samia Rahman | Hasan Murad

44 The advancement of high-speed internet and affordable bandwidth has led to a significant increase in video content and has brought challenges in content moderation due to the spread of unsafe or harmful narratives quickly. The rise of short-form videos like “Reels”, which is easy to create and consume, has intensified these challenges even more. In case of Bengali culture-specific content, the existing content moderation system struggles. To tackle these challenges within the culture-specific Bengali codemixed domain, this paper introduces “UNBER” a novel dataset of 1,111 multimodal Bengali codemixed Facebook Reels categorized into four classes: Safe, Adult, Harmful, and Suicidal. Our contribution also involves the development of a unique annotation tool “ReelAn” to enable an efficient annotation process of reels. While many existing content moderation techniques have focused on resource-rich or monolingual languages, approaches for multimodal datasets in Bengali are rare. To fill this gap, we propose a culturally aware cross-modal attention-based fusion framework to enhance the analysis of these fast-paced videos, which achieved a macro F1 score of 0.75. Our contributions aim to significantly advance multimodal content moderation and lay the groundwork for future research in this area.

31 This paper describes the LiITA Knowledge Base of interoperable linguistic resources for Italian.By adhering to the Linked Open Data principles, LiITA ensures and facilitates interoperability between distributed resources. The paper outlines the lemma-centered architecture of the Knowledge Base and details its core component: the Lemma Bank, a collection of Italian lemmas designed to interlink distributed lexical and textual resources.

pdf bib abs
On the Feasibility of LLM-based Automated Generation and Filtering of Competency Questions for Ontologies
Zola Mahlaza | C. Maria Keet | Nanee Chahinian | Batoul Haydar

54 Competency questions for ontologies are used in a number of ontology development tasks. The questions’ sentences structure have been analysed to inform ontology authoring and validation. One of the problems to make this a seamless process is the hurdle of writing good CQs manually or offering automated assistance in writing CQs. In this paper, we propose an enhanced and automated pipeline where one can trace meticulously through each step, using a mini-corpus, T5, and the SQuAD dataset to generate questions, and the CLaRO controlled language, semantic similarity, and other steps for filtering. This was evaluated with two corpora of different genre in the same broad domain and evaluated with domain experts. The final output questions across the experiments were around 25% for scope and relevance and 45% of unproblematic quality. Technically, it provided ample insight into trade-offs in generation and filtering, where relaxing filtering increased sentence structure diversity but also led to more spurious sentences that required additional processing

pdf bib abs
Terminology Enhanced Retrieval Augmented Generation for Spanish Legal Corpora
Patricia Martín Chozas | Pablo Calleja | Carlos Rodríguez Limón

49 This paper intends to highlight the importance of reusing terminologies in the context of Large Language Models (LLMs), particularly within a Retrieval-Augmented Generation (RAG) scenario. We explore the application of query expansion techniques using a controlled terminology enriched with synonyms. Our case study focuses on the Spanish legal domain, investigating both query expansion and improvements in retrieval effectiveness within the RAG model. The experimental setup includes various LLMs, such as Mistral, LLaMA3.2, and Granite 3, along with multiple Spanish-language embedding models. The results demonstrate that integrating current neural approaches with linguistic resources enhances RAG performance, reinforcing the role of structured lexical and terminological knowledge in modern NLP pipelines.

pdf bib abs
Cuaċ: Fast and Small Universal Representations of Corpora
John Philip McCrae | Bernardo Stearns | Alamgir Munir Qazi | Shubhanker Banerjee | Atul Kr. Ojha

36 The increasing size and diversity of corpora in natural language processing requires highly efficient processing frameworks. Building on the universal corpus format, Teanga, we present Cuaċ, a format for the compact representation of corpora. We describe this methodology based on short-string compression and indexing techniques and show that the files created with this methodology are similar to compressed human-readable serializations and can be further compressed using lossless compression. We also show that this introduces no computational penalty on the time to process files. This methodology aims to speed up natural language processing pipelines and is the basis for a fast database system for corpora.

33 The digital era has made millions of manuscript images in Hebrew available to all. However, despite major advancements in handwritten text recognition over the past decade, an efficient pipeline for large scale and accurate conversion of these manuscripts into useful machine-readable form is still sorely lacking.We propose a pipeline that significantly improves recognition models for automatic transcription of Hebrew manuscripts. Transfer learning is used to fine-tune pretrained models. For post-recognition correction, it leverages text reuse, a common phenomenon in medieval manuscripts, and state-of-the-art large language models for medieval Hebrew.The framework successfully handles noisy transcriptions and consistently suggests alternate, better readings. Initial results show that word level accuracy increased by 10% for new readings proposed by text-reuse detection. Moreover, the character level accuracy improved by 18% by fine-tuning models on the first few pages of each manuscript.

pdf bib abs
Towards Semantic Integration of Opinions: Unified Opinion Concepts Ontology and Extraction Task
Gaurav Negi | Dhairya Dalal | Omnia Zayed | Paul Buitelaar

43 This paper introduces the Unified Opinion Concepts (UOC) ontology to integrate opinions within their semantic context. The UOC ontology bridges the gap between the semantic representation of opinion across different formulations. It is a unified conceptualisation based on the facets of opinions studied extensively in NLP and semantic structures described through symbolic descriptions. We further propose the Unified Opinion Concept Extraction (UOCE) task of extracting opinions from the text with enhanced expressivity. Additionally, we provide a manually extended and re-annotated evaluation dataset for this task and tailored evaluation metrics to assess the adherence of extracted opinions to UOC semantics. Finally, we establish baseline performance for the UOCE task using state-of-the-art generative models.

pdf bib abs
Creating and enriching a repository of 177k interlinearized examples in 1611 mostly lesser-resourced languages
Sebastian Nordhoff

29 Much of NLP is concerned with languages for which dictionaries, thesauri, word nets or treebanks are available. This contribution focuses on languages for which all we have might be some isolated examples with word-to-word translation. We detail the collection, aggregation, storage and querying of this database of 177k examples from 1611 languages with a special eye on enrichment via Named Entity Recognition and links to the Wikidata ontology. We also discuss pitfalls of the approach and discuss the legal status of interlinear examples.

pdf bib abs
Linking the Lexicala Latin-French Dictionary to the LiLa Knowledge Base
Adriano De Paoli | Marco Carlo Passarotti | Paolo Ruffolo | Giovanni Moretti | Ilan Kernerman

16 This paper presents the integration of the Lexicala Latin–French Dictionary into the LiLa Knowledge Base of linguistic resources for Latin made interoperable through their publication as Linked Open Data. The entries of the dictionary are linked to the large collection of Latin lemmas of LiLa (Lemma Bank), enabling interaction with the other resources published therein. The paper details the data modelling process, the linking methodology, and a couple of practical use cases, showing how interlinking resources via LOD can support advancement in (multilingual) linguistic research.

38 This paper describes the release as Linguistic Linked Open Data of DynaMorphPro, a lexical resource recording loanwords, conversions and class-shifts from Latin to Old Italian. We show how existing vocabularies are reused and integrated to allow for a rich semantic representation of these data. Our main reference is the OntoLex-lemon model for lexical information, but classes and properties from many other ontologies are also reused to express other aspects. In particular, we identify the CIDOC Concept Reference Model as the ideal tool to convey chronological information on historical processes of lexical innovation and change, and describe how it can be integrated with OntoLex-lemon.

pdf bib abs
Exploring Medium-Sized LLMs for Knowledge Base Construction
Tomás Cerveira Da Cruz Pinto | Hugo Gonçalo Oliveira | Chris-Bennet Fleger

19 Knowledge base construction (KBC) is one of the great challenges in Natural Language Processing (NLP) and of fundamental importance to the growth of the Semantic Web. Large Language Models (LLMs) may be useful for extracting structured knowledge, including subject-predicate-object triples. We tackle the LM-KBC 2023 Challenge by leveraging LLMs for KBC, utilizing its dataset and benchmarking our results against challenge participants. Prompt engineering and ensemble strategies are tested for object prediction with pretrained LLMs in the 0.5-2B parameter range, which is between the limits of tracks 1 and 2 of the challenge.Selected models are assessed in zero-shot and few-shot learning approaches when predicting the objects of 21 relations. Results demonstrate that instruction-tuned LLMs outperform generative baselines by up to four times, with relation-adapted prompts playing a crucial role in performance. The ensemble approach further enhances triple extraction, with a relation-based selection strategy achieving the highest F1 score. These findings highlight the potential of medium-sized LLMs and prompt engineering methods for efficient KBC.

pdf bib abs
Breaking Ties: Some Methods for Refactoring RST Convergences
Andrew Potter

62 Among the set of schemata specified by Rhetorical Structure Theory is a pattern known variously as the request schema, satellite tie, multisatellite nucleus, or convergence. The essential feature of this schema is that it permits multiple satellites to attach to a single nucleus. Although the schema has long been considered fundamental to RST, it has never been subjected to detailed evaluation. This paper provides such an assessment. Close examination shows that it results in structures that are ambiguous, disjoint, incomplete, and sometimes incoherent. Fortunately, however, further examination shows it to be unnecessary. This paper describes the difficulties with convergences and presents methods for refactoring them as explicit specifications of text structure. The study shows that convergences can be more clearly rendered not as flat relational conjunctions, but rather as organized expressions of cumulative rhetorical moves, wherein each move asserts an identifiable structural integrity and the expressions conform to specifiable scoping rules.

pdf bib abs
Enhancing Information Extraction with Large Language Models: A Comparison with Human Annotation and Rule-Based Methods in a Real Estate Case Study
Renzo Alva Principe | Marco Viviani | Nicola Chiarini

52 Information Extraction (IE) is a key task in Natural Language Processing (NLP) that transforms unstructured text into structured data. This study compares human annotation, rule-based systems, and Large Language Models (LLMs) for domain-specific IE, focusing on real estate auction documents. We assess each method in terms of accuracy, scalability, and cost-efficiency, highlighting the associated trade-offs. Our findings provide valuable insights into the effectiveness of using LLMs for the considered task and, more broadly, offer guidance on how organizations can balance automation, maintainability, and performance when selecting the most suitable IE solution.

pdf bib abs
When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection
Alamgir Munir Qazi | John Philip McCrae | Jamal Nasir

9 The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.

pdf bib abs
Old Reviews, New Aspects: Aspect Based Sentiment Analysis and Entity Typing for Book Reviews with LLMs
Andrea Schimmenti | Stefano De Giorgis | Fabio Vitali | Marieke van Erp

30 This paper faces the problem of the limited availability of datasets for Aspect-Based Sentiment Analysis (ABSA) in the Cultural Heritage domain. Currently, the main datasets for ABSA are product or restaurant reviews. We expand this to book reviews. Our methodology employs an LLM to maintain domain relevance while preserving the linguistic authenticity and natural variations found in genuine reviews. Entity types are annotated through the tool Text2AMR2FRED and evaluated manually. Additionally, we finetuned Llama 3.1 8B as a baseline model that not only performs ABSA, but also performs Entity Typing (ET) with a set of classes from DOLCE foundational ontology, enabling precise categorization of target aspects within book reviews. We present three key contributions as a step forward expanding ABSA: 1) a semi-synthetic set of book reviews, 2) an evaluation of Llama-3-1-Instruct 8B on the ABSA task, and 3) a fine-tuned version of Llama-3-1-Instruct 8B for ABSA.

pdf bib abs
Making Sign Language Research Findable: The sign-lang@LREC Anthology and the Sign Language Dataset Compendium
Marc Schulder | Thomas Hanke | Maria Kopf

20 Resources and research on sign languages are sparse and can often be difficult to locate. Few centralised sources of information exist. This article presents two repositories that aim to improve the findability of such information through the implementation of open science best practices. The sign-lang@LREC Anthology is a repository of publications on sign languages in the series of sign-lang@LREC workshops and related events, enhanced with indices cataloguing what datasets, tools, languages and projects are addressed by these publications. The Sign Language Dataset Compendium provides an overview of existing linguistic corpora, lexical resources and data collection tasks. We describe the evolution of these repositories, covering topics such as supplementary information structures, rich metadata, interoperability, and dealing with the challenges of reference rot.

pdf bib abs
Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language
Kilian Sennrich | Sina Ahmadi

56 Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata’s lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.

pdf bib abs
GrEma: an HTR model for automated transcriptions of the Girifalco asylum’s medical records
Grazia Serratore | Emanuela Nicole Donato | Erika Pasceri | Antonietta Folino | Maria Chiaravalloti

51 This paper deals with the digitization and transcription of medical records from the historical archive of the former psychiatric hospital of Girifalco (Catanzaro, Italy). The digitization is carried out in the premises where the asylum once stood and the historical archive is stored. Using the ScanSnap SV600 flatbed scanner, a copy compliant with the original for each document contained within the medical records is returned. Subsequently the different training phases of a Handwritten Text Recognition model with the Transkribus tool are presented. The transcription aims to obtain texts in an interoperable format, and it was applied exclusively to the clinical documents, such as the informative form, the nosological table and the clinical diary. This paper describes the training phases of a customized model for medical record transcription, named GrEma, presenting its benefits, limitations and possible future applications. This work was carried out ensuring compliance with current legislation on the protection of personal data. It also highlights the importance of digitization and transcription for the recovery and preservation of historical archives from former psychiatric institutions, ensuring these valuable documents remain accessible for future research and potential users.

pdf bib abs
Constructing a liberal identity via political speech: Tracking lifespan change in the Icelandic Gigaword Corpus
Lilja Björk Stefánsdóttir | Johanna Mechler | Anton Karl Ingason

4 We examine individual lifespan change in the speech of an Icelandic MP, Þorgerður Gunnarsdóttir, who style-shifts after she switches parties, by becoming less formal as her political stance becomes more liberal. We make use of the resources of the Icelandic Gigaword Corpus, more specifically the Parliament section of that corpus, demonstrating how the reinvention of an identity in politics can be tracked by studying the collection of speeches given by a politician over time.

pdf bib abs
Towards Sense to Sense Linking across DBnary Languages
Gilles Sérasset

39 Since 2012, the DBnary project extracts lexical information from different Wiktionary language editions (26 editions in 2025) and makes it available to the community as queryable RDF data (modeled using ontolex-lemon ontology). This dataset contains more than 12M translations linking languages at the level of Lexical Entries. This paper presents an effort to automatically link the DBnary languages at the Lexical Sense level. For this we explore different ways to compute cross-lingual semantic similarity, using multilingual language models.

23 Personalized recommender systems play a crucial role in direct marketing, particularly in financial services, where delivering relevant content can enhance customer engagement and promote informed decision-making. This study explores interpretable knowledge graph (KG)-based recommender systems by proposing two distinct approaches for personalized article recommendations within a multinational financial services firm. The first approach leverages Reinforcement Learning (RL) to traverse a KG constructed from both structured (tabular) and unstructured (textual) data, enabling interpretability through Path Directed Reasoning (PDR). The second approach employs the XGBoost algorithm, with post-hoc explainability techniques such as SHAP and ELI5 to enhance transparency. By integrating machine learning with automatically generated KGs, our methods not only improve recommendation accuracy but also provide interpretable insights, facilitating more informed decision-making in customer relationship management.

40 This paper provides a comprehensive overview of EuroVoc, the European Union’s multilingual thesaurus. The paper highlights EuroVoc’s significance in the legislative and publications domain, examining its applications in improving information retrieval systems and multi-label text classification methods. Various technological tools developed specifically for EuroVoc classification, including JEX, PyEuroVoc, and KEVLAR, are reviewed, demonstrating the evolution from basic classification systems to sophisticated neural architectures. Additionally, the paper addresses the management practices managing EuroVoc’s continuous updating and expansion through collaborative tools such as VocBench, emphasising the role of interinstitutional committees and specialised teams in maintaining the thesaurus’s accuracy and relevance.A substantial part of the paper is dedicated to EuroVoc’s alignment with other semantic resources like Wikidata and UNESCO, detailing the challenges and methodologies adopted to facilitate semantic interoperability across diverse information systems. Finally, the paper identifies future directions that include modular extensions of EuroVoc, federated models, linked data approaches, thematic hubs, selective integration, and collaborative governance frameworks.