Luca Dini

Also published as: L. Dini

2025

From Human Reading to NLM Understanding: Evaluating the Role of Eye-Tracking Data in Encoder-Based Models
Luca Dini | Lucia Domenichelli | Dominique Brunato | Felice Dell’Orletta
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cognitive signals, particularly eye-tracking data, offer valuable insights into human language processing. Leveraging eye-gaze data from the Ghent Eye-Tracking Corpus, we conducted a series of experiments to examine how integrating knowledge of human reading behavior impacts Neural Language Models (NLMs) across multiple dimensions: task performance, attention mechanisms, and the geometry of their embedding space. We explored several fine-tuning methodologies to inject eye-tracking features into the models. Our results reveal that incorporating these features does not degrade downstream task performance, enhances alignment between model attention and human attention patterns, and compresses the geometry of the embedding space.

pdf bib

The Role of Eye-Tracking Data in Encoder-Based Models: An In-depth Linguistic Analysis
Lucia Domenichelli | Luca Dini | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

pdf bib abs

TEXT-CAKE: Challenging Language Models on Local Text Coherence
Luca Dini | Dominique Brunato | Felice Dell’Orletta | Tommaso Caselli
Proceedings of the 31st International Conference on Computational Linguistics

We present a deep investigation of encoder-based Language Models (LMs) on their abilities to detect text coherence across four languages and four text genres using a new evaluation benchmark, TEXT-CAKE. We analyze both multilingual and monolingual LMs with varying architectures and parameters in different finetuning settings. Our findings demonstrate that identifying subtle perturbations that disrupt local coherence is still a challenging task. Furthermore, our results underline the importance of using diverse text genres during pre-training and of an optimal pre-traning objective and large vocabulary size. When controlling for other parameters, deep LMs (i.e., higher number of layers) have an advantage over shallow ones, even when the total number of parameters is smaller.

In this paper, we present an experiment to detect emotions in tweets. Unlike much previous research, we draw the important distinction between the tasks of emotion detection in a closed world assumption (i.e. every tweet is emotional) and the complicated task of identifying emotional versus non-emotional tweets. Given an apparent lack of appropriately annotated data, we created two corpora for these tasks. We describe two systems, one symbolic and one based on machine learning, which we evaluated on our datasets. Our evaluation shows that a machine learning classifier performs best on emotion detection, while a symbolic approach is better for identifying relevant (i.e. emotional) tweets.

2014

pdf bib abs

The Dangerous Myth of the Star System
André Bittar | Luca Dini | Sigrid Maurel | Mathieu Ruhlmann
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In recent years we have observed two parallel trends in computational linguistics research and e-commerce development. On the research side, there has been an increasing interest in algorithms and approaches that are able to capture the polarity of opinions expressed by users on products, institutions and services. On the other hand, almost all big e-commerce and aggregator sites are by now providing users the possibility of writing comments and expressing their appreciation with a numeric score (usually represented as a number of stars). This generates the impression that the work carried out in the research community is made partially useless (at least for economic exploitation) by an evolution in web practices. In this paper we describe an experiment on a large corpus which shows that the score judgments provided by users are often conflicting with the text contained in the opinion, and to such a point that a rule-based opinion mining system can be demonstrated to perform better than the users themselves in ranking their opinions.

pdf bib abs

Generating a Resource for Products and Brandnames Recognition. Application to the Cosmetic Domain.
Cédric Lopez | Frédérique Segond | Olivier Hondermarck | Paolo Curtoni | Luca Dini
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Named Entity Recognition task needs high-quality and large-scale resources. In this paper, we present RENCO, a based-rules system focused on the recognition of entities in the Cosmetic domain (brandnames, product names, â¦). RENCO has two main objectives: 1) Generating resources for named entity recognition; 2) Mining new named entities relying on the previous generated resources. In order to build lexical resources for the cosmetic domain, we propose a system based on local lexico-syntactic rules complemented by a learning module. As the outcome of the system, we generate both a simple lexicon and a structured lexicon. Results of the evaluation show that even if RENCO outperforms a classic Conditional Random Fields algorithm, both systems should combine their respective strengths.

2013

pdf bib

Celi: EDITS and Generic Text Pair Classification
Milen Kouylekov | Luca Dini | Alessio Bosca | Marco Trevisan
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf bib

pdf bib abs

Linguagrid: a network of Linguistic and Semantic Services for the Italian Language.
Alessio Bosca | Luca Dini | Milen Kouylekov | Marco Trevisan
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In order to handle the increasing amount of textual information today available on the web and exploit the knowledge latent in this mass of unstructured data, a wide variety of linguistic knowledge and resources (Language Identification, Morphological Analysis, Entity Extraction, etc.). is crucial. In the last decade LRaas (Language Resource as a Service) emerged as a novel paradigm for publishing and sharing these heterogeneous software resources over the Web. In this paper we present an overview of Linguagrid, a recent initiative that implements an open network of linguistic and semantic Web Services for the Italian language, as well as a new approach for enabling customizable corpus-based linguistic services on Linguagrid LRaaS infrastructure. A corpus ingestion service in fact allows users to upload corpora of documents and to generate classification/clustering models tailored to their needs by means of standard machine learning techniques applied to the textual contents and metadata from the corpora. The models so generated can then be accessed through proper Web Services and exploited to process and classify new textual contents.

pdf bib

CELI: An Experiment with Cross Language Textual Entailment
Milen Kouylekov | Luca Dini | Alessio Bosca | Marco Trevisan
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

2010

pdf bib abs

The Impact of Grammar Enhancement on Semantic Resources Induction
Luca Dini | Giampaolo Mazzini
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper describes the effects of the evolution of an Italian dependency grammar on a task of multilingual FrameNet acquisition. The task is based on the creation of virtual English/Italian parallel annotation corpora, which are then aligned at dependency level by using two manually encoded grammar based dependency parsers. We show how the evolution of the LAS (Labeled Attachment Score) metric for the considered grammar has a direct impact on the quality of the induced FrameNet, thus proving that the evolution of the quality of syntactic resources is mirrored by an analogous evolution in semantic ones. In particular we show that an improvement of 30% in LAS causes an improvement of precision for the induced resource ranging from 5% to 10%, depending on the type of evaluation.

2006

pdf bib abs

This paper presents an on-going project aiming at enhancing the OPAC (Online Public Access Catalog) search system of the Library of the Free University of Bozen-Bolzano with multilingual access. The Multilingual search system (MUSIL), we have developed, integrates advanced linguistic technologies in a user friendly interface and bridges the gap between the world of free text search and the world of conceptual librarian search. In this paper we present the architecture of the system, its interface and preliminary evaluations of the precision of the search results.