Horacio Saggion

2021

pdf bib abs
Syntax-aware Transformers for Neural Machine Translation: The Case of Text to Sign Gloss Translation
Santiago Egea Gómez | Euan McGill | Horacio Saggion
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

It is well-established that the preferred mode of communication of the deaf and hard of hearing (DHH) community are Sign Languages (SLs), but they are considered low resource languages where natural language processing technologies are of concern. In this paper we study the problem of text to SL gloss Machine Translation (MT) using Transformer-based architectures. Despite the significant advances of MT for spoken languages in the recent couple of decades, MT is in its infancy when it comes to SLs. We enrich a Transformer-based architecture aggregating syntactic information extracted from a dependency parser to word-embeddings. We test our model on a well-known dataset showing that the syntax-aware model obtains performance gains in terms of MT evaluation metrics.

pdf bib abs
Controllable Sentence Simplification with a Unified Text-to-Text Transfer Transformer
Kim Cheng Sheang | Horacio Saggion
Proceedings of the 14th International Conference on Natural Language Generation

Recently, a large pre-trained language model called T5 (A Unified Text-to-Text Transfer Transformer) has achieved state-of-the-art performance in many NLP tasks. However, no study has been found using this pre-trained model on Text Simplification. Therefore in this paper, we explore the use of T5 fine-tuning on Text Simplification combining with a controllable mechanism to regulate the system outputs that can help generate adapted text for different target audiences. Our experiments show that our model achieves remarkable results with gains of between +0.69 and +1.41 over the current state-of-the-art (BART+ACCESS). We argue that using a pre-trained model such as T5, trained on several tasks with large amounts of data, can help improve Text Simplification.

2020

pdf bib abs
LaSTUS/TALN at TRAC - 2020 Trolling, Aggression and Cyberbullying
Lütfiye Seda Mut Altın | Alex Bravo | Horacio Saggion
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

This paper presents the participation of the LaSTUS/TALN team at TRAC-2020 Trolling, Aggression and Cyberbullying shared task. The aim of the task is to determine whether a given text is aggressive and contains misogynistic content. Our approach is based on a bidirectional Long Short Term Memory network (bi-LSTM). Our system performed well at sub-task A, aggression detection; however underachieved at sub-task B, misogyny detection.

pdf bib abs
A Multi-level Annotated Corpus of Scientific Papers for Scientific Document Summarization and Cross-document Relation Discovery
Ahmed AbuRa’ed | Horacio Saggion | Luis Chiruzzo
Proceedings of the 12th Language Resources and Evaluation Conference

Related work sections or literature reviews are an essential part of every scientific article being crucial for paper reviewing and assessment. The automatic generation of related work sections can be considered an instance of the multi-document summarization problem. In order to allow the study of this specific problem, we have developed a manually annotated, machine readable data-set of related work sections, cited papers (e.g. references) and sentences, together with an additional layer of papers citing the references. We additionally present experiments on the identification of cited sentences, using as input citation contexts. The corpus alongside the gold standard are made available for use by the scientific community.

2019

pdf bib abs
Transferring Knowledge from Discourse to Arguments: A Case Study with Scientific Abstracts
Pablo Accuosto | Horacio Saggion
Proceedings of the 6th Workshop on Argument Mining

In this work we propose to leverage resources available with discourse-level annotations to facilitate the identification of argumentative components and relations in scientific texts, which has been recognized as a particularly challenging task. In particular, we implement and evaluate a transfer learning approach in which contextualized representations learned from discourse parsing tasks are used as input of argument mining models. As a pilot application, we explore the feasibility of using automatically identified argumentative components and relations to predict the acceptance of papers in computer science venues. In order to conduct our experiments, we propose an annotation scheme for argumentative units and relations and use it to enrich an existing corpus with an argumentation layer.

pdf bib abs
LaSTUS/TALN at SemEval-2019 Task 6: Identification and Categorization of Offensive Language in Social Media with Attention-based Bi-LSTM model
Lutfiye Seda Mut Altin | Àlex Bravo Serrano | Horacio Saggion
Proceedings of the 13th International Workshop on Semantic Evaluation

We present a bidirectional Long-Short Term Memory network for identifying offensive language in Twitter. Our system has been developed in the context of the SemEval 2019 Task 6 which comprises three different sub-tasks, namely A: Offensive Language Detection, B: Categorization of Offensive Language, C: Offensive Language Target Identification. We used a pre-trained Word Embeddings in tweet data, including information about emojis and hashtags. Our approach achieves good performance in the three sub-tasks.

2018

pdf bib abs
Interpretable Emoji Prediction via Label-Wise Attention LSTMs
Francesco Barbieri | Luis Espinosa-Anke | Jose Camacho-Collados | Steven Schockaert | Horacio Saggion
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Human language has evolved towards newer forms of communication such as social media, where emojis (i.e., ideograms bearing a visual meaning) play a key role. While there is an increasing body of work aimed at the computational modeling of emoji semantics, there is currently little understanding about what makes a computational model represent or predict a given emoji in a certain way. In this paper we propose a label-wise attention mechanism with which we attempt to better understand the nuances underlying emoji prediction. In addition to advantages in terms of interpretability, we show that our proposed architecture improves over standard baselines in emoji prediction, and does particularly well when predicting infrequent emojis.

pdf bib
Data-Driven Text Simplification
Sanja Štajner | Horacio Saggion
Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts

This paper describes the results of the first Shared Task on Multilingual Emoji Prediction, organized as part of SemEval 2018. Given the text of a tweet, the task consists of predicting the most likely emoji to be used along such tweet. Two subtasks were proposed, one for English and one for Spanish, and participants were allowed to submit a system run to one or both subtasks. In total, 49 teams participated to the English subtask and 22 teams submitted a system run to the Spanish subtask. Evaluation was carried out emoji-wise, and the final ranking was based on macro F-Score. Data and further information about this task can be found at https://competitions.codalab.org/competitions/17344.

This paper describes the SemEval 2018 Shared Task on Hypernym Discovery. We put forward this task as a complementary benchmark for modeling hypernymy, a problem which has traditionally been cast as a binary classification task, taking a pair of candidate words as input. Instead, our reformulated task is defined as follows: given an input term, retrieve (or discover) its suitable hypernyms from a target corpus. We proposed five different subtasks covering three languages (English, Spanish, and Italian), and two specific domains of knowledge in English (Medical and Music). Participants were allowed to compete in any or all of the subtasks. Overall, a total of 11 teams participated, with a total of 39 different systems submitted through all subtasks. Data, results and further information about the task can be found at https://competitions.codalab.org/competitions/17119.

pdf bib abs
Multimodal Emoji Prediction
Francesco Barbieri | Miguel Ballesteros | Francesco Ronzano | Horacio Saggion
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Emojis are small images that are commonly included in social media text messages. The combination of visual and textual content in the same message builds up a modern way of communication, that automatic systems are not used to deal with. In this paper we extend recent advances in emoji prediction by putting forward a multimodal approach that is able to predict emojis in Instagram posts. Instagram posts are composed of pictures together with texts which sometimes include emojis. We show that these emojis can be predicted by using the text, but also using the picture. Our main finding is that incorporating the two synergistic modalities, in a combined model, improves accuracy in an emoji prediction task. This result demonstrates that these two modalities (text and images) encode different information on the use of emojis and therefore can complement each other.

pdf bib abs
LaSTUS/TALN at Complex Word Identification (CWI) 2018 Shared Task
Ahmed AbuRa’ed | Horacio Saggion
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

This paper presents the participation of the LaSTUS/TALN team in the Complex Word Identification (CWI) Shared Task 2018 in the English monolingual track . The purpose of the task was to determine if a word in a given sentence can be judged as complex or not by a certain target audience. For the English track, task organizers provided a training and a development datasets of 27,299 and 3,328 words respectively together with the sentence in which each word occurs. The words were judged as complex or not by 20 human evaluators; ten of whom are natives. We submitted two systems: one system modeled each word to evaluate as a numeric vector populated with a set of lexical, semantic and contextual features while the other system relies on a word embedding representation and a distance metric. We trained two separate classifiers to automatically decide if each word is complex or not. We submitted six runs, two for each of the three subsets of the English monolingual CWI track.

bib
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)
Arne Jönsson | Evelina Rennes | Horacio Saggion | Sanja Stajner | Victoria Yaneva
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)

pdf bib
PDFdigest: an Adaptable Layout-Aware PDF-to-XML Textual Content Extractor for Scientific Articles
Daniel Ferrés | Horacio Saggion | Francesco Ronzano | Àlex Bravo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
What Sentence are you Referring to and Why? Identifying Cited Sentences in Scientific Literature
Ahmed AbuRa’ed | Luis Chiruzzo | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In the current context of scientific information overload, text mining tools are of paramount importance for researchers who have to read scientific papers and assess their value. Current citation networks, which link papers by citation relationships (reference and citing paper), are useful to quantitatively understand the value of a piece of scientific work, however they are limited in that they do not provide information about what specific part of the reference paper the citing paper is referring to. This qualitative information is very important, for example, in the context of current community-based scientific summarization activities. In this paper, and relying on an annotated dataset of co-citation sentences, we carry out a number of experiments aimed at, given a citation sentence, automatically identify a part of a reference paper being cited. Additionally our algorithm predicts the specific reason why such reference sentence has been cited out of five possible reasons.

pdf bib abs
Towards the Understanding of Gaming Audiences by Modeling Twitch Emotes
Francesco Barbieri | Luis Espinosa-Anke | Miguel Ballesteros | Juan Soler-Company | Horacio Saggion
Proceedings of the 3rd Workshop on Noisy User-generated Text

Videogame streaming platforms have become a paramount example of noisy user-generated text. These are websites where gaming is broadcasted, and allows interaction with viewers via integrated chatrooms. Probably the best known platform of this kind is Twitch, which has more than 100 million monthly viewers. Despite these numbers, and unlike other platforms featuring short messages (e.g. Twitter), Twitch has not received much attention from the Natural Language Processing community. In this paper we aim at bridging this gap by proposing two important tasks specific to the Twitch platform, namely (1) Emote prediction; and (2) Trolling detection. In our experiments, we evaluate three models: a BOW baseline, a logistic supervised classifiers based on word embeddings, and a bidirectional long short-term memory recurrent neural network (LSTM). Our results show that the LSTM model outperforms the other two models, where explicit features with proven effectiveness for similar tasks were encoded.

pdf bib abs
An Adaptable Lexical Simplification Architecture for Major Ibero-Romance Languages
Daniel Ferrés | Horacio Saggion | Xavier Gómez Guinovart
Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems

Lexical Simplification is the task of reducing the lexical complexity of textual documents by replacing difficult words with easier to read (or understand) expressions while preserving the original meaning. The development of robust pipelined multilingual architectures able to adapt to new languages is of paramount importance in lexical simplification. This paper describes and evaluates a modular hybrid linguistic-statistical Lexical Simplifier that deals with the four major Ibero-Romance Languages: Spanish, Portuguese, Catalan, and Galician. The architecture of the system is the same for the four languages addressed, only the language resources used during simplification are language specific.

pdf bib abs
Are Emojis Predictable?
Francesco Barbieri | Miguel Ballesteros | Horacio Saggion
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Emojis are ideograms which are naturally combined with plain text to visually complement or condense the meaning of a message. Despite being widely used in social media, their underlying semantics have received little attention from a Natural Language Processing standpoint. In this paper, we investigate the relation between words and emojis, studying the novel task of predicting which emojis are evoked by text-based tweet messages. We train several models based on Long Short-Term Memory networks (LSTMs) in this task. Our experimental results show that our neural model outperforms a baseline as well as humans solving the same task, suggesting that computational models are able to better capture the underlying semantics of emojis.

2016

pdf bib
Making Sense of Massive Amounts of Scientific Publications: the Scientific Knowledge Miner Project
Francesco Ronzano | Ana Freire | Diego Saez-Trumper | Horacio Saggion
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

pdf bib
Trainable Citation-enhanced Summarization of Scientific Articles
Horacio Saggion | Ahmed AbuRa’ed | Francesco Ronzano
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

pdf bib abs
Extending WordNet with Fine-Grained Collocational Information via Supervised Distributional Learning
Luis Espinosa-Anke | Jose Camacho-Collados | Sara Rodríguez-Fernández | Horacio Saggion | Leo Wanner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

WordNet is probably the best known lexical resource in Natural Language Processing. While it is widely regarded as a high quality repository of concepts and semantic relations, updating and extending it manually is costly. One important type of relation which could potentially add enormous value to WordNet is the inclusion of collocational information, which is paramount in tasks such as Machine Translation, Natural Language Generation and Second Language Learning. In this paper, we present ColWordNet (CWN), an extended WordNet version with fine-grained collocational information, automatically introduced thanks to a method exploiting linear relations between analogous sense-level embeddings spaces. We perform both intrinsic and extrinsic evaluations, and release CWN for the use and scrutiny of the community.

pdf bib abs
Natural Language Processing for Intelligent Access to Scientific Information
Horacio Saggion | Francesco Ronzano
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Tutorial Abstracts

During the last decade the amount of scientific information available on-line increased at an unprecedented rate. As a consequence, nowadays researchers are overwhelmed by an enormous and continuously growing number of articles to consider when they perform research activities like the exploration of advances in specific topics, peer reviewing, writing and evaluation of proposals. Natural Language Processing Technology represents a key enabling factor in providing scientists with intelligent patterns to access to scientific information. Extracting information from scientific papers, for example, can contribute to the development of rich scientific knowledge bases which can be leveraged to support intelligent knowledge access and question answering. Summarization techniques can reduce the size of long papers to their essential content or automatically generate state-of-the-art-reviews. Paraphrase or textual entailment techniques can contribute to the identification of relations across different scientific textual sources. This tutorial provides an overview of the most relevant tasks related to the processing of scientific documents, including but not limited to the in-depth analysis of the structure of the scientific articles, their semantic interpretation, content extraction and summarization.

pdf bib
Supervised Distributional Hypernym Discovery via Domain Adaptation
Luis Espinosa-Anke | Jose Camacho-Collados | Claudio Delli Bovi | Horacio Saggion
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib abs
A Multi-Layered Annotated Corpus of Scientific Papers
Beatriz Fisas | Francesco Ronzano | Horacio Saggion
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Scientific literature records the research process with a standardized structure and provides the clues to track the progress in a scientific field. Understanding its internal structure and content is of paramount importance for natural language processing (NLP) technologies. To meet this requirement, we have developed a multi-layered annotated corpus of scientific papers in the domain of Computer Graphics. Sentences are annotated with respect to their role in the argumentative structure of the discourse. The purpose of each citation is specified. Special features of the scientific discourse such as advantages and disadvantages are identified. In addition, a grade is allocated to each sentence according to its relevance for being included in a summary.To the best of our knowledge, this complex, multi-layered collection of annotations and metadata characterizing a set of research papers had never been grouped together before in one corpus and therefore constitutes a newer, richer resource with respect to those currently available in the field.

pdf bib abs
ELMD: An Automatically Generated Entity Linking Gold Standard Dataset in the Music Domain
Sergio Oramas | Luis Espinosa Anke | Mohamed Sordo | Horacio Saggion | Xavier Serra
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present a gold standard dataset for Entity Linking (EL) in the Music Domain. It contains thousands of musical named entities such as Artist, Song or Record Label, which have been automatically annotated on a set of artist biographies coming from the Music website and social network Last.fm. The annotation process relies on the analysis of the hyperlinks present in the source texts and in a voting-based algorithm for EL, which considers, for each entity mention in text, the degree of agreement across three state-of-the-art EL systems. Manual evaluation shows that EL Precision is at least 94%, and due to its tunable nature, it is possible to derive annotations favouring higher Precision or Recall, at will. We make available the annotated dataset along with evaluation data and the code.

pdf bib abs
What does this Emoji Mean? A Vector Space Skip-Gram Model for Twitter Emojis
Francesco Barbieri | Francesco Ronzano | Horacio Saggion
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Emojis allow us to describe objects, situations and even feelings with small images, providing a visual and quick way to communicate. In this paper, we analyse emojis used in Twitter with distributional semantic models. We retrieve 10 millions tweets posted by USA users, and we build several skip gram word embedding models by mapping in the same vectorial space both words and emojis. We test our models with semantic similarity experiments, comparing the output of our models with human assessment. We also carry out an exhaustive qualitative evaluation, showing interesting results.

pdf bib
TALN at SemEval-2016 Task 11: Modelling Complex Words by Contextual, Lexical and Semantic Features
Francesco Ronzano | Ahmed Abura’ed | Luis Espinosa-Anke | Horacio Saggion
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
TALN at SemEval-2016 Task 14: Semantic Taxonomy Enrichment Via Sense-Based Embeddings
Luis Espinosa-Anke | Francesco Ronzano | Horacio Saggion
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
On the Discoursive Structure of Computer Graphics Research Papers
Beatriz Fisas | Horacio Saggion | Francesco Ronzano
Proceedings of The 9th Linguistic Annotation Workshop

pdf bib
How Topic Biases Your Results? A Case Study of Sentiment Analysis and Irony Detection in Italian
Francesco Barbieri | Francesco Ronzano | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Weakly Supervised Definition Extraction
Luis Espinosa-Anke | Horacio Saggion | Francesco Ronzano
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Translating from Original to Simplified Sentences using Moses: When does it Actually Work?
Sanja Štajner | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Automatic Text Simplification for Spanish: Comparative Evaluation of Various Simplification Strategies
Sanja Štajner | Iacer Calixto | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and its Evaluation
Sanja Štajner | Hannah Béchara | Horacio Saggion
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
UPF-taln: SemEval 2015 Tasks 10 and 11. Sentiment Analysis of Literal and Figurative Language in Twitter
Francesco Barbieri | Francesco Ronzano | Horacio Saggion
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
TALN-UPF: Taxonomy Learning Exploiting CRF-Based Hypernym Extraction on Encyclopedic Definitions
Luis Espinosa-Anke | Horacio Saggion | Francesco Ronzano
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib abs
Creating Summarization Systems with SUMMA
Horacio Saggion
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Automatic text summarization, the reduction of a text to its essential content is fundamental for an on-line information society. Although many summarization algorithms exist, there are few tools or infrastructures providing capabilities for developing summarization applications. This paper presents a new version of SUMMA, a text summarization toolkit for the development of adaptive summarization applications. SUMMA includes algorithms for computation of various sentence relevance features and functionality for single and multidocument summarization in various languages. It also offers methods for content-based evaluation of summaries.

pdf bib abs
Modelling Irony in Twitter: Feature Analysis and Evaluation
Francesco Barbieri | Horacio Saggion
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Irony, a creative use of language, has received scarce attention from the computational linguistics research point of view. We propose an automatic system capable of detecting irony with good accuracy in the social network Twitter. Twitter allows users to post short messages (140 characters) which usually do not follow the expected rules of the grammar, users tend to truncate words and use particular punctuation. For these reason automatic detection of Irony in Twitter is not trivial and requires specific linguistic tools. We propose in this paper a new set of experiments to assess the relevance of the features included in our model. Our model does not include words or sequences of words as features, aiming to detect inner characteristic of Irony.

pdf bib abs
Can Numerical Expressions Be Simpler? Implementation and Demostration of a Numerical Simplification System for Spanish
Susana Bautista | Horacio Saggion
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Information in newspapers is often showed in the form of numerical expressions which present comprehension problems for many people, including people with disabilities, illiteracy or lack of access to advanced technology. The purpose of this paper is to motivate, describe, and demonstrate a rule-based lexical component that simplifies numerical expressions in Spanish texts. We propose an approach that makes news articles more accessible to certain readers by rewriting difficult numerical expressions in a simpler way. We will showcase the numerical simplification system with a live demo based on the execution of our components over different texts, and which will consider both successful and unsuccessful simplification cases.

pdf bib
One Step Closer to Automatic Evaluation of Text Simplification Systems
Sanja Štajner | Ruslan Mitkov | Horacio Saggion
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

pdf bib
Keyword Highlighting Improves Comprehension for People with Dyslexia
Luz Rello | Horacio Saggion | Ricardo Baeza-Yates
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

pdf bib
Modelling Sarcasm in Twitter, a Novel Approach
Francesco Barbieri | Horacio Saggion | Francesco Ronzano
Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
Modelling Irony in Twitter
Francesco Barbieri | Horacio Saggion
Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
Unsupervised Learning Summarization Templates from Concise Summaries
Horacio Saggion
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility
Luz Rello | Horacio Saggion | Ricardo Baeza-Yates
Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility

pdf bib
Proceedings of the 14th European Workshop on Natural Language Generation
Albert Gatt | Horacio Saggion
Proceedings of the 14th European Workshop on Natural Language Generation

pdf bib
Readability Indices for Automatic Evaluation of Text Simplification Systems: A Feasibility Study for Spanish
Sanja Štajner | Horacio Saggion
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib
Towards Automatic Lexical Simplification in Spanish: An Empirical Study
Biljana Drndarević | Horacio Saggion
Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations

pdf bib
Graphical Schemes May Improve Readability but Not Understandability for People with Dyslexia
Luz Rello | Horacio Saggion | Ricardo Baeza-Yates | Eduardo Graells
Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations

pdf bib
A Hybrid System for Spanish Text Simplification
Stefan Bott | Horacio Saggion | David Figueroa
Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Unsupervised Content Discovery from Concise Summaries
Horacio Saggion
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)

pdf bib
Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish
Stefan Bott | Luz Rello | Biljana Drndarevic | Horacio Saggion
Proceedings of COLING 2012

pdf bib abs
The CONCISUS Corpus of Event Summaries
Horacio Saggion | Sandra Szasz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Text summarization and information extraction systems require adaptation to new domains and languages. This adaptation usually depends on the availability of language resources such as corpora. In this paper we present a comparable corpus in Spanish and English for the study of cross-lingual information extraction and summarization: the CONCISUS Corpus. It is a rich human-annotated dataset composed of comparable event summaries in Spanish and English covering four different domains: aviation accidents, rail accidents, earthquakes, and terrorist attacks. In addition to the monolingual summaries in English and Spanish, we provide automatic translations and ``comparable'' full event reports of the events. The human annotations are concepts marked in the textual sources representing the key event information associated to the event type. The dataset has also been annotated using text processing pipelines. It is being made freely available to the research community for research purposes.

pdf bib abs
Text Simplification Tools for Spanish
Stefan Bott | Horacio Saggion | Simon Mille
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we describe the development of a text simplification system for Spanish. Text simplification is the adaptation of a text to the special needs of certain groups of readers, such as language learners, people with cognitive difficulties and elderly people, among others. There is a clear need for simplified texts, but manual production and adaptation of existing texts is labour intensive and costly. Automatic simplification is a field which attracts growing attention in Natural Language Processing, but, to the best of our knowledge, there are no simplification tools for Spanish. We present a prototype for automatic simplification, which shows that the most important structural simplification operations can be successfully treated with an approach based on rules which can potentially be improved by statistical methods. For the development of this prototype we carried out a corpus study which aims at identifying the operations a text simplification system needs to carry out in order to produce an output similar to what human editors produce when they simplify texts.

2011

pdf bib
An Unsupervised Alignment Algorithm for Text Simplification Corpus Construction
Stefan Bott | Horacio Saggion
Proceedings of the Workshop on Monolingual Text-To-Text Generation

pdf bib
Multi-domain Cross-lingual Information Extraction from Clean and Noisy Texts
Horacio Saggion | Sandra Szasz
Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology

2010

pdf bib abs
NLP Resources for the Analysis of Patient/Therapist Interviews
Horacio Saggion | Elena Stein-Sparvieri | David Maldavsky | Sandra Szasz
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a set of tools and resources for the analysis of interviews during psychotherapy sessions. One of the main components of the work is a dictionary-based text interpretation tool for the Spanish language. The tool is designed to identify a subset of Freudian drives in patient and therapist discourse.

pdf bib abs
Interpreting SentiWordNet for Opinion Classification
Horacio Saggion | Adam Funk
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe a set of tools, resources, and experiments for opinion classification in business-related datasources in two languages. In particular we concentrate on SentiWordNet text interpretation to produce word, sentence, and text-based sentiment features for opinion classification. We achieve good results in experiments using supervised learning machine over syntactic and sentiment-based features. We also show preliminary experiments where the use of summaries before opinion classification provides competitive advantage over the use of full documents.

pdf bib
Multilingual Summarization Evaluation without Human Models
Horacio Saggion | Juan-Manuel Torres-Moreno | Iria da Cunha | Eric SanJuan | Patricia Velázquez-Morales
Coling 2010: Posters

pdf bib abs
Évaluation automatique de résumés avec et sans référence
Juan-Manuel Torres-Moreno | Horacio Saggion | Iria da Cunha | Patricia Velázquez-Morales | Eric Sanjuan
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous étudions différentes méthodes d’évaluation de résumé de documents basées sur le contenu. Nous nous intéressons en particulier à la corrélation entre les mesures d’évaluation avec et sans référence humaine. Nous avons développé FRESA, un nouveau système d’évaluation fondé sur le contenu qui calcule les divergences entre les distributions de probabilité. Nous appliquons notre système de comparaison aux diverses mesures d’évaluation bien connues en résumé de texte telles que la Couverture, Responsiveness, Pyramids et Rouge en étudiant leurs associations dans les tâches du résumé multi-document générique (francais/anglais), focalisé (anglais) et résumé mono-document générique (français/espagnol).

pdf bib
Experiments on Summary-based Opinion Classification
Elena Lloret | Horacio Saggion | Manuel Palomar
Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text

pdf bib
Human Language Technology for Text-based Analysis of Psychotherapy Sessions in the Spanish Language
Horacio Saggion | Elena Stein-Sparvieri | David Maldavsky | Sandra Szasz
Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas

2009

pdf bib
A Classification Algorithm for Predicting the Structure of Summaries
Horacio Saggion
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

2008

pdf bib abs
A Framework for Identity Resolution and Merging for Multi-source Information Extraction
Milena Yankova | Horacio Saggion | Hamish Cunningham
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In the context of ontology-based information extraction, identity resolution is the process of deciding whether an instance extracted from text refers to a known entity in the target domain (e.g. the ontology). We present an ontology-based framework for identity resolution which can be customized to different application domains and extraction tasks. Rules for identify resolution, which compute similarities between target and source entities based on class information and instance properties and values, can be defined for each class in the ontology. We present a case study of the application of the framework to the problem of multi-source job vacancy extraction

pdf bib
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization
Sivaji Bandyopadhyay | Thierry Poibeau | Horacio Saggion | Roman Yangarber
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization

pdf bib
Experiments on Semantic-based Clustering for Cross-document Coreference
Horacio Saggion
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Introduction to Text Summarization and Other Information Access Technologies
Horacio Saggion
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

2007

pdf bib
SHEF: Semantic Tagging and Summarization Techniques Applied to Cross-document Coreference
Horacio Saggion
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib abs
Multilingual Multidocument Summarization Tools and Evaluation
Horacio Saggion
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe a number of experiments carried out to address the problem of creating summaries from multiple sources in multiple languages. A centroid-based sentence extraction system has been developed which decides the content of the summary using texts in different languages and uses sentences from English sources alone to create the final output. We describe the evaluation of the system in the recent Multilingual Summarization Evaluation MSE 2005 using the pyramids and ROUGE methods.

pdf bib abs
Language Resources for Background Gathering
Horacio Saggion | Robert Gaizauskas
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe the Cubreporter information access system which allows access to news archives through the use of natural language technology. The system includes advanced text search, question answering, summarization, and entity profiling capabilities. It has been designed taking into account the characteristics of the background gathering task.

Horacio Saggion

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2000

1999

1994

Co-authors

Venues