A Review in Knowledge Extraction from Knowledge Bases
Fabio Yanez
Andrés Montoyo
Yoan Gutierrez
Rafael Muñoz
Armando Suarez
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Generative language models achieve the state of the art in many tasks within natural language processing (NLP). Although these models correctly capture syntactic information, they fail to interpret knowledge (semantics). Moreover, the lack of interpretability of these models promotes the use of other technologies as a replacement or complement to generative language models. This is the case with research focused on incorporating knowledge by resorting to knowledge bases mainly in the form of graphs. The generation of large knowledge graphs is carried out with unsupervised or semi-supervised techniques, which promotes the validation of this knowledge with the same type of techniques due to the size of the generated databases. In this review, we will explain the different techniques used to test and infer knowledge from graph structures with machine learning algorithms. The motivation of validating and inferring knowledge is to use correct knowledge in subsequent tasks with improved embeddings.
T2KG: Transforming Multimodal Document to Knowledge Graph
Santiago Galiano
Rafael Muñoz
Yoan Gutiérrez
Andrés Montoyo
Jose Ignacio Abreu
Luis Alfonso Ureña
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
The large amount of information in digital format that exists today makes it unfeasible to use manual means to acquire the knowledge contained in these documents. Therefore, it is necessary to develop tools that allow us to incorporate this knowledge into a structure that is easy to use by both machines and humans. This paper presents a system that can incorporate the relevant information from a document in any format, structured or unstructured, into a semantic network that represents the existing knowledge in the document. The system independently processes from structured documents based on its annotation scheme to unstructured documents, written in natural language, for which it uses a set of sensors that identifies the relevant information and subsequently incorporates it to enrich the semantic network that is created by linking all the information based on the knowledge discovered.
Active Learning for Assisted Corpus Construction: A Case Study in Knowledge Discovery from Biomedical Text
Hian Cañizares-Díaz
Alejandro Piad-Morffis
Suilan Estevez-Velarde
Yoan Gutiérrez
Yudivián Almeida Cruz
Andres Montoyo
Rafael Muñoz-Guillena
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
This paper presents an active learning approach that aims to reduce the human effort required during the annotation of natural language corpora composed of entities and semantic relations. Our approach assists human annotators by intelligently selecting the most informative sentences to annotate and then pre-annotating them with a few highly accurate entities and semantic relations. We define an uncertainty-based query strategy with a weighted density factor, using similarity metrics based on sentence embeddings. As a case study, we evaluate our approach via simulation in a biomedical corpus and estimate the potential reduction in total annotation time. Experimental results suggest that the query strategy reduces by between 35% and 40% the number of sentences that must be manually annotated to develop systems able to reach a target F1 score, while the pre-annotation strategy produces an additional 24% reduction in the total annotation time. Overall, our preliminary experiments suggest that as much as 60% of the annotation time could be saved while producing corpora that have the same usefulness for training machine learning algorithms. An open-source computational tool that implements the aforementioned strategies is presented and published online for the research community.
Knowledge Discovery in COVID-19 Research Literature
Ernesto L. Estevanell-Valladares
Suilan Estevez-Velarde
Alejandro Piad-Morffis
Yoan Gutierrez
Andres Montoyo
Rafael Muñoz
Yudivián Almeida Cruz
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
This paper presents the preliminary results of an ongoing project that analyzes the growing body of scientific research published around the COVID-19 pandemic. In this research, a general-purpose semantic model is used to double annotate a batch of 500 sentences that were manually selected from the CORD-19 corpus. Afterwards, a baseline text-mining pipeline is designed and evaluated via a large batch of 100,959 sentences. We present a qualitative analysis of the most interesting facts automatically extracted and highlight possible future lines of development. The preliminary results show that general-purpose semantic models are a useful tool for discovering fine-grained knowledge in large corpora of scientific documents.
Automatic Discovery of Heterogeneous Machine Learning Pipelines: An Application to Natural Language Processing
Suilan Estevez-Velarde
Yoan Gutiérrez
Andres Montoyo
Yudivián Almeida Cruz
Proceedings of the 28th International Conference on Computational Linguistics
This paper presents AutoGOAL, a system for automatic machine learning (AutoML) that uses heterogeneous techniques. In contrast with existing AutoML approaches, our contribution can automatically build machine learning pipelines that combine techniques and algorithms from different frameworks, including shallow classifiers, natural language processing tools, and neural networks. We define the heterogeneous AutoML optimization problem as the search for the best sequence of algorithms that transforms specific input data into the desired output. This provides a novel theoretical and practical approach to AutoML. Our proposal is experimentally evaluated in diverse machine learning problems and compared with alternative approaches, showing that it is competitive with other AutoML alternatives in standard benchmarks. Furthermore, it can be applied to novel scenarios, such as several NLP tasks, where existing alternatives cannot be directly deployed. The system is freely available and includes in-built compatibility with a large number of popular machine learning frameworks, which makes our approach useful for solving practical problems with relative ease and effort.
Demo Application for the AutoGOAL Framework
Suilan Estevez-Velarde
Alejandro Piad-Morffis
Yoan Gutiérrez
Andres Montoyo
Rafael Muñoz-Guillena
Yudivián Almeida Cruz
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations
This paper introduces a web demo that showcases the main characteristics of the AutoGOAL framework. AutoGOAL is a framework in Python for automatically finding the best way to solve a given task. It has been designed mainly for automatic machine learning(AutoML) but it can be used in any scenario where several possible strategies are available to solve a given computational task. In contrast with alternative frameworks, AutoGOAL can be applied seamlessly to Natural Language Processing as well as structured classification problems. This paper presents an overview of the framework’s design and experimental evaluation in several machine learning problems, including two recent NLP challenges. The accompanying software demo is available online (
https://autogoal.github.io/demo) and full source code is provided under the MIT open-source license (
Knowledge Discovery in COVID-19 Research Literature
Alejandro Piad-Morffis
Suilan Estevez-Velarde
Ernesto Luis Estevanell-Valladares
Yoan Gutiérrez
Andrés Montoyo
Rafael Muñoz
Yudivián Almeida-Cruz
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020
This paper presents the preliminary results of an ongoing project that analyzes the growing body of scientific research published around the COVID-19 pandemic. In this research, a general-purpose semantic model is used to double annotate a batch of 500 sentences that were manually selected by the researchers from the CORD-19 corpus. Afterwards, a baseline text-mining pipeline is designed and evaluated via a large batch of 100,959 sentences. We present a qualitative analysis of the most interesting facts automatically extracted and highlight possible future lines of development. The preliminary results show that general-purpose semantic models are a useful tool for discovering fine-grained knowledge in large corpora of scientific documents.
Demo Application for LETO: Learning Engine Through Ontologies
Suilan Estevez-Velarde
Andrés Montoyo
Yudivian Almeida-Cruz
Yoan Gutiérrez
Alejandro Piad-Morffis
Rafael Muñoz
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
The massive amount of multi-formatted information available on the Web necessitates the design of software systems that leverage this information to obtain knowledge that is valid and useful. The main challenge is to discover relevant information and continuously update, enrich and integrate knowledge from various sources of structured and unstructured data. This paper presents the Learning Engine Through Ontologies(LETO) framework, an architecture for the continuous and incremental discovery of knowledge from multiple sources of unstructured and structured data. We justify the main design decision behind LETO’s architecture and evaluate the framework’s feasibility using the Internet Movie Data Base(IMDB) and Twitter as a practical application.
A Neural Network Component for Knowledge-Based Semantic Representations of Text
Alejandro Piad-Morffis
Rafael Muñoz
Yoan Gutiérrez
Yudivian Almeida-Cruz
Suilan Estevez-Velarde
Andrés Montoyo
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
This paper presents Semantic Neural Networks (SNNs), a knowledge-aware component based on deep learning. SNNs can be trained to encode explicit semantic knowledge from an arbitrary knowledge base, and can subsequently be combined with other deep learning architectures. At prediction time, SNNs provide a semantic encoding extracted from the input data, which can be exploited by other neural network components to build extended representation models that can face alternative problems. The SNN architecture is defined in terms of the concepts and relations present in a knowledge base. Based on this architecture, a training procedure is developed. Finally, an experimental setup is presented to illustrate the behaviour and performance of a SNN for a specific NLP problem, in this case, opinion mining for the classification of movie reviews.
AutoML Strategy Based on Grammatical Evolution: A Case Study about Knowledge Discovery from Text
Suilan Estevez-Velarde
Yoan Gutiérrez
Andrés Montoyo
Yudivián Almeida-Cruz
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
The process of extracting knowledge from natural language text poses a complex problem that requires both a combination of machine learning techniques and proper feature selection. Recent advances in Automatic Machine Learning (AutoML) provide effective tools to explore large sets of algorithms, hyper-parameters and features to find out the most suitable combination of them. This paper proposes a novel AutoML strategy based on probabilistic grammatical evolution, which is evaluated on the health domain by facing the knowledge discovery challenge in Spanish text documents. Our approach achieves state-of-the-art results and provides interesting insights into the best combination of parameters and algorithms to use when dealing with this challenge. Source code is provided for the research community.
UCSC-NLP at SemEval-2017 Task 4: Sense n-grams for Sentiment Analysis in Twitter
José Abreu
Iván Castro
Claudia Martínez
Sebastián Oliva
Yoan Gutiérrez
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
This paper describes the system submitted to SemEval-2017 Task 4-A Sentiment Analysis in Twitter developed by the UCSC-NLP team. We studied how relationships between sense n-grams and sentiment polarities can contribute to this task, i.e. co-occurrences of WordNet senses in the tweet, and the polarity. Furthermore, we evaluated the effect of discarding a large set of features based on char-grams reported in preceding works. Based on these elements, we developed a SVM system, which exploring SentiWordNet as a polarity lexicon. It achieves an F1=0.624of average. Among 39 submissions to this task, we ranked 10th.
Opinion Mining in Social Networks versus Electoral Polls
Javi Fernández
Fernando Llopis
Yoan Gutiérrez
Patricio Martínez-Barco
Álvaro Díez
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
The recent failures of traditional poll models, like the predictions in United Kingdom with the Brexit, or in United States presidential elections with the victory of Donald Trump, have been noteworthy. With the decline of traditional poll models and the growth of the social networks, automatic tools are gaining popularity to make predictions in this context. In this paper we present our approximation and compare it with a real case: the 2017 French presidential election.
Natural Language Processing Technologies for Document Profiling
Antonio Guillén
Yoan Gutiérrez
Rafael Muñoz
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
Nowadays, search for documents on the Internet is becoming increasingly difficult. The reason is the amount of content published by users (articles, comments, blogs, reviews). How to facilitate that the users can find their required documents? What would be necessary to provide useful document meta-data for supporting search engines? In this article, we present a study of some Natural Language Processing (NLP) technologies that can be useful for facilitating the proper identification of documents according to the user needs. For this purpose, it is designed a document profile that will be able to represent semantic meta-data extracted from documents by using NLP technologies. The research is basically focused on the study of different NLP technologies in order to support the creation our novel document profile proposal from semantic perspectives.
GPLSI: Supervised Sentiment Analysis in Twitter using Skipgrams
Javi Fernández
Yoan Gutiérrez
Jose Manuel Gómez
Patricio Martínez-Barco
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
UMCC_DLSI_SemSim: Multilingual System for Measuring Semantic Textual Similarity
Alexander Chávez
Héctor Dávila
Yoan Gutiérrez
Antonio Fernández-Orquín
Andrés Montoyo
Rafael Muñoz
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
UMCC_DLSI: A Probabilistic Automata for Aspect Based Sentiment Analysis
Yenier Castañeda
Armando Collazo
Elvis Crego
Jorge L. Garcia
Yoan Gutiérrez
David Tomás
Andrés Montoyo
Rafael Muñoz
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
UMCC_DLSI: Sentiment Analysis in Twitter using Polirity Lexicons and Tweet Similarity
Pedro Aniel Sánchez-Mirabal
Yarelis Ruano Torres
Suilen Hernández Alvarado
Yoan Gutiérrez
Andrés Montoyo
Rafael Muñoz
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
UO_UA: Using Latent Semantic Analysis to Build a Domain-Dependent Sentiment Resource
Reynier Ortega Bueno
Adrian Fonseca Bruzón
Carlos Muñiz Cuza
Yoan Gutiérrez
Andrés Montoyo
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
RA-SR: Using a ranking algorithm to automatically building resources for subjectivity analysis over annotated corpora
Yoan Gutiérrez
Andy González
Antonio Fernández
Andrés Montoyo
Rafael Muñoz
Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
UMCC_DLSI: Textual Similarity based on Lexical-Semantic features
Alexander Chávez
Héctor Dávila
Yoan Gutiérrez
Armando Collazo
José I. Abreu
Antonio Fernández Orquín
Andrés Montoyo
Rafael Muñoz
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity
UMCC_DLSI-(EPS): Paraphrases Detection Based on Semantic Distance
Héctor Dávila
Antonio Fernández Orquín
Alexander Chávez
Yoan Gutiérrez
Armando Collazo
José I. Abreu
Andrés Montoyo
Rafael Muñoz
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)
UMCC_DLSI: Reinforcing a Ranking Algorithm with Sense Frequencies and Multidimensional Semantic Resources to solve Multilingual Word Sense Disambiguation
Yoan Gutiérrez
Yenier Castañeda
Andy González
Rainel Estrada
Dennys D. Piug
Jose I. Abreu
Roger Pérez
Antonio Fernández Orquín
Andrés Montoyo
Rafael Muñoz
Franc Camara
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)
UMCC_DLSI-(SA): Using a ranking algorithm and informal features to solve Sentiment Analysis in Twitter
Yoan Gutiérrez
Andy González
Roger Pérez
José I. Abreu
Antonio Fernández Orquín
Alejandro Mosquera
Andrés Montoyo
Rafael Muñoz
Franc Camara
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)
SSA-UO: Unsupervised Sentiment Analysis in Twitter
Reynier Ortega Bueno
Adrian Fonseca Bruzón
Yoan Gutiérrez
Andrés Montoyo
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)
UMCC_DLSI: Semantic and Lexical features for detection and classification Drugs in biomedical texts
Armando Collazo
Alberto Ceballo
Dennys D. Puig
Yoan Gutiérrez
José I. Abreu
Roger Pérez
Antonio Fernández Orquín
Andrés Montoyo
Rafael Muñoz
Franc Camara
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)
UMCC_DLSI: Multidimensional Lexical-Semantic Textual Similarity
Antonio Fernández
Yoan Gutiérrez
Héctor Dávila
Alexander Chávez
Andy González
Rainel Estrada
Yenier Castañeda
Sonia Vázquez
Andrés Montoyo
Rafael Muñoz
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)
Sentiment Classification Using Semantic Features Extracted from WordNet-based Resources
Yoan Gutiérrez
Sonia Vázquez
Andrés Montoyo
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)
Improving WSD using ISR-WN with Relevant Semantic Trees and SemCor Senses Frequency
Yoan Gutiérrez
Sonia Vázquez
Andrés Montoyo
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011
UMCC-DLSI: Integrative Resource for Disambiguation Task
Yoan Gutiérrez Vázquez
Antonio Fernandez Orquín
Andrés Montoyo Guijarro
Sonia Vázquez Pérez
Proceedings of the 5th International Workshop on Semantic Evaluation