Paul Buitelaar

2022

pdf abs
Analysing the Correlation between Lexical Ambiguity and Translation Quality in a Multimodal Setting using WordNet
Ali Hatami | Paul Buitelaar | Mihael Arcan
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

Multimodal Neural Machine Translation is focusing on using visual information to translate sentences in the source language into the target language. The main idea is to utilise information from visual modalities to promote the output quality of the text-based translation model. Although the recent multimodal strategies extract the most relevant visual information in images, the effectiveness of using visual information on translation quality changes based on the text dataset. Due to this, this work studies the impact of leveraging visual information in multimodal translation models of ambiguous sentences. Our experiments analyse the Multi30k evaluation dataset and calculate ambiguity scores of sentences based on the WordNet hierarchical structure. To calculate the ambiguity of a sentence, we extract the ambiguity scores for all nouns based on the number of senses in WordNet. The main goal is to find in which sentences, visual content can improve the text-based translation model. We report the correlation between the ambiguity scores and translation quality extracted for all sentences in the English-German dataset.

pdf abs
Linghub2: Language Resource Discovery Tool for Language Technologies
Cécile Robin | Gautham Vadakkekara Suresh | Víctor Rodriguez-Doncel | John P. McCrae | Paul Buitelaar
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Language resources are a key component of natural language processing and related research and applications. Users of language resources have different needs in terms of format, language, topics, etc. for the data they need to use. Linghub (McCrae and Cimiano, 2015) was first developed for this purpose, using the capabilities of linked data to represent metadata, and tackling the heterogeneous metadata issue. Linghub aimed at helping language resources and technology users to easily find and retrieve relevant data, and identify important information on access, topics, etc. This work describes a rejuvenation and modernisation of the 2015 platform into using a popular open source data management system, DSpace, as foundation. The new platform, Linghub2, contains updated and extended resources, more languages offered, and continues the work towards homogenisation of metadata through conversions, through linkage to standardisation strategies and community groups, such as the Open Digital Rights Language (ODRL) community group.

pdf abs
Towards Bootstrapping a Chatbot on Industrial Heritage through Term and Relation Extraction
Mihael Arcan | Rory O’Halloran | Cécile Robin | Paul Buitelaar
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

We describe initial work in developing a methodology for the automatic generation of a conversational agent or ‘chatbot’ through term and relation extraction from a relevant corpus of language data. We develop our approach in the domain of industrial heritage in the 18th and 19th centuries, and more specifically on the industrial history of canals and mills in Ireland. We collected a corpus of relevant newspaper reports and Wikipedia articles, which we deemed representative of a layman’s understanding of this topic. We used the Saffron toolkit to extract relevant terms and relations between the terms from the corpus and leveraged the extracted knowledge to query the British Library Digital Collection and the Project Gutenberg library. We leveraged the extracted terms and relations in identifying possible answers for a constructed set of questions based on the extracted terms, by matching them with sentences in the British Library Digital Collection and the Project Gutenberg library. In a final step, we then took this data set of question-answer pairs to train a chatbot. We evaluate our approach by manually assessing the appropriateness of the generated answers for a random sample, each of which is judged by four annotators.

pdf bib
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion
Bharathi Raja Chakravarthi | B Bharathi | John P McCrae | Manel Zarrouk | Kalika Bali | Paul Buitelaar
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

pdf abs
Overview of The Shared Task on Homophobia and Transphobia Detection in Social Media Comments
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Thenmozhi Durairaj | John McCrae | Paul Buitelaar | Prasanna Kumaresan | Rahul Ponnusamy
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Homophobia and Transphobia Detection is the task of identifying homophobia, transphobia, and non-anti-LGBT+ content from the given corpus. Homophobia and transphobia are both toxic languages directed at LGBTQ+ individuals that are described as hate speech. This paper summarizes our findings on the “Homophobia and Transphobia Detection in social media comments” shared task held at LT-EDI 2022 - ACL 2022 1. This shared taskfocused on three sub-tasks for Tamil, English, and Tamil-English (code-mixed) languages. It received 10 systems for Tamil, 13 systems for English, and 11 systems for Tamil-English. The best systems for Tamil, English, and Tamil-English scored 0.570, 0.870, and 0.610, respectively, on average macro F1-score.

2021

pdf abs
NUIG-DSI’s submission to The GEM Benchmark 2021
Nivranshu Pasricha | Mihael Arcan | Paul Buitelaar
Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

This paper describes the submission by NUIG-DSI to the GEM benchmark 2021. We participate in the modeling shared task where we submit outputs on four datasets for data-to-text generation, namely, DART, WebNLG (en), E2E and CommonGen. We follow an approach similar to the one described in the GEM benchmark paper where we use the pre-trained T5-base model for our submission. We train this model on additional monolingual data where we experiment with different masking strategies specifically focused on masking entities, predicates and concepts as well as a random masking strategy for pre-training. In our results we find that random masking performs the best in terms of automatic evaluation metrics, though the results are not statistically significantly different compared to other masking strategies.

pdf bib
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion
Bharathi Raja Chakravarthi | John P. McCrae | Manel Zarrouk | Kalika Bali | Paul Buitelaar
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

pdf abs
Enhancing Multiple-Choice Question Answering with Causal Knowledge
Dhairya Dalal | Mihael Arcan | Paul Buitelaar
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

The task of causal question answering aims to reason about causes and effects over a provided real or hypothetical premise. Recent approaches have converged on using transformer-based language models to solve question answering tasks. However, pretrained language models often struggle when external knowledge is not present in the premise or when additional context is required to answer the question. To the best of our knowledge, no prior work has explored the efficacy of augmenting pretrained language models with external causal knowledge for multiple-choice causal question answering. In this paper, we present novel strategies for the representation of causal knowledge. Our empirical results demonstrate the efficacy of augmenting pretrained models with external causal knowledge. We show improved performance on the COPA (Choice of Plausible Alternatives) and WIQA (What If Reasoning Over Procedural Text) benchmark tasks. On the WIQA benchmark, our approach is competitive with the state-of-the-art and exceeds it within the evaluation subcategories of In-Paragraph and Out-of-Paragraph perturbations.

2020

pdf abs
Contextual Modulation for Relation-Level Metaphor Identification
Omnia Zayed | John P. McCrae | Paul Buitelaar
Findings of the Association for Computational Linguistics: EMNLP 2020

Identifying metaphors in text is very challenging and requires comprehending the underlying comparison. The automation of this cognitive process has gained wide attention lately. However, the majority of existing approaches concentrate on word-level identification by treating the task as either single-word classification or sequential labelling without explicitly modelling the interaction between the metaphor components. On the other hand, while existing relation-level approaches implicitly model this interaction, they ignore the context where the metaphor occurs. In this work, we address these limitations by introducing a novel architecture for identifying relation-level metaphoric expressions of certain grammatical relations based on contextual modulation. In a methodology inspired by works in visual reasoning, our approach is based on conditioning the neural network computation on the deep contextualised features of the candidate expressions using feature-wise linear modulation. We demonstrate that the proposed architecture achieves state-of-the-art results on benchmark datasets. The proposed methodology is generic and could be applied to other textual classification problems that benefit from contextual interaction.

pdf abs
Adaptation of Word-Level Benchmark Datasets for Relation-Level Metaphor Identification
Omnia Zayed | John Philip McCrae | Paul Buitelaar
Proceedings of the Second Workshop on Figurative Language Processing

Metaphor processing and understanding has attracted the attention of many researchers recently with an increasing number of computational approaches. A common factor among these approaches is utilising existing benchmark datasets for evaluation and comparisons. The availability, quality and size of the annotated data are among the main difficulties facing the growing research area of metaphor processing. The majority of current approaches pertaining to metaphor processing concentrate on word-level processing due to data availability. On the other hand, approaches that process metaphors on the relation-level ignore the context where the metaphoric expression. This is due to the nature and format of the available data. Word-level annotation is poorly grounded theoretically and is harder to use in downstream tasks such as metaphor interpretation. The conversion from word-level to relation-level annotation is non-trivial. In this work, we attempt to fill this research gap by adapting three benchmark datasets, namely the VU Amsterdam metaphor corpus, the TroFi dataset and the TSV dataset, to suit relation-level metaphor identification. We publish the adapted datasets to facilitate future research in relation-level metaphor processing.

pdf abs
NUIG at SemEval-2020 Task 12: Pseudo Labelling for Offensive Content Classification
Shardul Suryawanshi | Mihael Arcan | Paul Buitelaar
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This work addresses the classification problem defined by sub-task A (English only) of the OffensEval 2020 challenge. We used a semi-supervised approach to classify given tweets into an offensive (OFF) or not-offensive (NOT) class. As the OffensEval 2020 dataset is loosely labelled with confidence scores given by unsupervised models, we used last year’s offensive language identification dataset (OLID) to label the OffensEval 2020 dataset. Our approach uses a pseudo-labelling method to annotate the current dataset. We trained four text classifiers on the OLID dataset and the classifier with the highest macro-averaged F1-score has been used to pseudo label the OffensEval 2020 dataset. The same model which performed best amongst four text classifiers on OLID dataset has been trained on the combined dataset of OLID and pseudo labelled OffensEval 2020. We evaluated the classifiers with precision, recall and macro-averaged F1-score as the primary evaluation metric on the OLID and OffensEval 2020 datasets. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/.

pdf bib abs
A Dataset for Troll Classification of TamilMemes
Shardul Suryawanshi | Bharathi Raja Chakravarthi | Pranav Verma | Mihael Arcan | John Philip McCrae | Paul Buitelaar
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

Social media are interactive platforms that facilitate the creation or sharing of information, ideas or other forms of expression among people. This exchange is not free from offensive, trolling or malicious contents targeting users or communities. One way of trolling is by making memes, which in most cases combines an image with a concept or catchphrase. The challenge of dealing with memes is that they are region-specific and their meaning is often obscured in humour or sarcasm. To facilitate the computational modelling of trolling in the memes for Indian languages, we created a meme dataset for Tamil (TamilMemes). We annotated and released the dataset containing suspected trolls and not-troll memes. In this paper, we use the a image classification to address the difficulties involved in the classification of troll memes with the existing methods. We found that the identification of a troll meme with such an image classifier is not feasible which has been corroborated with precision, recall and F1-score.

pdf abs
Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text
Shardul Suryawanshi | Bharathi Raja Chakravarthi | Mihael Arcan | Paul Buitelaar
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

A meme is a form of media that spreads an idea or emotion across the internet. As posting meme has become a new form of communication of the web, due to the multimodal nature of memes, postings of hateful memes or related events like trolling, cyberbullying are increasing day by day. Hate speech, offensive content and aggression content detection have been extensively explored in a single modality such as text or image. However, combining two modalities to detect offensive content is still a developing area. Memes make it even more challenging since they express humour and sarcasm in an implicit way, because of which the meme may not be offensive if we only consider the text or the image. Therefore, it is necessary to combine both modalities to identify whether a given meme is offensive or not. Since there was no publicly available dataset for multimodal offensive meme content detection, we leveraged the memes related to the 2016 U.S. presidential election and created the MultiOFF multimodal meme dataset for offensive content detection dataset. We subsequently developed a classifier for this task using the MultiOFF dataset. We use an early fusion technique to combine the image and text modality and compare it with a text- and an image-only baseline to investigate its effectiveness. Our results show improvements in terms of Precision, Recall, and F-Score. The code and dataset for this paper is published in https://github.com/bharathichezhiyan/Multimodal-Meme-Classification-Identifying-Offensive-Content-in-Image-and-Text

pdf abs
A Term Extraction Approach to Survey Analysis in Health Care
Cécile Robin | Mona Isazad Mashinchi | Fatemeh Ahmadi Zeleti | Adegboyega Ojo | Paul Buitelaar
Proceedings of the Twelfth Language Resources and Evaluation Conference

The voice of the customer has for a long time been a key focus of businesses in all domains. It has received a lot of attention from the research community in Natural Language Processing (NLP) resulting in many approaches to analyzing customers feedback ((aspect-based) sentiment analysis, topic modeling, etc.). In the health domain, public and private bodies are increasingly prioritizing patient engagement for assessing the quality of the service given at each stage of the care. Patient and customer satisfaction analysis relate in many ways. In the domain of health particularly, a more precise and insightful analysis is needed to help practitioners locate potential issues and plan actions accordingly. We introduce here an approach to patient experience with the analysis of free text questions from the 2017 Irish National Inpatient Survey campaign using term extraction as a means to highlight important and insightful subject matters raised by patients. We evaluate the results by mapping them to a manually constructed framework following the Activity, Resource, Context (ARC) methodology (Ordenes, 2014) and specific to the health care environment, and compare our results against manual annotations done on the full 2017 dataset based on those categories.

pdf abs
Evaluation Dataset and Methodology for Extracting Application-Specific Taxonomies from the Wikipedia Knowledge Graph
Georgeta Bordea | Stefano Faralli | Fleur Mougin | Paul Buitelaar | Gayo Diallo
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this work, we address the task of extracting application-specific taxonomies from the category hierarchy of Wikipedia. Previous work on pruning the Wikipedia knowledge graph relied on silver standard taxonomies which can only be automatically extracted for a small subset of domains rooted in relatively focused nodes, placed at an intermediate level in the knowledge graphs. In this work, we propose an iterative methodology to extract an application-specific gold standard dataset from a knowledge graph and an evaluation framework to comparatively assess the quality of noisy automatically extracted taxonomies. We employ an existing state of the art algorithm in an iterative manner and we propose several sampling strategies to reduce the amount of manual work needed for evaluation. A first gold standard dataset is released to the research community for this task along with a companion evaluation framework. This dataset addresses a real-world application from the medical domain, namely the extraction of food-drug and herb-drug interactions.

pdf abs
Figure Me Out: A Gold Standard Dataset for Metaphor Interpretation
Omnia Zayed | John Philip McCrae | Paul Buitelaar
Proceedings of the Twelfth Language Resources and Evaluation Conference

Metaphor comprehension and understanding is a complex cognitive task that requires interpreting metaphors by grasping the interaction between the meaning of their target and source concepts. This is very challenging for humans, let alone computers. Thus, automatic metaphor interpretation is understudied in part due to the lack of publicly available datasets. The creation and manual annotation of such datasets is a demanding task which requires huge cognitive effort and time. Moreover, there will always be a question of accuracy and consistency of the annotated data due to the subjective nature of the problem. This work addresses these issues by presenting an annotation scheme to interpret verb-noun metaphoric expressions in text. The proposed approach is designed with the goal of reducing the workload on annotators and maintain consistency. Our methodology employs an automatic retrieval approach which utilises external lexical resources, word embeddings and semantic similarity to generate possible interpretations of identified metaphors in order to enable quick and accurate annotation. We validate our proposed approach by annotating around 1,500 metaphors in tweets which were annotated by six native English speakers. As a result of this work, we publish as linked data the first gold standard dataset for metaphor interpretation which will facilitate research in this area.

pdf abs
Utilising Knowledge Graph Embeddings for Data-to-Text Generation
Nivranshu Pasricha | Mihael Arcan | Paul Buitelaar
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

Data-to-text generation has recently seen a move away from modular and pipeline architectures towards end-to-end architectures based on neural networks. In this work, we employ knowledge graph embeddings and explore their utility for end-to-end approaches in a data-to-text generation task. Our experiments show that using knowledge graph embeddings can yield an improvement of up to 2 – 3 BLEU points for seen categories on the WebNLG corpus without modifying the underlying neural network architecture.

pdf abs
NUIG-DSI at the WebNLG+ challenge: Leveraging Transfer Learning for RDF-to-text generation
Nivranshu Pasricha | Mihael Arcan | Paul Buitelaar
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

This paper describes the system submitted by NUIG-DSI to the WebNLG+ challenge 2020 in the RDF-to-text generation task for the English language. For this challenge, we leverage transfer learning by adopting the T5 model architecture for our submission and fine-tune the model on the WebNLG+ corpus. Our submission ranks among the top five systems for most of the automatic evaluation metrics achieving a BLEU score of 51.74 over all categories with scores of 58.23 and 45.57 across seen and unseen categories respectively.

2019

pdf abs
SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
Sapna Negi | Tobias Daudert | Paul Buitelaar
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the pilot SemEval task on Suggestion Mining. The task consists of subtasks A and B, where we created labeled data from feedback forum and hotel reviews respectively. Subtask A provides training and test data from the same domain, while Subtask B evaluates the system on a test dataset from a different domain than the available training data. 33 teams participated in the shared task, with a total of 50 members. We summarize the problem definition, benchmark dataset preparation, and methods used by the participating teams, providing details of the methods used by the top ranked systems. The dataset is made freely available to help advance the research in suggestion mining, and reproduce the systems submitted under this task

2018

pdf
Automatic Enrichment of Terminological Resources: the IATE RDF Example
Mihael Arcan | Elena Montiel-Ponsoda | John P. McCrae | Paul Buitelaar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
A Comparison Of Emotion Annotation Schemes And A New Annotated Data Set
Ian Wood | John P. McCrae | Vladimir Andryushechkin | Paul Buitelaar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
A supervised approach to taxonomy extraction using word embeddings
Rajdeep Sarkar | John P. McCrae | Paul Buitelaar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Teanga: A Linked Data based platform for Natural Language Processing
Housam Ziad | John P. McCrae | Paul Buitelaar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
Phrase-Level Metaphor Identification Using Distributed Representations of Word Meaning
Omnia Zayed | John Philip McCrae | Paul Buitelaar
Proceedings of the Workshop on Figurative Language Processing

Metaphor is an essential element of human cognition which is often used to express ideas and emotions that might be difficult to express using literal language. Processing metaphoric language is a challenging task for a wide range of applications ranging from text simplification to psychotherapy. Despite the variety of approaches that are trying to process metaphor, there is still a need for better models that mimic the human cognition while exploiting fewer resources. In this paper, we present an approach based on distributional semantics to identify metaphors on the phrase-level. We investigated the use of different word embeddings models to identify verb-noun pairs where the verb is used metaphorically. Several experiments are conducted to show the performance of the proposed approach on benchmark datasets.

pdf abs
Leveraging News Sentiment to Improve Microblog Sentiment Classification in the Financial Domain
Tobias Daudert | Paul Buitelaar | Sapna Negi
Proceedings of the First Workshop on Economics and Natural Language Processing

With the rising popularity of social media in the society and in research, analysing texts short in length, such as microblogs, becomes an increasingly important task. As a medium of communication, microblogs carry peoples sentiments and express them to the public. Given that sentiments are driven by multiple factors including the news media, the question arises if the sentiment expressed in news and the news article themselves can be leveraged to detect and classify sentiment in microblogs. Prior research has highlighted the impact of sentiments and opinions on the market dynamics, making the financial domain a prime case study for this approach. Therefore, this paper describes ongoing research dealing with the exploitation of news contained sentiment to improve microblog sentiment classification in a financial context.

pdf abs
Linking News Sentiment to Microblogs: A Distributional Semantics Approach to Enhance Microblog Sentiment Classification
Tobias Daudert | Paul Buitelaar
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Social media’s popularity in society and research is gaining momentum and simultaneously increasing the importance of short textual content such as microblogs. Microblogs are affected by many factors including the news media, therefore, we exploit sentiments conveyed from news to detect and classify sentiment in microblogs. Given that texts can deal with the same entity but might not be vastly related when it comes to sentiment, it becomes necessary to introduce further measures ensuring the relatedness of texts while leveraging the contained sentiments. This paper describes ongoing research introducing distributional semantics to improve the exploitation of news-contained sentiment to enhance microblog sentiment classification.

2016

pdf abs
Forecasting Emerging Trends from Scientific Literature
Kartik Asooja | Georgeta Bordea | Gabriela Vulcu | Paul Buitelaar
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Text analysis methods for the automatic identification of emerging technologies by analyzing the scientific publications, are gaining attention because of their socio-economic impact. The approaches so far have been mainly focused on retrospective analysis by mapping scientific topic evolution over time. We propose regression based approaches to predict future keyword distribution. The prediction is based on historical data of the keywords, which in our case, are LREC conference proceedings. Considering the insufficient number of data points available from LREC proceedings, we do not employ standard time series forecasting methods. We form a dataset by extracting the keywords from previous year proceedings and quantify their yearly relevance using tf-idf scores. This dataset additionally contains ranked lists of related keywords and experts for each keyword.

pdf abs
IRIS: English-Irish Machine Translation System
Mihael Arcan | Caoilfhionn Lane | Eoin Ó Droighneáin | Paul Buitelaar
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe IRIS, a statistical machine translation (SMT) system for translating from English into Irish and vice versa. Since Irish is considered an under-resourced language with a limited amount of machine-readable text, building a machine translation system that produces reasonable translations is rather challenging. As translation is a difficult task, current research in SMT focuses on obtaining statistics either from a large amount of parallel, monolingual or other multilingual resources. Nevertheless, we collected available English-Irish data and developed an SMT system aimed at supporting human translators and enabling cross-lingual language technology tasks.

pdf abs
Generating a Large-Scale Entity Linking Dictionary from Wikipedia Link Structure and Article Text
Ravindra Harige | Paul Buitelaar
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Wikipedia has been increasingly used as a knowledge base for open-domain Named Entity Linking and Disambiguation. In this task, a dictionary with entity surface forms plays an important role in finding a set of candidate entities for the mentions in text. Existing dictionaries mostly rely on the Wikipedia link structure, like anchor texts, redirect links and disambiguation links. In this paper, we introduce a dictionary for Entity Linking that includes name variations extracted from Wikipedia article text, in addition to name variations derived from the Wikipedia link structure. With this approach, we show an increase in the coverage of entities and their mentions in the dictionary in comparison to other Wikipedia based dictionaries.

pdf
NUIG-UNLP at SemEval-2016 Task 1: Soft Alignment and Deep Learning for Semantic Textual Similarity
John Philip McCrae | Kartik Asooja | Nitish Aggarwal | Paul Buitelaar
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
SemEval-2016 Task 13: Taxonomy Extraction Evaluation (TExEval-2)
Georgeta Bordea | Els Lefever | Paul Buitelaar
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
A Study of Suggestions in Opinionated Texts and their Automatic Detection
Sapna Negi | Kartik Asooja | Shubham Mehrotra | Paul Buitelaar
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

pdf abs
Expanding wordnets to new languages with multilingual sense disambiguation
Mihael Arcan | John Philip McCrae | Paul Buitelaar
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Princeton WordNet is one of the most important resources for natural language processing, but is only available for English. While it has been translated using the expand approach to many other languages, this is an expensive manual process. Therefore it would be beneficial to have a high-quality automatic translation approach that would support NLP techniques, which rely on WordNet in new languages. The translation of wordnets is fundamentally complex because of the need to translate all senses of a word including low frequency senses, which is very challenging for current machine translation approaches. For this reason we leverage existing translations of WordNet in other languages to identify contextual information for wordnet senses from a large set of generic parallel corpora. We evaluate our approach using 10 translated wordnets for European languages. Our experiment shows a significant improvement over translation without any contextual information. Furthermore, we evaluate how the choice of pivot languages affects performance of multilingual word sense disambiguation.

2015

pdf
Non-Orthogonal Explicit Semantic Analysis
Nitish Aggarwal | Kartik Asooja | Georgeta Bordea | Paul Buitelaar
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf
SemEval-2015 Task 17: Taxonomy Extraction Evaluation (TExEval)
Georgeta Bordea | Paul Buitelaar | Stefano Faralli | Roberto Navigli
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf
MixedEmotions: Social Semantic Emotion Analysis for Innovative Multilingual Big Data Analytics Markets
Mihael Arcan | Paul Buitelaar
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Towards the Extraction of Customer-to-Customer Suggestions from Reviews
Sapna Negi | Paul Buitelaar
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Knowledge Portability with Semantic Expansion of Ontology Labels
Mihael Arcan | Marco Turchi | Paul Buitelaar
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf
Curse or Boon? Presence of Subjunctive Mood in Opinionated Text
Sapna Negi | Paul Buitelaar
Proceedings of the 11th International Conference on Computational Semantics

2014

pdf
Exploring ESA to Improve Word Relatedness
Nitish Aggarwal | Kartik Asooja | Paul Buitelaar
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf
INSIGHT Galway: Syntactic and Lexical Features for Aspect Based Sentiment Analysis
Sapna Negi | Paul Buitelaar
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf abs
Enhancing statistical machine translation with bilingual terminology in a CAT environment
Mihael Arcan | Marco Turchi | Sara Topelli | Paul Buitelaar
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

In this paper, we address the problem of extracting and integrating bilingual terminology into a Statistical Machine Translation (SMT) system for a Computer Aided Translation (CAT) tool scenario. We develop a framework that, taking as input a small amount of parallel in-domain data, gathers domain-specific bilingual terms and injects them in an SMT system to enhance the translation productivity. Therefore, we investigate several strategies to extract and align bilingual terminology, and to embed it into the SMT. We compare two embedding methods that can be easily used at run-time without altering the normal activity of an SMT system: XML markup and the cache-based model. We tested our framework on two different domains showing improvements up to 15% BLEU score points.

pdf
Using Distributional Semantics to Trace Influence and Imitation in Romantic Orientalist Poetry
Nitish Aggarwal | Justin Tonra | Paul Buitelaar
Proceedings of the First AHA!-Workshop on Information Discovery in Text

pdf
Identification of Bilingual Terms from Monolingual Documents for Statistical Machine Translation
Mihael Arcan | Claudio Giuliano | Marco Turchi | Paul Buitelaar
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)

pdf abs
Missed opportunities in translation memory matching
Friedel Wolff | Laurette Pretorius | Paul Buitelaar
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

A translation memory system stores a data set of source-target pairs of translations. It attempts to respond to a query in the source language with a useful target text from the data set to assist a human translator. Such systems estimate the usefulness of a target text suggestion according to the similarity of its associated source text to the source text query. This study analyses two data sets in two language pairs each to find highly similar target texts, which would be useful mutual suggestions. We further investigate which of these useful suggestions can not be selected through source text similarity, and we do a thorough analysis of these cases to categorise and quantify them. This analysis provides insight into areas where the recall of translation memory systems can be improved. Specifically, source texts with an omission, and semantically very similar source texts are some of the more frequent cases with useful target text suggestions that are not selected with the baseline approach of simple edit distance between the source texts.

pdf abs
Hot Topics and Schisms in NLP: Community and Trend Analysis with Saffron on ACL and LREC Proceedings
Paul Buitelaar | Georgeta Bordea | Barry Coughlan
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present a comparative analysis of two series of conferences in the field of Computational Linguistics, the LREC conference and the ACL conference. Conference proceedings were analysed using Saffron by performing term extraction and topical hierarchy construction with the goal of analysing topic trends and research communities. The system aims to provide insight into a research community and to guide publication and participation strategies, especially of novice researchers.

2013

pdf bib
Linguistic Linked Data for Sentiment Analysis
Paul Buitelaar | Mihael Arcan | Carlos Iglesias | Fernando Sánchez-Rada | Carlo Strapparava
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data

pdf bib
Translating the FINREP Taxonomy using a Domain-specific Corpus
Mihael Arcan | Susan Marie Thomas | Derek De Brandt | Paul Buitelaar
Proceedings of Machine Translation Summit XIV: Posters

pdf
MONNET: Multilingual Ontologies for Networked Knowledge
Mihael Arcan | Paul Buitelaar
Proceedings of Machine Translation Summit XIV: European projects

pdf
Ontology Label Translation
Mihael Arcan | Paul Buitelaar
Proceedings of the 2013 NAACL HLT Student Research Workshop

2012

pdf abs
Semi-Supervised Technical Term Tagging With Minimal User Feedback
Behrang QasemiZadeh | Paul Buitelaar | Tianqi Chen | Georgeta Bordea
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we address the problem of extracting technical terms automatically from an unannotated corpus. We introduce a technology term tagger that is based on Liblinear Support Vector Machines and employs linguistic features including Part of Speech tags and Dependency Structures, in addition to user feedback to perform the task of identification of technology related terms. Our experiments show the applicability of our approach as witnessed by acceptable results on precision and recall.

pdf abs
Expertise Mining for Enterprise Content Management
Georgeta Bordea | Sabrina Kirrane | Paul Buitelaar | Bianca Pereira
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Enterprise content analysis and platform configuration for enterprise content management is often carried out by external consultants that are not necessarily domain experts. In this paper, we propose a set of methods for automatic content analysis that allow users to gain a high level view of the enterprise content. Here, a main concern is the automatic identification of key stakeholders that should ideally be involved in analysis interviews. The proposed approach employs recent advances in term extraction, semantic term grounding, expert profiling and expert finding in an enterprise content management setting. Extracted terms are evaluated using human judges, while term grounding is evaluated using a manually created gold standard for the DBpedia datasource.

pdf
Using Domain-specific and Collaborative Resources for Term Translation
Mihael Arcan | Christian Federmann | Paul Buitelaar
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf
DERI&UPM: Pushing Corpus Based Relatedness to Similarity: Shared Task System Description
Nitish Aggarwal | Kartik Asooja | Paul Buitelaar
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf
Experiments with Term Translation
Mihael Arcan | Christian Federmann | Paul Buitelaar
Proceedings of COLING 2012

OntoSelect is a dynamic web-based ontology library that harvests, analyzes and organizes ontologies published on the Semantic Web. OntoSelect allows searching as well as browsing of ontologies according to size (number of classes, properties), representation format (DAML, RDFS, OWL), connectedness (score over the number of included and referring ontologies) and human languages used for class- and object property-labels. Ontology search in OntoSelect is based on a combined measure of coverage, structure and connectedness. Further, and in contrast to other ontology search engines, OntoSelect provides ontology search based on a complete web document instead of one or more keywords only.

pdf abs
Domain-Specific English-To-Spanish Translation of FrameNet
Mario Crespo Miguel | Paul Buitelaar
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper is motivated by the demand for more linguistic resources for the study of languages and the improvement of those already existing. The first step in our work is the selection of the most significant frames in the English FrameNet according to a representative medical corpus. These frames were subsequently attached to different EuroWordNet synsets and translated into Spanish. Results show how the translation was made with high accuracy (95.9 % of correct words). In addition to that, the original English lexical units were augmented with new units by 120%

pdf
Statistical Term Profiling for Query Pattern Mining
Paul Buitelaar | Pinar Oezden Wennerberg | Sonja Zillner
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

2006

pdf abs
Ontology-based Information Extraction with SOBA
Paul Buitelaar | Philipp Cimiano | Stefania Racioppa | Melanie Siegel
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we describe SOBA, a sub-component of the SmartWeb multi-modal dialog system. SOBA is a component for ontologybased information extraction from soccer web pages for automatic population of a knowledge base that can be used for domainspecific question answering. SOBA realizes a tight connection between the ontology, knowledge base and the information extraction component. The originality of SOBA is in the fact that it extracts information from heterogeneous sources such as tabular structures, text and image captions in a semantically integrated way. In particular, it stores extracted information in a knowledge base, and in turn uses the knowledge base to interpret and link newly extracted information with respect to already existing entities.