Diana Inkpen

Also published as: Diana Zaiu, Diana Zaiu Inkpen

2024

pdf abs
Explainable Depression Detection Using Large Language Models on Social Media Data
Yuxi Wang | Diana Inkpen | Prasadith Kirinde Gamaarachchige
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)

Due to the rapid growth of user interaction on different social media platforms, publicly available social media data has increased substantially. The sheer amount of data and level of personal information being shared on such platforms has made analyzing textual information to predict mental disorders such as depression a reliable preliminary step when it comes to psychometrics. In this study, we first proposed a system to search for texts that are related to depression symptoms from the Beck’s Depression Inventory (BDI) questionnaire, and providing a ranking for further investigation in a second step. Then, in this second step, we address the even more challenging task of automatic depression level detection, using writings and voluntary answers provided by users on Reddit. Several Large Language Models (LLMs) were applied in experiments. Our proposed system based on LLMs can generate both predictions and explanations for each question. By combining two LLMs for different questions, we achieved better performance on three of four metrics compared to the state-of-the-art and remained competitive on the one remaining metric. In addition, our system is explainable on two levels: first, knowing the answers to the BDI questions provides clues about the possible symptoms that could lead to a clinical diagnosis of depression; second, our system can explain the predicted answer for each question.

2023

pdf abs
uOttawa at SemEval-2023 Task 6: Deep Learning for Legal Text Understanding
Intisar Almuslim | Sean Stilwell | Surya Kiran Suresh | Diana Inkpen
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

We describe the methods we used for legal text understanding, specifically Task 6 Legal-Eval at SemEval 2023. The outcomes could assist law practitioners and help automate the working process of judicial systems. The shared task defined three main sub-tasks: sub-task A, Rhetorical Roles Prediction (RR); sub-task B, Legal Named Entities Extraction (L-NER); and sub-task C, Court Judgement Prediction with Explanation (CJPE). Our team addressed all three sub-tasks by exploring various Deep Learning (DL) based models. Overall, our team’s approaches achieved promising results on all three sub-tasks, demonstrating the potential of deep learning-based models in the judicial domain.

2022

pdf abs
Multi-Task Learning to Capture Changes in Mood Over Time
Prasadith Kirinde Gamaarachchige | Ahmed Husseini Orabi | Mahmoud Husseini Orabi | Diana Inkpen
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology

This paper investigates the impact of using Multi-Task Learning (MTL) to predict mood changes over time for each individual (social media user). The presented models were developed as a part of the Computational Linguistics and Clinical Psychology (CLPsych) 2022 shared task. Given the limited number of Reddit social media users, as well as their posts, we decided to experiment with different multi-task learning architectures to identify to what extent knowledge can be shared among similar tasks. Due to class imbalance at both post and user levels and to accommodate task alignment, we randomly sampled an equal number of instances from the respective classes and performed ensemble learning to reduce prediction variance. Faced with several constraints, we managed to produce competitive results that could provide insights into the use of multi-task learning to identify mood changes over time and suicide ideation risk.

pdf abs
Detecting Relevant Differences Between Similar Legal Texts
Xiang Li | Jiaxun Gao | Diana Inkpen | Wolfgang Alschner
Proceedings of the Natural Legal Language Processing Workshop 2022

Given two similar legal texts, is it useful to be able to focus only on the parts that contain relevant differences. However, because of variation in linguistic structure and terminology, it is not easy to identify true semantic differences. An accurate difference detection model between similar legal texts is therefore in demand, in order to increase the efficiency of legal research and document analysis. In this paper, we automatically label a training dataset of sentence pairs using an existing legal resource of international investment treaties that were already manually annotated with metadata. Then we propose models based on state-of-the-art deep learning techniques for the novel task of detecting relevant differences. In addition to providing solutions for this task, we include models for automatically producing metadata for the treaties that do not have it.

2021

pdf abs
Conditional Adversarial Networks for Multi-Domain Text Classification
Yuan Wu | Diana Inkpen | Ahmed El-Roby
Proceedings of the Second Workshop on Domain Adaptation for NLP

In this paper, we propose conditional adversarial networks (CANs), a framework that explores the relationship between the shared features and the label predictions to impose stronger discriminability to the learned features, for multi-domain text classification (MDTC). The proposed CAN introduces a conditional domain discriminator to model the domain variance in both the shared feature representations and the class-aware information simultaneously, and adopts entropy conditioning to guarantee the transferability of the shared features. We provide theoretical analysis for the CAN framework, showing that CAN’s objective is equivalent to minimizing the total divergence among multiple joint distributions of shared features and label predictions. Therefore, CAN is a theoretically sound adversarial network that discriminates over multiple distributions. Evaluation results on two MDTC benchmarks show that CAN outperforms prior methods. Further experiments demonstrate that CAN has a good ability to generalize learned knowledge to unseen domains.

pdf bib
Traitement Automatique des Langues, Volume 62, Numéro 2 : Nouvelles applications du TAL [New applications in NLP]
Géraldine Damnati | Diana Inkpen
Traitement Automatique des Langues, Volume 62, Numéro 2 : Nouvelles applications du TAL [New applications in NLP]

pdf bib
Nouvelles applications du TAL [New applications in NLP]
Géraldine Damnati | Diana Inkpen
Traitement Automatique des Langues, Volume 62, Numéro 2 : Nouvelles applications du TAL [New applications in NLP]

2019

pdf abs
Multi-Task, Multi-Channel, Multi-Input Learning for Mental Illness Detection using Social Media Text
Prasadith Kirinde Gamaarachchige | Diana Inkpen
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

We investigate the impact of using emotional patterns identified by the clinical practitioners and computational linguists to enhance the prediction capabilities of a mental illness detection (in our case depression and post-traumatic stress disorder) model built using a deep neural network architecture. Over the years, deep learning methods have been successfully used in natural language processing tasks, including a few in the domain of mental illness and suicide ideation detection. We illustrate the effectiveness of using multi-task learning with a multi-channel convolutional neural network as the shared representation and use additional inputs identified by researchers as indicatives in detecting mental disorders to enhance the model predictability. Given the limited amount of unstructured data available for training, we managed to obtain a task-specific AUC higher than 0.90. In comparison to methods such as multi-class classification, we identified multi-task learning with multi-channel convolution neural network and multiple-inputs to be effective in detecting mental disorders.

pdf abs
Semantics and Homothetic Clustering of Hafez Poetry
Arya Rahgozar | Diana Inkpen
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We have created two sets of labels for Hafez (1315-1390) poems, using unsupervised learning. Our labels are the only semantic clustering alternative to the previously existing, hand-labeled, gold-standard classification of Hafez poems, to be used for literary research. We have cross-referenced, measured and analyzed the agreements of our clustering labels with Houman’s chronological classes. Our features are based on topic modeling and word embeddings. We also introduced a similarity of similarities’ features, we called homothetic clustering approach that proved effective, in case of Hafez’s small corpus of ghazals2. Although all our experiments showed different clusters when compared with Houman’s classes, we think they were valid in their own right to have provided further insights, and have proved useful as a contrasting alternative to Houman’s classes. Our homothetic clusterer and its feature design and engineering framework can be used for further semantic analysis of Hafez’s poetry and other similar literary research.

2018

pdf abs
Neural Natural Language Inference Models Enhanced with External Knowledge
Qian Chen | Xiaodan Zhu | Zhen-Hua Ling | Diana Inkpen | Si Wei
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Modeling natural language inference is a very challenging task. With the availability of large annotated data, it has recently become feasible to train complex models such as neural-network-based inference models, which have shown to achieve the state-of-the-art performance. Although there exist relatively large annotated data, can machines learn all knowledge needed to perform natural language inference (NLI) from these data? If not, how can neural-network-based NLI models benefit from external knowledge and how to build NLI models to leverage it? In this paper, we enrich the state-of-the-art neural natural language inference models with external knowledge. We demonstrate that the proposed models improve neural NLI models to achieve the state-of-the-art performance on the SNLI and MultiNLI datasets.

pdf abs
Authorship Identification for Literary Book Recommendations
Haifa Alharthi | Diana Inkpen | Stan Szpakowicz
Proceedings of the 27th International Conference on Computational Linguistics

Book recommender systems can help promote the practice of reading for pleasure, which has been declining in recent years. One factor that influences reading preferences is writing style. We propose a system that recommends books after learning their authors’ style. To our knowledge, this is the first work that applies the information learned by an author-identification model to book recommendations. We evaluated the system according to a top-k recommendation scenario. Our system gives better accuracy when compared with many state-of-the-art methods. We also conducted a qualitative analysis by checking if similar books/authors were annotated similarly by experts.

pdf abs
Introduction to the Special Issue on Language in Social Media: Exploiting Discourse and Other Contextual Information
Farah Benamara | Diana Inkpen | Maite Taboada
Computational Linguistics, Volume 44, Issue 4 - December 2018

Social media content is changing the way people interact with each other and share information, personal messages, and opinions about situations, objects, and past experiences. Most social media texts are short online conversational posts or comments that do not contain enough information for natural language processing (NLP) tools, as they are often accompanied by non-linguistic contextual information, including meta-data (e.g., the user’s profile, the social network of the user, and their interactions with other users). Exploiting such different types of context and their interactions makes the automatic processing of social media texts a challenging research task. Indeed, simply applying traditional text mining tools is clearly sub-optimal, as, typically, these tools take into account neither the interactive dimension nor the particular nature of this data, which shares properties with both spoken and written language. This special issue contributes to a deeper understanding of the role of these interactions to process social media data from a new perspective in discourse interpretation. This introduction first provides the necessary background to understand what context is from both the linguistic and computational linguistic perspectives, then presents the most recent context-based approaches to NLP for social media. We conclude with an overview of the papers accepted in this special issue, highlighting what we believe are the future directions in processing social media texts.

pdf abs
Deep Learning for Depression Detection of Twitter Users
Ahmed Husseini Orabi | Prasadith Buddhitha | Mahmoud Husseini Orabi | Diana Inkpen
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

Mental illness detection in social media can be considered a complex task, mainly due to the complicated nature of mental disorders. In recent years, this research area has started to evolve with the continuous increase in popularity of social media platforms that became an integral part of people’s life. This close relationship between social media platforms and their users has made these platforms to reflect the users’ personal life with different limitations. In such an environment, researchers are presented with a wealth of information regarding one’s life. In addition to the level of complexity in identifying mental illnesses through social media platforms, adopting supervised machine learning approaches such as deep neural networks have not been widely accepted due to the difficulties in obtaining sufficient amounts of annotated training data. Due to these reasons, we try to identify the most effective deep neural network architecture among a few of selected architectures that were successfully used in natural language processing tasks. The chosen architectures are used to detect users with signs of mental illnesses (depression in our case) given limited unstructured text data extracted from the Twitter social media platform.

pdf abs
Cyberbullying Intervention Based on Convolutional Neural Networks
Qianjia Huang | Diana Inkpen | Jianhong Zhang | David Van Bruwaene
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

This paper describes the process of building a cyberbullying intervention interface driven by a machine-learning based text-classification service. We make two main contributions. First, we show that cyberbullying can be identified in real-time before it takes place, with available machine learning and natural language processing tools. Second, we present a mechanism that provides individuals with early feedback about how other people would feel about wording choices in their messages before they are sent out. This interface not only gives a chance for the user to revise the text, but also provides a system-level flagging/intervention in a situation related to cyberbullying.

pdf abs
Cyber-aggression Detection using Cross Segment-and-Concatenate Multi-Task Learning from Text
Ahmed Husseini Orabi | Mahmoud Husseini Orabi | Qianjia Huang | Diana Inkpen | David Van Bruwaene
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

In this paper, we propose a novel deep-learning architecture for text classification, named cross segment-and-concatenate multi-task learning (CSC-MTL). We use CSC-MTL to improve the performance of cyber-aggression detection from text. Our approach provides a robust shared feature representation for multi-task learning by detecting contrasts and similarities among polarity and neutral classes. We participated in the cyber-aggression shared task under the team name uOttawa. We report 59.74% F1 performance for the Facebook test set and 56.9% for the Twitter test set, for detecting aggression from text.

pdf abs
uOttawa at SemEval-2018 Task 1: Self-Attentive Hybrid GRU-Based Network
Ahmed Husseini Orabi | Mahmoud Husseini Orabi | Diana Inkpen | David Van Bruwaene
Proceedings of the 12th International Workshop on Semantic Evaluation

We propose a novel attentive hybrid GRU-based network (SAHGN), which we used at SemEval-2018 Task 1: Affect in Tweets. Our network has two main characteristics, 1) has the ability to internally optimize its feature representation using attention mechanisms, and 2) provides a hybrid representation using a character level Convolutional Neural Network (CNN), as well as a self-attentive word-level encoder. The key advantage of our model is its ability to signify the relevant and important information that enables self-optimization. Results are reported on the valence intensity regression task.

2017

pdf abs
A Dataset for Multi-Target Stance Detection
Parinaz Sobhani | Diana Inkpen | Xiaodan Zhu
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Current models for stance classification often treat each target independently, but in many applications, there exist natural dependencies among targets, e.g., stance towards two or more politicians in an election or towards several brands of the same product. In this paper, we focus on the problem of multi-target stance detection. We present a new dataset that we built for this task. Furthermore, We experiment with several neural models on the dataset and show that they are more effective in jointly modeling the overall position towards two related targets compared to independent predictions and other models of joint learning, such as cascading classification. We make the new dataset publicly available, in order to facilitate further research in multi-target stance classification.

pdf bib abs
Metaphor Detection in a Poetry Corpus
Vaibhav Kesarwani | Diana Inkpen | Stan Szpakowicz | Chris Tanasescu
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Metaphor is indispensable in poetry. It showcases the poet’s creativity, and contributes to the overall emotional pertinence of the poem while honing its specific rhetorical impact. Previous work on metaphor detection relies on either rule-based or statistical models, none of them applied to poetry. Our method focuses on metaphor detection in a poetry corpus. It combines rule-based and statistical models (word embeddings) to develop a new classification system. Our system has achieved a precision of 0.759 and a recall of 0.804 in identifying one type of metaphor in poetry.

pdf abs
Monitoring Tweets for Depression to Detect At-risk Users
Zunaira Jamil | Diana Inkpen | Prasadith Buddhitha | Kenton White
Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology — From Linguistic Signal to Clinical Reality

We propose an automated system that can identify at-risk users from their public social media activity, more specifically, from Twitter. The data that we collected is from the #BellLetsTalk campaign, which is a wide-reaching, multi-year program designed to break the silence around mental illness and support mental health across Canada. To achieve our goal, we trained a user-level classifier that can detect at-risk users that achieves a reasonable precision and recall. We also trained a tweet-level classifier that predicts if a tweet indicates depression. This task was much more difficult due to the imbalanced data. In the dataset that we labeled, we came across 5% depression tweets and 95% non-depression tweets. To handle this class imbalance, we used undersampling methods. The resulting classifier had high recall, but low precision. Therefore, we only use this classifier to compute the estimated percentage of depressed tweets and to add this value as a feature for the user-level classifier.

pdf abs
Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference
Qian Chen | Xiaodan Zhu | Zhen-Hua Ling | Si Wei | Hui Jiang | Diana Inkpen
Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP

The RepEval 2017 Shared Task aims to evaluate natural language understanding models for sentence representation, in which a sentence is represented as a fixed-length vector with neural networks and the quality of the representation is tested with a natural language inference task. This paper describes our system (alpha) that is ranked among the top in the Shared Task, on both the in-domain test set (obtaining a 74.9% accuracy) and on the cross-domain test set (also attaining a 74.9% accuracy), demonstrating that the model generalizes well to the cross-domain data. Our model is equipped with intra-sentence gated-attention composition which helps achieve a better performance. In addition to submitting our model to the Shared Task, we have also tested it on the Stanford Natural Language Inference (SNLI) dataset. We obtain an accuracy of 85.5%, which is the best reported result on SNLI when cross-sentence attention is not allowed, the same condition enforced in RepEval 2017.

Reasoning and inference are central to human and artificial intelligence. Modeling inference in human language is very challenging. With the availability of large annotated data (Bowman et al., 2015), it has recently become feasible to train neural network based inference models, which have shown to be very effective. In this paper, we present a new state-of-the-art result, achieving the accuracy of 88.6% on the Stanford Natural Language Inference Dataset. Unlike the previous top models that use very complicated network architectures, we first demonstrate that carefully designing sequential inference models based on chain LSTMs can outperform all previous models. Based on this, we further show that by explicitly considering recursive architectures in both local inference modeling and inference composition, we achieve additional improvement. Particularly, incorporating syntactic parsing information contributes to our best result—it further improves the performance even when added to the already very strong model.

2016

pdf
MyAnnotator: A Tool for Technology-Mediated Written Corrective Feedback
Marie-Josée Hamel | Nikolay Slavkov | Diana Inkpen | Dingwen Xiao
Traitement Automatique des Langues, Volume 57, Numéro 3 : TALP et didactique [NLP for Learning and Teaching]

pdf
Bilingual Chronological Classification of Hafez’s Poems
Arya Rahgozar | Diana Inkpen
Proceedings of the Fifth Workshop on Computational Linguistics for Literature

pdf bib abs
Local-Global Vectors to Improve Unigram Terminology Extraction
Ehsan Amjadian | Diana Inkpen | Tahereh Paribakht | Farahnaz Faez
Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016)

The present paper explores a novel method that integrates efficient distributed representations with terminology extraction. We show that the information from a small number of observed instances can be combined with local and global word embeddings to remarkably improve the term extraction results on unigram terms. To do so we pass the terms extracted by other tools to a filter made of the local-global embeddings and a classifier which in turn decides whether or not a term candidate is a term. The filter can also be used as a hub to merge different term extraction tools into a single higher-performing system. We compare filters that use the skip-gram architecture and filters that employ the CBOW architecture for the task at hand.

2015

pdf bib
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop
Diana Inkpen | Smaranda Muresan | Shibamouli Lahiri | Karen Mazidi | Alisa Zhila
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

abs
Applications of Social Media Text Analysis
Atefeh Farzindar | Diana Inkpen
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Analyzing social media texts is a complex problem that becomes difficult to address using traditional Natural Language Processing (NLP) methods. Our tutorial focuses on presenting new methods for NLP tasks and applications that work on noisy and informal texts, such as the ones from social media.Automatic processing of large collections of social media texts is important because they contain a lot of useful information, due to the in-creasing popularity of all types of social media. Use of social media and messaging apps grew 203 percent year-on-year in 2013, with overall app use rising 115 percent over the same period, as reported by Statista, citing data from Flurry Analytics. This growth means that 1.61 billion people are now active in social media around the world and this is expected to advance to 2 billion users in 2016, led by India. The research shows that consumers are now spending daily 5.6 hours on digital media including social media and mo-bile internet usage.At the heart of this interest is the ability for users to create and share content via a variety of platforms such as blogs, micro-blogs, collaborative wikis, multimedia sharing sites, social net-working sites. The unprecedented volume and variety of user-generated content, as well as the user interaction network constitute new opportunities for understanding social behavior and building socially intelligent systems. Therefore it is important to investigate methods for knowledge extraction from social media data. Furthermore, we can use this information to detect and retrieve more related content about events, such as photos and video clips that have caption texts.

pdf
From Argumentation Mining to Stance Classification
Parinaz Sobhani | Diana Inkpen | Stan Matwin
Proceedings of the 2nd Workshop on Argumentation Mining

pdf
Estimating User Location in Social Media with Stacked Denoising Auto-encoders
Ji Liu | Diana Inkpen
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

pdf
How much does word sense disambiguation help in sentiment analysis of micropost data?
Chiraag Sumanth | Diana Inkpen
Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Anaphora resolution is still a challenging research field in natural language processing, lacking a algorithm that correctly resolves anaphoric pronouns. Anaphoric zero pronouns pose an even greater challenge, since this category is not lexically realised. Thus, their resolution is conditioned by their prior identification stage. This paper reports on the distribution of zero pronouns in Romanian in various genres: encyclopaedic, legal, literary, and news-wire texts. For this purpose, the RoZP corpus has been created, containing almost 50000 tokens and 800 zero pronouns which are manually annotated. The distribution patterns are compared across genres, and exceptional cases are presented in order to facilitate the methodological process of developing a future zero pronoun identification and resolution algorithm. The evaluation results emphasise that zero pronouns appear frequently in Romanian, and their distribution depends largely on the genre. Additionally, possible features are revealed for their identification, and a search scope for the antecedent has been determined, increasing the chances of correct resolution.

2009

pdf
Real-Word Spelling Correction using Google Web 1T 3-grams
Aminul Islam | Diana Inkpen
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf
Visual Development Process for Automatic Generation of Digital Games Narrative Content
Maria Fernanda Caropreso | Diana Inkpen | Shahzad Khan | Fazel Keshtkar
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

pdf
Inducing translations from officially published materials in Canadian government websites
Qibo Zhu | Diana Inkpen | Ash Asudeh
Proceedings of Machine Translation Summit XII: Papers

2008

pdf
Textual Information for Predicting Functional Properties of the Genes
Oana Frunza | Diana Inkpen
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

pdf bib abs
Combining Multiple Models for Speech Information Retrieval
Muath Alzghool | Diana Inkpen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this article we present a method for combining different information retrieval models in order to increase the retrieval performance in a Speech Information Retrieval task. The formulas for combining the models are tuned on training data. Then the system is evaluated on test data. The task is particularly difficult because the text collection is automatically transcribed spontaneous speech, with many recognition errors. Also, the topics are real information needs, difficult to satisfy. Information Retrieval systems are not able to obtain good results on this data set, except for the case when manual summaries are included.

pdf abs
Using the Complexity of the Distribution of Lexical Elements as a Feature in Authorship Attribution
Leanne Spracklin | Diana Inkpen | Amiya Nayak
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Traditional Authorship Attribution models extract normalized counts of lexical elements such as nouns, common words and punctuation and use these normalized counts or ratios as features for author fingerprinting. The text is viewed as a bag-of-words and the order of words and their position relative to other words is largely ignored. We propose a new method of feature extraction which quantifies the distribution of lexical elements within the text using Kolmogorov complexity estimates. Testing carried out on blog corpora indicates that such measures outperform ratios when used as features in an SVM authorship attribution model. Moreover, by adding complexity estimates to a model using ratios, we were able to increase the F-measure by 5.2-11.8%

2007

pdf
Near-Synonym Choice in an Intelligent Thesaurus
Diana Inkpen
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf abs
A tool for detecting French-English cognates and false friends
Oana Frunza | Diana Inkpen
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cognates are pairs of words in different languages similar in spelling and meaning. They can help a second-language learner on the tasks of vocabulary expansion and reading comprehension. False friends are pairs of words that have similar spelling but different meanings. Partial cognates are pairs of words in two languages that have the same meaning in some, but not all contexts. In this article we present a method to automatically classify a pair of words as cognates or false friends, by using several measures of orthographic similarity as features for classification. We use this method to create complete lists of cognates and false friends between two languages. We also disambiguate partial cognates in context. We applied all our methods to French and English, but they can be applied to other pairs of languages as well. We built a tool that takes the produced lists and annotates a French text with equivalent English cognates or false friends, in order to help second-language learners improve their reading comprehension skills and retention rate.

2006

pdf abs
Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words
Md. Aminul Islam | Diana Inkpen
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents a new corpus-based method for calculating the semantic similarity of two target words. Our method, called Second Order Co-occurrencePMI (SOC-PMI), uses Pointwise Mutual Information to sort lists of important neighbor words of the two target words. Then we consider the words which are common in both lists and aggregate their PMI values (from the opposite list) to calculate the relative semantic similarity. Our method was empirically evaluated using Miller and Charlers (1991) 30 noun pair subset, Ruben-stein and Goodenoughs (1965) 65 noun pairs, 80 synonym test questions from the Test of English as a Foreign Language (TOEFL), and 50 synonym test questions from a collection of English as a Second Language (ESL) tests. Evaluation results show that our method outperforms several competing corpus-based methods.

pdf
Semi-Supervised Learning of Partial Cognates Using Bilingual Bootstrapping
Oana Frunza | Diana Inkpen
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
Building and Using a Lexical Knowledge Base of Near-Synonym Differences
Diana Inkpen | Graeme Hirst
Computational Linguistics, Volume 32, Number 2, June 2006

pdf
Investigating Cross-Language Speech Retrieval for a Spontaneous Conversational Speech Collection
Diana Inkpen | Muath Alzghool | Gareth Jones | Douglas Oard
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers