This paper proposes a new idea that uses Wikipedia categories as answer types and defines candidate sets inside Wikipedia. The focus of a given question is searched in the hierarchy of Wikipedia main pages. Our searching strategy combines head-noun matching and synonym matching provided in semantic resources. The set of answer candidates is determined by the entry hierarchy in Wikipedia and the hyponymy hierarchy in WordNet. The experimental results show that the approach can find candidate sets in a smaller size but achieve better performance especially for ARTIFACT and ORGANIZATION types, where the performance is better than state-of-the-art Chinese factoid QA systems.
Commonsense knowledge is essential for fully understanding language in many situations. We acquire large-scale commonsense knowledge from humans using a game with a purpose (GWAP) developed on a smartphone spoken dialogue system. We transform the manual knowledge acquisition process into an enjoyable quiz game and have collected over 150,000 unique commonsense facts by gathering the data of more than 70,000 players over eight months. In this paper, we present a simple method for maintaining the quality of acquired knowledge and an empirical analysis of the knowledge acquisition process. To the best of our knowledge, this is the first work to collect large-scale knowledge via a GWAP on a widely-used spoken dialogue system.
This paper describes a hierarchical neural network we propose for sentence classification to extract product information from product documents. The network classifies each sentence in a document into attribute and condition classes on the basis of word sequences and sentence sequences in the document. Experimental results showed the method using the proposed network significantly outperformed baseline methods by taking semantic representation of word and sentence sequential data into account. We also evaluated the network with two different product domains (insurance and tourism domains) and found that it was effective for both the domains.
Question answering is always an attractive and challenging task in natural language processing area. There are some open domain question answering systems, such as IBM Waston, which take the unstructured text data as input, in some ways of humanlike thinking process and a mode of artificial intelligence. At the conference on Natural Language Processing and Chinese Computing (NLPCC) 2016, China Computer Federation hosted a shared task evaluation about Open Domain Question Answering. We achieve the 2nd place at the document-based subtask. In this paper, we present our solution, which consists of feature engineering in lexical and semantic aspects and model training methods. As the result of the evaluation shows, our solution provides a valuable and brief model which could be used in modelling question answering or sentence semantic relevance. We hope our solution would contribute to this vast and significant task with some heuristic thinking.
An Entity-based approach to Answering recurrent and non-recurrent questions with Past Answers Abstract Community question answering (CQA) systems such as Yahoo! Answers allow registered-users to ask and answer questions in various question categories. However, a significant percentage of asked questions in Yahoo! Answers are unanswered. In this paper, we propose to reduce this percentage by reusing answers to past resolved questions from the site. Specifically, we propose to satisfy unanswered questions in entity rich categories by searching for and reusing the best answers to past resolved questions with shared needs. For unanswered questions that do not have a past resolved question with a shared need, we propose to use the best answer to a past resolved question with similar needs. Our experiments on a Yahoo! Answers dataset shows that our approach retrieves most of the past resolved questions that have shared and similar needs to unanswered questions.
In an era where highly accurate Question Answering (QA) systems are being built using complex Natural Language Processing (NLP) and Information Retrieval (IR) algorithms, presenting the acquired answer to the user akin to a human answer is also crucial. In this paper we present an answer presentation strategy by embedding the answer in a sentence which is developed by incorporating the linguistic structure of the source question extracted through typed dependency parsing. The evaluation using human participants proved that the methodology is human-competitive and can result in linguistically correct sentences for more that 70% of the test dataset acquired from QALD question dataset.
Question answering (QA) systems need to provide exact answers for the questions that are posed to the system. However, this can only be achieved through a precise processing of the question. During this procedure, one important step is the detection of the expected type of answer that the system should provide by extracting the headword of the questions and identifying its semantic type. We have annotated the headword and assigned UMLS semantic types to 643 factoid/list questions from the BioASQ training data. We present statistics on the corpus and a preliminary evaluation in baseline experiments. We also discuss the challenges on both the manual annotation and the automatic detection of the headwords and the semantic types. We believe that this is a valuable resource for both training and evaluation of biomedical QA systems. The corpus is available at: https://github.com/mariananeves/BioMedLAT.
The paper describes topic shifting in dialogues with a robot that provides information from Wiki-pedia. The work focuses on a double topical construction of dialogue coherence which refers to discourse coherence on two levels: the evolution of dialogue topics via the interaction between the user and the robot system, and the creation of discourse topics via the content of the Wiki-pedia article itself. The user selects topics that are of interest to her, and the system builds a list of potential topics, anticipated to be the next topic, by the links in the article and by the keywords extracted from the article. The described system deals with Wikipedia articles, but could easily be adapted to other digital information providing systems.
Wikipedia has become a reference knowledge source for scores of NLP applications. One of its invaluable features lies in its multilingual nature, where articles on a same entity or concept can have from one to more than 200 different versions. The interlinking of language versions in Wikipedia has undergone a major renewal with the advent of Wikidata, a unified scheme to identify entities and their properties using unique numbers. However, as the interlinking is still manually carried out by thousands of editors across the globe, errors may creep in the assignment of entities. In this paper, we describe an optimization technique to match automatically language versions of articles, and hence entities, that is only based on bags of words and anchors. We created a dataset of all the articles on persons we extracted from Wikipedia in six languages: English, French, German, Russian, Spanish, and Swedish. We report a correct match of at least 94.3% on each pair.
In this paper, we present an open information extraction system so-called SRDF that generates lexical knowledge graphs from unstructured texts. In semantic web, knowledge is expressed in the RDF triple form but the natural language text consist of multiple relations between arguments. For this reason, we combine open information extraction with the reification for the full text extraction to preserve meaning of sentence in our knowledge graph. And also our knowledge graph is designed to adapt for many existing semantic web applications. At the end of this paper, we introduce the result of the experiment and a Korean template generation module developed using SRDF.
Natural language questions are interpreted to a sequence of patterns to be matched with instances of patterns in a knowledge base (KB) for answering. A natural language (NL) question answering (QA) system utilizes meaningful patterns matching the syntac-tic/lexical features between the NL questions and KB. In the most of KBs, there are only binary relations in triple form to represent relation between two entities or entity and a value using the domain specific ontology. However, the binary relation representation is not enough to cover complex information in questions, and the ontology vocabulary sometimes does not cover the lexical meaning in questions. Complex meaning needs a knowledge representation to link the binary relation-type triples in KB. In this paper, we propose a frame semantics-based semantic parsing approach as KB-independent question pre-processing. We will propose requirements of question interpretation in the KBQA perspective, and a query form representation based on our proposed format QAF (Ques-tion Answering with the Frame Semantics), which is supposed to cover the requirements. In QAF, frame semantics roles as a model to represent complex information in questions and to disambiguate the lexical meaning in questions to match with the ontology vocabu-lary. Our system takes a question as an input and outputs QAF-query by the process which assigns semantic information in the question to its corresponding frame semantic structure using the semantic parsing rules.
Answering yes–no questions is more difficult than simply retrieving ranked search results. To answer yes–no questions, especially when the correct answer is no, one must find an objectionable keyword that makes the question’s answer no. Existing systems, such as factoid-based ones, cannot answer yes–no questions very well because of insufficient handling of such objectionable keywords. We suggest an algorithm that answers yes–no questions by assigning an importance to objectionable keywords. Concretely speaking, we suggest a penalized scoring method that finds and makes lower score for parts of documents that include such objectionable keywords. We check a keyword distribution for each part of a document such as a paragraph, calculating the keyword density as a basic score. Then we use an objectionable keyword penalty when a keyword does not appear in a target part but appears in other parts of the document. Our algorithm is robust for open domain problems because it requires no training. We achieved 4.45 point better results in F1 scores than the best score of the NTCIR-10 RITE2 shared task, also obtained the best score in 2014 mock university examination challenge of the Todai Robot project.
Nowadays, a question answering (QA) system is used in various areas such a quiz show, personal assistant, home device, and so on. The OKBQA framework supports developing a QA system in an intuitive and collaborative ways. To support collaborative development, the framework should be equipped with some functions, e.g., flexible system configuration, debugging supports, intuitive user interface, and so on while considering different developing groups of different domains. This paper presents OKBQA controller, a dedicated workflow manager for OKBQA framework, to boost collaborative development of a QA system.
Methods based on deep learning approaches have recently achieved state-of-the-art performance in a range of machine learning tasks and are increasingly applied to natural language processing (NLP). Despite strong results in various established NLP tasks involving general domain texts, there is only limited work applying these models to biomedical NLP. In this paper, we consider a Convolutional Neural Network (CNN) approach to biomedical text classification. Evaluation using a recently introduced cancer domain dataset involving the categorization of documents according to the well-established hallmarks of cancer shows that a basic CNN model can achieve a level of performance competitive with a Support Vector Machine (SVM) trained using complex manually engineered features optimized to the task. We further show that simple modifications to the CNN hyperparameters, initialization, and training process allow the model to notably outperform the SVM, establishing a new state of the art result at this task. We make all of the resources and tools introduced in this study available under open licenses from https://cambridgeltl.github.io/cancer-hallmark-cnn/.
End-to-end neural network models for named entity recognition (NER) have shown to achieve effective performances on general domain datasets (e.g. newswire), without requiring additional hand-crafted features. However, in biomedical domain, recent studies have shown that hand-engineered features (e.g. orthographic features) should be used to attain effective performance, due to the complexity of biomedical terminology (e.g. the use of acronyms and complex gene names). In this work, we propose a novel approach that allows a neural network model based on a long short-term memory (LSTM) to automatically learn orthographic features and incorporate them into a model for biomedical NER. Importantly, our bi-directional LSTM model learns and leverages orthographic features on an end-to-end basis. We evaluate our approach by comparing against existing neural network models for NER using three well-established biomedical datasets. Our experimental results show that the proposed approach consistently outperforms these strong baselines across all of the three datasets.
This paper proposes several network construction methods for collections of scarce scientific literature data. We define scarcity as lacking in value and in volume. Instead of using the paper’s metadata to construct several kinds of scientific networks, we use the full texts of the articles and automatically extract the entities needed to construct the networks. Specifically, we present seven kinds of networks using the proposed construction methods: co-occurrence networks for author, keyword, and biological entities, and citation networks for author, keyword, biological, and topic entities. We show two case studies that applies our proposed methods: CADASIL, a rare yet the most common form of hereditary stroke disorder, and Metformin, the first-line medication to the type 2 diabetes treatment. We apply our proposed method to four different applications for evaluation: finding prolific authors, finding important bio-entities, finding meaningful keywords, and discovering influential topics. The results show that the co-occurrence and citation networks constructed using the proposed method outperforms the traditional-based networks. We also compare our proposed networks to traditional citation networks constructed using enough data and infer that even with the same amount of enough data, our methods perform comparably or better than the traditional methods.
We propose an approach for named entity recognition in medical data, using a character-based deep bidirectional recurrent neural network. Such models can learn features and patterns based on the character sequence, and are not limited to a fixed vocabulary. This makes them very well suited for the NER task in the medical domain. Our experimental evaluation shows promising results, with a 60% improvement in F 1 score over the baseline, and our system generalizes well between different datasets.
The increasing amount of biomedical information that is available for researchers and clinicians makes it harder to quickly find the right information. Automatic summarization of multiple texts can provide summaries specific to the user’s information needs. In this paper we look into the use named-entity recognition for graph-based summarization. We extend the LexRank algorithm with information about named entities and present EntityRank, a multi-document graph-based summarization algorithm that is solely based on named entities. We evaluate our system on a datasets of 1009 human written summaries provided by BioASQ and on 1974 gene summaries, fetched from the Entrez Gene database. The results show that the addition of named-entity information increases the performance of graph-based summarizers and that the EntityRank significantly outperforms the other methods with regard to the ROUGE measures.
Electronic health records show great variability since the same concept is often expressed with different terms, either scientific latin forms, common or lay variants and even vernacular naming. Deep learning enables distributional representation of terms in a vector-space, and therefore, related terms tend to be close in the vector space. Accordingly, embedding words through these vectors opens the way towards accounting for semantic relatedness through classical algebraic operations. In this work we propose a simple though efficient unsupervised characterization of Adverse Drug Reactions (ADRs). This approach exploits the embedding representation of the terms involved in candidate ADR events, that is, drug-disease entity pairs. In brief, the ADRs are represented as vectors that link the drug with the disease in their context through a recursive additive model. We discovered that a low-dimensional representation that makes use of the modulus and argument of the embedded representation of the ADR event shows correlation with the manually annotated class. Thus, it can be derived that this characterization results in to be beneficial for further classification tasks as predictive features.
Very few datasets have been released for the evaluation of diagnosis coding with the International Classification of Diseases, and only one so far in a language other than English. This paper describes a large-scale dataset prepared from French death certificates, and the problems which needed to be solved to turn it into a dataset suitable for the application of machine learning and natural language processing methods of ICD-10 coding. The dataset includes the free-text statements written by medical doctors, the associated meta-data, the human coder-assigned codes for each statement, as well as the statement segments which supported the coder’s decision for each code. The dataset comprises 93,694 death certificates totalling 276,103 statements and 377,677 ICD-10 code assignments (3,457 unique codes). It was made available for an international automated coding shared task, which attracted five participating teams. An extended version of the dataset will be used in a new edition of the shared task.
The development of text mining techniques for biomedical research literature has received increased attention in recent times. However, most of these techniques focus on prose, while much important biomedical data reside in tables. In this paper, we present a corpus created to serve as a gold standard for the development and evaluation of techniques for the automatic extraction of information from biomedical tables. We describe the guidelines used for corpus annotation and the manner in which they were developed. The high inter-annotator agreement achieved on the corpus, and the generic nature of our annotation approach, suggest that the developed guidelines can serve as a general framework for table annotation in biomedical and other scientific domains. The annotated corpus and the guidelines are available at http://www.csse.monash.edu.au/research/umnl/data/index.shtml.
In some plain text documents, end-of-line marks may or may not mark the boundary of a text unit (e.g., of a paragraph). This vexing problem is likely to impact subsequent natural language processing components, but is seldom addressed in the literature. We propose a method which uses no manual annotation to classify whether end-of-lines must actually be seen as simple spaces (soft line breaks) or as true text unit boundaries. This method, which includes self-training and co-training steps based on token and line length features, achieves 0.943 F-measure on a corpus of short e-books with controlled format, F=0.904 on a random sample of 24 clinical texts with soft line breaks, and F=0.898 on a larger set of mixed clinical texts which may or may not contain soft line breaks, a fairly high value for a method with no manual annotation.
This paper describes a Natural language processing system developed for automatic identification of explicit connectives, its sense and arguments. Prior work has shown that the difference in usage of connectives across corpora affects the cross domain connective identification task negatively. Hence the development of domain specific discourse parser has become indispensable. Here, we present a corpus annotated with discourse relations on Medline abstracts. Kappa score is calculated to check the annotation quality of our corpus. The previous works on discourse analysis in bio-medical data have concentrated only on the identification of connectives and hence we have developed an end-end parser for connective and argument identification using Conditional Random Fields algorithm. The type and sub-type of the connective sense is also identified. The results obtained are encouraging.
Social media has emerged into a crucial resource for obtaining population-based signals for various public health monitoring and surveillance tasks, such as pharmacovigilance. There is an abundance of knowledge hidden within social media data, and the volume is growing. Drug-related chatter on social media can include user-generated information that can provide insights into public health problems such as abuse, adverse reactions, long-term effects, and multi-drug interactions. Our objective in this paper is to present to the biomedical natural language processing, data science, and public health communities data sets (annotated and unannotated), tools and resources that we have collected and created from social media. The data we present was collected from Twitter using the generic and brand names of drugs as keywords, along with their common misspellings. Following the collection of the data, annotation guidelines were created over several iterations, which detail important aspects of social media data annotation and can be used by future researchers for developing similar data sets. The annotation guidelines were followed to prepare data sets for text classification, information extraction and normalization. In this paper, we discuss the preparation of these guidelines, outline the data sets prepared, and present an overview of our state-of-the-art systems for data collection, supervised classification, and information extraction. In addition to the development of supervised systems for classification and extraction, we developed and released unlabeled data and language models. We discuss the potential uses of these language models in data mining and the large volumes of unlabeled data from which they were generated. We believe that the summaries and repositories we present here of our data, annotation guidelines, models, and tools will be beneficial to the research community as a single-point entry for all these resources, and will promote further research in this area.
Electronic Health Records (EHRs) are increasingly available in modern health care institutions either through the direct creation of electronic documents in hospitals’ health information systems, or through the digitization of historical paper records. Each EHR creation method yields the need for sophisticated text reuse detection tools in order to prepare the EHR collections for efficient secondary use relying on Natural Language Processing methods. Herein, we address the detection of two types of text reuse in French EHRs: 1) the detection of updated versions of the same document and 2) the detection of document duplicates that still bear surface differences due to OCR or de-identification processing. We present a robust text reuse detection method to automatically identify redundant document pairs in two French EHR corpora that achieves an overall macro F-measure of 0.68 and 0.60, respectively and correctly identifies all redundant document pairs of interest.
An important subtask in clinical text mining tries to identify whether a clinical finding is expressed as present, absent or unsure in a text. This work presents a system for detecting mentions of clinical findings that are negated or just speculated. The system has been applied to two different types of German clinical texts: clinical notes and discharge summaries. Our approach is built on top of NegEx, a well known algorithm for identifying non-factive mentions of medical findings. In this work, we adjust a previous adaptation of NegEx to German and evaluate the system on our data to detect negation and speculation. The results are compared to a baseline algorithm and are analyzed for both types of clinical documents. Our system achieves an F1-Score above 0.9 on both types of reports.
Effective knowledge resources are critical for developing successful clinical decision support systems that alleviate the cognitive load on physicians in patient care. In this paper, we describe two new methods for building a knowledge resource of disease to medication associations. These methods use fundamentally different content and are based on advanced natural language processing and machine learning techniques. One method uses distributional semantics on large medical text, and the other uses data mining on a large number of patient records. The methods are evaluated using 25,379 unique disease-medication pairs extracted from 100 de-identified longitudinal patient records of a large multi-provider hospital system. We measured recall (R), precision (P), and F scores for positive and negative association prediction, along with coverage and accuracy. While individual methods performed well, a combined stacked classifier achieved the best performance, indicating the limitations and unique value of each resource and method. In predicting positive associations, the stacked combination significantly outperformed the baseline (a distant semi-supervised method on large medical text), achieving F scores of 0.75 versus 0.55 on the pairs seen in the patient records, and F scores of 0.69 and 0.35 on unique pairs.
Author name disambiguation (AND) in publication and citation resources is a well-known problem. Often, information about email address and other details in the affiliation is missing. In cases where such information is not available, identifying the authorship of publications becomes very challenging. Consequently, there have been attempts to resolve such cases by utilizing external resources as references. However, such external resources are heterogeneous and are not always reliable regarding the correctness of information. To solve the AND task, especially when information about an author is not complete we suggest the use of new features such as journal descriptors (JD) and semantic types (ST). The evaluation of different feature models shows that their inclusion has an impact equivalent to that of other important features such as email address. Using such features we show that our system outperforms the state of the art.
In this paper, we extend an existing annotation scheme ISO-Space for annotating necessary spatial information for the task placing an specified object at a specified location with a specified direction according to a natural language instruction. We call such task the spatial placement problem. Our extension particularly focuses on describing the object direction, when the object is placed on the 2D plane. We conducted an annotation experiment in which a corpus of 20 situated dialogues were annotated. The annotation result showed the number of newly introduced tags by our proposal is not negligible. We also implemented an analyser that automatically assigns the proposed tags to the corpus and evaluated its performance. The result showed that the performance for entity tag was quite high ranging from 0.68 to 0.99 in F-measure, but not the case for relation tags, i.e. less than 0.4 in F-measure.
This paper proposes a methodology for building a specialized Japanese data set for recognizing temporal relations and discourse relations. In addition to temporal and discourse relations, multi-layered situational relations that distinguish generic and specific states belonging to different layers in a discourse are annotated. Our methodology has been applied to 170 text fragments taken from Wikinews articles in Japanese. The validity of our methodology is evaluated and analyzed in terms of degree of annotator agreement and frequency of errors.
This article proposes a Universal Dependency Annotation Scheme for Mandarin Chinese, including POS tags and dependency analysis. We identify cases of idiosyncrasy of Mandarin Chinese that are difficult to fit into the current schema which has mainly been based on the descriptions of various Indo-European languages. We discuss differences between our scheme and those of the Stanford Chinese Dependencies and the Chinese Dependency Treebank.
The approach which formulates the automatic text summarization as a maximum coverage problem with knapsack constraint over a set of textual units and a set of weighted conceptual units is promising. However, it is quite important and difficult to determine the appropriate granularity of conceptual units for this formulation. In order to resolve this problem, we are examining to use components of presentation slides as conceptual units to generate a summary of lecture utterances, instead of other possible conceptual units like base noun phrases or important nouns. This paper explains our developing corpus designed to evaluate our proposing approach, which consists of presentation slides and lecture utterances aligned to presentation slide components.
This paper presents VSoLSCSum, a Vietnamese linked sentence-comment dataset, which was manually created to treat the lack of standard corpora for social context summarization in Vietnamese. The dataset was collected through the keywords of 141 Web documents in 12 special events, which were mentioned on Vietnamese Web pages. Social users were asked to involve in creating standard summaries and the label of each sentence or comment. The inter-agreement calculated by Cohen’s Kappa among raters after validating is 0.685. To illustrate the potential use of our dataset, a learning to rank method was trained by using a set of local and social features. Experimental results indicate that the summary model trained on our dataset outperforms state-of-the-art baselines in both ROUGE-1 and ROUGE-2 in social context summarization.
Paratactic syntactic structures are difficult to represent in syntactic dependency tree structures. As such, we propose an annotation schema for syntactic dependency annotation of Japanese, in which coordinate structures are split from and overlaid on bunsetsu-based (base phrase unit) dependency. The schema represents nested coordinate structures, non-constituent conjuncts, and forward sharing as the set of regions. The annotation was performed on the core data of ‘Balanced Corpus of Contemporary Written Japanese’, which comprised about one million words and 1980 samples from six registers, such as newspapers, books, magazines, and web texts.
Treebanks are curial for natural language processing (NLP). In this paper, we present our work for annotating a Chinese treebank in scientific domain (SCTB), to address the problem of the lack of Chinese treebanks in this domain. Chinese analysis and machine translation experiments conducted using this treebank indicate that the annotated treebank can significantly improve the performance on both tasks. This treebank is released to promote Chinese NLP research in scientific domain.
This paper introduces the NIFTY-Serve corpus, a large data archive collected from Japanese discussion forums that operated via a Bulletin Board System (BBS) between 1987 and 2006. This corpus can be used in Artificial Intelligence researches such as Natural Language Processing, Community Analysis, and so on. The NIFTY-Serve corpus differs from data on WWW in three ways; (1) essentially spam- and duplication-free because of strict data collection procedures, (2) historic user-generated data before WWW, and (3) a complete data set because the service now shut down. We also introduce some examples of use of the corpus.
This paper describes various Indonesian language resources that Agency for the Assessment and Application of Technology (BPPT) has developed and collected since mid 80’s when we joined MMTS (Multilingual Machine Translation System), an international project coordinated by CICC-Japan to develop a machine translation system for five Asian languages (Bahasa Indonesia, Malay, Thai, Japanese, and Chinese). Since then, we have been actively doing many types of research in the field of statistical machine translation, speech recognition, and speech synthesis which requires many text and speech corpus. Most recent cooperation within ASEAN-IVO is the development of Indonesian ALT (Asian Language Treebank) has added new NLP tools.
This paper describes a Japanese political corpus created for interdisciplinary political research. The corpus contains the local assembly minutes of 47 prefectures from April 2011 to March 2015. This four-year period coincides with the term of office for assembly members in most autonomies. We analyze statistical data, such as the number of speakers, characters, and words, to clarify the characteristics of local assembly minutes. In addition, we identify problems associated with the different web services used by the autonomies to make the minutes available to the public.
Many NLP tasks involve sentence-level annotation yet the relevant information is not encoded at sentence level but at some relevant parts of the sentence. Such tasks include but are not limited to: sentiment expression annotation, product feature annotation, and template annotation for Q&A systems. However, annotation of the full corpus sentence by sentence is resource intensive. In this paper, we propose an approach that iteratively extracts frequent parts of sentences for annotating, and compresses the set of sentences after each round of annotation. Our approach can also be used in preparing training sentences for binary classification (domain-related vs. noise, subjectivity vs. objectivity, etc.), assuming that sentence-type annotation can be predicted by annotation of the most relevant sub-sentences. Two experiments are performed to test our proposal and evaluated in terms of time saved and agreement of annotation.
Summarization of multi-party conversation is one of the important tasks in natural language processing. In this paper, we explain a Japanese corpus and a topic segmentation task. To the best of our knowledge, the corpus is the first Japanese corpus annotated for summarization tasks and freely available to anyone. We call it “the Kyutech corpus.” The task of the corpus is a decision-making task with four participants and it contains utterances with time information, topic segmentation and reference summaries. As a case study for the corpus, we describe a method combined with LCSeg and TopicTiling for a topic segmentation task. We discuss the effectiveness and the problems of the combined method through the experiment with the Kyutech corpus.
In this paper we present two methods for automatic common sense knowledge evaluation for Japanese entries in ConceptNet ontology. Our proposed methods utilize text-mining approach: one with relation clue words and WordNet synonyms, and one without. Both methods were tested with a blog corpus. The system based on our proposed methods reached relatively high precision score for three relations (MadeOf, UsedFor, AtLocation), which is comparable with previous research using commercial search engines and simpler input. We analyze errors and discuss problems of common sense evaluation, both manual and automatic and propose ideas for further improvements.
Although MWE are relatively morphologically and syntactically fixed expressions, several types of flexibility can be observed in MWE, verbal MWE in particular. Identifying the degree of morphological and syntactic flexibility of MWE is very important for many Lexicographic and NLP tasks. Adding MWE variants/tokens to a dictionary resource requires characterizing the flexibility among other morphosyntactic features. Carrying out the task manually faces several challenges since it is a very laborious task time and effort wise, as well as it will suffer from coverage limitation. The problem is exacerbated in rich morphological languages where the average word in Arabic could have 12 possible inflection forms. Accordingly, in this paper we introduce a semi-automatic Arabic multiwords expressions resource (SAMER). We propose an automated method that identifies the morphological and syntactic flexibility of Arabic Verbal Multiword Expressions (AVMWE). All observed morphological variants and syntactic pattern alternations of an AVMWE are automatically acquired using large scale corpora. We look for three morphosyntactic aspects of AVMWE types investigating derivational and inflectional variations and syntactic templates, namely: 1) inflectional variation (inflectional paradigm) and calculating degree of flexibility; 2) derivational productivity; and 3) identifying and classifying the different syntactic types. We build a comprehensive list of AVMWE. Every token in the AVMWE list is lemmatized and tagged with POS information. We then search Arabic Gigaword and All ATBs for all possible flexible matches. For each AVMWE type we generate: a) a statistically ranked list of MWE-lexeme inflections and syntactic pattern alternations; b) An abstract syntactic template; and c) The most frequent form. Our technique is validated using a Golden MWE annotated list. The results shows that the quality of the generated resource is 80.04%.
This paper describes our attempt to build a sentiment analysis system for Indonesian tweets. With this system, we can study and identify sentiments and opinions in a text or document computationally. We used four thousand manually labeled tweets collected in February and March 2016 to build the model. Because of the variety of content in tweets, we analyze tweets into eight groups in total, including pos(itive), neg(ative), and neu(tral). Finally, we obtained 73.2% accuracy with Long Short Term Memory (LSTM) without normalizer.
In this paper, we propose to use a subfield of machine learning –grammatical inference– to measure linguistic complexity from a developmental point of view. We focus on relative complexity by considering a child learner in the process of first language acquisition. The relevance of grammatical inference models for measuring linguistic complexity from a developmental point of view is based on the fact that algorithms proposed in this area can be considered computational models for studying first language acquisition. Even though it will be possible to use different techniques from the field of machine learning as computational models for dealing with linguistic complexity -since in any model we have algorithms that can learn from data-, we claim that grammatical inference models offer some advantages over other tools.
In this paper, we introduce for the first time a Distributional Model for computing semantic complexity, inspired by the general principles of the Memory, Unification and Control framework(Hagoort, 2013; Hagoort, 2016). We argue that sentence comprehension is an incremental process driven by the goal of constructing a coherent representation of the event represented by the sentence. The composition cost of a sentence depends on the semantic coherence of the event being constructed and on the activation degree of the linguistic constructions. We also report the results of a first evaluation of the model on the Bicknell dataset (Bicknell et al., 2010).
We present a novel approach to the automatic assessment of text complexity based on a sliding-window technique that tracks the distribution of complexity within a text. Such distribution is captured by what we term “complexity contours” derived from a series of measurements for a given linguistic complexity measure. This approach is implemented in an automatic computational tool, CoCoGen – Complexity Contour Generator, which in its current version supports 32 indices of linguistic complexity. The goal of the paper is twofold: (1) to introduce the design of our computational tool based on a sliding-window technique and (2) to showcase this approach in the area of second language (L2) learning, i.e. more specifically, in the area of L2 writing.
This study demonstrates a weakness in how n-gram and PCFG surprisal are used to predict reading times in eye-tracking data. In particular, the information conveyed by words skipped during saccades is not usually included in the surprisal measures. This study shows that correcting the surprisal calculation improves n-gram surprisal and that upcoming n-grams affect reading times, replicating previous findings of how lexical frequencies affect reading times. In contrast, the predictivity of PCFG surprisal does not benefit from the surprisal correction despite the fact that lexical sequences skipped by saccades are processed by readers, as demonstrated by the corrected n-gram measure. These results raise questions about the formulation of information-theoretic measures of syntactic processing such as PCFG surprisal and entropy reduction when applied to reading times.
Computational approaches to readability assessment are generally built and evaluated using gold standard corpora labeled by publishers or teachers rather than being grounded in observations about human performance. Considering that both the reading process and the outcome can be observed, there is an empirical wealth that could be used to ground computational analysis of text readability. This will also support explicit readability models connecting text complexity and the reader’s language proficiency to the reading process and outcomes. This paper takes a step in this direction by reporting on an experiment to study how the relation between text complexity and reader’s language proficiency affects the reading process and performance outcomes of readers after reading We modeled the reading process using three eye tracking variables: fixation count, average fixation count, and second pass reading duration. Our models for these variables explained 78.9%, 74% and 67.4% variance, respectively. Performance outcome was modeled through recall and comprehension questions, and these models explained 58.9% and 27.6% of the variance, respectively. While the online models give us a better understanding of the cognitive correlates of reading with text complexity and language proficiency, modeling of the offline measures can be particularly relevant for incorporating user aspects into readability models.
Studies on the role of memory as a predictor of reading time latencies (1) differ in their predictions about when memory effects should occur in processing and (2) have had mixed results, with strong positive effects emerging from isolated constructed stimuli and weak or even negative effects emerging from naturally-occurring stimuli. Our study addresses these concerns by comparing several implementations of prominent sentence processing theories on an exploratory corpus and evaluating the most successful of these on a confirmatory corpus, using a new self-paced reading corpus of seemingly natural narratives constructed to contain an unusually high proportion of memory-intensive constructions. We show highly significant and complementary broad-coverage latency effects both for predictors based on the Dependency Locality Theory and for predictors based on a left-corner parsing model of sentence processing. Our results indicate that memory access during sentence processing does take time, but suggest that stimuli requiring many memory access events may be necessary in order to observe the effect.
Lexical complexity plays a central role in readability, particularly for dyslexic children and poor readers because of their slow and laborious decoding and word recognition skills. Although some features to aid readability may be common to most languages (e.g., the majority of ‘easy’ words are of low frequency), we believe that lexical complexity is mainly language-specific. In this paper, we define lexical complexity for French and we present a pilot study on the effects of text simplification in dyslexic children. The participants were asked to read out loud original and manually simplified versions of a standardized French text corpus and to answer comprehension questions after reading each text. The analysis of the results shows that the simplifications performed were beneficial in terms of reading speed and they reduced the number of reading errors (mainly lexical ones) without a loss in comprehension. Although the number of participants in this study was rather small (N=10), the results are promising and contribute to the development of applications in computational linguistics.
In this paper we will be dealing with different levels of complexity in the processing of Italian, a Romance language inheriting many properties from Latin which make it an almost free word order language . The paper is concerned with syntactic complexity as measurable on the basis of the cognitive parser that incrementally builds up a syntactic representation to be used by the semantic component. The theory behind will be LFG and parsing preferences will be used to justify one choice both from a principled and a processing point of view. LFG is a transformationless theory in which there is no deep structure separate from surface syntactic structure. This is partially in accordance with constructional theories in which noncanonical structures containing non-argument functions FOCUS/TOPIC are treated as multifunctional constituents. Complexity is computed on a processing basis following suggestions made by Blache and demonstrated by Kluender and Chesi
Previous researches have shown that learning multiple representations for polysemous words can improve the performance of word embeddings on many tasks. However, this leads to another problem. Several vectors of a word may actually point to the same meaning, namely pseudo multi-sense. In this paper, we introduce the concept of pseudo multi-sense, and then propose an algorithm to detect such cases. With the consideration of the detected pseudo multi-sense cases, we try to refine the existing word embeddings to eliminate the influence of pseudo multi-sense. Moreover, we apply our algorithm on previous released multi-sense word embeddings and tested it on artificial word similarity tasks and the analogy task. The result of the experiments shows that diminishing pseudo multi-sense can improve the quality of word representations. Thus, our method is actually an efficient way to reduce linguistic complexity.
In this paper, we present a comparative analysis of statistically predictive syntactic features of complexity and the treatment of these features by humans when simplifying texts. To that end, we have used a list of the most five statistically predictive features obtained automatically and the Corpus of Basque Simplified Texts (CBST) to analyse how the syntactic phenomena in these features have been manually simplified. Our aim is to go beyond the descriptions of operations found in the corpus and relate the multidisciplinary findings to understand text complexity from different points of view. We also present some issues that can be important when analysing linguistic complexity.
Pause analysis of key-stroke logged translations is a hallmark of process based translation studies. However, an exact definition of what a cognitively effortful pause during the translation process is has not been found yet (Saldanha and O’Brien, 2013). This paper investigates the design of a key-stroke and subject dependent identification system of cognitive effort to track complexity in translation with keystroke logging (cf. also (Dragsted, 2005) (Couto-Vale, in preparation)). It is an elastic measure that takes into account idiosyncratic pause duration of translators as well as further confounds such as bi-gram frequency, letter frequency and some motor tasks involved in writing. The method is compared to a common static threshold of 1000 ms in an analysis of cognitive effort during the translation of grammatical functions from English to German. Additionally, the results are triangulated with eye tracking data for further validation. The findings show that at least for smaller sets of data a dynamic pause assessment may lead to more accurate results than a generic static pause threshold of similar duration.
Data driven approaches to readability analysis for languages other than English has been plagued by a scarcity of suitable corpora. Often, relevant corpora consist only of easy-to-read texts with no rank information or empirical readability scores, making only binary approaches, such as classification, applicable. We propose a Bayesian, latent variable, approach to get the most out of these kinds of corpora. In this paper we present results on using such a model for readability ranking. The model is evaluated on a preliminary corpus of ranked student texts with encouraging results. We also assess the model by showing that it performs readability classification on par with a state of the art classifier while at the same being transparent enough to allow more sophisticated interpretations.
Informed by research on readability and language acquisition, computational linguists have developed sophisticated tools for the analysis of linguistic complexity. While some tools are starting to become accessible on the web, there still is a disconnect between the features that can in principle be identified based on state-of-the-art computational linguistic analysis, and the analyses a second language acquisition researcher, teacher, or textbook writer can readily obtain and visualize for their own collection of texts. This short paper presents a web-based tool development that aims to meet this challenge. The Common Text Analysis Platform (CTAP) is designed to support fully configurable linguistic feature extraction for a wide range of complexity analyses. It features a user-friendly interface, modularized and reusable analysis component integration, and flexible corpus and feature management. Building on the Unstructured Information Management framework (UIMA), CTAP readily supports integration of state-of-the-art NLP and complexity feature extraction maintaining modularization and reusability. CTAP thereby aims at providing a common platform for complexity analysis, encouraging research collaboration and sharing of feature extraction components—to jointly advance the state-of-the-art in complexity analysis in a form that readily supports real-life use by ordinary users.
We bring together knowledge from two different types of language learning data, texts learners read and texts they write, to improve linguistic complexity classification in the latter. Linguistic complexity in the foreign and second language learning context can be expressed in terms of proficiency levels. We show that incorporating features capturing lexical complexity information from reading passages can boost significantly the machine learning based classification of learner-written texts into proficiency levels. With an F1 score of .8 our system rivals state-of-the-art results reported for other languages for this task. Finally, we present a freely available web-based tool for proficiency level classification and lexical complexity visualization for both learner writings and reading texts.
Arabic writing is typically underspecified for short vowels and other markups, referred to as diacritics. In addition to the lexical ambiguity exhibited in most languages, the lack of diacritics in written Arabic adds another layer of ambiguity which is an artifact of the orthography. In this paper, we present the details of three annotation experimental conditions designed to study the impact of automatic ambiguity detection, on annotation speed and quality in a large scale annotation project.
Computational linguistic approaches to sign languages could benefit from investigating how complexity influences structure. We investigate whether morphological complexity has an effect on the order of Verb (V) and Object (O) in Swedish Sign Language (SSL), on the basis of elicited data from five Deaf signers. We find a significant difference in the distribution of the orderings OV vs. VO, based on an analysis of morphological weight. While morphologically heavy verbs exhibit a general preference for OV, humanness seems to affect the ordering in the opposite direction, with [+human] Objects pushing towards a preference for VO.
Language complexity is an intriguing phenomenon argued to play an important role in both language learning and processing. The need to compare languages with regard to their complexity resulted in a multitude of approaches and methods, ranging from accounts targeting specific structural features to global quantification of variation more generally. In this paper, we investigate the degree to which morphological complexity measures are mutually correlated in a sample of more than 500 languages of 101 language families. We use human expert judgements from the World Atlas of Language Structures (WALS), and compare them to four quantitative measures automatically calculated from language corpora. These consist of three previously defined corpus-derived measures, which are all monolingual, and one new measure based on automatic word-alignment across pairs of languages. We find strong correlations between all the measures, illustrating that both expert judgements and automated approaches converge to similar complexity ratings, and can be used interchangeably.
Comparable or parallel corpora are beneficial for many NLP tasks. The automatic collection of corpora enables large-scale resources, even for less-resourced languages, which in turn can be useful for deducing rules and patterns for text rewriting algorithms, a subtask of automatic text simplification. We present two methods for the alignment of Swedish easy-to-read text segments to text segments from a reference corpus. The first method (M1) was originally developed for the task of text reuse detection, measuring sentence similarity by a modified version of a TF-IDF vector space model. A second method (M2), also accounting for part-of-speech tags, was developed, and the methods were compared. For evaluation, a crowdsourcing platform was built for human judgement data collection, and preliminary results showed that cosine similarity relates better to human ranks than the Dice coefficient. We also saw a tendency that including syntactic context to the TF-IDF vector space model is beneficial for this kind of paraphrase alignment task.
This work presents a framework for the automatic construction of large Web corpora classified by readability level. We compare different Machine Learning classifiers for the task of readability assessment focusing on Portuguese and English texts, analysing the impact of variables like the feature inventory used in the resulting corpus. In a comparison between shallow and deeper features, the former already produce F-measures of over 0.75 for Portuguese texts, but the use of additional features results in even better results, in most cases. For English, shallow features also perform well as do classic readability formulas. Comparing different classifiers for the task, logistic regression obtained, in general, the best results, but with considerable differences between the results for two and those for three-classes, especially regarding the intermediary class. Given the large scale of the resulting corpus, for evaluation we adopt the agreement between different classifiers as an indication of readability assessment certainty. As a result of this work, a large corpus for Brazilian Portuguese was built, including 1.7 million documents and about 1.6 billion tokens, already parsed and annotated with 134 different textual attributes, along with the agreement among the various classifiers.
This work investigates the application of a measure of surprisal to modeling a grammatical variation phenomenon between near-synonymous constructions. We investigate a particular variation phenomenon, word order variation in Dutch two-verb clusters, where it has been established that word order choice is affected by processing cost. Several multifactorial corpus studies of Dutch verb clusters have used other measures of processing complexity to show that this factor affects word order choice. This previous work allows us to compare the surprisal measure, which is based on constraint satisfaction theories of language modeling, to those previously used measures, which are more directly linked to empirical observations of processing complexity. Our results show that surprisal does not predict the word order choice by itself, but is a significant predictor when used in a measure of uniform information density (UID). This lends support to the view that human language processing is facilitated not so much by predictable sequences of words but more by sequences of words in which information is spread evenly.
The relative contributions of meaning and form to sentence processing remains an outstanding issue across the language sciences. We examine this issue by formalizing four incremental complexity metrics and comparing them against freely-available ROI timecourses. Syntax-related metrics based on top-down parsing and structural dependency-distance turn out to significantly improve a regression model, compared to a simpler model that formalizes only conceptual combination using a distributional vector-space model. This confirms the view of the anterior temporal lobes as combinatory engines that deal in both form (see e.g. Brennan et al., 2012; Mazoyer, 1993) and meaning (see e.g., Patterson et al., 2007). This same characterization applies to a posterior temporal region in roughly “Wernicke’s Area.”
This paper investigates the use of automatic speech recognition (ASR) errors as indicators of the second language (L2) learners’ listening difficulties and in doing so strives to overcome the shortcomings of Partial and Synchronized Caption (PSC) system. PSC is a system that generates a partial caption including difficult words detected based on high speech rate, low frequency, and specificity. To improve the choice of words in this system, and explore a better method to detect speech challenges, ASR errors were investigated as a model of the L2 listener, hypothesizing that some of these errors are similar to those of language learners’ when transcribing the videos. To investigate this hypothesis, ASR errors in transcription of several TED talks were analyzed and compared with PSC’s selected words. Both the overlapping and mismatching cases were analyzed to investigate possible improvement for the PSC system. Those ASR errors that were not detected by PSC as cases of learners’ difficulties were further analyzed and classified into four categories: homophones, minimal pairs, breached boundaries and negatives. These errors were embedded into the baseline PSC to make the enhanced version and were evaluated in an experiment with L2 learners. The results indicated that the enhanced version, which encompasses the ASR errors addresses most of the L2 learners’ difficulties and better assists them in comprehending challenging video segments as compared with the baseline.
Eye-tracking reading times have been attested to reflect cognitive processes underlying sentence comprehension. However, the use of reading times in NLP applications is an underexplored area of research. In this initial work we build an automatic system to assess sentence complexity using automatically predicted eye-tracking reading time measures and demonstrate the efficacy of these reading times for a well known NLP task, namely, readability assessment. We use a machine learning model and a set of features known to be significant predictors of reading times in order to learn per-word reading times from a corpus of English text having reading times of human readers. Subsequently, we use the model to predict reading times for novel text in the context of the aforementioned task. A model based only on reading times gave competitive results compared to the systems that use extensive syntactic features to compute linguistic complexity. Our work, to the best of our knowledge, is the first study to show that automatically predicted reading times can successfully model the difficulty of a text and can be deployed in practical text processing applications.
The article presents results of entropy rate estimation for human languages across six languages by using large, state-of-the-art corpora of up to 7.8 gigabytes. To obtain the estimates for data length tending to infinity, we use an extrapolation function given by an ansatz. Whereas some ansatzes of this kind were proposed in previous research papers, here we introduce a stretched exponential extrapolation function that has a smaller error of fit. In this way, we uncover a possibility that the entropy rates of human languages are positive but 20% smaller than previously reported.
The morphological complexity of languages differs widely and changes over time. Pathways of change are often driven by the interplay of multiple competing factors, and are hard to disentangle. We here focus on a paradigmatic scenario of language change: the reduction of morphological complexity from Latin towards the Romance languages. To establish a causal explanation for this phenomenon, we employ three lines of evidence: 1) analyses of parallel corpora to measure the complexity of words in actual language production, 2) applications of NLP tools to further tease apart the contribution of inflectional morphology to word complexity, and 3) experimental data from artificial language learning, which illustrate the learning pressures at play when morphology simplifies. These three lines of evidence converge to show that pressures associated with imperfect language learning are good candidates to causally explain the reduction in morphological complexity in the Latin-to-Romance scenario. More generally, we argue that combining corpus, computational and experimental evidence is the way forward in historical linguistics and linguistic typology.
The goal of this paper is to examine the impact of simple feature engineering mechanisms before applying more sophisticated techniques to the task of medical NER. Sometimes papers using scientifically sound techniques present raw baselines that could be improved adding simple and cheap features. This work focuses on entity recognition for the clinical domain for three languages: English, Swedish and Spanish. The task is tackled using simple features, starting from the window size, capitalization, prefixes, and moving to POS and semantic tags. This work demonstrates that a simple initial step of feature engineering can improve the baseline results significantly. Hence, the contributions of this paper are: first, a short list of guidelines well supported with experimental results on three languages and, second, a detailed description of the relevance of these features for medical NER.
Automated extraction of concepts from patient clinical records is an essential facilitator of clinical research. For this reason, the 2010 i2b2/VA Natural Language Processing Challenges for Clinical Records introduced a concept extraction task aimed at identifying and classifying concepts into predefined categories (i.e., treatments, tests and problems). State-of-the-art concept extraction approaches heavily rely on handcrafted features and domain-specific resources which are hard to collect and define. For this reason, this paper proposes an alternative, streamlined approach: a recurrent neural network (the bidirectional LSTM with CRF decoding) initialized with general-purpose, off-the-shelf word embeddings. The experimental results achieved on the 2010 i2b2/VA reference corpora using the proposed framework outperform all recent methods and ranks closely to the best submission from the original 2010 i2b2/VA challenge.
Due to the recent replacements of physical documents with electronic medical records (EMR), the importance of information processing in medical fields has been increased. We have been organizing the MedNLP task series in NTCIR-10 and 11. These workshops were the first shared tasks which attempt to evaluate technologies that retrieve important information from medical reports written in Japanese. In this report, we describe the NTCIR-12 MedNLPDoc task which is designed for more advanced and practical use for the medical fields. This task is considered as a multi-labeling task to a patient record. This report presents results of the shared task, discusses and illustrates remained issues in the medical natural language processing field.
Patient notes contain a wealth of information of potentially great interest to medical investigators. However, to protect patients’ privacy, Protected Health Information (PHI) must be removed from the patient notes before they can be legally released, a process known as patient note de-identification. The main objective for a de-identification system is to have the highest possible recall. Recently, the first neural-network-based de-identification system has been proposed, yielding state-of-the-art results. Unlike other systems, it does not rely on human-engineered features, which allows it to be quickly deployed, but does not leverage knowledge from human experts or from electronic health records (EHRs). In this work, we explore a method to incorporate human-engineered features as well as features derived from EHRs to a neural-network-based de-identification system. Our results show that the addition of features, especially the EHR-derived features, further improves the state-of-the-art in patient note de-identification, including for some of the most sensitive PHI types such as patient names. Since in a real-life setting patient notes typically come with EHRs, we recommend developers of de-identification systems to leverage the information EHRs contain.
Semi-supervised clustering is an attractive alternative for traditional (unsupervised) clustering in targeted applications. By using the information of a small annotated dataset, semi-supervised clustering can produce clusters that are customized to the application domain. In this paper, we present a semi-supervised clustering technique based on a multi-objective evolutionary algorithm (NSGA-II-clus). We apply this technique to the task of clustering medical publications for Evidence Based Medicine (EBM) and observe an improvement of the results against unsupervised and other semi-supervised clustering techniques.
Rapid growth in Electronic Medical Records (EMR) has emerged to an expansion of data in the clinical domain. The majority of the available health care information is sealed in the form of narrative documents which form the rich source of clinical information. Text mining of such clinical records has gained huge attention in various medical applications like treatment and decision making. However, medical records enclose patient Private Health Information (PHI) which can reveal the identities of the patients. In order to retain the privacy of patients, it is mandatory to remove all the PHI information prior to making it publicly available. The aim is to de-identify or encrypt the PHI from the patient medical records. In this paper, we propose an algorithm based on deep learning architecture to solve this problem. We perform de-identification of seven PHI terms from the clinical records. Experiments on benchmark datasets show that our proposed approach achieves encouraging performance, which is better than the baseline model developed with Conditional Random Field.
Paraphrase generation is important in various applications such as search, summarization, and question answering due to its ability to generate textual alternatives while keeping the overall meaning intact. Clinical paraphrase generation is especially vital in building patient-centric clinical decision support (CDS) applications where users are able to understand complex clinical jargons via easily comprehensible alternative paraphrases. This paper presents Neural Clinical Paraphrase Generation (NCPG), a novel approach that casts the task as a monolingual neural machine translation (NMT) problem. We propose an end-to-end neural network built on an attention-based bidirectional Recurrent Neural Network (RNN) architecture with an encoder-decoder framework to perform the task. Conventional bilingual NMT models mostly rely on word-level modeling and are often limited by out-of-vocabulary (OOV) issues. In contrast, we represent the source and target paraphrase pairs as character sequences to address this limitation. To the best of our knowledge, this is the first work that uses attention-based RNNs for clinical paraphrase generation and also proposes an end-to-end character-level modeling for this task. Extensive experiments on a large curated clinical paraphrase corpus show that the attention-based NCPG models achieve improvements of up to 5.2 BLEU points and 0.5 METEOR points over a non-attention based strong baseline for word-level modeling, whereas further gains of up to 6.1 BLEU points and 1.3 METEOR points are obtained by the character-level NCPG models over their word-level counterparts. Overall, our models demonstrate comparable performance relative to the state-of-the-art phrase-based non-neural models.
The proliferation of deep learning methods in natural language processing (NLP) and the large amounts of data they often require stands in stark contrast to the relatively data-poor clinical NLP domain. In particular, large text corpora are necessary to build high-quality word embeddings, yet often large corpora that are suitably representative of the target clinical data are unavailable. This forces a choice between building embeddings from small clinical corpora and less representative, larger corpora. This paper explores this trade-off, as well as intermediate compromise solutions. Two standard clinical NLP tasks (the i2b2 2010 concept and assertion tasks) are evaluated with commonly used deep learning models (recurrent neural networks and convolutional neural networks) using a set of six corpora ranging from the target i2b2 data to large open-domain datasets. While combinations of corpora are generally found to work best, the single-best corpus is generally task-dependent.
Importance of utilizing medical information is getting increased as electronic health records (EHRs) are widely used nowadays. We aim to assign international standardized disease codes, ICD-10, to Japanese textual information in EHRs for users to reuse the information accurately. In this paper, we propose methods to automatically extract diagnosis and to assign ICD codes to Japanese medical records. Due to the lack of available training data, we dare employed rule-based methods rather than machine learning. We observed characteristics of medical records carefully, writing rules to make effective methods by hand. We applied our system to the NTCIR-12 MedNLPDoc shared task data where participants are required to assign ICD-10 codes of possible diagnosis in given EHRs. In this shared task, our system achieved the highest F-measure score among all participants in the most severe evaluation criteria. Through comparison with other approaches, we show that our approach could be a useful milestone for the future development of Japanese medical record processing.
In this work we present a fine-grained annotation schema to detect named entities in German clinical data of chronically ill patients with kidney diseases. The annotation schema is driven by the needs of our clinical partners and the linguistic aspects of German language. In order to generate annotations within a short period, the work also presents a semi-automatic annotation which uses additional sources of knowledge such as UMLS, to pre-annotate concepts in advance. The presented schema will be used to apply novel techniques from natural language processing and machine learning to support doctors treating their patients by improved information access from unstructured German texts.
In recent years, detecting Alzheimer disease (AD) in early stages based on natural language processing (NLP) has drawn much attention. To date, vocabulary size, grammatical complexity, and fluency have been studied using NLP metrics. However, the content analysis of AD narratives is still unreachable for NLP. This study investigates features of the words that AD patients use in their spoken language. After recruiting 18 examinees of 53–90 years old (mean: 76.89), they were divided into two groups based on MMSE scores. The AD group comprised 9 examinees with scores of 21 or lower. The healthy control group comprised 9 examinees with a score of 22 or higher. Linguistic Inquiry and Word Count (LIWC) classified words were used to categorize the words that the examinees used. The word frequency was found from observation. Significant differences were confirmed for the usage of impersonal pronouns in the AD group. This result demonstrated the basic feasibility of the proposed NLP-based detection approach.
The number of unstructured medical records kept in hospital information systems is increasing. The conditions of patients are formulated as outcomes in clinical pathway. A variance of an outcome describes deviations from standards of care like a patient’s bad condition. The present paper applied text mining to extract feature words and phrases of the variance from admission records. We report the cases the variances of “pain control” and “no neuropathy worsening” in cerebral infarction.
Clinical narratives in electronic health record systems are a rich resource of patient-based information. They constitute an ongoing challenge for natural language processing, due to their high compactness and abundance of short forms. German medical texts exhibit numerous ad-hoc abbreviations that terminate with a period character. The disambiguation of period characters is therefore an important task for sentence and abbreviation detection. This task is addressed by a combination of co-occurrence information of word types with trailing period characters, a large domain dictionary, and a simple rule engine, thus merging statistical and dictionary-based disambiguation strategies. An F-measure of 0.95 could be reached by using the unsupervised approach presented in this paper. The results are promising for a domain-independent abbreviation detection strategy, because our approach avoids retraining of models or use case specific feature engineering efforts required for supervised machine learning approaches.
The issue of privacy has always been a concern when clinical texts are used for research purposes. Personal health information (PHI) (such as name and identification number) needs to be removed so that patients cannot be identified. Manual anonymization is not feasible due to the large number of clinical texts to be anonymized. In this paper, we tackle the task of anonymizing clinical texts written in sentence fragments and which frequently contain symbols, abbreviations, and misspelled words. Our clinical texts therefore differ from those in the i2b2 shared tasks which are in prose form with complete sentences. Our clinical texts are also part of a structured database which contains patient name and identification number in structured fields. As such, we formulate our anonymization task as spelling variant detection, exploiting patients’ personal information in the structured fields to detect their spelling variants in clinical texts. We successfully anonymized clinical texts consisting of more than 200 million words, using minimum edit distance and regular expression patterns.
Distributional Semantic Models (DSMs) have recently received increased attention, together with the rise of neural architectures for scalable training of dense vector embeddings. While some of the literature even includes terms like ‘vectors’ and ‘dimensionality’ in the definition of DSMs, there are some good reasons why we should consider alternative formulations of distributional models. As an instance, I present a scalable graph-based solution to distributional semantics. The model belongs to the family of ‘count-based’ DSMs, keeps its representation sparse and explicit, and thus fully interpretable. I will highlight some important differences between sparse graph-based and dense vector approaches to DSMs: while dense vector-based models are computationally easier to handle and provide a nice uniform representation that can be compared and combined in many ways, they lack interpretability, provenance and robustness. On the other hand, graph-based sparse models have a more straightforward interpretation, handle sense distinctions more naturally and can straightforwardly be linked to knowledge bases, while lacking the ability to compare arbitrary lexical units and a compositionality operation. Since both representations have their merits, I opt for exploring their combination in the outlook.
Notwithstanding the success of the notion of construction, the computational tradition still lacks a way to represent the semantic content of these linguistic entities. Here we present a simple corpus-based model implementing the idea that the meaning of a syntactic construction is intimately related to the semantics of its typical verbs. It is a two-step process, that starts by identifying the typical verbs occurring with a given syntactic construction and building their distributional vectors. We then calculated the weighted centroid of these vectors in order to derive the distributional signature of a construction. In order to assess the goodness of our approach, we replicated the priming effect described by Johnson and Golberg (2013) as a function of the semantic distance between a construction and its prototypical verbs. Additional support for our view comes from a regression analysis showing that our distributional information can be used to model behavioral data collected with a crowdsourced elicitation experiment.
Regular polysemy was extensively investigated in lexical semantics, but this phenomenon has been very little studied in distributional semantics. We propose a model for regular polysemy detection that is based on sense vectors and allows to work directly with senses in semantic vector space. Our method is able to detect polysemous words that have the same regular sense alternation as in a given example (a word with two automatically induced senses that represent one polysemy pattern, such as ANIMAL / FOOD). The method works equally well for nouns, verbs and adjectives and achieves average recall of 0.55 and average precision of 0.59 for ten different polysemy patterns.
Recognizing various semantic relations between terms is beneficial for many NLP tasks. While path-based and distributional information sources are considered complementary for this task, the superior results the latter showed recently suggested that the former’s contribution might have become obsolete. We follow the recent success of an integrated neural method for hypernymy detection (Shwartz et al., 2016) and extend it to recognize multiple relations. The empirical results show that this method is effective in the multiclass setting as well. We further show that the path-based information source always contributes to the classification, and analyze the cases in which it mostly complements the distributional information.
The identification of semantic relations between terms within texts is a fundamental task in Natural Language Processing which can support applications requiring a lightweight semantic interpretation model. Currently, semantic relation classification concentrates on relations which are evaluated over open-domain data. This work provides a critique on the set of abstract relations used for semantic relation classification with regard to their ability to express relationships between terms which are found in a domain-specific corpora. Based on this analysis, this work proposes an alternative semantic relation model based on reusing and extending the set of abstract relations present in the DOLCE ontology. The resulting set of relations is well grounded, allows to capture a wide range of relations and could thus be used as a foundation for automatic classification of semantic relations.
The interaction between roots and patterns in Arabic has intrigued lexicographers and morphologists for centuries. While roots provide the consonantal building blocks, patterns provide the syllabic vocalic moulds. While roots provide abstract semantic classes, patterns realize these classes in specific instances. In this way both roots and patterns are indispensable for understanding the derivational, morphological and, to some extent, the cognitive aspects of the Arabic language. In this paper we perform lemmatization (a high-level lexical processing) without relying on a lookup dictionary. We use a hybrid approach that consists of a machine learning classifier to predict the lemma pattern for a given stem, and mapping rules to convert stems to their respective lemmas with the vocalization defined by the pattern.
In this paper we present a clean, yet effective, model for word sense disambiguation. Our approach leverage a bidirectional long short-term memory network which is shared between all words. This enables the model to share statistical strength and to scale well with vocabulary size. The model is trained end-to-end, directly from the raw text to sense labels, and makes effective use of word order. We evaluate our approach on two standard datasets, using identical hyperparameter settings, which are in turn tuned on a third set of held out data. We employ no external resources (e.g. knowledge graphs, part-of-speech tagging, etc), language specific features, or hand crafted rules, but still achieve statistically equivalent results to the best state-of-the-art systems, that employ no such limitations.
Language production is largely a matter of words which, in the case of access problems, can be searched for in an external resource (lexicon, thesaurus). In this kind of dialogue the user provides the momentarily available knowledge concerning the target and the system responds with the best guess(es) it can make given this input. As tip-of-the-tongue (ToT)-studies have shown, people always have some knowledge concerning the target (meaning fragments, number of syllables, ...) even if its complete form is eluding them. We will show here how to tap on this knowledge to build a resource likely to help authors (speakers/writers) to overcome the ToT-problem. Yet, before doing so we need a better understanding of the various kinds of knowledge people have when looking for a word. To this end, we asked crowdworkers to provide some cues to describe a given target and to specify then how each one of them relates to the target, in the hope that this could help others to find the elusive word. Next, we checked how well a given search strategy worked when being applied to differently built lexical networks. The results showed quite dramatic differences, which is not really surprising. After all, different networks are built for different purposes; hence each one of them is more or less suited for a given task. What was more surprising though is the fact that the relational information given by the users did not allow us to find the elusive word in WordNet better than without it.
The shared task of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-V) aims at providing a common benchmark for testing current corpus-based methods for the identification of lexical semantic relations (synonymy, antonymy, hypernymy, part-whole meronymy) and at gaining a better understanding of their respective strengths and weaknesses. The shared task uses a challenging dataset extracted from EVALution 1.0, which contains word pairs holding the above-mentioned relations as well as semantically unrelated control items (random). The task is split into two subtasks: (i) identification of related word pairs vs. unrelated ones; (ii) classification of the word pairs according to their semantic relation. This paper describes the subtasks, the dataset, the evaluation metrics, the seven participating systems and their results. The best performing system in subtask 1 is GHHH (F1 = 0.790), while the best system in subtask 2 is LexNet (F1 = 0.445). The dataset and the task description are available at https://sites.google.com/site/cogalex2016/home/shared-task.
We present a submission to the CogALex 2016 shared task on the corpus-based identification of semantic relations, using LexNET (Shwartz and Dagan, 2016), an integrated path-based and distributional method for semantic relation classification. The reported results in the shared task bring this submission to the third place on subtask 1 (word relatedness), and the first place on subtask 2 (semantic relation classification), demonstrating the utility of integrating the complementary path-based and distributional information sources in recognizing concrete semantic relations. Combined with a common similarity measure, LexNET performs fairly good on the word relatedness task (subtask 1). The relatively low performance of LexNET and all other systems on subtask 2, however, confirms the difficulty of the semantic relation classification task, and stresses the need to develop additional methods for this task.
This paper describes our system submission to the CogALex-2016 Shared Task on Corpus-Based Identification of Semantic Relations. Our system won first place for Task-1 and second place for Task-2. The evaluation results of our system on the test set is 88.1% (79.0% for TRUE only) f-measure for Task-1 on detecting semantic similarity, and 76.0% (42.3% when excluding RANDOM) for Task-2 on identifying finer-grained semantic relations. In our experiments, we try word analogy, linear regression, and multi-task Convolutional Neural Networks (CNNs) with word embeddings from publicly available word vectors. We found that linear regression performs better in the binary classification (Task-1), while CNNs have better performance in the multi-class semantic classification (Task-2). We assume that word analogy is more suited for deterministic answers rather than handling the ambiguity of one-to-many and many-to-many relationships. We also show that classifier performance could benefit from balancing the distribution of labels in the training data.
This contribution provides a strong baseline result for the CogALex-V shared task using a traditional “count”-type DSM (placed in rank 2 out of 7 in subtask 1 and rank 3 out of 6 in subtask 2). Parameter tuning experiments reveal some surprising effects and suggest that the use of random word pairs as negative examples may be problematic, guiding the parameter optimization in an undesirable direction.
In this paper, we describe ROOT 18, a classifier using the scores of several unsupervised distributional measures as features to discriminate between semantically related and unrelated words, and then to classify the related pairs according to their semantic relation (i.e. synonymy, antonymy, hypernymy, part-whole meronymy). Our classifier participated in the CogALex-V Shared Task, showing a solid performance on the first subtask, but a poor performance on the second subtask. The low scores reported on the second subtask suggest that distributional measures are not sufficient to discriminate between multiple semantic relations at once.
In this paper, we describe a system (CGSRC) for classifying four semantic relations: synonym, hypernym, antonym and meronym using convolutional neural networks (CNN). We have participated in CogALex-V semantic shared task of corpus-based identification of semantic relations. Proposed approach using CNN-based deep neural networks leveraging pre-compiled word2vec distributional neural embeddings achieved 43.15% weighted-F1 accuracy on subtask-1 (checking existence of a relation between two terms) and 25.24% weighted-F1 accuracy on subtask-2 (classifying relation types).
Automatic discovery of semantically-related words is one of the most important NLP tasks, and has great impact on the theoretical psycholinguistic modeling of the mental lexicon. In this shared task, we employ the word embeddings model to testify two thoughts explicitly or implicitly assumed by the NLP community: (1). Word embedding models can reflect syntagmatic similarities in usage between words to distances in projected vector space. (2). Word embedding models can reflect paradigmatic relationships between words.
The CogALex-V Shared Task provides two datasets that consists of pairs of words along with a classification of their semantic relation. The dataset for the first task distinguishes only between related and unrelated, while the second data set distinguishes several types of semantic relations. A number of recent papers propose to construct a feature vector that represents a pair of words by applying a pairwise simple operation to all elements of the feature vector. Subsequently, the pairs can be classified by training any classification algorithm on these vectors. In the present paper we apply this method to the provided datasets. We see that the results are not better than from the given simple baseline. We conclude that the results of the investigated method are strongly depended on the type of data to which it is applied.
This paper deals with a seldom studied object/oblique alternation phenomenon in Japanese, which. We call this the bump alternation. This phenomenon, first discussed by Sadanobu (1990), is similar to the English with/against alternation. For example, compare hit the wall with the bat [=immobile-as-direct-object frame] to hit the bat against the wall [=mobile-as-direct-object frame]). However, in the Japanese version, the case frame remains constant. Although we fundamentally question Sadanobu’s acceptability judgment, we also claim that the causation type (i.e., whether the event is an instance of onset or extended causation; Talmy, 1988; 2000) could make an improvement. An extended causative interpretation could improve the acceptability of the otherwise awkward immobile-as-direct-object frame. We examined this claim through a rating study, and the results showed an interaction between the Causation type (extended/onset) and the Object type (mobile/immobile) in the direction we predicted. We propose that a perspective shift on what is moving causes the “extended causation” advantage.
German particle verbs represent a frequent type of multi-word-expression that forms a highly productive paradigm in the lexicon. Similarly to other multi-word expressions, particle verbs exhibit various levels of compositionality. One of the major obstacles for the study of compositionality is the lack of representative gold standards of human ratings. In order to address this bottleneck, this paper presents such a gold standard data set containing 400 randomly selected German particle verbs. It is balanced across several particle types and three frequency bands, and accomplished by human ratings on the degree of semantic compositionality.
This paper presents a method to discover possible terminological relationships from tweets. We match the histories of terms (frequency patterns). Similar history indicates a possible relationship between terms. For example, if two terms (t1, t2) appeared frequently in Twitter at particular days, and there is a ‘similarity’ in the frequencies over a period of time, then t1 and t2 can be related. Maintaining standard terminological repository with updated relationships can be difficult; especially in a dynamic domain such as social media where thousands of new terms (neology) are coined every day. So we propose to construct a raw repository of lexical units with unconfirmed relationships. We have experimented our method on time-sensitive Arabic terms used by the online Arabic community of Twitter. We draw relationships between these terms by matching their similar frequency patterns (timelines). We use dynamic time warping as a similarity measure. For evaluation, we have selected 630 possible terms (we call them preterms) and we matched the similarity of these terms over a period of 30 days. Around 270 correct relationships were discovered with a precision of 0.61. These relationships were extracted without considering the textual context of the term.
A lexical function represents a type of relation that exists between lexical units (words or expressions) in any language. For example, the antonymy is a type of relation that is represented by the lexical function Anti: Anti(big) = small. Those relations include both paradigmatic relations, i.e. vertical relations, such as synonymy, antonymy and meronymy and syntagmatic relations, i.e. horizontal relations, such as objective qualification (legitimate demand), subjective qualification (fruitful analysis), positive evaluation (good review) and support verbs (pay a visit, subject to an interrogation). In this paper, we present the Lexical Functions Ontology Model (lexfom) to represent lexical functions and the relation among lexical units. Lexfom is divided in four modules: lexical function representation (lfrep), lexical function family (lffam), lexical function semantic perspective (lfsem) and lexical function relations (lfrel). Moreover, we show how it combines to Lexical Model for Ontologies (lemon), for the transformation of lexical networks into the semantic web formats. So far, we have implemented 100 simple and 500 complex lexical functions, and encoded about 8,000 syntagmatic and 46,000 paradigmatic relations, for the French language.
The objectives of the work described in this paper are: 1. To list the differences between a general language resource (namely FrameNet) and a domain-specific resource; 2. To devise solutions to merge their contents in order to increase the coverage of the general resource. Both resources are based on Frame Semantics (Fillmore 1985; Fillmore and Baker 2010) and this raises specific challenges since the theoretical framework and the methodology derived from it provide for both a lexical description and a conceptual representation. We propose a series of strategies that handle both lexical and conceptual (frame) differences and implemented them in the specialized resource. We also show that most differences can be handled in a straightforward manner. However, some more domain specific differences (such as frames defined exclusively for the specialized domain or relations between these frames) are likely to be much more difficult to take into account since some are domain-specific.
The present paper investigates the phenomenon of antonym canonicity by providing new behavioural and distributional evidence on Italian adjectives. Previous studies have showed that some pairs of antonyms are perceived to be better examples of opposition than others, and are so considered representative of the whole category (e.g., Deese, 1964; Murphy, 2003; Paradis et al., 2009). Our goal is to further investigate why such canonical pairs (Murphy, 2003) exist and how they come to be associated. In the literature, two different approaches have dealt with this issue. The lexical-categorical approach (Charles and Miller, 1989; Justeson and Katz, 1991) finds the cause of canonicity in the high co-occurrence frequency of the two adjectives. The cognitive-prototype approach (Paradis et al., 2009; Jones et al., 2012) instead claims that two adjectives form a canonical pair because they are aligned along a simple and salient dimension. Our empirical evidence, while supporting the latter view, shows that the paradigmatic distributional properties of adjectives can also contribute to explain the phenomenon of canonicity, providing a corpus-based correlate of the cognitive notion of salience.
Understanding the semantic relationships between terms is a fundamental task in natural language processing applications. While structured resources that can express those relationships in a formal way, such as ontologies, are still scarce, a large number of linguistic resources gathering dictionary definitions is becoming available, but understanding the semantic structure of natural language definitions is fundamental to make them useful in semantic interpretation tasks. Based on an analysis of a subset of WordNet’s glosses, we propose a set of semantic roles that compose the semantic structure of a dictionary definition, and show how they are related to the definition’s syntactic configuration, identifying patterns that can be used in the development of information extraction frameworks and semantic models.
Although quantifiers/classifiers expressions occur frequently in everyday communications or written documents, there is no description for them in classical bilingual paper dictionaries, nor in machine-readable dictionaries. The paper describes a corpus and dictionary development for quantifiers/classifiers, and their usage in the framework of French-Japanese machine translation (MT). They often cause problems of lexical ambiguity and of set phrase recognition during analysis, in particular for a long-distance language pair like French and Japanese. For the development of a dictionary aiming at ambiguity resolution for expressions including quantifiers and classifiers which may be ambiguous with common nouns, we have annotated our corpus with UWs (interlingual lexemes) of UNL (Universal Networking Language) found on the UNL-jp dictionary. The extraction of potential classifiers/quantifiers from corpus is made by UNLexplorer web service. Keywords : classifiers, quantifiers, phraseology study, corpus annotation, UNL (Universal Networking Language), UWs dictionary, Tori Bank, French-Japanese machine translation (MT).
Cancer (a.k.a neoplasms in a broader sense) is one of the leading causes of death worldwide and its incidence is expected to exacerbate. To respond to the critical need from the society, there have been rigorous attempts for the cancer research community to develop treatment for cancer. Accordingly, we observe a surge in the sheer volume of research products and outcomes in relation to neoplasms. In this talk, we introduce the notion of entitymetrics to provide a new lens for understanding the impact, trend, and diffusion of knowledge associated with neoplasms research. To this end, we collected over two million records from PubMed, the most popular search engine in the medical domain. Coupled with text mining techniques including named entity recognition, sentence boundary detection, string approximate matching, entitymetrics enables us to analyze knowledge diffusion, impact, and trend at various knowledge entity units, such as bio-entity, organization, and country. At the end of the talk, the future applications and possible directions of entitymetrics will be discussed.
The present paper explores a novel method that integrates efficient distributed representations with terminology extraction. We show that the information from a small number of observed instances can be combined with local and global word embeddings to remarkably improve the term extraction results on unigram terms. To do so we pass the terms extracted by other tools to a filter made of the local-global embeddings and a classifier which in turn decides whether or not a term candidate is a term. The filter can also be used as a hub to merge different term extraction tools into a single higher-performing system. We compare filters that use the skip-gram architecture and filters that employ the CBOW architecture for the task at hand.
In the paper, we address the problem of recognition of non-domain phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms and discourse function expressions. We tested several methods based on domain corpora comparison and a method based on contexts of phrases identified in a large corpus of general language. We compared the results of the methods to manual annotation. The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method with the precision about 0.75 at the half of the tested list was the context based method using a modified contextual diversity coefficient.
This article presents a domain-driven algorithm for the task of term sense disambiguation (TSD). TSD aims at automatically choosing which term record from a term bank best represents the meaning of a term occurring in a particular context. In a translation environment, finding the contextually appropriate term record is necessary to access the proper equivalent to be used in the target language text. The term bank TERMIUM Plus, recently published as an open access repository, is chosen as a domain-rich resource for testing our TSD algorithm, using English and French as source and target languages. We devise an experiment using over 1300 English terms found in scientific articles, and show that our domain-driven TSD algorithm is able to bring the best term record, and therefore the best French equivalent, at the average rank of 1.69 compared to a baseline random rank of 3.51.
In this paper, we propose a method of augmenting existing bilingual terminologies. Our method belongs to a “generate and validate” framework rather than extraction from corpora. Although many studies have proposed methods to find term translations or to augment terminology within a “generate and validate” framework, few has taken full advantage of the systematic nature of terminologies. A terminology of a domain represents the conceptual system of the domain fairly systematically, and we contend that making use of the systematicity fully will greatly contribute to the effective augmentation of terminologies. This paper proposes and evaluates a novel method to generate bilingual term candidates by using existing terminologies and delving into their systematicity. Experiments have shown that our method can generate much better term candidate pairs than the existing method and give improved performance for terminology augmentation.
The extraction of data exemplifying relations between terms can make use, at least to a large extent, of techniques that are similar to those used in standard hybrid term candidate extraction, namely basic corpus analysis tools (e.g. tagging, lemmatization, parsing), as well as morphological analysis of complex words (compounds and derived items). In this article, we discuss the use of such techniques for the extraction of raw material for a description of relations between terms, and we provide internal evaluation data for the devices developed. We claim that user-generated content is a rich source of term variation through paraphrasing and reformulation, and that these provide relational data at the same time as term variants. Germanic languages with their rich word formation morphology may be particularly good candidates for the approach advocated here.
We investigate how both model-related factors and application-related factors affect the accuracy of distributional semantic models (DSMs) in the context of specialized lexicography, and how these factors interact. This holistic approach to the evaluation of DSMs provides valuable guidelines for the use of these models and insight into the kind of semantic information they capture.
We propose and evaluate a method for identifying co-hyponym lexical units in a terminological resource. The principles of term recognition and distributional semantics are combined to extract terms from a similar category of concept. Given a set of candidate terms, random projections are employed to represent them as low-dimensional vectors. These vectors are derived automatically from the frequency of the co-occurrences of the candidate terms and words that appear within windows of text in their proximity (context-windows). In a k-nearest neighbours framework, these vectors are classified using a small set of manually annotated terms which exemplify concept categories. We then investigate the interplay between the size of the corpus that is used for collecting the co-occurrences and a number of factors that play roles in the performance of the proposed method: the configuration of context-windows for collecting co-occurrences, the selection of neighbourhood size (k), and the choice of similarity metric.
Despite advances in computer technology, terminologists still tend to rely on manual work to extract all the semantic information that they need for the description of specialized concepts. In this paper we propose the creation of new word sketches in Sketch Engine for the extraction of semantic relations. Following a pattern-based approach, new sketch grammars are devel-oped in order to extract some of the most common semantic relations used in the field of ter-minology: generic-specific, part-whole, location, cause and function.
This paper presents the construction and evaluation of Japanese and English controlled bilingual terminologies that are particularly intended for controlled authoring and machine translation with special reference to the Japanese municipal domain. Our terminologies are constructed by extracting terms from municipal website texts, and the term variations are controlled by defining preferred and proscribed terms for both the source Japanese and the target English. To assess the coverage of the terms/concepts in the municipal domain and validate the quality of the control, we employ a quantitative extrapolation method that estimates the potential vocabulary size. Using Large-Number-of-Rare-Event (LNRE) modelling, we compare two parameters: (1) uncontrolled and controlled and (2) Japanese and English. The results show that our terminologies currently cover about 45–65% of the terms and 50–65% of the concepts in the municipal domain, and are well controlled. The detailed analysis of growth patterns of terminologies also provides insight into the extent to which we can enlarge the terminologies within the realistic range.
By its own nature, the Natural Language Processing (NLP) community is a priori the best equipped to study the evolution of its own publications, but works in this direction are rare and only recently have we seen a few attempts at charting the field. In this paper, we use the algorithms, resources, standards, tools and common practices of the NLP field to build a list of terms characteristic of ongoing research, by mining a large corpus of scientific publications, aiming at the largest possible exhaustivity and covering the largest possible time span. Study of the evolution of this term list through time reveals interesting insights on the dynamics of field and the availability of the term database and of the corpus (for a large part) make possible many further comparative studies in addition to providing a test field for a new graphic interface designed to perform visual time analytics of large sized thesauri.
Annotating medical text such as clinical notes with human phenotype descriptors is an important task that can, for example, assist in building patient profiles. To automatically annotate text one usually needs a dictionary of predefined terms. However, do to the variety of human expressiveness, current state-of-the art phenotype concept recognizers and automatic annotators struggle with specific domain issues and challenges. In this paper we present results of an-notating gold standard corpus with a dictionary containing lexical variants for the Human Phenotype Ontology terms. The main purpose of the dictionary is to improve the recall of phenotype concept recognition systems. We compare the method with four other approaches and present results.
We propose a semi-automatic method for the acquisition of specialised ontological and terminological knowledge. An ontology and a terminology are automatically built from domain experts’ annotations. The ontology formalizes the common and shared conceptual vocabulary of those experts. Its associated terminology defines a glossary linking annotated terms to their semantic categories. These two resources evolve incrementally and are used for an automatic annotation of a new corpus at each iteration. The annotated corpus concerns the evaluation of French higher education and science institutions.
With many hospitals digitalizing clinical records it has opened opportunities for researchers in NLP, Machine Learning to apply techniques for extracting meaning and make actionable insights. There has been previous attempts in mapping free text to medical nomenclature like UMLS, SNOMED. However, in this paper, we had analyzed diagnosis in clinical reports using ICD10 to achieve a lightweight, real-time predictions by introducing concepts like WordInfo, root word identification. We were able to achieve 68.3% accuracy over clinical records collected from qualified clinicians. Our study would further help the healthcare institutes in organizing their clinical reports based on ICD10 mappings and derive numerous insights to achieve operational efficiency and better medical care.
Topic Models have been reported to be beneficial for aspect-based sentiment analysis. This paper reports the first topic model for sarcasm detection, to the best of our knowledge. Designed on the basis of the intuition that sarcastic tweets are likely to have a mixture of words of both sentiments as against tweets with literal sentiment (either positive or negative), our hierarchical topic model discovers sarcasm-prevalent topics and topic-level sentiment. Using a dataset of tweets labeled using hashtags, the model estimates topic-level, and sentiment-level distributions. Our evaluation shows that topics such as ‘work’, ‘gun laws’, ‘weather’ are sarcasm-prevalent topics. Our model is also able to discover the mixture of sentiment-bearing words that exist in a text of a given sentiment-related label. Finally, we apply our model to predict sarcasm in tweets. We outperform two prior work based on statistical classifiers with specific features, by around 25%.
In this paper, we aim at identifying uncertainty cues in Hungarian social media texts. We present our machine learning based uncertainty detector which is based on a rich features set including lexical, morphological, syntactic, semantic and discourse-based features, and we evaluate our system on a small set of manually annotated social media texts. We also carry out cross-domain and domain adaptation experiments using an annotated corpus of standard Hungarian texts and show that domain differences significantly affect machine learning. Furthermore, we argue that differences among uncertainty cue types may also affect the efficiency of uncertainty detection.
There has been extensive work on detecting the level of committed belief (also known as “factuality”) that an author is expressing towards the propositions in his or her utterances. Previous work on English has revealed that this can be done as a sequence tagging task. In this paper, we investigate the same task for Chinese and Spanish, two very different languages from English and from each other.
The utilization of social media material in journalistic workflows is increasing, demanding automated methods for the identification of mis- and disinformation. Since textual contradiction across social media posts can be a signal of rumorousness, we seek to model how claims in Twitter posts are being textually contradicted. We identify two different contexts in which contradiction emerges: its broader form can be observed across independently posted tweets and its more specific form in threaded conversations. We define how the two scenarios differ in terms of central elements of argumentation: claims and conversation structure. We design and evaluate models for the two scenarios uniformly as 3-way Recognizing Textual Entailment tasks in order to represent claims and conversation structure implicitly in a generic inference model, while previous studies used explicit or no representation of these properties. To address noisy text, our classifiers use simple similarity features derived from the string and part-of-speech level. Corpus statistics reveal distribution differences for these features in contradictory as opposed to non-contradictory tweet relations, and the classifiers yield state of the art performance.
Negation and modality are two important grammatical phenomena that have attracted recent research attention as they can contribute to extra-propositional meaning aspects, among with factuality, attribution, irony and sarcasm. These aspects go beyond analysis such as semantic role labeling, and modeling them is important as a step towards a higher level of language understanding, which is needed for practical applications such as sentiment analysis. In this talk, I will go beyond English, and I will discuss how negation and modality are expressed in other languages. I will also go beyond sentiment analysis and I will present some challenges that the two phenomena pose for machine translation (MT). In particular, I will demonstrate how contemporary MT systems fail on them, and I will discuss some possible solutions.
This paper presents the main sources of disagreement found during the annotation of the Spanish SFU Review Corpus with negation (SFU ReviewSP -NEG). Negation detection is a challenge in most of the task related to NLP, so the availability of corpora annotated with this phenomenon is essential in order to advance in tasks related to this area. A thorough analysis of the problems found during the annotation could help in the study of this phenomenon.
This paper discusses the need for a dictionary of affixal negations and regular antonyms to facilitate their automatic detection in text. Without such a dictionary, affixal negations are very difficult to detect. In addition, we show that the set of affixal negations is not homogeneous, and that different NLP tasks may require different subsets. A dictionary can store the subtypes of affixal negations, making it possible to select a certain subset or to make inferences on the basis of these subtypes. We take a first step towards creating a negation dictionary by annotating all direct antonym pairs inWordNet using an existing typology of affixal negations. By highlighting some of the issues that were encountered in this annotation experiment, we hope to provide some insights into the necessary steps of building a negation dictionary.
The paper is a corpus study of the factors involved in disambiguating potential scope ambiguity in sentences with negation and universal quantifier, such as “I don’t want talk to all these people”, which can alternatively mean ‘I don’t want to talk to any of these people’ and ‘I don’t want to talk to some of these people’. The relevant factors are demonstrated to be largely different from those involved in disambiguating lexical polysemy. They include the syntactic function of the constituent containing “all” quantifier (subject, direct complement, adjunct), as well as the deepness of its embedding; the status of the main predicate and “all” constituent with respect to the information structure of the 6utterance (topic vs. focus, given vs. new information); pragmatic implicatures pertaining to the situations described in the utterances.
In this talk, I will outline a range of challenges presented by multiword expressions in terms of (lexicalist) precision grammar engineering, and different strategies for accommodating those challenges, in an attempt to strike the right balance in terms of generalisation and over- and under-generation.
Microsyntactic linguistic units, such as syntactic idioms and non-standard syntactic constructions, are poorly represented in linguistic resources, mostly because the former are elements occupying an intermediate position between the lexicon and the grammar and the latter are too specific to be routinely tackled by general grammars. Consequently, many such units produce substantial gaps in systems intended to solve sophisticated computational linguistics tasks, such as parsing, deep semantic analysis, question answering, machine translation, or text generation. They also present obstacles for applying advanced techniques to these tasks, such as machine learning. The paper discusses an approach aimed at bridging such gaps, focusing on the development of monolingual and multilingual corpora where microsyntactic units are to be tagged.
An excellent example of a phenomenon bridging a lexicon and a grammar is provided by grammaticalized alternations (e.g., passivization, reflexivity, and reciprocity): these alternations represent productive grammatical processes which are, however, lexically determined. While grammaticalized alternations keep lexical meaning of verbs unchanged, they are usually characterized by various changes in their morphosyntactic structure. In this contribution, we demonstrate on the example of reciprocity and its representation in the valency lexicon of Czech verbs, VALLEX how a linguistic description of complex (and still systemic) changes characteristic of grammaticalized alternations can benefit from an integration of grammatical rules into a valency lexicon. In contrast to other types of grammaticalized alternations, reciprocity in Czech has received relatively little attention although it closely interacts with various linguistic phenomena (e.g., with light verbs, diatheses, and reflexivity).
Language-endowed intelligent agents benefit from leveraging lexical knowledge falling at different points along a spectrum of compositionality. This means that robust computational lexicons should include not only the compositional expectations of argument-taking words, but also non-compositional collocations (idioms), semi-compositional collocations that might be difficult for an agent to interpret (e.g., standard metaphors), and even collocations that could be compositionally analyzed but are so frequently encountered that recording their meaning increases the efficiency of interpretation. In this paper we argue that yet another type of string-to-meaning mapping can also be useful to intelligent agents: remembered semantic analyses of actual text inputs. These can be viewed as super-specific multi-word expressions whose recorded interpretations mimic a person’s memories of knowledge previously learned from language input. These differ from typical annotated corpora in two ways. First, they provide a full, context-sensitive semantic interpretation rather than select features. Second, they are are formulated in the ontologically-grounded metalanguage used in a particular agent environment, meaning that the interpretations contribute to the dynamically evolving cognitive capabilites of agents configured in that environment.
Universal Dependencies is an initiative to develop cross-linguistically consistent grammatical annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning and parsing research from a language typology perspective. It assumes a dependency-based approach to syntax and a lexicalist approach to morphology, which together entail that the fundamental units of grammatical annotation are words. Words have properties captured by morphological annotation and enter into relations captured by syntactic annotation. Moreover, priority is given to relations between lexical content words, as opposed to grammatical function words. In this position paper, I discuss how this approach allows us to capture similarities and differences across typologically diverse languages.
Human communication is a multimodal activity, involving not only speech and written expressions, but intonation, images, gestures, visual clues, and the interpretation of actions through perception. In this paper, we describe the design of a multimodal lexicon that is able to accommodate the diverse modalities that present themselves in NLP applications. We have been developing a multimodal semantic representation, VoxML, that integrates the encoding of semantic, visual, gestural, and action-based features associated with linguistic expressions.
Valency slot filling is a semantic glue, which brings together the meanings of words. As regards the position of an argument in the dependency structure with respect to its predicate, there exist three types of valency filling: active (canonical), passive, and discontinuous. Of these, the first type is studied much better than the other two. As a rule, canonical actants are unambiguously marked in the syntactic structure, and each actant corresponds to a unique syntactic position. Linguistic information on which syntactic function an actant might have (subject, direct or indirect object), what its morphological form should be and which prepositions or conjunctions it requires, can be given in the lexicon in the form of government patterns, subcategorization frames, or similar data structures. We concentrate on non-canonical cases of valency filling in Russian, which are characteristic of non-verbal parts of speech, such as adverbs, adjectives, and particles, in the first place. They are more difficult to handle than canonical ones, because the position of the actant in the tree is governed by more complicated rules. A valency may be filled by expressions occupying different syntactic positions, and a syntactic position may accept expressions filling different valencies of the same word. We show how these phenomena can be processed in a semantic analyzer.
Verbenet is a French lexicon developed by “translation” of its English counterpart — VerbNet (Kipper-Schuler, 2005)—and treatment of the specificities of French syntax (Pradet et al., 2014; Danlos et al., 2016). One difficulty encountered in its development springs from the fact that the list of (potentially numerous) frames has no internal organization. This paper proposes a type system for frames that shows whether two frames are variants of a given alternation. Frame typing facilitates coherence checking of the resource in a “virtuous circle”. We present the principles underlying a program we developed and used to automatically type frames in VerbeNet. We also show that our system is portable to other languages.
We present an attempt to automatically identify Czech deverbative nouns using several methods that use large corpora as well as existing lexical resources. The motivation for the task is to extend a verbal valency (i.e., predicate-argument) lexicon by adding nouns that share the valency properties with the base verb, assuming their properties can be derived (even if not trivially) from the underlying verb by deterministic grammatical rules. At the same time, even in inflective languages, not all deverbatives are simply created from their underlying base verb by regular lexical derivation processes. We have thus developed hybrid techniques that use both large parallel corpora and several standard lexical resources. Thanks to the use of parallel corpora, the resulting sets contain also synonyms, which the lexical derivation rules cannot get. For evaluation, we have manually created a small, 100-verb gold data since no such dataset was initially available for Czech.
We present an interdisciplinary study on the interaction between the interpretation of noun-noun deverbal compounds (DCs; e.g., task assignment) and the morphosyntactic properties of their deverbal heads in English. Underlying hypotheses from theoretical linguistics are tested with tools and resources from computational linguistics. We start with Grimshaw’s (1990) insight that deverbal nouns are ambiguous between argument-supporting nominal (ASN) readings, which inherit verbal arguments (e.g., the assignment of the tasks), and the less verbal and more lexicalized Result Nominal and Simple Event readings (e.g., a two-page assignment). Following Grimshaw, our hypothesis is that the former will realize object arguments in DCs, while the latter will receive a wider range of interpretations like root compounds headed by non-derived nouns (e.g., chocolate box). Evidence from a large corpus assisted by machine learning techniques confirms this hypothesis, by showing that, besides other features, the realization of internal arguments by deverbal heads outside compounds (i.e., the most distinctive ASN-property in Grimshaw 1990) is a good predictor for an object interpretation of non-heads in DCs.
We show how to turn a large-scale syntactic dictionary into a dependency-based unification grammar where each piece of lexical information calls a separate rule, yielding a super granular grammar. Subcategorization, raising and control verbs, auxiliaries and copula, passivization, and tough-movement are discussed. We focus on the semantics-syntax interface and offer a new perspective on syntactic structure.
This paper presents our ongoing work on compilation of English multi-word expression (MWE) lexicon. We are especially interested in collecting flexible MWEs, in which some other components can intervene the expression such as “a number of” vs “a large number of” where a modifier of “number” can be placed in the expression and inherit the original meaning. We fiest collect possible candidates of flexible English MWEs from the web, and annotate all of their occurrences in the Wall Street Journal portion of Ontonotes corpus. We make use of word dependency strcuture information of the sentences converted from the phrase structure annotation. This process enables semi-automatic annotation of MWEs in the corpus and simultanaously produces the internal and external dependency representation of flexible MWEs.
The paper presents a contrastive description of reflexive possessive pronouns “svůj” in Czech and “svoj” in Russian. The research concerns syntactic, semantic and pragmatic aspects. With our analysis, we shed a new light on the already investigated issue, which comes from a detailed comparison of the phenomenon of possessive reflexivization in two typologically and genetically similar languages. We show that whereas in Czech, the possessive reflexivization is mostly limited to syntactic functions and does not go beyond the grammar, in Russian it gets additional semantic meanings and moves substan-tially towards the lexicon. The obtained knowledge allows us to explain heretofore unclear marginal uses of reflexives in each language.
A specific language as used by different speakers and in different situations has a number of more or less distant varieties. Extending the notion of non-standard language to varieties that do not fit an explicitly or implicitly assumed norm or pattern, we look for methods and tools that could be applied to this domain. The needs start from the theoretical side: categories usable for the analysis of non-standard language are not readily available, and continue to methods and tools required for its detection and diagnostics. A general discussion of issues related to non-standard language is followed by two case studies. The first study presents a taxonomy of morphosyntactic categories as an attempt to analyse non-standard forms produced by non-native learners of Czech. The second study focusses on the role of a rule-based grammar and lexicon in the process of building and using a parsebank.
fast align is a simple and fast word alignment tool which is widely used in state-of-the-art machine translation systems. It yields comparable results in the end-to-end translation experiments of various language pairs. However, fast align does not perform as well as GIZA++ when applied to language pairs with distinct word orders, like English and Japanese. In this paper, given the lexical translation table output by fast align, we propose to realign words using the hierarchical sub-sentential alignment approach. Experimental results show that simple additional processing improves the performance of word alignment, which is measured by counting alignment matches in comparison with fast align. We also report the result of final machine translation in both English-Japanese and Japanese-English. We show our best system provided significant improvements over the baseline as measured by BLEU and RIBES.
This paper presents the comparison of how using different neural network based language modeling tools for selecting the best candidate fragments affects the final output translation quality in a hybrid multi-system machine translation setup. Experiments were conducted by comparing perplexity and BLEU scores on common test cases using the same training data set. A 12-gram statistical language model was selected as a baseline to oppose three neural network based models of different characteristics. The models were integrated in a hybrid system that depends on the perplexity score of a sentence fragment to produce the best fitting translations. The results show a correlation between language model perplexity and BLEU scores as well as overall improvements in BLEU.
We present a novel method of comparable corpora construction. Unlike the traditional methods which heavily rely on linguistic features, our method only takes image similarity into consid-eration. We use an image-image search engine to obtain similar images, together with the cap-tions in source language and target language. On the basis, we utilize captions of similar imag-es to construct sentence-level bilingual corpora. Experiments on 10,371 target captions show that our method achieves a precision of 0.85 in the top search results.
We present an algorithm for predicting translation equivalents between two languages, based on the corresponding WordNets. The assumption is that all synsets of one of the languages are linked to the corresponding synsets in the other language. In theory, given the exact sense of a word in a context it must be possible to translate it as any of the words in the linked synset. In practice, however, this does not work well since automatic and accurate sense disambiguation is difficult. Instead it is possible to define a more robust translation relation between the lexemes of the two languages. As far as we know the Finnish WordNet is the only one that includes that relation. Our algorithm can be used to predict the relation for other languages as well. This is useful for instance in hybrid machine translation systems which are usually more dependent on high-quality translation dictionaries.
Traditional machine translation evaluation metrics such as BLEU and WER have been widely used, but these metrics have poor correlations with human judgements because they badly represent word similarity and impose strict identity matching. In this paper, we propose some modifications to the traditional measures based on word embeddings for these two metrics. The evaluation results show that our modifications significantly improve their correlation with human judgements.
We describe experiments in Machine Translation using word sense disambiguation (WSD) information. This work focuses on WSD in verbs, based on two different approaches – verbal patterns based on corpus pattern analysis and verbal word senses from valency frames. We evaluate several options of using verb senses in the source-language sentences as an additional factor for the Moses statistical machine translation system. Our results show a statistically significant translation quality improvement in terms of the BLEU metric for the valency frames approach, but in manual evaluation, both WSD methods bring improvements.
We introduce a new statistical machine translation approach specifically geared to learning translation from low resource languages, that exploits monolingual English semantic parsing to bias inversion transduction grammar (ITG) induction. We show that in contrast to conventional statistical machine translation (SMT) training methods, which rely heavily on phrase memorization, our approach focuses on learning bilingual correlations that help translating low resource languages, by using the output language semantic structure to further narrow down ITG constraints. This approach is motivated by previous research which has shown that injecting a semantic frame based objective function while training SMT models improves the translation quality. We show that including a monolingual semantic objective function during the learning of the translation model leads towards a semantically driven alignment which is more efficient than simply tuning loglinear mixture weights against a semantic frame based evaluation metric in the final stage of statistical machine translation training. We test our approach with three different language pairs and demonstrate that our model biases the learning towards more semantically correct alignments. Both GIZA++ and ITG based techniques fail to capture meaningful bilingual constituents, which is required when trying to learn translation models for low resource languages. In contrast, our proposed model not only improve translation by injecting a monolingual objective function to learn bilingual correlations during early training of the translation model, but also helps to learn more meaningful correlations with a relatively small data set, leading to a better alignment compared to either conventional ITG or traditional GIZA++ based approaches.
We defend that bilingual lexicons automatically extracted from parallel corpora, whose entries have been meanwhile validated by linguists and classified as correct or incorrect, should constitute a specific parallel corpora. And, in this paper, we propose to use word-to-word translations to learn morph-units (comprising of bilingual stems and suffixes) from those bilingual lexicons for two language pairs L1-L2 and L1-L3 to induce a bilingual lexicon for the language pair L2-L3, apart from also learning morph-units for this other language pair. The applicability of bilingual morph-units in L1-L2 and L1-L3 is examined from the perspective of pivot-based lexicon induction for language pair L2-L3 with L1 as bridge. While the lexicon is derived by transitivity, the correspondences are identified based on previously learnt bilingual stems and suffixes rather than surface translation forms. The induced pairs are validated using a binary classifier trained on morphological and similarity-based features using an existing, automatically acquired, manually validated bilingual translation lexicon for language pair L2-L3. In this paper, we discuss the use of English (EN)-French (FR) and English (EN)-Portuguese (PT) lexicon of word-to-word translations in generating word-to-word translations for the language pair FR-PT with EN as pivot language. Generated translations are filtered out first using an SVM-based FR-PT classifier and then are manually validated.
The availability of Language Technology Resources and Tools generates a considerable methodological potential in the Digital Humanities: aspects of research questions from the Humanities and Social Sciences can be addressed on text collections in ways that were unavailable to traditional approaches. I start this talk by sketching some sample scenarios of Digital Humanities projects which involve various Humanities and Social Science disciplines, noting that the potential for a meaningful contribution to higher-level questions is highest when the employed language technological models are carefully tailored both (a) to characteristics of the given target corpus, and (b) to relevant analytical subtasks feeding the discipline-specific research questions. Keeping up a multidisciplinary perspective, I then point out a recurrent dilemma in Digital Humanities projects that follow the conventional set-up of collaboration: to build high-quality computational models for the data, fixed analytical targets should be specified as early as possible – but to be able to respond to Humanities questions as they evolve over the course of analysis, the analytical machinery should be kept maximally flexible. To reach both, I argue for a novel collaborative culture that rests on a more interleaved, continuous dialogue. (Re-)Specification of analytical targets should be an ongoing process in which the Humanities Scholars and Social Scientists play a role that is as important as the Computational Scientists’ role. A promising approach lies in the identification of re-occurring types of analytical subtasks, beyond linguistic standard tasks, which can form building blocks for text analysis across disciplines, and for which corpus-based characterizations (viz. annotations) can be collected, compared and revised. On such grounds, computational modeling is more directly tied to the evolving research questions, and hence the seemingly opposing needs of reliable target specifications vs. “malleable” frameworks of analysis can be reconciled. Experimental work following this approach is under way in the Center for Reflected Text Analytics (CRETA) in Stuttgart.
We examine two different methods for finding rising words (among which neologisms) and falling words (among which archaisms) in decades of magazine texts (millions of words) and in years of tweets (billions of words): one based on correlation coefficients of relative frequencies and time, and one based on comparing initial and final word frequencies of time intervals. We find that smoothing frequency scores improves the precision scores of both methods and that the correlation coefficients perform better on magazine text but worse on tweets. Since the two ranking methods find different words they can be used in side-by-side to study the behavior of words over time.
Multimodal question answering in the cultural heritage domain allows visitors to ask questions in a more natural way and thus provides better user experiences with cultural objects while visiting a museum, landmark or any other historical site. In this paper, we introduce the construction of a golden standard dataset that will aid research of multimodal question answering in the cultural heritage domain. The dataset, which will be soon released to the public, contains multimodal content including images of typical artworks from the fascinating old-Egyptian Amarna period, related image-containing documents of the artworks and over 800 multimodal queries integrating visual and textual questions. The multimodal questions and related documents are all in English. The multimodal questions are linked to relevant paragraphs in the related documents that contain the answer to the multimodal query.
In this paper a social network is extracted from a literary text. The social network shows, how frequent the characters interact and how similar their social behavior is. Two types of similarity measures are used: the first applies co-occurrence statistics, while the second exploits cosine similarity on different types of word embedding vectors. The results are evaluated by a paid micro-task crowdsourcing survey. The experiments suggest that specific types of word embeddings like word2vec are well-suited for the task at hand and the specific circumstances of literary fiction text.
We present an approach to detect differences in lexical semantics across English language registers, using word embedding models from distributional semantics paradigm. Models trained on register-specific subcorpora of the BNC corpus are employed to compare lists of nearest associates for particular words and draw conclusions about their semantic shifts depending on register in which they are used. The models are evaluated on the task of register classification with the help of the deep inverse regression approach. Additionally, we present a demo web service featuring most of the described models and allowing to explore word meanings in different English registers and to detect register affiliation for arbitrary texts. The code for the service can be easily adapted to any set of underlying models.
We are constructing an annotated diachronic corpora of the Japanese language. In part of thiswork, we construct a corpus of Manyosyu, which is an old Japanese poetry anthology. In thispaper, we describe how to align the transcribed text and its original text semiautomatically to beable to cross-reference them in our Manyosyu corpus. Although we align the original charactersto the transcribed words manually, we preliminarily align the transcribed and original charactersby using an unsupervised automatic alignment technique of statistical machine translation toalleviate the work. We found that automatic alignment achieves an F1-measure of 0.83; thus, each poem has 1–2 alignment errors. However, finding these errors and modifying them are less workintensiveand more efficient than fully manual annotation. The alignment probabilities can beutilized in this modification. Moreover, we found that we can locate the uncertain transcriptionsin our corpus and compare them to other transcriptions, by using the alignment probabilities.
Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.
We here describe a novel methodology for measuring affective language in historical text by expanding an affective lexicon and jointly adapting it to prior language stages. We automatically construct a lexicon for word-emotion association of 18th and 19th century German which is then validated against expert ratings. Subsequently, this resource is used to identify distinct emotional patterns and trace long-term emotional trends in different genres of writing spanning several centuries.
Historical treebanks tend to be manually annotated, which is not surprising, since state-of-the-art parsers are not accurate enough to ensure high-quality annotation for historical texts. We test whether automatic parsing can be an efficient pre-annotation tool for Old East Slavic texts. We use the TOROT treebank from the PROIEL treebank family. We convert the PROIEL format to the CONLL format and use MaltParser to create syntactic pre-annotation. Using the most conservative evaluation method, which takes into account PROIEL-specific features, MaltParser by itself yields 0.845 unlabelled attachment score, 0.779 labelled attachment score and 0.741 secondary dependency accuracy (note, though, that the test set comes from a relatively simple genre and contains rather short sentences). Experiments with human annotators show that preparsing, if limited to sentences where no changes to word or sentence boundaries are required, increases their annotation rate. For experienced annotators, the speed gain varies from 5.80% to 16.57%, for inexperienced annotators from 14.61% to 32.17% (using conservative estimates). There are no strong reliable differences in the annotation accuracy, which means that there is no reason to suspect that using preparsing might lower the final annotation quality.
In this paper we will discuss a method for data visualization together with its potential usefulness in digital humanities and philosophy of language. We compiled a multilingual parallel corpus from different versions of Wittgenstein’s Tractatus Logico-philosophicus, including the original in German and translations into English, Spanish, French, and Russian. Using this corpus, we compute a similarity measure between propositions and render a visual network of relations for different languages.
We introduce the third major release of WebAnno, a generic web-based annotation tool for distributed teams. New features in this release focus on semantic annotation tasks (e.g. semantic role labelling or event annotation) and allow the tight integration of semantic annotations with syntactic annotations. In particular, we introduce the concept of slot features, a novel constraint mechanism that allows modelling the interaction between semantic and syntactic annotations, as well as a new annotation user interface. The new features were developed and used in an annotation project for semantic roles on German texts. The paper briefly introduces this project and reports on experiences performing annotations with the new tool. On a comparative evaluation, our tool reaches significant speedups over WebAnno 2 for a semantic annotation task.
Although spanning thousands of years and genres as diverse as liturgy, historiography, lyric and other forms of prose and poetry, the body of Latin texts is still relatively sparse compared to English. Data sparsity in Latin presents a number of challenges for traditional Named Entity Recognition techniques. Solving such challenges and enabling reliable Named Entity Recognition in Latin texts can facilitate many down-stream applications, from machine translation to digital historiography, enabling Classicists, historians, and archaeologists for instance, to track the relationships of historical persons, places, and groups on a large scale. This paper presents the first annotated corpus for evaluating Named Entity Recognition in Latin, as well as a fully supervised model that achieves over 90% F-score on a held-out test set, significantly outperforming a competitive baseline. We also present a novel active learning strategy that predicts how many and which sentences need to be annotated for named entities in order to attain a specified degree of accuracy when recognizing named entities automatically in a given text. This maximizes the productivity of annotators while simultaneously controlling quality.
We present ANNISVis, a webapp for comparative visualization of geographical distribution of linguistic data, as well as a sample deployment for a corpus of Middle High German texts. Unlike existing geographical visualization solutions, which work with pre-existing data sets, or are bound to specific corpora, ANNISVis allows the user to formulate multiple ad-hoc queries and visualizes them on a map, and it can be configured for any corpus that can be imported into ANNIS. This enables explorative queries of the quantitative aspects of a corpus with geographical features. The tool will be made available to download in open source.
In the Danish CLARIN-DK infrastructure, chaining language technology (LT) tools into a workflow is easy even for a non-expert user, because she only needs to specify the input and the desired output of the workflow. With this information and the registered input and output profiles of the available tools, the CLARIN-DK workflow management system (WMS) computes combinations of tools that will give the desired result. This advanced functionality was originally not envisaged, but came within reach by writing the WMS partly in Java and partly in a programming language for symbolic computation, Bracmat. Handling LT tool profiles, including the computation of workflows, is easier with Bracmat’s language constructs for tree pattern matching and tree construction than with the language constructs offered by mainstream programming languages.
Machine Translation (MT) plays a critical role in expanding capacity in the translation industry. However, many valuable documents, including digital documents, are encoded in non-accessible formats for machine processing (e.g., Historical or Legal documents). Such documents must be passed through a process of Optical Character Recognition (OCR) to render the text suitable for MT. No matter how good the OCR is, this process introduces recognition errors, which often renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding a new OCR error correction module to enhance the overall quality of translation. Experimentation shows that our new system correction based on the combination of Language Modeling and Translation methods outperforms the baseline system by nearly 30% relative improvement.
In this paper we describe how the complexity of human communication can be analysed with the help of language technology. We present the HuComTech corpus, a multimodal corpus containing 50 hours of videotaped interviews containing a rich annotation of about 2 million items annotated on 33 levels. The corpus serves as a general resource for a wide range of re-search addressing natural conversation between humans in their full complexity. It can benefit particularly digital humanities researchers working in the field of pragmatics, conversational analysis and discourse analysis. We will present a number of tools and automated methods that can help such enquiries. In particular, we will highlight the tool Theme, which is designed to uncover hidden temporal patterns (called T-patterns) in human interaction, and will show how it can applied to the study of multimodal communication.
Most modern and post-modern poems have developed a post-metrical idea of lyrical prosody that employs rhythmical features of everyday language and prose instead of a strict adherence to rhyme and metrical schemes. This development is subsumed under the term free verse prosody. We present our methodology for the large-scale analysis of modern and post-modern poetry in both their written form and as spoken aloud by the author. We employ language processing tools to align text and speech, to generate a null-model of how the poem would be spoken by a naïve reader, and to extract contrastive prosodic features used by the poet. On these, we intend to build our model of free verse prosody, which will help to understand, differentiate and relate the different styles of free verse poetry. We plan to use our processing scheme on large amounts of data to iteratively build models of styles, to validate and guide manual style annotation, to identify further rhythmical categories, and ultimately to broaden our understanding of free verse poetry. In this paper, we report on a proof-of-concept of our methodology using smaller amounts of poems and a limited set of features. We find that our methodology helps to extract differentiating features in the authors’ speech that can be explained by philological insight. Thus, our automatic method helps to guide the literary analysis and this in turn helps to improve our computational models.
This paper presents a tool to investigate the design of multimodal instructions (MIs), i.e., instructions that contain both text and pictures. The benefit of including pictures in information presentation has been established, but the characteristics of those pictures and of their textual counterparts and the rela-tion(s) between them have not been researched in a systematic manner. We present the PAT Work-bench, a tool to store, annotate and retrieve MIs based on a validated coding scheme with currently 42 categories that describe instructions in terms of textual features, pictorial elements, and relations be-tween text and pictures. We describe how the PAT Workbench facilitates collaborative annotation and inter-annotator agreement calculation. Future work on the tool includes expanding its functionality and usability by (i) making the MI annotation scheme dynamic for adding relevant features based on empirical evaluations of the MIs, (ii) implementing algorithms for automatic tagging of MI features, and (iii) implementing automatic MI evaluation algorithms based on results obtained via e.g. crowdsourced assessments of MIs.
The increasing amount of multilingual text collections available in different domains makes its automatic processing essential for the development of a given field. However, standard processing techniques based on statistical clues and keyword searches have clear limitations. Instead, we propose a knowledge-based processing pipeline which overcomes most of the limitations of these techniques. This, in turn, enables direct comparison across texts in different languages without the need of translation. In this paper we show the potential of this approach for semantically indexing multilingual text collections in the history domain. In our experiments we used a version of the Bible translated in four different languages, evaluating the precision of our semantic indexing pipeline and showing its reliability on the cross-lingual text retrieval task.
This paper presents on-going work on creating NLP tools for under-resourced languages from very sparse training data coming from linguistic field work. In this work, we focus on Ingush, a Nakh-Daghestanian language spoken by about 300,000 people in the Russian republics Ingushetia and Chechnya. We present work on morphosyntactic taggers trained on transcribed and linguistically analyzed recordings and dependency parsers using English glosses to project annotation for creating synthetic treebanks. Our preliminary results are promising, supporting the goal of bootstrapping efficient NLP tools with limited or no task-specific annotated data resources available.
This paper gives an overview of the MultiTal project, which aims to create a research infrastructure that ensures long-term distribution of NLP tools descriptions. The goal is to make NLP tools more accessible and usable to end-users of different disciplines. The infrastructure is built on a meta-data scheme modelling and standardising multilingual NLP tools documentation. The model is conceptualised using an OWL ontology. The formal representation of the ontology allows us to automatically generate organised and structured documentation in different languages for each represented tool.
This article describes work on enabling the addition of temporal information to senses of words in linguistic linked open data lexica based on the lemonDia model. Our contribution in this article is twofold. On the one hand, we demonstrate how lemonDia enables the querying of diachronic lexical datasets using OWL-oriented Semantic Web based technologies. On the other hand, we present a preliminary version of an interactive interface intended to help users in creating lexical datasets that model meaning change over time.
(This is the abstract for the submission.) Large-scale comparisons between the poetry of Tang and Song dynasties shed light on how words and expressions were used and shared among the poets. That some words were used only in the Tang poetry and some only in the Song poetry could lead to interesting research in linguistics. That the most frequent colors are different in the Tang and Song poetry provides a trace of the changing social circumstances in the dynasties. Results of the current work link to research topics of lexicography, semantics, and social transitions. We discuss our findings and present our algorithms for efficient comparisons among the poems, which are crucial for completing billion times of comparisons within acceptable time.
The following paper describes the first steps in the development of an ontology for the textbook research discipline. The aim of the project WorldViews is to establish a digital edition focussing on views of the world depicted in textbooks. For this purpose an initial TEI profile has been formalised and tested as a use case to enable the semantical encoding of the resource ‘textbook’. This profile shall provide a basic data model describing major facets of the textbook’s structure relevant to historians.
In this paper we present a new combination of existing language tools for Polish with a popular data mining platform intended to help researchers from digital humanities perform computational analyses without any programming. The toolset includes RapidMiner Studio, a software solution offering graphical setup of integrated analytical processes and Multiservice, a Web service offering access to several state-of-the-art linguistic tools for Polish. The setting is verified in a simple task of counting frequencies of unknown words in a small corpus.
Learning functional expressions is one of the difficulties for language learners, since functional expressions tend to have multiple meanings and complicated usages in various situations. In this paper, we report an experiment of simplifying example sentences of Japanese functional expressions especially for Chinese-speaking learners. For this purpose, we developed “Japanese Functional Expressions List” and “Simple Japanese Replacement List”. To evaluate the method, we conduct a small-scale experiment with Chinese-speaking learners on the effectiveness of the simplified example sentences. The experimental results indicate that simplified sentences are helpful in learning Japanese functional expressions.
In learning Asian languages, learners encounter the problem of character types that are different from those in their first language, for instance, between Chinese characters and the Latin alphabet. This problem also affects listening because learners reconstruct letters from speech sounds. Hence, special attention should be paid to listening practice for learners of Asian languages. However, to our knowledge, few studies have evaluated the ease of listening comprehension (listenability) in Asian languages. Therefore, as a pilot study of listenability in Asian languages, we developed a measurement method for learners of English in order to examine the discriminability of linguistic and learner features. The results showed that the accuracy of our method outperformed a simple majority vote, which suggests that a combination of linguistic and learner features should be used to measure listenability in Asian languages as well as in English.
We propose a new approach for extracting argument structure from natural language texts that contain an underlying argument. Our approach comprises of two phases: Score Assignment and Structure Prediction. The Score Assignment phase trains models to classify relations between argument units (Support, Attack or Neutral). To that end, different training strategies have been explored. We identify different linguistic and lexical features for training the classifiers. Through ablation study, we observe that our novel use of word-embedding features is most effective for this task. The Structure Prediction phase makes use of the scores from the Score Assignment phase to arrive at the optimal structure. We perform experiments on three argumentation datasets, namely, AraucariaDB, Debatepedia and Wikipedia. We also propose two baselines and observe that the proposed approach outperforms baseline systems for the final task of Structure Prediction.
We address the problem of automatic short answer grading, evaluating a collection of approaches inspired by recent advances in distributional text representations. In addition, we propose an unsupervised approach for determining text similarity using one-to-many alignment of word vectors. We evaluate the proposed technique across two datasets from different domains, namely, computer science and English reading comprehension, that additionally vary between highschool level and undergraduate students. Experiments demonstrate that the proposed technique often outperforms other compositional distributional semantics approaches as well as vector space methods such as latent semantic analysis. When combined with a scoring scheme, the proposed technique provides a powerful tool for tackling the complex problem of short answer grading. We also discuss a number of other key points worthy of consideration in preparing viable, easy-to-deploy automatic short-answer grading systems for the real-world.
Word embeddings are now ubiquitous forms of word representation in natural language processing. There have been applications of word embeddings for monolingual word sense disambiguation (WSD) in English, but few comparisons have been done. This paper attempts to bridge that gap by examining popular embeddings for the task of monolingual English WSD. Our simplified method leads to comparable state-of-the-art performance without expensive retraining. Cross-Lingual WSD – where the word senses of a word in a source language come from a separate target translation language – can also assist in language learning; for example, when providing translations of target vocabulary for learners. Thus we have also applied word embeddings to the novel task of cross-lingual WSD for Chinese and provide a public dataset for further benchmarking. We have also experimented with using word embeddings for LSTM networks and found surprisingly that a basic LSTM network does not work well. We discuss the ramifications of this outcome.
This paper presents the NLP-TEA 2016 shared task for Chinese grammatical error diagnosis which seeks to identify grammatical error types and their range of occurrence within sentences written by learners of Chinese as foreign language. We describe the task definition, data preparation, performance metrics, and evaluation results. Of the 15 teams registered for this shared task, 9 teams developed the system and submitted a total of 36 runs. We expected this evaluation campaign could lead to the development of more advanced NLP techniques for educational applications, especially for Chinese error detection. All data sets with gold standards and scoring scripts are made publicly available to researchers.
Grammatical error diagnosis is an important task in natural language processing. This paper introduces our Chinese Grammatical Error Diagnosis (CGED) system in the NLP-TEA-3 shared task for CGED. The CGED system can diagnose four types of grammatical errors which are redundant words (R), missing words (M), bad word selection (S) and disordered words (W). We treat the CGED task as a sequence labeling task and describe three models, including a CRF-based model, an LSTM-based model and an ensemble model using stacking. We also show in details how we build and train the models. Evaluation includes three levels, which are detection level, identification level and position level. On the CGED-HSK dataset of NLP-TEA-3 shared task, our system presents the best F1-scores in all the three levels and also the best recall in the last two levels.
In the process of learning and using Chinese, foreigners may have grammatical errors due to negative migration of their native languages. Currently, the computer-oriented automatic detection method of grammatical errors is not mature enough. Based on the evaluating task — CGED2016, we select and analyze the classification model and design feature extraction method to obtain grammatical errors including Mission(M), Disorder(W), Selection (S) and Redundant (R) automatically. The experiment results based on the dynamic corpus of HSK show that the Chinese grammatical error automatic detection method, which uses CRF as classification model and n-gram as feature extraction method. It is simple and efficient which play a positive effect on the research of Chinese grammatical error automatic detection and also a supporting and guiding role in the teaching of Chinese as a foreign language.
This paper describe the CYUT-III system on grammar error detection in the 2016 NLP-TEA Chinese Grammar Error Detection shared task CGED. In this task a system has to detect four types of errors, in-cluding redundant word error, missing word error, word selection error and word ordering error. Based on the conditional random fields (CRF) model, our system is a linear tagger that can detect the errors in learners’ essays. Since the system performance depends on the features heavily, in this paper, we are going to report how to integrate the collocation feature into the CRF model. Our system presents the best detection accuracy and Identification accuracy on the TOCFL dataset, which is in traditional Chi-nese. The same system also works well on the simplified Chinese HSK dataset.
This paper discusses how to adapt two new word embedding features to build a more efficient Chinese Grammatical Error Diagnosis (CGED) systems to assist Chinese foreign learners (CFLs) in improving their written essays. The major idea is to apply word order sensitive Word2Vec approaches including (1) structured skip-gram and (2) continuous window (CWindow) models, because they are more suitable for solving syntax-based problems. The proposed new features were evaluated on the Test of Chinese as a Foreign Language (TOCFL) learner database provided by NLP-TEA-3&CGED shared task. Experimental results showed that the new features did work better than the traditional word order insensitive Word2Vec approaches. Moreover, according to the official evaluation results, our system achieved the lowest (0.1362) false positive (FA) and the highest precision rates in all three measurements.
We offer a fluctuation smoothing computational approach for unsupervised automatic short answer grading (ASAG) techniques in the educational ecosystem. A major drawback of the existing techniques is the significant effect that variations in model answers could have on their performances. The proposed fluctuation smoothing approach, based on classical sequential pattern mining, exploits lexical overlap in students’ answers to any typical question. We empirically demonstrate using multiple datasets that the proposed approach improves the overall performance and significantly reduces (up to 63%) variation in performance (standard deviation) of unsupervised ASAG techniques. We bring in additional benchmarks such as (a) paraphrasing of model answers and (b) using answers by k top performing students as model answers, to amplify the benefits of the proposed approach.
This paper introduces Japanese lexical simplification. Japanese lexical simplification is the task of replacing difficult words in a given sentence to produce a new sentence with simple words without changing the original meaning of the sentence. We purpose a method of supervised regression learning to estimate difficulty ordering of words with statistical features obtained from two types of Japanese corpora. For the similarity of words, we use a Japanese thesaurus and dependency-based word embeddings. Evaluation of the proposed method is performed by comparing the difficulty ordering of the words.
Due to the huge population that speaks Spanish and Chinese, these languages occupy an important position in the language learning studies. Although there are some automatic translation systems that benefit the learning of both languages, there is enough space to create resources in order to help language learners. As a quick and effective resource that can give large amount language information, corpus-based learning is becoming more and more popular. In this paper we enrich a Spanish-Chinese parallel corpus automatically with part of-speech (POS) information and manually with discourse segmentation (following the Rhetorical Structure Theory (RST) (Mann and Thompson, 1988)). Two search tools allow the Spanish-Chinese language learners to carry out different queries based on tokens and lemmas. The parallel corpus and the research tools are available to the academic community. We propose some examples to illustrate how learners can use the corpus to learn Spanish and Chinese.
We present a novel approach to Computer Assisted Language Learning (CALL), using deep syntactic parsers and semantic based machine translation (MT) in diagnosing and providing explicit feedback on language learners’ errors. We are currently developing a proof of concept system showing how semantic-based machine translation can, in conjunction with robust computational grammars, be used to interact with students, better understand their language errors, and help students correct their grammar through a series of useful feedback messages and guided language drills. Ultimately, we aim to prove the viability of a new integrated rule-based MT approach to disambiguate students’ intended meaning in a CALL system. This is a necessary step to provide accurate coaching on how to correct ungrammatical input, and it will allow us to overcome a current bottleneck in the field — an exponential burst of ambiguity caused by ambiguous lexical items (Flickinger, 2010). From the users’ interaction with the system, we will also produce a richly annotated Learner Corpus, annotated automatically with both syntactic and semantic information.
This paper describes a corpus of nearly 10K French-Chinese aligned segments, produced by post-editing machine translated computer science courseware. This corpus was built from 2013 to 2016 within the PROJECT_NAME project, by native Chinese students. The quality, as judged by native speakers, is ad-equate for understanding (far better than by reading only the original French) and for getting better marks. This corpus is annotated at segment-level by a self-assessed quality score. It has been directly used as supplemental training data to build a statistical machine translation system dedicated to that sublanguage, and can be used to extract the specific bilingual terminology. To our knowledge, it is the first corpus of this kind to be released.
Much research in education has been done on the study of different language teaching methods. However, there has been little investigation using computational analysis to compare such methods in terms of readability or complexity progression. In this paper, we make use of existing readability scoring techniques and our own classifiers to analyze the textbooks used in two very different teaching methods for English as a Second Language – the grammar-based and the communicative methods. Our analysis indicates that the grammar-based curriculum shows a more coherent readability progression compared to the communicative curriculum. This finding corroborates with the expectations about the differences between these two methods and validates our approach’s value in comparing different teaching methods quantitatively.
Grammatical error diagnosis is an essential part in a language-learning tutoring system. Based on the data sets of Chinese grammar error detection tasks, we proposed a system which measures the likelihood of correction candidates generated by deleting or inserting characters or words, moving substrings to different positions, substituting prepositions with other prepositions, or substituting words with their synonyms or similar strings. Sentence likelihood is measured based on the frequencies of substrings from the space-removed version of Google n-grams. The evaluation on the training set shows that Missing-related and Selection-related candidate generation methods have promising performance. Our final system achieved a precision of 30.28% and a recall of 62.85% in the identification level evaluated on the test set.
Mandarin is not simple language for foreigner. Even using Mandarin as the mother tongue, they have to spend more time to learn when they were child. The following issues are the reason why causes learning problem. First, the word is envolved by Hieroglyphic. So a character can express meanings independently, but become a word has another semantic. Second, the Mandarin’s grammars have flexible rule and special usage. Therefore, the common grammatical errors can classify to missing, redundant, selection and disorder. In this paper, we proposed the structure of the Recurrent Neural Networks using Long Short-term memory (RNN-LSTM). It can detect the error type from the foreign learner writing. The features based on the word vector and part-of-speech vector. In the test data found that our method in the detection level of recall better than the others, even as high as 0.9755. That is because we give the possibility of greater choice in detecting errors.
Grammatical Error Diagnosis for Chinese has always been a challenge for both foreign learners and NLP researchers, for the variousity of grammar and the flexibility of expression. In this paper, we present a model based on Bidirectional Long Short-Term Memory(Bi-LSTM) neural networks, which treats the task as a sequence labeling problem, so as to detect Chinese grammatical errors, to identify the error types and to locate the error positions. In the corpora of this year’s shared task, there can be multiple errors in a single offset of a sentence, to address which, we simutaneously train three Bi-LSTM models sharing word embeddings which label Missing, Redundant and Selection errors respectively. We regard word ordering error as a special kind of word selection error which is longer during training phase, and then separate them by length during testing phase. In NLP-TEA 3 shared task for Chinese Grammatical Error Diagnosis(CGED), Our system achieved relatively high F1 for all the three levels in the traditional Chinese track and for the detection level in the Simpified Chinese track.
Abstract Automatic grammatical error detection for Chinese has been a big challenge for NLP researchers. Due to the formal and strict grammar rules in Chinese, it is hard for foreign students to master Chinese. A computer-assisted learning tool which can automatically detect and correct Chinese grammatical errors is necessary for those foreign students. Some of the previous works have sought to identify Chinese grammatical errors using template- and learning-based methods. In contrast, this study introduced convolutional neural network (CNN) and long-short term memory (LSTM) for the shared task of Chinese Grammatical Error Diagnosis (CGED). Different from traditional word-based embedding, single word embedding was used as input of CNN and LSTM. The proposed single word embedding can capture both semantic and syntactic information to detect those four type grammatical error. In experimental evaluation, the recall and f1-score of our submitted results Run1 of the TOCFL testing data ranked the fourth place in all submissions in detection-level.
We present Kathaa, an Open Source web-based Visual Programming Framework for Natural Language Processing (NLP) Systems. Kathaa supports the design, execution and analysis of complex NLP systems by visually connecting NLP components from an easily extensible Module Library. It models NLP systems an edge-labeled Directed Acyclic MultiGraph, and lets the user use publicly co-created modules in their own NLP applications irrespective of their technical proficiency in Natural Language Processing. Kathaa exposes an intuitive web based Interface for the users to interact with and modify complex NLP Systems; and a precise Module definition API to allow easy integration of new state of the art NLP components. Kathaa enables researchers to publish their services in a standardized format to enable the masses to use their services out of the box. The vision of this work is to pave the way for a system like Kathaa, to be the Lego blocks of NLP Research and Applications. As a practical use case we use Kathaa to visually implement the Sampark Hindi-Panjabi Machine Translation Pipeline and the Sampark Hindi-Urdu Machine Translation Pipeline, to demonstrate the fact that Kathaa can handle really complex NLP systems while still being intuitive for the end user.
The US National Science Foundation (NSF) SI2-funded LAPPS/Galaxy project has developed an open-source platform for enabling complex analyses while hiding complexities associated with underlying infrastructure, that can be accessed through a web interface, deployed on any Unix system, or run from the cloud. It provides sophisticated tool integration and history capabilities, a workflow system for building automated multi-step analyses, state-of-the-art evaluation capabilities, and facilities for sharing and publishing analyses. This paper describes the current facilities available in LAPPS/Galaxy and outlines the project’s ongoing activities to enhance the framework.
Most tools for natural language processing today are based on machine learning and come with pre-trained models. In addition, third-parties provide pre-trained models for popular NLP tools. The predictive power and accuracy of these tools depends on the quality of these models. Downstream researchers often base their results on pre-trained models instead of training their own. Consequently, pre-trained models are an essential resource to our community. However, to be best of our knowledge, no systematic study of pre-trained models has been conducted so far. This paper reports on the analysis of 274 pre-models for six NLP tools and four potential causes of problems: encoding, tokenization, normalization and change over time. The analysis is implemented in the open source tool Model Investigator. Our work 1) allows model consumers to better assess whether a model is suitable for their task, 2) enables tool and model creators to sanity-check their models before distributing them, and 3) enables improvements in tool interoperability by performing automatic adjustments of normalization or other pre-processing based on the models used.
In this research, we introduce and implement a method that combines human inputters and machine translators. When the languages of the participants vary widely, the cost of simultaneous translation becomes very high. However, the results of simply applying machine translation to speech text do not have the quality that is needed for real use. Thus, we propose a method that people who understand the language of the speaker cooperate with a machine translation service in support of multilingualization by the co-creation of value. We implement a system with this method and apply it to actual presentations. While the quality of direct machine translations is 1.84 (fluency) and 2.89 (adequacy), the system has corresponding values of 3.76 and 3.85.
Complaint classification aims at using information to deliver greater insights to enhance user experience after purchasing the products or services. Categorized information can help us quickly collect emerging problems in order to provide a support needed. Indeed, the response to the complaint without the delay will grant users highest satisfaction. In this paper, we aim to deliver a novel approach which can clarify the complaints precisely with the aim to classify each complaint into nine predefined classes i.e. acces-sibility, company brand, competitors, facilities, process, product feature, staff quality, timing respec-tively and others. Given the idea that one word usually conveys ambiguity and it has to be interpreted by its context, the word embedding technique is used to provide word features while applying deep learning techniques for classifying a type of complaints. The dataset we use contains 8,439 complaints of one company.
The Universal Dependencies (UD) Project seeks to build a cross-lingual studies of treebanks, linguistic structures and parsing. Its goal is to create a set of multilingual harmonized treebanks that are designed according to a universal annotation scheme. In this paper, we report on the conversion of the Uyghur dependency treebank to a UD version of the treebank which we term the Uyghur Universal Dependency Treebank (UyDT). We present the mapping of the Uyghur dependency treebank’s labelling scheme to the UD scheme, along with a clear description of the structural changes required in this conversion.
In this paper we describe a non-expert setup for Vietnamese speech recognition system using Kaldi toolkit. We collected a speech corpus over fifteen hours from about fifty Vietnamese native speakers and using it to test the feasibility of our setup. The essential linguistic components for the Automatic Speech Recognition (ASR) system was prepared basing on the written form of the language instead of expertise knowledge on linguistic and phonology as commonly seen in rich resource languages like English. The modeling of tones by integrating them into the phoneme and using the phonetic decision tree is also discussed. Experimental results showed this setup for ASR systems does yield competitive results while still have potentials for further improvements.
Annotated corpora are crucial language resources, and pre-annotation is an usual way to reduce the cost of corpus construction. Ensemble based pre-annotation approach combines multiple existing named entity taggers and categorizes annotations into normal annotations with high confidence and candidate annotations with low confidence, to reduce the human annotation time. In this paper, we manually annotate three English datasets under various pre-annotation conditions, report the effects of ensemble based pre-annotation, and analyze the experimental results. In order to verify the effectiveness of ensemble based pre-annotation in other languages, such as Chinese, three Chinese datasets are also tested. The experimental results show that the ensemble based pre-annotation approach significantly reduces the number of annotations which human annotators have to add, and outperforms the baseline approaches in reduction of human annotation time without loss in annotation performance (in terms of F1-measure), on both English and Chinese datasets.
Fragmentation and recombination is a key to create customized language environments for supporting various intercultural activities. Fragmentation provides various language resource components for the customized language environments and recombination builds each language environment according to user’s request by combining these components. To realize this fragmentation and recombination process, existing language resources (both data and programs) should be shared as language services and combined beyond mismatch of their service interfaces. To address this issue, standardization is inevitable: standardized interfaces are necessary for language services as well as data format required for language resources. Therefore, we have constructed a hierarchy of language services based on inheritance of service interfaces, which is called language service ontology. This ontology allows users to create a new customized language service that is compatible with existing ones. Moreover, we have developed a dynamic service binding technology that instantiates various executable customized services from an abstract workflow according to user’s request. By using the ontology and service binding together, users can bind the instantiated language service to another abstract workflow for a new customized one.
Different types of users require different functions in NLP software. It is difficult for a single platform to cover all types of users. When a framework aims to provide more interoperability, users are required to learn more concepts; users’ application designs are restricted to be compliant with the framework. While an interoperability framework is useful in certain cases, some types of users will not select the framework due to the learning cost and design restrictions. We suggest a rather simple framework for the interoperability aiming at developers. Reusing an existing NLP platform Kachako, we created an API oriented NLP system. This system loosely couples rich high-end functions, including annotation visualizations, statistical evaluations, an-notation searching, etc. This API do not require users much learning cost, providing customization ability for power users while also allowing easy users to employ many GUI functions.
Men are from Mars and women are from Venus - or so the genre of relationship literature would have us believe. But there is some truth in this idea, and researchers in fields as diverse as psychology, sociology, and linguistics have explored ways to better understand the differences between genders. In this paper, we take another look at the problem of gender discrimination and attempt to move beyond the typical surface-level text classification approach, by (1) identifying semantic and psycholinguistic word classes that reflect systematic differences between men and women and (2) finding differences between genders in the ways they use the same words. We describe several experiments and report results on a large collection of blogs authored by men and women.
Recent studies have demonstrated gender and cultural differences in the recognition of emotions in facial expressions. However, most studies were conducted on American subjects. In this paper, we explore the generalizability of several findings to a non-American culture in the form of Danish subjects. We conduct an emotion recognition task followed by two stereotype questionnaires with different genders and age groups. While recent findings (Krems et al., 2015) suggest that women are biased to see anger in neutral facial expressions posed by females, in our sample both genders assign higher ratings of anger to all emotions expressed by females. Furthermore, we demonstrate an effect of gender on the fear-surprise-confusion observed by Tomkins and McCarter (1964); females overpredict fear, while males overpredict surprise.
Many methods have been used to recognise author personality traits from text, typically combining linguistic feature engineering with shallow learning models, e.g. linear regression or Support Vector Machines. This work uses deep-learning-based models and atomic features of text, the characters, to build hierarchical, vectorial word and sentence representations for trait inference. This method, applied to a corpus of tweets, shows state-of-the-art performance across five traits compared with prior work. The results, supported by preliminary visualisation work, are encouraging for the ability to detect complex human traits.
We exploit the Facebook reaction feature in a distant supervised fashion to train a support vector machine classifier for emotion detection, using several feature combinations and combining different Facebook pages. We test our models on existing benchmarks for emotion detection and show that employing only information that is derived completely automatically, thus without relying on any handcrafted lexicon as it’s usually done, we can achieve competitive results. The results also show that there is large room for improvement, especially by gearing the collection of Facebook pages, with a view to the target domain.
This paper proposes a graphical framework to extract opinionated sentences which highlight different contexts within a given news article by introducing the concept of diversity in a graphical model for opinion detection.We conduct extensive evaluations and find that the proposed modification leads to impressive improvement in performance and makes the final results of the model much more usable. The proposed method (OP-D) not only performs much better than the other techniques used for opinion detection as well as introducing diversity, but is also able to select opinions from different categories (Asher et al. 2009). By developing a classification model which categorizes the identified sentences into various opinion categories, we find that OP-D is able to push opinions from different categories uniformly among the top opinions.
Automatic detection of five language components, which are all relevant for expressing opinions and for stance taking, was studied: positive sentiment, negative sentiment, speculation, contrast and condition. A resource-aware approach was taken, which included manual annotation of 500 training samples and the use of limited lexical resources. Active learning was compared to random selection of training data, as well as to a lexicon-based method. Active learning was successful for the categories speculation, contrast and condition, but not for the two sentiment categories, for which results achieved when using active learning were similar to those achieved when applying a random selection of training data. This difference is likely due to a larger variation in how sentiment is expressed than in how speakers express the other three categories. This larger variation was also shown by the lower recall results achieved by the lexicon-based approach for sentiment than for the categories speculation, contrast and condition.
We investigate the application of kernel methods to representing both structural and lexical knowledge for predicting polarity of opinions in consumer product review. We introduce any-gram kernels which model lexical information in a significantly faster way than the traditional n-gram features, while capturing all possible orders of n-grams n in a sequence without the need to explicitly present a pre-specified set of such orders. We also present an effective format to represent constituency and dependency structure together with aspect terms and sentiment polarity scores. Furthermore, we modify the traditional tree kernel function to compute the similarity based on word embedding vectors instead of exact string match and present experiments using the new models.
This paper explores humour recognition for Twitter-based hashtag games. Given their popularity, frequency, and relatively formulaic nature, these games make a good target for computational humour research and can leverage Twitter likes and retweets as humour judgments. In this work, we use pair-wise relative humour judgments to examine several measures of semantic relatedness between setups and punchlines on a hashtag game corpus we collected and annotated. Results show that perplexity, Normalized Google Distance, and free-word association-based features are all useful in identifying “funnier” hashtag game responses. In fact, we provide empirical evidence that funnier punchlines tend to be more obscure, although more obscure punchlines are not necessarily rated funnier. Furthermore, the asymmetric nature of free-word association features allows us to see that while punchlines should be harder to predict given a setup, they should also be relatively easy to understand in context.
Despite a substantial progress made in developing new sentiment lexicon generation (SLG) methods for English, the task of transferring these approaches to other languages and domains in a sound way still remains open. In this paper, we contribute to the solution of this problem by systematically comparing semi-automatic translations of common English polarity lists with the results of the original automatic SLG algorithms, which were applied directly to German data. We evaluate these lexicons on a corpus of 7,992 manually annotated tweets. In addition to that, we also collate the results of dictionary- and corpus-based SLG methods in order to find out which of these paradigms is better suited for the inherently noisy domain of social media. Our experiments show that semi-automatic translations notably outperform automatic systems (reaching a macro-averaged F1-score of 0.589), and that dictionary-based techniques produce much better polarity lists as compared to corpus-based approaches (whose best F1-scores run up to 0.479 and 0.419 respectively) even for the non-standard Twitter genre.
Detecting depression or personality traits, tutoring and student behaviour systems, or identifying cases of cyber-bulling are a few of the wide range of the applications, in which the automatic detection of emotion is a crucial element. Emotion detection has the potential of high impact by contributing the benefit of business, society, politics or education. Given this context, the main objective of our research is to contribute to the resolution of one of the most important challenges in textual emotion detection task: the problems of emotional corpora annotation. This will be tackled by proposing of a new semi-automatic methodology. Our innovative methodology consists in two main phases: (1) an automatic process to pre-annotate the unlabelled sentences with a reduced number of emotional categories; and (2) a refinement manual process where human annotators will determine which is the predominant emotion between the emotional categories selected in the phase 1. Our proposal in this paper is to show and evaluate the pre-annotation process to analyse the feasibility and the benefits by the methodology proposed. The results obtained are promising and allow obtaining a substantial improvement of annotation time and cost and confirm the usefulness of our pre-annotation process to improve the annotation task.
We created a model to estimate personality trait from authors’ text written in Japanese and measured its performance by conducting surveys and analyzing the Twitter data of 1,630 users. We used the Big Five personality traits for personality trait estimation. Our approach is a combination of category- and Word2Vec-based approaches. For the category-based element, we added several unique Japanese categories along with the ones regularly used in the English model, and for the Word2Vec-based element, we used a model called GloVe. We found that some of the newly added categories have a stronger correlation with personality traits than other categories do and that the combination of the category- and Word2Vec-based approaches improves the accuracy of the personality trait estimation compared with the case of using just one of them.
On June 23rd 2016, UK held the referendum which ratified the exit from the EU. While most of the traditional pollsters failed to forecast the final vote, there were online systems that hit the result with high accuracy using opinion mining techniques and big data. Starting one month before, we collected and monitored millions of posts about the referendum from social media conversations, and exploited Natural Language Processing techniques to predict the referendum outcome. In this paper we discuss the methods used by traditional pollsters and compare it to the predictions based on different opinion mining techniques. We find that opinion mining based on agreement/disagreement classification works better than opinion mining based on polarity classification in the forecast of the referendum outcome.
The conundrum of understanding and classifying sarcasm has been dealt with by the traditional theorists as an analysis of a sarcastic utterance and the ironic situation that surrounds it. The problem with such an approach is that it is too narrow, as it is unable to sufficiently utilize the two indispensable agents in making such an utterance, viz. the speaker and the listener. It undermines the necessary context required to comprehend a sarcastic utterance. In this paper, we propose a novel approach towards understanding sarcasm in terms of the existing knowledge hierarchy between the two participants, which forms the basis of the context that both agents share. The difference in relationship of the speaker of the sarcastic utterance and the disparate audience found on social media, such as Twitter, is also captured. We then apply our model on a corpus of tweets to achieve significant results and consequently, shed light on subjective nature of context, which is contingent on the relation between the speaker and the listener.
A growing body of research exploits social media behaviors to gauge psychological character-istics, though trait empathy has received little attention. Because of its intimate link to the abil-ity to relate to others, our research aims to predict participants’ levels of empathy, given their textual and friending behaviors on Facebook. Using Poisson regression, we compared the vari-ance explained in Davis’ Interpersonal Reactivity Index (IRI) scores on four constructs (em-pathic concern, personal distress, fantasy, perspective taking), by two classes of variables: 1) post content and 2) linguistic style. Our study lays the groundwork for a greater understanding of empathy’s role in facilitating interactions on social media.
This paper outlines a pilot study on multi-dimensional and multilingual sentiment analysis of social media content. We use parallel corpora of movie subtitles as a proxy for colloquial language in social media channels and a multilingual emotion lexicon for fine-grained sentiment analyses. Parallel data sets make it possible to study the preservation of sentiments and emotions in translation and our assessment reveals that the lexical approach shows great inter-language agreement. However, our manual evaluation also suggests that the use of purely lexical methods is limited and further studies are necessary to pinpoint the cross-lingual differences and to develop better sentiment classifiers.
In this paper, we address the issue of automatic prediction of readers’ mood from newspaper articles and comments. As online newspapers are becoming more and more similar to social media platforms, users can provide affective feedback, such as mood and emotion. We have exploited the self-reported annotation of mood categories obtained from the metadata of the Italian online newspaper corriere.it to design and evaluate a system for predicting five different mood categories from news articles and comments: indignation, disappointment, worry, satisfaction, and amusement. The outcome of our experiments shows that overall, bag-of-word-ngrams perform better compared to all other feature sets; however, stylometric features perform better for the mood score prediction of articles. Our study shows that self-reported annotations can be used to design automatic mood prediction systems.
Most work in NLP analysing microblogs focuses on textual content thus neglecting temporal and spatial information. We present a new interdisciplinary method for emotion classification that combines linguistic, temporal, and spatial information into a single metric. We create a graph of labeled and unlabeled tweets that encodes the relations between neighboring tweets with respect to their emotion labels. Graph-based semi-supervised learning labels all tweets with an emotion.
We explore a domain-agnostic approach for analyzing speech with the goal of opinion prediction. We represent the speech signal by mel-frequency cepstral coefficients and apply long short-term memory neural networks to automatically learn temporal regularities in speech. In contrast to previous work, our approach does not require complex feature engineering and works without textual transcripts. As a consequence, it can easily be applied on various speech analysis tasks for different languages and the results show that it can nevertheless be competitive to the state-of-the-art in opinion prediction. In a detailed error analysis for opinion mining we find that our approach performs well in identifying speaker-specific characteristics, but should be combined with additional information if subtle differences in the linguistic content need to be identified.
Considering the importance of public speech skills, a system which makes a prediction on where audiences laugh in a talk can be helpful to a person who prepares for a talk. We investigated a possibility that a state-of-the-art humor recognition system can be used in detecting sentences inducing laughters in talks. In this study, we used TED talks and laughters in the talks as data. Our results showed that the state-of-the-art system needs to be improved in order to be used in a practical application. In addition, our analysis showed that classifying humorous sentences in talks is very challenging due to close distance between humorous and non-humorous sentences.
Major depressive disorder, a debilitating and burdensome disease experienced by individuals worldwide, can be defined by several depressive symptoms (e.g., anhedonia (inability to feel pleasure), depressed mood, difficulty concentrating, etc.). Individuals often discuss their experiences with depression symptoms on public social media platforms like Twitter, providing a potentially useful data source for monitoring population-level mental health risk factors. In a step towards developing an automated method to estimate the prevalence of symptoms associated with major depressive disorder over time in the United States using Twitter, we developed classifiers for discerning whether a Twitter tweet represents no evidence of depression or evidence of depression. If there was evidence of depression, we then classified whether the tweet contained a depressive symptom and if so, which of three subtypes: depressed mood, disturbed sleep, or fatigue or loss of energy. We observed that the most accurate classifiers could predict classes with high-to-moderate F1-score performances for no evidence of depression (85), evidence of depression (52), and depressive symptoms (49). We report moderate F1-scores for depressive symptoms ranging from 75 (fatigue or loss of energy) to 43 (disturbed sleep) to 35 (depressed mood). Our work demonstrates baseline approaches for automatically encoding Twitter data with granular depressive symptoms associated with major depressive disorder.