Dimitrios Kokkinakis

2022

pdf bib
Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference
Dimitrios Kokkinakis | Charalambos K. Themistocleous | Kristina Lundholm Fors | Athanasios Tsanas | Kathleen C. Fraser
Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference

pdf abs
Extraction and Classification of Acoustic Features from Italian Speaking Children with Autism Spectrum Disorders.
Federica Beccaria | Gloria Gagliardi | Dimitrios Kokkinakis
Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference

Autism Spectrum Disorders (ASD) are a group of complex developmental conditions whose effects and severity show high intraindividual variability. However, one of the main symptoms shared along the spectrum is social interaction impairments that can be explored through acoustic analysis of speech production. In this paper, we compare 14 Italian-speaking children with ASD and 14 typically developing peers. Accordingly, we extracted and selected the acoustic features related to prosody, quality of voice, loudness, and spectral distribution using the parameter set eGeMAPS provided by the openSMILE feature extraction toolkit. We implemented four supervised machine learning methods to evaluate the extraction performances. Our findings show that Decision Trees (DTs) and Support Vector Machines (SVMs) are the best-performing methods. The overall DT models reach a 100% recall on all the trials, meaning they correctly recognise autistic features. However, half of its models overfit, while SVMs are more consistent. One of the results of the work is the creation of a speech pipeline to extract Italian speech biomarkers typical of ASD by comparing our results with studies based on other languages. A better understanding of this topic can support clinicians in diagnosing the disorder.

2019

pdf abs
Multilingual prediction of Alzheimer’s disease through domain adaptation and concept-based language modelling
Kathleen C. Fraser | Nicklas Linz | Bai Li | Kristina Lundholm Fors | Frank Rudzicz | Alexandra König | Jan Alexandersson | Philippe Robert | Dimitrios Kokkinakis
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

There is growing evidence that changes in speech and language may be early markers of dementia, but much of the previous NLP work in this area has been limited by the size of the available datasets. Here, we compare several methods of domain adaptation to augment a small French dataset of picture descriptions (n = 57) with a much larger English dataset (n = 550), for the task of automatically distinguishing participants with dementia from controls. The first challenge is to identify a set of features that transfer across languages; in addition to previously used features based on information units, we introduce a new set of features to model the order in which information units are produced by dementia patients and controls. These concept-based language model features improve classification performance in both English and French separately, and the best result (AUC = 0.89) is achieved using the multilingual training set with a combination of information and language model features.

pdf abs
Temporal Analysis of the Semantic Verbal Fluency Task in Persons with Subjective and Mild Cognitive Impairment
Nicklas Linz | Kristina Lundholm Fors | Hali Lindsay | Marie Eckerström | Jan Alexandersson | Dimitrios Kokkinakis
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology

The Semantic Verbal Fluency (SVF) task is a classical neuropsychological assessment where persons are asked to produce words belonging to a semantic category (e.g., animals) in a given time. This paper introduces a novel method of temporal analysis for SVF tasks utilizing time intervals and applies it to a corpus of elderly Swedish subjects (mild cognitive impairment, subjective cognitive impairment and healthy controls). A general decline in word count and lexical frequency over the course of the task is revealed, as well as an increase in word transition times. Persons with subjective cognitive impairment had a higher word count during the last intervals, but produced words of the same lexical frequencies. Persons with MCI had a steeper decline in both word count and lexical frequencies during the third interval. Additional correlations with neuropsychological scores suggest these findings are linked to a person’s overall vocabulary size and processing speed, respectively. Classification results improved when adding the novel features (AUC=0.72), supporting their diagnostic value.

2018

pdf
A Swedish Cookie-Theft Corpus
Dimitrios Kokkinakis | Kristina Lundholm Fors | Kathleen Fraser | Arto Nordlund
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf
Data Collection from Persons with Mild Forms of Cognitive Impairment and Healthy Controls - Infrastructure for Classification and Prediction of Dementia
Dimitrios Kokkinakis | Kristina Lundholm Fors | Eva Björkner | Arto Nordlund
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf abs
An analysis of eye-movements during reading for the detection of mild cognitive impairment
Kathleen C. Fraser | Kristina Lundholm Fors | Dimitrios Kokkinakis | Arto Nordlund
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We present a machine learning analysis of eye-tracking data for the detection of mild cognitive impairment, a decline in cognitive abilities that is associated with an increased risk of developing dementia. We compare two experimental configurations (reading aloud versus reading silently), as well as two methods of combining information from the two trials (concatenation and merging). Additionally, we annotate the words being read with information about their frequency and syntactic category, and use these annotations to generate new features. Ultimately, we are able to distinguish between participants with and without cognitive impairment with up to 86% accuracy.

Named entity recognition (NER) is a knowledge-intensive information extraction task that is used for recognizing textual mentions of entities that belong to a predefined set of categories, such as locations, organizations and time expressions. NER is a challenging, difficult, yet essential preprocessing technology for many natural language processing applications, and particularly crucial for language understanding. NER has been actively explored in academia and in industry especially during the last years due to the advent of social media data. This paper describes the conversion, modeling and adaptation of a Swedish NER system from a hybrid environment, with integrated functionality from various processing components, to the Helsinki Finite-State Transducer Technology (HFST) platform. This new HFST-based NER (HFST-SweNER) is a full-fledged open source implementation that supports a variety of generic named entity types and consists of multiple, reusable resource layers, e.g., various n-gram-based named entity lists (gazetteers).

2013

pdf
Figurative Language in Swedish Clinical Texts
Dimitrios Kokkinakis
Proceedings of the IWCS 2013 Workshop on Computational Semantics in Clinical Text (CSCT 2013)

2012

pdf abs
Semantic Role Labeling with the Swedish FrameNet
Richard Johansson | Karin Friberg Heppin | Dimitrios Kokkinakis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present the first results on semantic role labeling using the Swedish FrameNet, which is a lexical resource currently in development. Several aspects of the task are investigated, including the %design and selection of machine learning features, the effect of choice of syntactic parser, and the ability of the system to generalize to new frames and new genres. In addition, we evaluate two methods to make the role label classifier more robust: cross-frame generalization and cluster-based features. Although the small amount of training data limits the performance achievable at the moment, we reach promising results. In particular, the classifier that extracts the boundaries of arguments works well for new frames, which suggests that it already at this stage can be useful in a semi-automatic setting.

pdf
Advanced Visual Analytics Methods for Literature Analysis
Daniela Oelke | Dimitrios Kokkinakis | Mats Malm
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2011

pdf
Character Profiling in 19th Century Fiction
Dimitrios Kokkinakis | Mats Malm
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage

pdf
Reducing Complexity in Parsing Scientific Medical Data, a Diabetes Case Study
Dimitrios Kokkinakis
Proceedings of the Second Workshop on Biomedical Natural Language Processing

2010

pdf
Linking SweFN++ with Medical Resources, towards a MedFrameNet for Swedish
Dimitrios Kokkinakis | Maria Toporowska Gronostaj
Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents

pdf abs
A Swedish Scientific Medical Corpus for Terminology Management and Linguistic Exploration
Dimitrios Kokkinakis | Ulla Gerdin
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes the development of a new Swedish scientific medical corpus. We provide a detailed description of the characteristics of this new collection as well results of an application of the corpus on term management tasks, including terminology validation and terminology extraction. Although the corpus is representative for the scientific medical domain it still covers in detail a lot of specialised sub-disciplines such as diabetes and osteoporosis which makes it suitable for facilitating the production of smaller but more focused sub-corpora. We address this issue by making explicit some features of the corpus in order to demonstrate the usability of the corpus particularly for the quality assessment of subsets of official terminologies such as the Systematized NOmenclature of MEDicine - Clinical Terms (SNOMED CT). Domain-dependent language resources, labelled or not, are a crucial key components for progressing R&D in the human language technology field since such resources are an indispensable, integrated part for terminology management, evaluation, software prototyping and design validation and a prerequisite for the development and evaluation of a number of sublanguage dependent applications including information extraction, text mining and information retrieval.

pdf abs
Diabase: Towards a Diachronic BLARK in Support of Historical Studies
Lars Borin | Markus Forsberg | Dimitrios Kokkinakis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present our ongoing work on language technology-based e-science in the humanities, social sciences and education, with a focus on text-based research in the historical sciences. An important aspect of language technology is the research infrastructure known by the acronym BLARK (Basic LAnguage Resource Kit). A BLARK as normally presented in the literature arguably reflects a modern standard language, which is topic- and genre-neutral, thus abstracting away from all kinds of language variation. We argue that this notion could fruitfully be extended along any of the three axes implicit in this characterization (the social, the topical and the temporal), in our case the temporal axis, towards a diachronic BLARK for Swedish, which can be used to develop e-science tools in support of historical studies.

2009

pdf
Issues on Quality Assessment of SNOMED CT® Subsets – Term Validation and Term Extraction
Dimitrios Kokkinakis | Ulla Gerdin
Proceedings of the Workshop on Biomedical Information Extraction

2008

pdf abs
MeSH©: from a Controlled Vocabulary to a Processable Resource
Dimitrios Kokkinakis
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Large repositories of life science data in the form of domain-specific literature and large specialised textual collections increase on a daily basis to a level beyond the human mind can grasp and interpret. As the volume of data continues to increase, substantial support from new information technologies and computational techniques grounded in the mining paradigm is becoming apparent. These emerging technologies play a critical role in aiding research productivity, and they provide the means for reducing the workload for information access and decision support and for speeding up and enhancing the knowledge discovery process. In order to accomplish these higher level goals a fundamental and unavoidable starting point is the identification and mapping of terminology from unstructured data to biomedical knowledge sources and concept hierarchies. This paper provides a description of the work regarding terminology recognition using the Swedish MeSH© thesaurus and its corresponding English source. The various transformation and refinement steps applied to the original database tables into a fully-fledged processing-oriented annotating resource are explained. Particular attention has been given to a number of these steps in order to automatically map the extensive variability of lexical terms to structured MeSH© nodes. Issues on annotation and coverage are also discussed.

pdf abs
A Semantically Annotated Swedish Medical Corpus
Dimitrios Kokkinakis
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

With the information overload in the life sciences there is an increasing need for annotated corpora, particularly with biological and biomedical entities, which is the driving force for data-driven language processing applications and the empirical approach to language study. Inspired by the work in the GENIA Corpus, which is one of the very few of such corpora, extensively used in the biomedical field, and in order to fulfil the needs of our research, we have collected a Swedish medical corpus, the MEDLEX Corpus. MEDLEX is a large structurally and linguistically annotated document collection, consisting of a variety of text documents related to various medical text subfields, and does not focus at a particular medical genre, due to the lack of large Swedish resources within a particular medical subdomain. Out of this collection we selected 300 documents which were manually examined by two human experts who inspected, corrected and/or accordingly modified the automatically provided annotations according to a set of provided labelling guidelines. The annotations consist of medical terminology provided by the Swedish and English MeSH© (Medical Subject Headings) thesauri as well as named entity labels provided by an enhanced named entity recognition software.

2007

pdf bib
Naming the Past: Named Entity and Animacy Recognition in 19th Century Swedish Literature
Lars Borin | Dimitrios Kokkinakis | Leif-Jöran Olsson
Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007).

pdf
Identification of Entity References in Hospital Discharge Letters
Dimitrios Kokkinakis | Anders Thurin
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

pdf
Lexical Parameters, Based on Corpus Analysis of English and Swedish Cancer Data, of Relevance for NLG
Dimitrios Kokkinakis | Maria Toporowska Gronostaj | Catalina Hallett | David Hardcastle
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

2006

pdf abs
Collection, Encoding and Linguistic Processing of a Swedish Medical Corpus - The MEDLEX Experience
Dimitrios Kokkinakis
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Corpora annotated with structural and linguistic characteristics play a major role in nearly every area of language processing. During recent years a number of corpora and large data sets became known and available to research even in specialized fields such as medicine, but still however, targeted predominantly for the English language. This paper provides a description of the collection, encoding and linguistic processing of an ever growing Swedish medical corpus, the MEDLEX Corpus. MEDLEX consists of a variety of text-documents related to various medical text genres. The MEDLEX Corpus has been structurally annotated using the Corpus Encoding Standard for XML (XCES), lemmatized and automatically annotated with part-of-speech and semantic information (extended named entities and the Medical Subject Headings, MeSH, terminology). The results from the processing stages (part-of-speech, entities and terminology) have been merged into a single representation format and syntactically analysed using a cascaded finite state parser. Finally, the parsers results are converted into a tree structure that follows the TIGER-XML coding scheme, resulting a suitable for further exploration and fairly large Treebank of Swedish medical texts.

pdf abs
Recognizing Acronyms and their Definitions in Swedish Medical Texts
Dimitrios Kokkinakis | Dana Dannélls
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper addresses the task of recognizing acronym-definition pairs in Swedish (medical) texts as well as the compilation of a freely available sample of such manually annotated pairs. A material suitable not only for supervised learning experiments, but also as a testbed for the evaluation of the quality of future acronym-definition recognition systems. There are a number of approaches to the identification described in the literature, particularly within the biomedical domain, but none of those addresses the variation and complexity exhibited in a language other than English. This is realized by the fact that we can have a mixture of two languages in the same document and/or sentence, i.e. Swedish and English; that Swedish is a compound language that significantly deteriorates the performance of previous approaches (without adaptations) and, most importantly, the fact that there is a large variation of possible acronym-definition permutations realized in the analysed corpora, a variation that is usually ignored in previous studies.