2024
pdf
abs
UniDive: A COST Action on Universality, Diversity and Idiosyncrasy in Language Technology
Agata Savary
|
Daniel Zeman
|
Verginica Barbu Mititelu
|
Anabela Barreiro
|
Olesea Caftanatov
|
Marie-Catherine de Marneffe
|
Kaja Dobrovoljc
|
Gülşen Eryiğit
|
Voula Giouli
|
Bruno Guillaume
|
Stella Markantonatou
|
Nurit Melnik
|
Joakim Nivre
|
Atul Kr. Ojha
|
Carlos Ramisch
|
Abigail Walsh
|
Beata Wójtowicz
|
Alina Wróblewska
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
This paper presents the objectives, organization and activities of the UniDive COST Action, a scientific network dedicated to universality, diversity and idiosyncrasy in language technology. We describe the objectives and organization of this initiative, the people involved, the working groups and the ongoing tasks and activities. This paper is also an pen call for participation towards new members and countries.
2023
pdf
bib
abs
Incorporating Dropped Pronouns into Coreference Resolution: The case for Turkish
Tuğba Pamay Arslan
|
Gülşen Eryiğit
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Representation of coreferential relations is a challenging and actively studied topic for pro-drop and morphologically rich languages (PD-MRLs) due to dropped pronouns (e.g., null subjects and omitted possessive pronouns). These phenomena require a representation scheme at the morphology level and enhanced evaluation methods. In this paper, we propose a representation & evaluation scheme to incorporate dropped pronouns into coreference resolution and validate it on the Turkish language. Using the scheme, we extend the annotations on the only existing Turkish coreference dataset, which originally did not contain annotations for dropped pronouns. We provide publicly available pre and post processors to enhance the prominent CoNLL coreference scorer also to cover coreferential relations arising from dropped pronouns. As a final step, the paper reports the first neural Turkish coreference resolution results in the literature. Although validated on Turkish, the proposed scheme is language-independent and may be used for other PD-MRLs.
pdf
abs
Towards Automatic Grammatical Error Type Classification for Turkish
Harun Uz
|
Gülşen Eryiğit
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Automatic error type classification is an important process in both learner corpora creation and evaluation of large-scale grammatical error correction systems. Rule-based classifier approaches such as ERRANT have been widely used to classify edits between correct-erroneous sentence pairs into predefined error categories. However, the used error categories are far from being universal yielding many language specific variants of ERRANT.In this paper, we discuss the applicability of the previously introduced grammatical error types to an agglutinative language, Turkish. We suggest changes on current error categories and discuss a hierarchical structure to better suit the inflectional and derivational properties of this morphologically highly rich language. We also introduce ERRANT-TR, the first automatic error type classification toolkit for Turkish. ERRANT-TR currently uses a rule-based error type classification pipeline which relies on word level morphological information. Due to unavailability of learner corpora in Turkish, the proposed system is evaluated on a small set of 106 annotated sentences and its performance is measured as 77.04% F0.5 score. The next step is to use ERRANT-TR for the development of a Turkish learner corpus.
pdf
abs
Neural End-to-End Coreference Resolution using Morphological Information
Tuğba Pamay Arslan
|
Kutay Acar
|
Gülşen Eryiğit
Proceedings of the CRAC 2023 Shared Task on Multilingual Coreference Resolution
In morphologically rich languages, words consist of morphemes containing deeper information in morphology, and thus such languages may necessitate the use of morpheme-level representations as well as word representations. This study introduces a neural multilingual end-to-end coreference resolution system by incorporating morphological information in transformer-based word embeddings on the baseline model. This proposed model participated in the Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2023). Including morphological information explicitly into the coreference resolution improves the performance, especially in morphologically rich languages (e.g., Catalan, Hungarian, and Turkish). The introduced model outperforms the baseline system by 2.57 percentage points on average by obtaining 59.53% CoNLL F-score.
2022
pdf
abs
AMR Alignment for Morphologically-rich and Pro-drop Languages
K. Elif Oral
|
Gülşen Eryiğit
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Alignment between concepts in an abstract meaning representation (AMR) graph and the words within a sentence is one of the important stages of AMR parsing. Although there exist high performing AMR aligners for English, unfortunately, these are not well suited for many languages where many concepts appear from morpho-semantic elements. For the first time in the literature, this paper presents an AMR aligner tailored for morphologically-rich and pro-drop languages by experimenting on the Turkish language being a prominent example of this language group. Our aligner focuses on the meaning considering the rich Turkish morphology and aligns AMR concepts that emerge from morphemes using a tree traversal approach without additional resources or rules. We evaluate our aligner over a manually annotated gold data set in terms of precision, recall and F1 score. Our aligner outperforms the Turkish adaptations of the previously proposed aligners for English and Portuguese by an F1 score of 0.87 and provides a relative error reduction of up to 76%.
2020
pdf
abs
Constructing Multimodal Language Learner Texts Using LARA: Experiences with Nine Languages
Elham Akhlaghi
|
Branislav Bédi
|
Fatih Bektaş
|
Harald Berthelsen
|
Matthias Butterweck
|
Cathy Chua
|
Catia Cucchiarin
|
Gülşen Eryiğit
|
Johanna Gerlach
|
Hanieh Habibi
|
Neasa Ní Chiaráin
|
Manny Rayner
|
Steinþór Steingrímsson
|
Helmer Strik
Proceedings of the Twelfth Language Resources and Evaluation Conference
LARA (Learning and Reading Assistant) is an open source platform whose purpose is to support easy conversion of plain texts into multimodal online versions suitable for use by language learners. This involves semi-automatically tagging the text, adding other annotations and recording audio. The platform is suitable for creating texts in multiple languages via crowdsourcing techniques that can be used for teaching a language via reading and listening. We present results of initial experiments by various collaborators where we measure the time required to produce substantial LARA resources, up to the length of short novels, in Dutch, English, Farsi, French, German, Icelandic, Irish, Swedish and Turkish. The first results are encouraging. Although there are some startup problems, the conversion task seems manageable for the languages tested so far. The resulting enriched texts are posted online and are freely available in both source and compiled form.
pdf
bib
Substituto – A Synchronous Educational Language Game for Simultaneous Teaching and Crowdsourcing
Marianne Grace Araneta
|
Gülşen Eryiğit
|
Alexander König
|
Ji-Ung Lee
|
Ana Luís
|
Verena Lyding
|
Lionel Nicolas
|
Christos Rodosthenous
|
Federico Sangati
Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning
2019
pdf
bib
abs
Extracting Complex Relations from Banking Documents
Berke Oral
|
Erdem Emekligil
|
Seçil Arslan
|
Gülşen Eryiğit
Proceedings of the Second Workshop on Economics and Natural Language Processing
In order to automate banking processes (e.g. payments, money transfers, foreign trade), we need to extract banking transactions from different types of mediums such as faxes, e-mails, and scanners. Banking orders may be considered as complex documents since they contain quite complex relations compared to traditional datasets used in relation extraction research. In this paper, we present our method to extract intersentential, nested and complex relations from banking orders, and introduce a relation extraction method based on maximal clique factorization technique. We demonstrate 11% error reduction over previous methods.
pdf
abs
Towards Turkish Abstract Meaning Representation
Zahra Azin
|
Gülşen Eryiğit
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Using rooted, directed and labeled graphs, Abstract Meaning Representation (AMR) abstracts away from syntactic features such as word order and does not annotate every constituent in a sentence. AMR has been specified for English and was not supposed to be an Interlingua. However, several studies strived to overcome divergences in the annotations between English AMRs and those of their target languages by refining the annotation specification. Following this line of research, we have started to build the first Turkish AMR corpus by hand-annotating 100 sentences of the Turkish translation of the novel “The Little Prince” and comparing the results with the English AMRs available for the same corpus. The next step is to prepare the Turkish AMR annotation specification for training future annotators.
2018
pdf
abs
Detecting Code-Switching between Turkish-English Language Pair
Zeynep Yirmibeşoğlu
|
Gülşen Eryiğit
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text
Code-switching (usage of different languages within a single conversation context in an alternative manner) is a highly increasing phenomenon in social media and colloquial usage which poses different challenges for natural language processing. This paper introduces the first study for the detection of Turkish-English code-switching and also a small test data collected from social media in order to smooth the way for further studies. The proposed system using character level n-grams and conditional random fields (CRFs) obtains 95.6% micro-averaged F1-score on the introduced test data set.
2017
pdf
abs
Survey: Multiword Expression Processing: A Survey
Mathieu Constant
|
Gülşen Eryiǧit
|
Johanna Monti
|
Lonneke van der Plas
|
Carlos Ramisch
|
Michael Rosner
|
Amalia Todirascu
Computational Linguistics, Volume 43, Issue 4 - December 2017
Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has to be re-thought to accommodate MWEs. The issue of MWE handling is crucial for NLP applications, where it raises a number of challenges. The emergence of solutions in the absence of guiding principles motivates this survey, whose aim is not only to provide a focused review of MWE processing, but also to clarify the nature of interactions between MWE processing and downstream applications. We propose a conceptual framework within which challenges and research contributions can be positioned. It offers a shared understanding of what is meant by “MWE processing,” distinguishing the subtasks of MWE discovery and identification. It also elucidates the interactions between MWE processing and two use cases: Parsing and machine translation. Many of the approaches in the literature can be differentiated according to how MWE processing is timed with respect to underlying use cases. We discuss how such orchestration choices affect the scope of MWE-aware systems. For each of the two MWE processing subtasks and for each of the two use cases, we conclude on open issues and research perspectives.
2016
pdf
bib
SemEval-2016 Task 5: Aspect Based Sentiment Analysis
Maria Pontiki
|
Dimitris Galanis
|
Haris Papageorgiou
|
Ion Androutsopoulos
|
Suresh Manandhar
|
Mohammad AL-Smadi
|
Mahmoud Al-Ayyoub
|
Yanyan Zhao
|
Bing Qin
|
Orphée De Clercq
|
Véronique Hoste
|
Marianna Apidianaki
|
Xavier Tannier
|
Natalia Loukachevitch
|
Evgeniy Kotelnikov
|
Nuria Bel
|
Salud María Jiménez-Zafra
|
Gülşen Eryiğit
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
pdf
TGB at SemEval-2016 Task 5: Multi-Lingual Constraint System for Aspect Based Sentiment Analysis
Fatih Samet Çetin
|
Ezgi Yıldırım
|
Can Özbey
|
Gülşen Eryiğit
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
pdf
abs
Universal Dependencies for Turkish
Umut Sulubacak
|
Memduh Gokirmak
|
Francis Tyers
|
Çağrı Çöltekin
|
Joakim Nivre
|
Gülşen Eryiğit
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
The Universal Dependencies (UD) project was conceived after the substantial recent interest in unifying annotation schemes across languages. With its own annotation principles and abstract inventory for parts of speech, morphosyntactic features and dependency relations, UD aims to facilitate multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. This paper presents the Turkish IMST-UD Treebank, the first Turkish treebank to be in a UD release. The IMST-UD Treebank was automatically converted from the IMST Treebank, which was also recently released. We describe this conversion procedure in detail, complete with mapping tables. We also present our evaluation of the parsing performances of both versions of the IMST Treebank. Our findings suggest that the UD framework is at least as viable for Turkish as the original annotation framework of the IMST Treebank.
2015
pdf
Annotation and Extraction of Multiword Expressions in Turkish Treebanks
Gülşen Eryiǧit
|
Kübra Adali
|
Dilara Torunoğlu-Selamet
|
Umut Sulubacak
|
Tuğba Pamay
Proceedings of the 11th Workshop on Multiword Expressions
pdf
The Annotation Process of the ITU Web Treebank
Tuğba Pamay
|
Umut Sulubacak
|
Dilara Torunoğlu-Selamet
|
Gülşen Eryiğit
Proceedings of the 9th Linguistic Annotation Workshop
pdf
Using Finite State Transducers for Helping Foreign Language Learning
Hasan Kaya
|
Gülşen Eryiğit
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications
pdf
Transition-based Dependency DAG Parsing Using Dynamic Oracles
Alper Tokgöz
|
Gülşen Eryiǧit
Proceedings of the ACL-IJCNLP 2015 Student Research Workshop
2014
pdf
Vowel and Diacritic Restoration for Social Media Texts
Kübra Adali
|
Gülşen Eryiǧit
Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)
pdf
A Cascaded Approach for Social Media Text Normalization of Turkish
Dilara Torunoǧlu
|
Gülşen Eryiǧit
Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)
pdf
bib
ITU Turkish NLP Web Service
Gülşen Eryiğit
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics
2013
pdf
TURKSENT: A Sentiment Annotation Tool for Social Media
Gülşen Eryiǧit
|
Fatih Samet Çetin
|
Meltem Yanık
|
Tanel Temel
|
İlyas Çiçekli
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse
pdf
Representation of Morphosyntactic Units and Coordination Structures in the Turkish Dependency Treebank
Umut Sulubacak
|
Gülşen Eryiğit
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages
2012
pdf
Initial Explorations on using CRFs for Turkish Named Entity Recognition
Gökhan Akın Şeker
|
Gülşen Eryiğit
Proceedings of COLING 2012
pdf
abs
The Impact of Automatic Morphological Analysis & Disambiguation on Dependency Parsing of Turkish
Gülşen Eryiğit
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The studies on dependency parsing of Turkish so far gave their results on the Turkish Dependency Treebank. This treebank consists of sentences where gold standard part-of-speech tags are manually assigned to each word and the words forming multi word expressions are also manually determined and combined into single units. For the first time, we investigate the results of parsing Turkish sentences from scratch and observe the accuracy drop at the end of processing raw data. We test one state-of-the art morphological analyzer together with two different morphological disambiguators. We both show separately the accuracy drop due to the automatic morphological processing and to the lack of multi word unit extraction. With this purpose, we use and present a new version of the Turkish Treebank where we detached the multi word expressions (MWEs) into multiple tokens and manually annotated the missing part-of-speech tags of these new tokens.
pdf
abs
Word Alignment for English-Turkish Language Pair
Mehmet Talha Çakmak
|
Süleyman Acar
|
Gülşen Eryiğit
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Word alignment is an important step for machine translation systems. Although the alignment performance between grammatically similar languages is reported to be very high in many studies, the case is not the same for language pairs from different language families. In this study, we are focusing on English-Turkish language pairs. Turkish is a highly agglutinative language with a very productive and rich morphology whereas English has a very poor morphology when compared to this language. As a result of this, one Turkish word is usually aligned with several English words. The traditional models which use word-level alignment approaches generally fail in such circumstances. In this study, we evaluate a Giza++ system by splitting the words into their morphological units (stem and suffixes) and compare the model with the traditional one. For the first time, we evaluate the performance of our aligner on gold standard parallel sentences rather than in a real machine translation system. Our approach reduced the alignment error rate by 40% relative. Finally, a new test corpus of 300 manually aligned sentences is released together with this study.
pdf
Disambiguating Main POS tags for Turkish
Razieh Ehsani
|
Muzaffer Ege Alper
|
Gülşen Eryiğit
|
Eşref Adali
Proceedings of the 24th Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
2011
pdf
Multiword Expressions in Statistical Dependency Parsing
Gülşen Eryiğit
|
Tugay İlbay
|
Ozan Arkan Can
Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages
2008
pdf
Dependency Parsing of Turkish
Gülşen Eryiğit
|
Joakim Nivre
|
Kemal Oflazer
Computational Linguistics, Volume 34, Number 3, September 2008
pdf
Erratum: Dependency Parsing of Turkish
Gülşen Eryiğit
|
Joakim Nivre
|
Kemal Oflazer
Computational Linguistics, Volume 34, Number 4, December 2008
2007
pdf
Single Malt or Blended? A Study in Multilingual Parser Optimization
Johan Hall
|
Jens Nilsson
|
Joakim Nivre
|
Gülşen Eryiǧit
|
Beáta Megyesi
|
Mattias Nilsson
|
Markus Saers
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
pdf
ITU Treebank Annotation Tool
Gülşen Eryiǧit
Proceedings of the Linguistic Annotation Workshop
2006
pdf
Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines
Joakim Nivre
|
Johan Hall
|
Jens Nilsson
|
Gülşen Eryiǧit
|
Svetoslav Marinov
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)
pdf
Statistical Dependency Parsing for Turkish
Gülşen Eryiǧit
|
Kemal Oflazer
11th Conference of the European Chapter of the Association for Computational Linguistics