Özlem Çetinoğlu

Also published as: Ozlem Cetinoglu, Özlem Çetinoglu

2022

pdf abs
Anonymising the SAGT Speech Corpus and Treebank
Özlem Çetinoğlu | Antje Schweitzer
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Anonymisation, that is identifying and neutralising sensitive references, is a crucial part of dataset creation. In this paper, we describe the anonymisation process of a Turkish-German code-switching corpus, namely SAGT, which consists of speech data and a treebank that is built on its transcripts. We employed a selective pseudonymisation approach where we manually identified sensitive references to anonymise and replaced them with surrogate values on the treebank side. In addition to maintaining data privacy, our primary concerns in surrogate selection were keeping the integrity of code-switching properties, morphosyntactic annotation layers, and semantics. After the treebank anonymisation, we anonymised the speech data by mapping between the treebank sentences and audio transcripts with the help of Praat scripts. The treebank is publicly available for research purposes and the audio files can be obtained via an individual licence agreement.

pdf abs
Improving Code-Switching Dependency Parsing with Semi-Supervised Auxiliary Tasks
Şaziye Betül Özateş | Arzucan Özgür | Tunga Gungor | Özlem Çetinoğlu
Findings of the Association for Computational Linguistics: NAACL 2022

Code-switching dependency parsing stands as a challenging task due to both the scarcity of necessary resources and the structural difficulties embedded in code-switched languages. In this study, we introduce novel sequence labeling models to be used as auxiliary tasks for dependency parsing of code-switched text in a semi-supervised scheme. We show that using auxiliary tasks enhances the performance of an LSTM-based dependency parsing model and leads to better results compared to an XLM-R-based model with significantly less computational and time complexity. As the first study that focuses on multiple code-switching language pairs for dependency parsing, we acquire state-of-the-art scores on all of the studied languages. Our best models outperform the previous work by 7.4 LAS points on average.

2021

pdf abs
Lexical Normalization for Code-switched Data and its Effect on POS Tagging
Rob van der Goot | Özlem Çetinoğlu
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of many natural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle code-switched data which we evaluate for two language pairs: Indonesian-English and Turkish-German. For the latter, we introduce novel normalization layers and their corresponding language ID and POS tags for the dataset, and evaluate the downstream effect of normalization on POS tagging. Results show that our CS-tailored normalization models significantly outperform monolingual ones, and lead to 5.4% relative performance increase for POS tagging as compared to unnormalized input.

Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MultiLexNorm shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 13 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-of-speech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system.

pdf abs
A Language-aware Approach to Code-switched Morphological Tagging
Şaziye Betül Özateş | Özlem Çetinoğlu
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Morphological tagging of code-switching (CS) data becomes more challenging especially when language pairs composing the CS data have different morphological representations. In this paper, we explore a number of ways of implementing a language-aware morphological tagging method and present our approach for integrating language IDs into a transformer-based framework for CS morphological tagging. We perform our set of experiments on the Turkish-German SAGT Treebank. Experimental results show that including language IDs to the learning model significantly improves accuracy over other approaches.

pdf abs
Assessing Gender Bias in Wikipedia: Inequalities in Article Titles
Agnieszka Falenska | Özlem Çetinoğlu
Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing

Potential gender biases existing in Wikipedia’s content can contribute to biased behaviors in a variety of downstream NLP systems. Yet, efforts in understanding what inequalities in portraying women and men occur in Wikipedia focused so far only on *biographies*, leaving open the question of how often such harmful patterns occur in other topics. In this paper, we investigate gender-related asymmetries in Wikipedia titles from *all domains*. We assess that for only half of gender-related articles, i.e., articles with words such as *women* or *male* in their titles, symmetrical counterparts describing the same concept for the other gender (and clearly stating it in their titles) exist. Among the remaining imbalanced cases, the vast majority of articles concern sports- and social-related issues. We provide insights on how such asymmetries can influence other Wikipedia components and propose steps towards reducing the frequency of observed patterns.

2020

pdf abs
Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus
Mohamed Balabel | Injy Hamed | Slim Abdennadher | Ngoc Thang Vu | Özlem Çetinoğlu
Proceedings of the Twelfth Language Resources and Evaluation Conference

Code-switching has become a prevalent phenomenon across many communities. It poses a challenge to NLP researchers, mainly due to the lack of available data needed for training and testing applications. In this paper, we introduce a new resource: a corpus of Egyptian- Arabic code-switch speech data that is fully tokenized, lemmatized and annotated for part-of-speech tags. Beside the corpus itself, we provide annotation guidelines to address the unique challenges of annotating code-switch data. Another challenge that we address is the fact that Egyptian Arabic orthography and grammar are not standardized.

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD.

pdf abs
Tackling the Low-resource Challenge for Canonical Segmentation
Manuel Mager | Özlem Çetinoğlu | Katharina Kann
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Canonical morphological segmentation consists of dividing words into their standardized morphemes. Here, we are interested in approaches for the task when training data is limited. We compare model performance in a simulated low-resource setting for the high-resource languages German, English, and Indonesian to experiments on new datasets for the truly low-resource languages Popoluca and Tepehua. We explore two new models for the task, borrowing from the closely related area of morphological generation: an LSTM pointer-generator and a sequence-to-sequence model with hard monotonic attention trained with imitation learning. We find that, in the low-resource setting, the novel approaches out-perform existing ones on all languages by up to 11.4% accuracy. However, while accuracy in emulated low-resource scenarios is over 50% for all languages, for the truly low-resource languages Popoluca and Tepehua, our best model only obtains 37.4% and 28.4% accuracy, respectively. Thus, we conclude that canonical segmentation is still a challenging task for low-resource languages.

2019

pdf abs
Subword-Level Language Identification for Intra-Word Code-Switching
Manuel Mager | Özlem Çetinoğlu | Katharina Kann
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Language identification for code-switching (CS), the phenomenon of alternating between two or more languages in conversations, has traditionally been approached under the assumption of a single language per token. However, if at least one language is morphologically rich, a large number of words can be composed of morphemes from more than one language (intra-word CS). In this paper, we extend the language identification task to the subword-level, such that it includes splitting mixed words while tagging each part with a language ID. We further propose a model for this task, which is based on a segmental recurrent neural network. In experiments on a new Spanish–Wixarika dataset and on an adapted German–Turkish dataset, our proposed model performs slightly better than or roughly on par with our best baseline, respectively. Considering only mixed words, however, it strongly outperforms all baselines.

pdf
Challenges of Annotating a Code-Switching Treebank
Özlem Çetinoğlu | Çağrı Çöltekin
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

2018

2017

pdf abs
A Code-Switching Corpus of Turkish-German Conversations
Özlem Çetinoğlu
Proceedings of the 11th Linguistic Annotation Workshop

We present a code-switching corpus of Turkish-German that is collected by recording conversations of bilinguals. The recordings are then transcribed in two layers following speech and orthography conventions, and annotated with sentence boundaries and intersentential, intrasentential, and intra-word switch points. The total amount of data is 5 hours of speech which corresponds to 3614 sentences. The corpus aims at serving as a resource for speech or text analysis, as well as a collection for linguistic inquiries.

pdf abs
Lexicalized vs. Delexicalized Parsing in Low-Resource Scenarios
Agnieszka Falenska | Özlem Çetinoğlu
Proceedings of the 15th International Conference on Parsing Technologies

We present a systematic analysis of lexicalized vs. delexicalized parsing in low-resource scenarios, and propose a methodology to choose one method over another under certain conditions. We create a set of simulation experiments on 41 languages and apply our findings to 9 low-resource languages. Experimental results show that our methodology chooses the best approach in 8 out of 9 cases.

2016

pdf
Part of Speech Annotation of a Turkish-German Code-Switching Corpus
Özlem Çetinoğlu | Çağrı Çöltekin
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf bib
Challenges of Computational Processing of Code-Switching
Özlem Çetinoğlu | Sarah Schulz | Ngoc Thang Vu
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf abs
A Turkish-German Code-Switching Corpus
Özlem Çetinoğlu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Bilingual communities often alternate between languages both in spoken and written communication. One such community, Germany residents of Turkish origin produce Turkish-German code-switching, by heavily mixing two languages at discourse, sentence, or word level. Code-switching in general, and Turkish-German code-switching in particular, has been studied for a long time from a linguistic perspective. Yet resources to study them from a more computational perspective are limited due to either small size or licence issues. In this work we contribute the solution of this problem with a corpus. We present a Turkish-German code-switching corpus which consists of 1029 tweets, with a majority of intra-sentential switches. We share different type of code-switching we have observed in our collection and describe our processing steps. The first step is data collection and filtering. This is followed by manual tokenisation and normalisation. And finally, we annotate data with word-level language identification information. The resulting corpus is available for research purposes.

2015

pdf
Stacking or Supertagging for Dependency Parsing – What’s the Difference?
Agnieszka Faleńska | Anders Björkelund | Özlem Çetinoğlu | Wolfgang Seeker
Proceedings of the 14th International Conference on Parsing Technologies

pdf abs
A Graph-based Lattice Dependency Parser for Joint Morphological Segmentation and Syntactic Analysis
Wolfgang Seeker | Özlem Çetinoğlu
Transactions of the Association for Computational Linguistics, Volume 3

Space-delimited words in Turkish and Hebrew text can be further segmented into meaningful units, but syntactic and semantic context is necessary to predict segmentation. At the same time, predicting correct syntactic structures relies on correct segmentation. We present a graph-based lattice dependency parser that operates on morphological lattices to represent different segmentations and morphological analyses for a given input sentence. The lattice parser predicts a dependency tree over a path in the lattice and thus solves the joint task of segmentation, morphological analysis, and syntactic parsing. We conduct experiments on the Turkish and the Hebrew treebank and show that the joint model outperforms three state-of-the-art pipeline systems on both data sets. Our work corroborates findings from constituency lattice parsing for Hebrew and presents the first results for full lattice parsing on Turkish.

2014

pdf abs
Turkish Treebank as a Gold Standard for Morphological Disambiguation and Its Influence on Parsing
Özlem Çetinoğlu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

So far predicted scenarios for Turkish dependency parsing have used a morphological disambiguator that is trained on the data distributed with the tool(Sak et al., 2008). Although models trained on this data have high accuracy scores on the test and development data of the same set, the accuracy drastically drops when the model is used in the preprocessing of Turkish Treebank parsing experiments. We propose to use the Turkish Treebank(Oflazer et al., 2003) as a morphological resource to overcome this problem and convert the treebank to the morphological disambiguators format. The experimental results show that we achieve improvements in disambiguating the Turkish Treebank and the results also carry over to parsing. With the help of better morphological analysis, we present the best labelled dependency parsing scores to date on Turkish.

pdf bib
Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM
Özlem Çetinoğlu | Jeffrey Heinz | Andreas Maletti | Jason Riggle
Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM

pdf bib
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages
Yoav Goldberg | Yuval Marton | Ines Rehbein | Yannick Versley | Özlem Çetinoğlu | Joel Tetreault
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages

pdf
Introducing the IMS-Wrocław-Szeged-CIS entry at the SPMRL 2014 Shared Task: Reranking and Morpho-syntax meet Unlabeled Data
Anders Björkelund | Özlem Çetinoğlu | Agnieszka Faleńska | Richárd Farkas | Thomas Mueller | Wolfgang Seeker | Zsolt Szántó
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages

2013

pdf
Towards Joint Morphological Analysis and Dependency Parsing of Turkish
Özlem Çetinoğlu | Jonas Kuhn
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

pdf
(Re)ranking Meets Morphosyntax: State-of-the-art Results from the SPMRL 2013 Shared Task
Anders Björkelund | Özlem Çetinoğlu | Richárd Farkas | Thomas Mueller | Wolfgang Seeker
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

2012

pdf abs
Irish Treebanking and Parsing: A Preliminary Evaluation
Teresa Lynn | Özlem Çetinoğlu | Jennifer Foster | Elaine Uí Dhonnchadha | Mark Dras | Josef van Genabith
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Language resources are essential for linguistic research and the development of NLP applications. Low-density languages, such as Irish, therefore lack significant research in this area. This paper describes the early stages in the development of new language resources for Irish ― namely the first Irish dependency treebank and the first Irish statistical dependency parser. We present the methodology behind building our new treebank and the steps we take to leverage upon the few existing resources. We discuss language-specific choices made when defining our dependency labelling scheme, and describe interesting Irish language characteristics such as prepositional attachment, copula, and clefting. We manually develop a small treebank of 300 sentences based on an existing POS-tagged corpus and report an inter-annotator agreement of 0.7902. We train MaltParser to achieve preliminary parsing results for Irish and describe a bootstrapping approach for further stages of development.

In this paper, we give a description of the Machine Translation (MT) system developed at DCU that was used for our fourth participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2009). Two techniques are deployed in our system in order to improve the translation quality in a low-resource scenario. The first technique is to use multiple segmentations in MT training and to utilise word lattices in decoding stage. The second technique is used to select the optimal training data that can be used to build MT systems. In this year’s participation, we use three different prototype SMT systems, and the output from each system are combined using standard system combination method. Our system is the top system for Chinese–English CHALLENGE task in terms of BLEU score.