Şaziye Betül Özateş

Also published as: Şaziye Betül Özateş, Şaziye Betül Özateş


2024

pdf
Towards a Clean Text Corpus for Ottoman Turkish
Fatih Karagöz | Berat Doğan | Şaziye Betül Özateş
Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024)

Ottoman Turkish, as a historical variant of modern Turkish, suffers from a scarcity of available corpora and NLP models. This paper outlines our pioneering endeavors to address this gap by constructing a clean text corpus of Ottoman Turkish materials. We detail the challenges encountered in this process and offer potential solutions. Additionally, we present a case study wherein the created corpus is employed in continual pre-training of BERTurk, followed by evaluation of the model’s performance on the named entity recognition task for Ottoman Turkish. Preliminary experimental results suggest the effectiveness of our corpus in adapting existing models developed for modern Turkish to historical Turkish.

pdf
Dependency Annotation of Ottoman Turkish with Multilingual BERT
Şaziye Betül Özateş | Tarık Tıraş | Efe Genç | Esma Bilgin Tasdemir
Proceedings of The 18th Linguistic Annotation Workshop (LAW-XVIII)

This study introduces a pretrained large language model-based annotation methodology of the first dependency treebank in Ottoman Turkish. Our experimental results show that, through iteratively i) pseudo-annotating data using a multilingual BERT-based parsing model, ii) manually correcting the pseudo-annotations, and iii) fine-tuning the parsing model with the corrected annotations, we speed up and simplify the challenging dependency annotation process. The resulting treebank, that will be a part of the Universal Dependencies (UD) project, will facilitate automated analysis of Ottoman Turkish documents, unlocking the linguistic richness embedded in this historical heritage.

2022

pdf
Improving Code-Switching Dependency Parsing with Semi-Supervised Auxiliary Tasks
Şaziye Betül Özateş | Arzucan Özgür | Tunga Gungor | Özlem Çetinoğlu
Findings of the Association for Computational Linguistics: NAACL 2022

Code-switching dependency parsing stands as a challenging task due to both the scarcity of necessary resources and the structural difficulties embedded in code-switched languages. In this study, we introduce novel sequence labeling models to be used as auxiliary tasks for dependency parsing of code-switched text in a semi-supervised scheme. We show that using auxiliary tasks enhances the performance of an LSTM-based dependency parsing model and leads to better results compared to an XLM-R-based model with significantly less computational and time complexity. As the first study that focuses on multiple code-switching language pairs for dependency parsing, we acquire state-of-the-art scores on all of the studied languages. Our best models outperform the previous work by 7.4 LAS points on average.

2021

pdf
A Language-aware Approach to Code-switched Morphological Tagging
Şaziye Betül Özateş | Özlem Çetinoğlu
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Morphological tagging of code-switching (CS) data becomes more challenging especially when language pairs composing the CS data have different morphological representations. In this paper, we explore a number of ways of implementing a language-aware morphological tagging method and present our approach for integrating language IDs into a transformer-based framework for CS morphological tagging. We perform our set of experiments on the Turkish-German SAGT Treebank. Experimental results show that including language IDs to the learning model significantly improves accuracy over other approaches.

2020

pdf
First Steps towards Universal Dependencies for Laz
Utku Türk | Kaan Bayar | Ayşegül Dilara Özercan | Görkem Yiğit Öztürk | Şaziye Betül Özateş
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

This paper presents the first treebank for the Laz language, which is also the first Universal Dependencies Treebank for a South Caucasian language. This treebank aims to create a syntactically and morphologically annotated resource for further research. We also aim to document an endangered language in a systematic fashion within an inherently cross-linguistic framework: the Universal Dependencies Project (UD). As of now, our treebank consists of 576 sentences and 2,306 tokens annotated in light with the UD guidelines. We evaluated the treebank on the dependency parsing task using a pretrained multilingual parsing model, and the results are comparable with other low-resourced treebanks with no training set. We aim to expand our treebank in the near future to include 1,500 sentences. The bigger goal for our project is to create a set of treebanks for minority languages in Anatolia.

2019

pdf
Turkish Treebanking: Unifying and Constructing Efforts
Utku Türk | Furkan Atmaca | Şaziye Betül Özateş | Abdullatif Köksal | Balkiz Ozturk Basaran | Tunga Gungor | Arzucan Özgür
Proceedings of the 13th Linguistic Annotation Workshop

In this paper, we present the current version of two different treebanks, the re-annotation of the Turkish PUD Treebank and the first annotation of the Turkish National Corpus Universal Dependency (henceforth TNC-UD). The annotation of both treebanks, the Turkish PUD Treebank and TNC-UD, was carried out based on the decisions concerning linguistic adequacy of re-annotation of the Turkish IMST-UD Treebank (Türk et. al., forthcoming). Both of the treebanks were annotated with the same annotation process and morphological and syntactic analyses. The TNC-UD is planned to have 10,000 sentences. In this paper, we will present the first 500 sentences along with the annotation PUD Treebank. Moreover, this paper also offers the parsing results of a graph-based neural parser on the previous and re-annotated PUD, as well as the TNC-UD. In light of the comparisons, even though we observe a slight decrease in the attachment scores of the Turkish PUD treebank, we demonstrate that the annotation of the TNC-UD improves the parsing accuracy of Turkish. In addition to the treebanks, we have also constructed a custom annotation software with advanced filtering and morphological editing options. Both the treebanks, including a full edit-history and the annotation guidelines, and the custom software are publicly available under an open license online.

pdf
Improving the Annotations in the Turkish Universal Dependency Treebank
Utku Türk | Furkan Atmaca | Şaziye Betül Özateş | Balkız Öztürk Başaran | Tunga Güngör | Arzucan Özgür
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

2018

pdf
A Morphology-Based Representation Model for LSTM-Based Dependency Parsing of Agglutinative Languages
Şaziye Betül Özateş | Arzucan Özgür | Tunga Güngör | Balkız Öztürk
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We propose two word representation models for agglutinative languages that better capture the similarities between words which have similar tasks in sentences. Our models highlight the morphological features in words and embed morphological information into their dense representations. We have tested our models on an LSTM-based dependency parser with character-based word embeddings proposed by Ballesteros et al. (2015). We participated in the CoNLL 2018 Shared Task on multilingual parsing from raw text to universal dependencies as the BOUN team. We show that our morphology-based embedding models improve the parsing performance for most of the agglutinative languages.

2016

pdf
Sentence Similarity based on Dependency Tree Kernels for Multi-document Summarization
Şaziye Betül Özateş | Arzucan Özgür | Dragomir Radev
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce an approach based on using the dependency grammar representations of sentences to compute sentence similarity for extractive multi-document summarization. We adapt and investigate the effects of two untyped dependency tree kernels, which have originally been proposed for relation extraction, to the multi-document summarization problem. In addition, we propose a series of novel dependency grammar based kernels to better represent the syntactic and semantic similarities among the sentences. The proposed methods incorporate the type information of the dependency relations for sentence similarity calculation. To our knowledge, this is the first study that investigates using dependency tree based sentence similarity for multi-document summarization.