Deniz Yuret


2024

pdf
Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking
Emre Can Acikgoz | Mete Erdogan | Deniz Yuret
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)

Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations, with a special focus on Turkish. We conduct an in-depth analysis to evaluate the impact of training strategies, model choices, and data availability on the performance of LLMs designed for underrepresented languages. Our approach includes two methodologies: (i) adapting existing LLMs originally pretrained in English to understand Turkish, and (ii) developing a model from the ground up using Turkish pretraining data, both supplemented with supervised fine-tuning on a novel Turkish instruction-tuning dataset aimed at enhancing reasoning capabilities. The relative performance of these methods is evaluated through the creation of a new leaderboard for Turkish LLMs, featuring benchmarks that assess different reasoning and knowledge skills. Furthermore, we conducted experiments on data and model scaling, both during pretraining and fine-tuning, simultaneously emphasizing the capacity for knowledge transfer across languages and addressing the challenges of catastrophic forgetting encountered during fine-tuning on a different language. Our goal is to offer a detailed guide for advancing the LLM framework in low-resource linguistic contexts, thereby making natural language processing (NLP) benefits more globally accessible.

pdf
Neurocache: Efficient Vector Retrieval for Long-range Language Modeling
Ali Safaya | Deniz Yuret
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

This paper introduces Neurocache, an approach to extend the effective context size of large language models (LLMs) using an external vector cache to store its past states. Like recent vector retrieval approaches, Neurocache uses an efficient k-nearest-neighbor (kNN) algorithm to retrieve relevant past states and incorporate them into the attention process. Neurocache improves upon previous methods by (1) storing compressed states, which reduces cache size; (2) performing a single retrieval operation per token which increases inference speed; and (3) extending the retrieval window to neighboring states, which improves both language modeling and downstream task accuracy. Our experiments show the effectiveness of Neurocache both for models trained from scratch and for pre-trained models such as Llama2-7B and Mistral-7B when enhanced with the cache mechanism. We also compare Neurocache with text retrieval methods and show improvements in single-document question-answering and few-shot learning tasks. We made the source code available under: https://github.com/alisafaya/neurocache

pdf
Sequential Compositional Generalization in Multimodal Models
Semih Yagcioglu | Osman Batur İnce | Aykut Erdem | Erkut Erdem | Desmond Elliott | Deniz Yuret
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using CompAct (Compositional Activities), a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain.

pdf bib
Unsupervised Learning of Turkish Morphology with Multiple Codebook VQ-VAE
Müge Kural | Deniz Yuret
Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024)

This paper presents an interpretable unsupervised morphological learning model, showing comparable performance to supervised models in learning complex morphological rules of Turkish as evidenced by its application to the problem of morphological inflection within the SIGMORPHON Shared Tasks. The significance of our unsupervised approach lies in its alignment with how humans naturally acquire rules from raw data without supervision. To achieve this, we construct a model with multiple codebooks of VQ-VAE employing continuous and discrete latent variables during word generation. We evaluate the model’s performance under high and low-resource scenarios, and use probing techniques to examine encoded information in latent representations. We also evaluate its generalization capabilities by testing unseen suffixation scenarios within the SIGMORPHON-UniMorph 2022 Shared Task 0. Our results demonstrate our model’s ability to distinguish word structures into lemmas and suffixes, with each codebook specialized for different morphological features, contributing to the interpretability of our model and effectively performing morphological inflection on both seen and unseen morphological features.

2022

pdf
Mukayese: Turkish NLP Strikes Back
Ali Safaya | Emirhan Kurtuluş | Arda Goktogan | Deniz Yuret
Findings of the Association for Computational Linguistics: ACL 2022

Having sufficient resources for language X lifts it from the under-resourced languages class, but not necessarily from the under-researched class. In this paper, we address the problem of the absence of organized benchmarks in the Turkish language. We demonstrate that languages such as Turkish are left behind the state-of-the-art in NLP applications. As a solution, we present Mukayese, a set of NLP benchmarks for the Turkish language that contains several NLP tasks. We work on one or more datasets for each benchmark and present two or more baselines. Moreover, we present four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking. All datasets and baselines are available under: https://github.com/alisafaya/mukayese

pdf
CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions
Tayfun Ates | M. Ateşoğlu | Çağatay Yiğit | Ilker Kesen | Mert Kobas | Erkut Erdem | Aykut Erdem | Tilbe Goksun | Deniz Yuret
Findings of the Association for Computational Linguistics: ACL 2022

Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories in CRAFT include previously studied descriptive and counterfactual questions. Additionally, inspired by the Force Dynamics Theory in cognitive linguistics, we introduce a new causal question category that involves understanding the causal interactions between objects through notions like cause, enable, and prevent. Our results show that even though the questions in CRAFT are easy for humans, the tested baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark.

pdf
Transformers on Multilingual Clause-Level Morphology
Emre Can Acikgoz | Tilek Chubakov | Muge Kural | Gözde Şahin | Deniz Yuret
Proceedings of the 2nd Workshop on Multi-lingual Representation Learning (MRL)

This paper describes the KUIS-AI NLP team’s submission for the 1st Shared Task on Multilingual Clause-level Morphology (MRL2022). We present our work on all three parts of the shared task: inflection, reinflection, and analysis. We mainly explore two approaches: Trans- former models in combination with data augmentation, and exploiting the state-of-the-art language modeling techniques for morphological analysis. Data augmentation leads to a remarkable performance improvement for most of the languages in the inflection task. Prefix-tuning on pretrained mGPT model helps us to adapt reinflection and analysis tasks in a low-data setting. Additionally, we used pipeline architectures using publicly available open-source lemmatization tools and monolingual BERT- based morphological feature classifiers for rein- flection and analysis tasks, respectively. While Transformer architectures with data augmentation and pipeline architectures achieved the best results for inflection and reinflection tasks, pipelines and prefix-tuning on mGPT received the highest results for the analysis task. Our methods achieved first place in each of the three tasks and outperforms mT5-baseline with 89% for inflection, 80% for reflection, and 12% for analysis. Our code 1 is publicly available.

2021

pdf bib
Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021): Workshop and Shared Task Report
Ali Hürriyetoğlu | Hristo Tanev | Vanni Zavarella | Jakub Piskorski | Reyyan Yeniterzi | Osman Mutlu | Deniz Yuret | Aline Villavicencio
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)

This workshop is the fourth issue of a series of workshops on automatic extraction of socio-political events from news, organized by the Emerging Market Welfare Project, with the support of the Joint Research Centre of the European Commission and with contributions from many other prominent scholars in this field. The purpose of this series of workshops is to foster research and development of reliable, valid, robust, and practical solutions for automatically detecting descriptions of socio-political events, such as protests, riots, wars and armed conflicts, in text streams. This year workshop contributors make use of the state-of-the-art NLP technologies, such as Deep Learning, Word Embeddings and Transformers and cover a wide range of topics from text classification to news bias detection. Around 40 teams have registered and 15 teams contributed to three tasks that are i) multilingual protest news detection detection, ii) fine-grained classification of socio-political events, and iii) discovering Black Lives Matter protest events. The workshop also highlights two keynote and four invited talks about various aspects of creating event data sets and multi- and cross-lingual machine learning in few- and zero-shot settings.

2020

pdf
Joint Training with Semantic Role Labeling for Better Generalization in Natural Language Inference
Cemil Cengiz | Deniz Yuret
Proceedings of the 5th Workshop on Representation Learning for NLP

End-to-end models trained on natural language inference (NLI) datasets show low generalization on out-of-distribution evaluation sets. The models tend to learn shallow heuristics due to dataset biases. The performance decreases dramatically on diagnostic sets measuring compositionality or robustness against simple heuristics. Existing solutions for this problem employ dataset augmentation which has the drawbacks of being applicable to only a limited set of adversaries and at worst hurting the model performance on other adversaries not included in the augmentation set. Instead, our proposed solution is to improve sentence understanding (hence out-of-distribution generalization) with joint learning of explicit semantics. We show that a BERT based model trained jointly on English semantic role labeling (SRL) and NLI achieves significantly higher performance on external evaluation sets measuring generalization performance.

pdf
KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media
Ali Safaya | Moutasem Abdullatif | Deniz Yuret
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we describe our approach to utilize pre-trained BERT models with Convolutional Neural Networks for sub-task A of the Multilingual Offensive Language Identification shared task (OffensEval 2020), which is a part of the SemEval 2020. We show that combining CNN with BERT is better than using BERT on its own, and we emphasize the importance of utilizing pre-trained language models for downstream tasks. Our system, ranked 4th with macro averaged F1-Score of 0.897 in Arabic, 4th with score of 0.843 in Greek, and 3rd with score of 0.814 in Turkish. Additionally, we present ArabicBERT, a set of pre-trained transformer language models for Arabic that we share with the community.

2019

pdf
Morphological Analysis Using a Sequence Decoder
Ekin Akyürek | Erenay Dayanık | Deniz Yuret
Transactions of the Association for Computational Linguistics, Volume 7

We introduce Morse, a recurrent encoder-decoder model that produces morphological analyses of each word in a sentence. The encoder turns the relevant information about the word and its context into a fixed size vector representation and the decoder generates the sequence of characters for the lemma followed by a sequence of individual morphological features. We show that generating morphological features individually rather than as a combined tag allows the model to handle rare or unseen tags and to outperform whole-tag models. In addition, generating morphological features as a sequence rather than, for example, an unordered set allows our model to produce an arbitrary number of features that represent multiple inflectional groups in morphologically complex languages. We obtain state-of-the-art results in nine languages of different morphological complexity under low-resource, high-resource, and transfer learning settings. We also introduce TrMor2018, a new high-accuracy Turkish morphology data set. Our Morse implementation and the TrMor2018 data set are available online to support future research.1See https://github.com/ai-ku/Morse.jl for a Morse implementation in Julia/Knet (Yuret, 2016) and https://github.com/ai-ku/TrMor2018 for the new Turkish data set.

pdf
Learning from Implicit Information in Natural Language Instructions for Robotic Manipulations
Ozan Arkan Can | Pedro Zuidberg Dos Martires | Andreas Persson | Julian Gaal | Amy Loutfi | Luc De Raedt | Deniz Yuret | Alessandro Saffiotti
Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) and Grounded Communication for Robotics (RoboNLP)

Human-robot interaction often occurs in the form of instructions given from a human to a robot. For a robot to successfully follow instructions, a common representation of the world and objects in it should be shared between humans and the robot so that the instructions can be grounded. Achieving this representation can be done via learning, where both the world representation and the language grounding are learned simultaneously. However, in robotics this can be a difficult task due to the cost and scarcity of data. In this paper, we tackle the problem by separately learning the world representation of the robot and the language grounding. While this approach can address the challenges in getting sufficient data, it may give rise to inconsistencies between both learned components. Therefore, we further propose Bayesian learning to resolve such inconsistencies between the natural language grounding and a robot’s world representation by exploiting spatio-relational information that is implicitly present in instructions given by a human. Moreover, we demonstrate the feasibility of our approach on a scenario involving a robotic arm in the physical world.

pdf
KU_ai at MEDIQA 2019: Domain-specific Pre-training and Transfer Learning for Medical NLI
Cemil Cengiz | Ulaş Sert | Deniz Yuret
Proceedings of the 18th BioNLP Workshop and Shared Task

In this paper, we describe our system and results submitted for the Natural Language Inference (NLI) track of the MEDIQA 2019 Shared Task. As KU_ai team, we used BERT as our baseline model and pre-processed the MedNLI dataset to mitigate the negative impact of de-identification artifacts. Moreover, we investigated different pre-training and transfer learning approaches to improve the performance. We show that pre-training the language model on rich biomedical corpora has a significant effect in teaching the model domain-specific language. In addition, training the model on large NLI datasets such as MultiNLI and SNLI helps in learning task-specific reasoning. Finally, we ensembled our highest-performing models, and achieved 84.7% accuracy on the unseen test dataset and ranked 10th out of 17 teams in the official results.

2018

pdf
Tree-Stack LSTM in Transition Based Dependency Parsing
Ömer Kırnap | Erenay Dayanık | Deniz Yuret
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We introduce tree-stack LSTM to model state of a transition based parser with recurrent neural networks. Tree-stack LSTM does not use any parse tree based or hand-crafted features, yet performs better than models with these features. We also develop new set of embeddings from raw features to enhance the performance. There are 4 main components of this model: stack’s σ-LSTM, buffer’s β-LSTM, actions’ LSTM and tree-RNN. All LSTMs use continuous dense feature vectors (embeddings) as an input. Tree-RNN updates these embeddings based on transitions. We show that our model improves performance with low resource languages compared with its predecessors. We participate in CoNLL 2018 UD Shared Task as the “KParse” team and ranked 16th in LAS, 15th in BLAS and BLEX metrics, of 27 participants parsing 82 test sets from 57 languages.

pdf
SParse: KUniversity Graph-Based Parsing System for the CoNLL 2018 Shared Task
Berkay Önder | Can Gümeli | Deniz Yuret
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present SParse, our Graph-Based Parsing model submitted for the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2018). Our model extends the state-of-the-art biaffine parser (Dozat and Manning, 2016) with a structural meta-learning module, SMeta, that combines local and global label predictions. Our parser has been trained and run on Universal Dependencies datasets (Nivre et al., 2016, 2018) and has 87.48% LAS, 78.63% MLAS, 78.69% BLEX and 81.76% CLAS (Nivre and Fang, 2017) score on the Italian-ISDT dataset and has 72.78% LAS, 59.10% MLAS, 61.38% BLEX and 61.72% CLAS score on the Japanese-GSD dataset in our official submission. All other corpora are evaluated after the submission deadline, for whom we present our unofficial test results.

2017

pdf
Parsing with Context Embeddings
Ömer Kırnap | Berkay Furkan Önder | Deniz Yuret
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We introduce context embeddings, dense vectors derived from a language model that represent the left/right context of a word instance, and demonstrate that context embeddings significantly improve the accuracy of our transition based parser. Our model consists of a bidirectional LSTM (BiLSTM) based language model that is pre-trained to predict words in plain text, and a multi-layer perceptron (MLP) decision model that uses features from the language model to predict the correct actions for an ArcHybrid transition based parser. We participated in the CoNLL 2017 UD Shared Task as the “Koç University” team and our system was ranked 7th out of 33 systems that parsed 81 treebanks in 49 languages.

2016

pdf
Learning grammatical categories using paradigmatic representations: Substitute words for language acquisition
Mehmet Ali Yatbaz | Volkan Cirik | Aylin Küntay | Deniz Yuret
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Learning syntactic categories is a fundamental task in language acquisition. Previous studies show that co-occurrence patterns of preceding and following words are essential to group words into categories. However, the neighboring words, or frames, are rarely repeated exactly in the data. This creates data sparsity and hampers learning for frame based models. In this work, we propose a paradigmatic representation of word context which uses probable substitutes instead of frames. Our experiments on child-directed speech show that models based on probable substitutes learn more accurate categories with fewer examples compared to models based on frames.

pdf
CharNER: Character-Level Named Entity Recognition
Onur Kuru | Ozan Arkan Can | Deniz Yuret
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We describe and evaluate a character-level tagger for language-independent Named Entity Recognition (NER). Instead of words, a sentence is represented as a sequence of characters. The model consists of stacked bidirectional LSTMs which inputs characters and outputs tag probabilities for each character. These probabilities are then converted to consistent word level named entity tags using a Viterbi decoder. We are able to achieve close to state-of-the-art NER performance in seven languages with the same basic model using only labeled NER data and no hand-engineered features or other external resources like syntactic taggers or Gazetteers.

pdf
Transfer Learning for Low-Resource Neural Machine Translation
Barret Zoph | Deniz Yuret | Jonathan May | Kevin Knight
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Why Neural Translations are the Right Length
Xing Shi | Kevin Knight | Deniz Yuret
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Natural Language Communication with Robots
Yonatan Bisk | Deniz Yuret | Daniel Marcu
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf
Unsupervised Instance-Based Part of Speech Induction Using Probable Substitutes
Deniz Yuret | Mehmet Ali Yatbaz | Enis Sert
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf
Probabilistic Modeling of Joint-context in Distributional Similarity
Oren Melamud | Ido Dagan | Jacob Goldberger | Idan Szpektor | Deniz Yuret
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

2013

pdf bib
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)
Suresh Manandhar | Deniz Yuret
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf
AI-KU: Using Substitute Vectors and Co-Occurrence Modeling For Word Sense Induction and Disambiguation
Osman Başkaya | Enis Sert | Volkan Cirik | Deniz Yuret
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf
Learning Syntactic Categories Using Paradigmatic Representations of Word Context
Mehmet Ali Yatbaz | Enis Sert | Deniz Yuret
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)
Eneko Agirre | Johan Bos | Mona Diab | Suresh Manandhar | Yuval Marton | Deniz Yuret
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

2011

pdf
Instance Selection for Machine Translation using Feature Decay Algorithms
Ergun Biçici | Deniz Yuret
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf
RegMT System for Machine Translation, System Combination, and Evaluation
Ergun Biçici | Deniz Yuret
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf
Unsupervised Part of Speech Tagging Using Unambiguous Substitutes from a Statistical Language Model
Mehmet Ali Yatbaz | Deniz Yuret
Coling 2010: Posters

pdf
The Noisy Channel Model for Unsupervised Word Sense Disambiguation
Deniz Yuret | Mehmet Ali Yatbaz
Computational Linguistics, Volume 36, Number 1, March 2010

pdf
SemEval-2010 Task 12: Parser Evaluation Using Textual Entailments
Deniz Yuret | Aydin Han | Zehra Turgut
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf
L1 Regularized Regression for Reranking and System Combination in Machine Translation
Ergun Biçici | Deniz Yuret
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf
Modeling Morphologically Rich Languages Using Split Words and Unstructured Dependencies
Deniz Yuret | Ergun Biçici
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2008

pdf
Smoothing a Tera-word Language Model
Deniz Yuret
Proceedings of ACL-08: HLT, Short Papers

pdf
Discriminative vs. Generative Approaches in Semantic Role Labeling
Deniz Yuret | Mehmet Ali Yatbaz | Ahmet Engin Ural
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

2007

pdf
The CoNLL 2007 Shared Task on Dependency Parsing
Joakim Nivre | Johan Hall | Sandra Kübler | Ryan McDonald | Jens Nilsson | Sebastian Riedel | Deniz Yuret
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf
SemEval-2007 Task 04: Classification of Semantic Relations between Nominals
Roxana Girju | Preslav Nakov | Vivi Nastase | Stan Szpakowicz | Peter Turney | Deniz Yuret
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf
KU: Word Sense Disambiguation by Substitution
Deniz Yuret
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf
Learning Morphological Disambiguation Rules for Turkish
Deniz Yuret | Ferhan Türe
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf
Dependency Parsing as a Classication Problem
Deniz Yuret
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

2004

pdf
Some experiments with a Naive Bayes WSD system
Deniz Yuret
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text