Shuly Wintner

2023

pdf abs
Shared Lexical Items as Triggers of Code Switching
Shuly Wintner | Safaa Shehadi | Yuli Zeira | Doreen Osmelak | Yuval Nov
Transactions of the Association for Computational Linguistics, Volume 11

Why do bilingual speakers code-switch (mix their two languages)? Among the several theories that attempt to explain this natural and ubiquitous phenomenon, the triggering hypothesis relates code-switching to the presence of lexical triggers, specifically cognates and proper names, adjacent to the switch point. We provide a fuller, more nuanced and refined exploration of the triggering hypothesis, based on five large datasets in three language pairs, reflecting both spoken and written bilingual interactions. Our results show that words that are assumed to reside in a mental lexicon shared by both languages indeed trigger code-switching, that the tendency to switch depends on the distance of the trigger from the switch point and on whether the trigger precedes or succeeds the switch, but not on the etymology of the trigger words. We thus provide strong, robust, evidence-based confirmation to several hypotheses on the relationships between lexical triggers and code-switching.

pdf abs
The Denglisch Corpus of German-English Code-Switching
Doreen Osmelak | Shuly Wintner
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

When multilingual speakers involve in a conversation they inevitably introduce code-switching (CS), i.e., mixing of more than one language between and within utterances. CS is still an understudied phenomenon, especially in the written medium, and relatively few computational resources for studying it are available. We describe a corpus of German-English code-switching in social media interactions. We focus on some challenges in annotating CS, especially due to words whose language ID cannot be easily determined. We introduce a novel schema for such word-level annotation, with which we manually annotated a subset of the corpus. We then trained classifiers to predict and identify switches, and applied them to the remainder of the corpus. Thereby, we created a large scale corpus of German-English mixed utterances with precise indications of CS points.

2022

pdf abs
Identifying Code-switching in Arabizi
Safaa Shehadi | Shuly Wintner
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

We describe a corpus of social media posts that include utterances in Arabizi, a Roman-script rendering of Arabic, mixed with other languages, notably English, French, and Arabic written in the Arabic script. We manually annotated a subset of the texts with word-level language IDs; this is a non-trivial task due to the nature of mixed-language writing, especially on social media. We developed classifiers that can accurately predict the language ID tags. Then, we extended the word-level predictions to identify sentences that include Arabizi (and code-switching), and applied the classifiers to the raw corpus, thereby harvesting a large number of additional instances. The result is a large-scale dataset of Arabizi, with precise indications of code-switching between Arabizi and English, French, and Arabic.

pdf abs
Predicting the Proficiency Level of Nonnative Hebrew Authors
Isabelle Nguyen | Shuly Wintner
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present classifiers that can accurately predict the proficiency level of nonnative Hebrew learners. This is important for practical (mainly educational) applications, but the endeavor also sheds light on the features that support the classification, thereby improving our understanding of learner language in general, and transfer effects from Arabic, French, and Russian on nonnative Hebrew in particular.

pdf abs
The Hebrew Essay Corpus
Chen Gafni | Anat Prior | Shuly Wintner
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present the Hebrew Essay Corpus: an annotated corpus of Hebrew language argumentative essays authored by prospective higher-education students. The corpus includes both essays by native speakers, written as part of the psychometric exam that is used to assess their future success in academic studies; and essays authored by non-native speakers, with three different native languages, that were written as part of a language aptitude test. The corpus is uniformly encoded and stored. The non-native essays were annotated with target hypotheses whose main goal is to make the texts amenable to automatic processing (morphological and syntactic analysis). The corpus is available for academic purposes upon request. We describe the corpus and the error correction and annotation schemes used in its analysis. In addition to introducing this new resource, we discuss the challenges of identifying and analyzing non-native language use in general, and propose various ways for dealing with these challenges.

pdf abs
Speaker Information Can Guide Models to Better Inductive Biases: A Case Study On Predicting Code-Switching
Alissa Ostapenko | Shuly Wintner | Melinda Fricke | Yulia Tsvetkov
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Natural language processing (NLP) models trained on people-generated data can be unreliable because, without any constraints, they can learn from spurious correlations that are not relevant to the task. We hypothesize that enriching models with speaker information in a controlled, educated way can guide them to pick up on relevant inductive biases. For the speaker-driven task of predicting code-switching points in English–Spanish bilingual dialogues, we show that adding sociolinguistically-grounded speaker features as prepended prompts significantly improves accuracy. We find that by adding influential phrases to the input, speaker-informed models learn useful and explainable linguistic information. To our knowledge, we are the first to incorporate speaker characteristics in a neural model for code-switching, and more generally, take a step towards developing transparent, personalized models that use speaker information in a controlled way.

2021

pdf abs
Machine Translation into Low-resource Language Varieties
Sachin Kumar | Antonios Anastasopoulos | Shuly Wintner | Yulia Tsvetkov
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

State-of-the-art machine translation (MT) systems are typically trained to generate “standard” target language; however, many languages have multiple varieties (regional varieties, dialects, sociolects, non-native varieties) that are different from the standard language. Such varieties are often low-resource, and hence do not benefit from contemporary NLP solutions, MT included. We propose a general framework to rapidly adapt MT systems to generate language varieties that are close to, but different from, the standard target language, using no parallel (source–variety) data. This also includes adaptation of MT systems to low-resource typologically-related target languages. We experiment with adapting an English–Russian MT system to generate Ukrainian and Belarusian, an English–Norwegian Bokmål system to generate Nynorsk, and an English–Arabic system to generate four Arabic dialects, obtaining significant improvements over competitive baselines.

2019

pdf abs
Automatic Detection of Translation Direction
Ilia Sominsky | Shuly Wintner
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Parallel corpora are crucial resources for NLP applications, most notably for machine translation. The direction of the (human) translation of parallel corpora has been shown to have significant implications for the quality of statistical machine translation systems that are trained with such corpora. We describe a method for determining the direction of the (manual) translation of parallel corpora at the sentence-pair level. Using several linguistically-motivated features, coupled with a neural network model, we obtain high accuracy on several language pairs. Furthermore, we demonstrate that the accuracy is correlated with the (typological) distance between the two languages.

pdf abs
Topics to Avoid: Demoting Latent Confounds in Text Classification
Sachin Kumar | Shuly Wintner | Noah A. Smith | Yulia Tsvetkov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification. We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author’s native language is Swedish). We propose a method that represents the latent topical confounds and a model which “unlearns” confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.

2018

pdf abs
Framing and Agenda-setting in Russian News: a Computational Analysis of Intricate Political Strategies
Anjalie Field | Doron Kliger | Shuly Wintner | Jennifer Pan | Dan Jurafsky | Yulia Tsvetkov
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Amidst growing concern over media manipulation, NLP attention has focused on overt strategies like censorship and “fake news”. Here, we draw on two concepts from political science literature to explore subtler strategies for government media manipulation: agenda-setting (selecting what topics to cover) and framing (deciding how topics are covered). We analyze 13 years (100K articles) of the Russian newspaper Izvestia and identify a strategy of distraction: articles mention the U.S. more frequently in the month directly following an economic downturn in Russia. We introduce embedding-based methods for cross-lingually projecting English frames to Russian, and discover that these articles emphasize U.S. moral failings and threats to the U.S. Our work offers new ways to identify subtle media manipulation strategies at the intersection of agenda-setting and framing.

pdf abs
Native Language Identification with User Generated Content
Gili Goldin | Ella Rabinovich | Shuly Wintner
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We address the task of native language identification in the context of social media content, where authors are highly-fluent, advanced nonnative speakers (of English). Using both linguistically-motivated features and the characteristics of the social media outlet, we obtain high accuracy on this challenging task. We provide a detailed analysis of the features that sheds light on differences between native and nonnative speakers, and among nonnative speakers with different backgrounds.

pdf abs
Native Language Cognate Effects on Second Language Lexical Choice
Ella Rabinovich | Yulia Tsvetkov | Shuly Wintner
Transactions of the Association for Computational Linguistics, Volume 6

We present a computational analysis of cognate effects on the spontaneous linguistic productions of advanced non-native speakers. Introducing a large corpus of highly competent non-native English speakers, and using a set of carefully selected lexical items, we show that the lexical choices of non-natives are affected by cognates in their native language. This effect is so powerful that we are able to reconstruct the phylogenetic language tree of the Indo-European language family solely from the frequencies of specific lexical items in the English of authors with various native languages. We quantitatively analyze non-native lexical choice, highlighting cognate facilitation as one of the important phenomena shaping the language of non-native speakers.

2017

pdf abs
Found in Translation: Reconstructing Phylogenetic Language Trees from Translations
Ella Rabinovich | Noam Ordan | Shuly Wintner
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Translation has played an important role in trade, law, commerce, politics, and literature for thousands of years. Translators have always tried to be invisible; ideal translations should look as if they were written originally in the target language. We show that traces of the source language remain in the translation product to the extent that it is possible to uncover the history of the source language by looking only at the translation. Specifically, we automatically reconstruct phylogenetic language trees from monolingual texts (translated from several source languages). The signal of the source language is so powerful that it is retained even after two phases of translation. This strongly indicates that source language interference is the most dominant characteristic of translated texts, overshadowing the more subtle signals of universal properties of translation.

pdf abs
Personalized Machine Translation: Preserving Original Author Traits
Ella Rabinovich | Raj Nath Patel | Shachar Mirkin | Lucia Specia | Shuly Wintner
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

The language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait of the author, gender, and study how it is manifested in original texts and in translations. We show that author’s gender has a powerful, clear signal in originals texts, but this signal is obfuscated in human and machine translation. We then propose simple domain-adaptation techniques that help retain the original gender traits in the translation, without harming the quality of the translation, thereby creating more personalized machine translation systems.

2016

pdf abs
Translationese: Between Human and Machine Translation
Shuly Wintner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Tutorial Abstracts

Translated texts, in any language, have unique characteristics that set them apart from texts originally written in the same language. Translation Studies is a research field that focuses on investigating these characteristics. Until recently, research in machine translation (MT) has been entirely divorced from translation studies. The main goal of this tutorial is to introduce some of the findings of translation studies to researchers interested mainly in machine translation, and to demonstrate that awareness to these findings can result in better, more accurate MT systems.

pdf abs
A Corpus of Native, Non-native and Translated Texts
Sergiu Nisioi | Ella Rabinovich | Liviu P. Dinu | Shuly Wintner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe a monolingual English corpus of original and (human) translated texts, with an accurate annotation of speaker properties, including the original language of the utterances and the speaker’s country of origin. We thus obtain three sub-corpora of texts reflecting native English, non-native English, and English translated from a variety of European languages. This dataset will facilitate the investigation of similarities and differences between these kinds of sub-languages. Moreover, it will facilitate a unified comparative study of translations and language produced by (highly fluent) non-native speakers, two closely-related phenomena that have only been studied in isolation so far.

2015

pdf bib
Statistical Machine Translation with Automatic Identification of Translationese
Naama Twitto | Noam Ordan | Shuly Wintner
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf abs
Unsupervised Identification of Translationese
Ella Rabinovich | Shuly Wintner
Transactions of the Association for Computational Linguistics, Volume 3

Translated texts are distinctively different from original ones, to the extent that supervised text classification methods can distinguish between them with high accuracy. These differences were proven useful for statistical machine translation. However, it has been suggested that the accuracy of translation detection deteriorates when the classifier is evaluated outside the domain it was trained on. We show that this is indeed the case, in a variety of evaluation scenarios. We then show that unsupervised classification is highly accurate on this task. We suggest a method for determining the correct labels of the clustering outcomes, and then use the labels for voting, improving the accuracy even further. Moreover, we suggest a simple method for clustering in the challenging case of mixed-domain datasets, in spite of the dominance of domain-related features over translation-related ones. The result is an effective, fully-unsupervised method for distinguishing between original and translated texts that can be applied to new domains with reasonable accuracy.

Hebrew and Arabic are related but mutually incomprehensible languages with complex morphology and scarce parallel corpora. Machine translation between the two languages is therefore interesting and challenging. We discuss similarities and differences between Hebrew and Arabic, the benefits and challenges that they induce, respectively, and their implications for machine translation. We highlight the shortcomings of using English as a pivot language and advocate a direct, transfer-based and linguistically-informed (but still statistical, and hence scalable) approach. We report preliminary results of such a system that we are currently developing.

pdf abs
A General Method for Creating a Bilingual Transliteration Dictionary
Amit Kirschenbaum | Shuly Wintner
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Transliteration is the rendering in one language of terms from another language (and, possibly, another writing system), approximating spelling and/or phonetic equivalents between the two languages. A transliteration dictionary is a crucial resource for a variety of natural language applications, most notably machine translation. We describe a general method for creating bilingual transliteration dictionaries from Wikipedia article titles. The method can be applied to any language pair with Wikipedia presence, independently of the writing systems involved, and requires only a single simple resource that can be provided by any literate bilingual speaker. It was successfully applied to extract a Hebrew-English transliteration dictionary which, when incorporated in a machine translation system, indeed improved its performance.

pdf abs
Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content
Yulia Tsvetkov | Shuly Wintner
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles retrieved from the Internet, potentially containing manually translated texts. This algorithm was implemented and tested on Hebrew-English parallel texts. With properly selected thresholds, precision of 100% can be obtained.

pdf abs
A Morphologically-Analyzed CHILDES Corpus of Hebrew
Bracha Nir | Brian MacWhinney | Shuly Wintner
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a corpus of transcribed spoken Hebrew that forms an integral part of a comprehensive data system that has been developed to suit the specific needs and interests of child language researchers: CHILDES (Child Language Data Exchange System). We introduce a dedicated transcription scheme for the spoken Hebrew data that is aware both of the phonology and of the standard orthography of the language. We also introduce a morphological analyzer that was specifically developed for this corpus.

pdf bib
Identifying Multi-word Expressions by Leveraging Morphological and Syntactic Idiosyncrasy
Hassan Al-Haj | Shuly Wintner
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf
Extraction of Multi-word Expressions from Small Parallel Corpora
Yulia Tsvetkov | Shuly Wintner
Coling 2010: Posters

2009

pdf
Lightly Supervised Transliteration for Machine Translation
Amit Kirschenbaum | Shuly Wintner
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Mike Rosner | Shuly Wintner
Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages

pdf
Last Words: What Science Underlies Natural Language Engineering?
Shuly Wintner
Computational Linguistics, Volume 35, Number 4, December 2009

2008

pdf
Identifying Semitic Roots: Machine Learning with Linguistic Constraints
Ezra Daya | Dan Roth | Shuly Wintner
Computational Linguistics, Volume 34, Number 3, September 2008

2007

pdf
High-accuracy Annotation and Parsing of CHILDES Transcripts
Kenji Sagae | Eric Davis | Alon Lavie | Brian MacWhinney | Shuly Wintner
Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition

pdf
Cross Lingual and Semantic Retrieval for Cultural Heritage Appreciation
Idan Szpektor | Ido Dagan | Alon Lavie | Danny Shacham | Shuly Wintner
Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007).

pdf
Morphological Disambiguation of Hebrew: A Case Study in Classifier Combination
Danny Shacham | Shuly Wintner
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
11th Conference of the European Chapter of the Association for Computational Linguistics
Diana McCarthy | Shuly Wintner
11th Conference of the European Chapter of the Association for Computational Linguistics

pdf abs
A Computational Lexicon of Contemporary Hebrew
Alon Itai | Shuly Wintner | Shlomo Yona
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Computational lexicons are among the most important resources for natural language processing (NLP). Their importance is even greater in languages with rich morphology, where the lexicon is expected to provide morphological analyzers with enough information to enable themto correctly process intricately inflected forms. We describe the Haifa Lexicon of Contemporary Hebrew, the broadest-coverage publicly available lexicon of Modern Hebrew, currently consisting of over 20,000 entries. While other lexical resources of Modern Hebrew have been developed in the past, this is the first publicly available large-scale lexicon of the language. In addition to supporting morphological processors (analyzers and generators), which was our primary objective, thelexicon is used as a research tool in Hebrew lexicography and lexical semantics. It is open for browsing on the web and several search tools and interfaces were developed which facilitate on-line access to its information. The lexicon is currently used for a variety of NLP applications.

pdf
Partially Specified Signatures: A Vehicle for Grammar Modularity
Yael Cohen-Sygal | Shuly Wintner
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
Highly Constrained Unification Grammars
Daniel Feinstein | Shuly Wintner
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
Finite-State Registered Automata for Non-Concatenative Morphology
Yael Cohen-Sygal | Shuly Wintner
Computational Linguistics, Volume 32, Number 1, March 2006

2005

pdf bib
A Finite-State Morphological Grammar of Hebrew
Shlomo Yona | Shuly Wintner
Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages

pdf
XFST2FSA: Comparing Two Finite-State Toolboxes
Yael Cohen-Sygal | Shuly Wintner
Proceedings of Workshop on Software

2004

pdf
Learning Hebrew Roots: Machine Learning with Linguistic Constraints
Ezra Daya | Dan Roth | Shuly Wintner
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

pdf bib
Rapid prototyping of a transfer-based Hebrew-to-English machine translation system
Alon Lavie | Erik Peterson | Katharina Probst | Shuly Wintner | Yaniv Eytani
Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

2003

pdf bib
Finite state technology and its applications to machine translation
Shuly Wintner
Proceedings of Machine Translation Summit IX: Tutorials

pdf abs
Resources for processing Israeli Hebrew
Shuly Wintner | Shlomo Yona
Workshop on Machine Translation for Semitic languages: issues and approaches

We describe work in progress whose main objective is to create a collection of resources and tools for processing Hebrew. These resources include corpora of written texts, some of them annotated in various degrees of detail; tools for collecting, expanding and maintaining corpora; tools for annotation; lexicons, both monolingual and bilingual; a rule-based, linguistically motivated morphological analyzer and generator; and a WordNet for Hebrew. We emphasize the methodological issue of well-defined standards for the resources to be developed. The design of the resources guarantees their reusability, such that the output of one system can naturally be the input to another.

2002

pdf
Formal Language Theory for Natural Language Processing
Shuly Wintner
Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics

pdf
Guaranteeing Parsing Termination of Unification Grammars
Efrat Jaeger | Nissim Francez | Shuly Wintner
COLING 2002: The 19th International Conference on Computational Linguistics

pdf
Squibs and Discussions: A Note on Typing Feature Structures
Shuly Wintner | Anoop Sarkar
Computational Linguistics, Volume 28, Number 3, September 2002

1999

pdf
Compositional Semantics for Linguistic Formalisms
Shuly Wintner
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

pdf
Towards a linguistically motivated computational grammar for Hebrew
Shuly Wintner
Computational Approaches to Semitic Languages

pdf
System Demonstration Natural Language Generation With Abstract Machine
Evgeniy Gabrilovich | Nissirn Francez | Shuly Wintner
Natural Language Generation

1995

pdf abs
Parsing with Typed Feature Structures
Shuly Wintner | Nissim Francez
Proceedings of the Fourth International Workshop on Parsing Technologies

In this paper we provide for parsing with respect to grammars expressed in a general TFS-based formalism, a restriction of ALE ([2]). Our motivation being the design of an abstract (WAM-like) machine for the formalism ([14]), we consider parsing as a computational process and use it as an operational semantics to guide the design of the control structures for the abstract machine. We emphasize the notion of abstract typed feature structures (AFSs) that encode the essential information of TFSs and define unification over AFSs rather than over TFSs. We then introduce an explicit construct of multi-rooted feature structures (MRSs) that naturally extend TFSs and use them to represent phrasal signs as well as grammar rules. We also employ abstractions of MRSs and give the mathematical foundations needed for manipulating them. We then present a simple bottom-up chart parser as a model for computation: grammars written in the TFS-based formalism are executed by the parser. Finally, we show that the parser is correct.