C. Anton Rytting

2024

pdf abs
Rapidly Piloting Real-time Linguistic Assistance for Simultaneous Interpreters with Untrained Bilingual Surrogates
Alvin C. Grissom II | Jo Shoemaker | Benjamin Goldman | Ruikang Shi | Craig Stewart | C. Anton Rytting | Leah Findlater | Jordan Boyd-Graber
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Simultaneous interpretation is a cognitively taxing task, and even seasoned professionals benefit from real-time assistance. However, both recruiting professional interpreters and evaluating new assistance techniques are difficult. We present a novel, realistic simultaneous interpretation task that mimics the cognitive load of interpretation with crowdworker surrogates. Our task tests different real-time assistance methods in a Wizard-of-Oz experiment with a large pool of proxy users and compares against professional interpreters. Both professional and proxy participants respond similarly to changes in interpreting conditions, including improvement with two assistance interventions—translation of specific terms and of numbers—compared to a no-assistance control.

2022

Social media has provided a platform for many individuals to easily express themselves naturally and publicly, and researchers have had the opportunity to utilize large quantities of this data to improve author trait analysis techniques and to improve author trait profiling systems. The majority of the work in this area, however, has been narrowly spent on English and other Western European languages, and generally focuses on a single social network at a time, despite the large quantity of data now available across languages and differences that have been found across platforms. This paper introduces RU-ADEPT, a dataset of Russian authors’ personality trait scores–Big Five and Dark Triad, demographic information (e.g. age, gender), with associated corpus of the authors’ cross-contributions to (up to) four different social media platforms–VKontakte (VK), LiveJournal, Blogger, and Moi Mir. We believe this to be the first publicly-available dataset associating demographic and personality trait data with Russian-language social media content, the first paper to describe the collection of Dark Triad scores with texts across multiple Russian-language social media platforms, and to a limited extent, the first publicly-available dataset of personality traits to author content across several different social media sites.

2021

pdf abs
Personality Trait Identification Using the Russian Feature Extraction Toolkit
James R. Hull | Valerie Novak | C. Anton Rytting | Paul Rodrigues | Victor M. Frank | Matthew Swahn
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Feature engineering is an important step in classical NLP pipelines, but machine learning engineers may not be aware of the signals to look for when processing foreign language text. The Russian Feature Extraction Toolkit (RFET) is a collection of feature extraction libraries bundled for ease of use by engineers who do not speak Russian. RFET’s current feature set includes features applicable to social media genres of text and to computational social science tasks. We demonstrate the effectiveness of the tool by using it in a personality trait identification task. We compare the performance of Support Vector Machines (SVMs) trained with and without the features provided by RFET; we also compare it to a SVM with neural embedding features generated by Sentence-BERT.

2018

pdf
Arabic Data Science Toolkit: An API for Arabic Language Feature Extraction
Paul Rodrigues | Valerie Novak | C. Anton Rytting | Julie Yelle | Jennifer Boutz
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf
DECCA Repurposed: Detecting transcription inconsistencies without an orthographic standard
C. Anton Rytting | Julie Yelle
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2014

2012

pdf abs
Typing Race Games as a Method to Create Spelling Error Corpora
Paul Rodrigues | C. Anton Rytting
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents a method to elicit spelling error corpora using an online typing race game. After being tested for their native language, English-native participants were instructed to retype stimuli as quickly and as accurately as they could. The participants were informed that the system was keeping a score based on accuracy and speed, and that a high score would result in a position on a public scoreboard. Words were presented on the screen one at a time from a queue, and the queue was advanced by pressing the ENTER key following the stimulus. Responses were recorded and compared to the original stimuli. Responses that differed from the stimuli were considered a typographical or spelling error, and added to an error corpus. Collecting a corpus using a game offers several unique benefits. 1) A game attracts engaged participants, quickly. 2) The web-based delivery reduces the cost and decreases the time and effort of collecting the corpus. 3) Participants have fun. Spelling error corpora have been difficult and expensive to obtain for many languages and this research was performed to fill this gap. In order to evaluate the methodology, we compare our game data against three existing spelling corpora for English.

2010

We describe a new Arabic spelling correction system which is intended for use with electronic dictionary search by learners of Arabic. Unlike other spelling correction systems, this system does not depend on a corpus of attested student errors but on student- and teacher-generated ratings of confusable pairs of phonemes or letters. Separate error modules for keyboard mistypings, phonetic confusions, and dialectal confusions are combined to create a weighted finite-state transducer that calculates the likelihood that an input string could correspond to each citation form in a dictionary of Iraqi Arabic. Results are ranked by the estimated likelihood that a citation form could be misheard, mistyped, or mistranscribed for the input given by the user. To evaluate the system, we developed a noisy-channel model trained on students speech errors and use it to perturb citation forms from a dictionary. We compare our system to a baseline based on Levenshtein distance and find that, when evaluated on single-error queries, our system performs 28% better than the baseline (overall MRR) and is twice as good at returning the correct dictionary form as the top-ranked result. We believe this to be the first spelling correction system designed for a spoken, colloquial dialect of Arabic.