Paul Rodrigues


RU-ADEPT: Russian Anonymized Dataset with Eight Personality Traits
C. Anton Rytting | Valerie Novak | James R. Hull | Victor M. Frank | Paul Rodrigues | Jarrett G. W. Lee | Laurel Miller-Sims
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Social media has provided a platform for many individuals to easily express themselves naturally and publicly, and researchers have had the opportunity to utilize large quantities of this data to improve author trait analysis techniques and to improve author trait profiling systems. The majority of the work in this area, however, has been narrowly spent on English and other Western European languages, and generally focuses on a single social network at a time, despite the large quantity of data now available across languages and differences that have been found across platforms. This paper introduces RU-ADEPT, a dataset of Russian authors’ personality trait scores–Big Five and Dark Triad, demographic information (e.g. age, gender), with associated corpus of the authors’ cross-contributions to (up to) four different social media platforms–VKontakte (VK), LiveJournal, Blogger, and Moi Mir. We believe this to be the first publicly-available dataset associating demographic and personality trait data with Russian-language social media content, the first paper to describe the collection of Dark Triad scores with texts across multiple Russian-language social media platforms, and to a limited extent, the first publicly-available dataset of personality traits to author content across several different social media sites.


Personality Trait Identification Using the Russian Feature Extraction Toolkit
James R. Hull | Valerie Novak | C. Anton Rytting | Paul Rodrigues | Victor M. Frank | Matthew Swahn
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Feature engineering is an important step in classical NLP pipelines, but machine learning engineers may not be aware of the signals to look for when processing foreign language text. The Russian Feature Extraction Toolkit (RFET) is a collection of feature extraction libraries bundled for ease of use by engineers who do not speak Russian. RFET’s current feature set includes features applicable to social media genres of text and to computational social science tasks. We demonstrate the effectiveness of the tool by using it in a personality trait identification task. We compare the performance of Support Vector Machines (SVMs) trained with and without the features provided by RFET; we also compare it to a SVM with neural embedding features generated by Sentence-BERT.


Arabic Data Science Toolkit: An API for Arabic Language Feature Extraction
Paul Rodrigues | Valerie Novak | C. Anton Rytting | Julie Yelle | Jennifer Boutz
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


ArCADE: An Arabic Corpus of Auditory Dictation Errors
C. Anton Rytting | Paul Rodrigues | Tim Buckwalter | Valerie Novak | Aric Bills | Noah H. Silbert | Mohini Madgavkar
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

The IUCL+ System: Word-Level Language Identification via Extended Markov Models
Levi King | Eric Baucom | Timur Gilmanov | Sandra Kübler | Dan Whyatt | Wolfgang Maier | Paul Rodrigues
Proceedings of the First Workshop on Computational Approaches to Code Switching


Typing Race Games as a Method to Create Spelling Error Corpora
Paul Rodrigues | C. Anton Rytting
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents a method to elicit spelling error corpora using an online typing race game. After being tested for their native language, English-native participants were instructed to retype stimuli as quickly and as accurately as they could. The participants were informed that the system was keeping a score based on accuracy and speed, and that a high score would result in a position on a public scoreboard. Words were presented on the screen one at a time from a queue, and the queue was advanced by pressing the ENTER key following the stimulus. Responses were recorded and compared to the original stimuli. Responses that differed from the stimuli were considered a typographical or spelling error, and added to an error corpus. Collecting a corpus using a game offers several unique benefits. 1) A game attracts engaged participants, quickly. 2) The web-based delivery reduces the cost and decreases the time and effort of collecting the corpus. 3) Participants have fun. Spelling error corpora have been difficult and expensive to obtain for many languages and this research was performed to fill this gap. In order to evaluate the methodology, we compare our game data against three existing spelling corpora for English.

A Random Forest System Combination Approach for Error Detection in Digital Dictionaries
Michael Bloodgood | Peng Ye | Paul Rodrigues | David Zajic | David Doermann
Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data


Error Correction for Arabic Dictionary Lookup
C. Anton Rytting | Paul Rodrigues | Tim Buckwalter | David Zajic | Bridget Hirsch | Jeff Carnes | Nathanael Lynn | Sarah Wayland | Chris Taylor | Jason White | Charles Blake III | Evelyn Browne | Corey Miller | Tristan Purvis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe a new Arabic spelling correction system which is intended for use with electronic dictionary search by learners of Arabic. Unlike other spelling correction systems, this system does not depend on a corpus of attested student errors but on student- and teacher-generated ratings of confusable pairs of phonemes or letters. Separate error modules for keyboard mistypings, phonetic confusions, and dialectal confusions are combined to create a weighted finite-state transducer that calculates the likelihood that an input string could correspond to each citation form in a dictionary of Iraqi Arabic. Results are ranked by the estimated likelihood that a citation form could be misheard, mistyped, or mistranscribed for the input given by the user. To evaluate the system, we developed a noisy-channel model trained on students’ speech errors and use it to perturb citation forms from a dictionary. We compare our system to a baseline based on Levenshtein distance and find that, when evaluated on single-error queries, our system performs 28% better than the baseline (overall MRR) and is twice as good at returning the correct dictionary form as the top-ranked result. We believe this to be the first spelling correction system designed for a spoken, colloquial dialect of Arabic.


pdf bib
On Statistical Parameter Setting
Damir Ćavar | Joshua Herring | Toshikazu Ikuta | Paul Rodrigues | Giancarlo Schrementi
Proceedings of the Workshop on Psycho-Computational Models of Human Language Acquisition