Steve Sloto


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2023

pdf bib
Findings of the WMT 2023 Shared Task on Parallel Data Curation
Steve Sloto | Brian Thompson | Huda Khayrallah | Tobias Domhan | Thamme Gowda | Philipp Koehn
Proceedings of the Eighth Conference on Machine Translation

Building upon prior WMT shared tasks in document alignment and sentence filtering, we posed the open-ended shared task of finding the best subset of possible training data from a collection of Estonian-Lithuanian web data. Participants could focus on any portion of the end-to-end data curation pipeline, including alignment and filtering. We evaluated results based on downstream machine translation quality. We release processed Common Crawl data, along with various intermediate states from a strong baseline system, which we believe will enable future research on this topic.

2019

pdf bib
Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data
Amittai Axelrod | Anish Kumar | Steve Sloto
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

We introduce a purely monolingual approach to filtering for parallel data from a noisy corpus in a low-resource scenario. Our work is inspired by Junczysdowmunt:2018, but we relax the requirements to allow for cases where no parallel data is available. Our primary contribution is a dual monolingual cross-entropy delta criterion modified from Cynical data selection Axelrod:2017, and is competitive (within 1.8 BLEU) with the best bilingual filtering method when used to train SMT systems. Our approach is featherweight, and runs end-to-end on a standard laptop in three hours.

2018

pdf bib
Leveraging Data Resources for Cross-Linguistic Information Retrieval Using Statistical Machine Translation
Steve Sloto | Ann Clifton | Greg Hanneman | Patrick Porter | Donna Gates | Almut Hildebrand | Anish Kumar
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)