Amir Kamran


2020

pdf
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón | Pinzhen Chen | Barry Haddow | Kenneth Heafield | Hieu Hoang | Miquel Esplà-Gomis | Mikel L. Forcada | Amir Kamran | Faheem Kirefu | Philipp Koehn | Sergio Ortiz Rojas | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Elsa Sarrías | Marek Strelec | Brian Thompson | William Waites | Dion Wiggins | Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

pdf
CEF Data Marketplace: Powering a Long-term Supply of Language Data
Amir Kamran | Dace Dzeguze | Jaap van der Meer | Milica Panic | Alessandro Cattelan | Daniele Patrioli | Luisa Bentivogli | Marco Turchi
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

We describe the CEF Data Marketplace project, which focuses on the development of a trading platform of translation data for language professionals: translators, machine translation (MT) developers, language service providers (LSPs), translation buyers and government bodies. The CEF Data Marketplace platform will be designed and built to manage and trade data for all languages and domains. This project will open a continuous and longterm supply of language data for MT and other machine learning applications.

2017

pdf
Results of the WMT17 Metrics Shared Task
Ondřej Bojar | Yvette Graham | Amir Kamran
Proceedings of the Second Conference on Machine Translation

2016

pdf
Results of the WMT16 Metrics Shared Task
Ondřej Bojar | Yvette Graham | Amir Kamran | Miloš Stanojević
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf
Results of the WMT16 Tuning Shared Task
Bushra Jawaid | Amir Kamran | Miloš Stanojević | Ondřej Bojar
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf
Enriching Source for English-to-Urdu Machine Translation
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

This paper focuses on the generation of case markers for free word order languages that use case markers as phrasal clitics for marking the relationship between the dependent-noun and its head. The generation of such clitics becomes essential task especially when translating from fixed word order languages where syntactic relations are identified by the positions of the dependent-nouns. To address the problem of missing markers on source-side, artificial markers are added in source to improve alignments with its target counterparts. Up to 1 BLEU point increase is observed over the baseline on different test sets for English-to-Urdu.

2015

pdf
Results of the WMT15 Metrics Shared Task
Miloš Stanojević | Amir Kamran | Philipp Koehn | Ondřej Bojar
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf
Results of the WMT15 Tuning Shared Task
Miloš Stanojević | Amir Kamran | Ondřej Bojar
Proceedings of the Tenth Workshop on Statistical Machine Translation

2014

pdf
English to Urdu Statistical Machine Translation: Establishing a Baseline
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the Fifth Workshop on South and Southeast Asian Natural Language Processing

pdf
A Tagged Corpus and a Tagger for Urdu
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe a release of a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the tagged corpus. Additionally, we use this data to train a single standalone tagger which will hopefully significantly simplify Urdu processing. The standalone tagger obtains the accuracy of 88.74% on test data.

2012

pdf
Probes in a Taxonomy of Factored Phrase-Based Models
Ondřej Bojar | Bushra Jawaid | Amir Kamran
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf
Selecting Data for English-to-Czech Machine Translation
Aleš Tamchyna | Petra Galuščáková | Amir Kamran | Miloš Stanojević | Ondřej Bojar
Proceedings of the Seventh Workshop on Statistical Machine Translation