Mike O’Malley


The Challenges of Distributed Parallel Corpora
Mike O’Malley
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Government MT User Program

Parallel corpora have traditionally been created, maintained and disseminated by translators and analysts addressing specific domains. They grow by aggregation, individual contributions taking residence in the knowledge base. While the provenance of these new terms is known, their validity is not; they must be vetted by domain and language experts in order to be considered for use in the translation process. In order to address the evolving ecosphere surrounding parallel corpora, developers and analysts need to move beyond the data limitations of the static model. This traditional model does not fully take advantage of new infiltration and exfiltration datapaths available in today's world of distributed knowledge bases. Incoming data are no longer simply textual-audio, imagery and video are all critical components in corpora utility. Corpora maintainers have access to these media types through a variety of data sources, such as automated media monitoring services, the output of any number of translation environments, and translation memory exchanges (TMXs) developed by domain and language experts. These input opportunities are often pre-vetted and ready for automated inclusion into the parallel corpora; their content should not be reduced to the strictly textual. Unfortunately, the quality of the automated alignment and segmentation systems used in these automated systems remains a concern for the bulk preprocessing needed for downstream systems. These data sources share a common characteristic, that of known provenance. They are typically a vetted source and a regular provider to the parallel corpora, whether via daily newscasts or other means. Other data sources are distributed in nature and thus offer distinct challenges to the collection, vetting and exploitation processes. One of the most exciting of such an infiltration path is crowdsourcing. A next-generation parallel corpora management system must be capable of, if not necessarily automatically incorporating crowdsourced terminology as a vetted source, facilitating manual inclusion of vetted crowdsourced terminology. This terminology may be submitted in any scale from practically any source. It may overlap or be contradictory - it almost certainly will require some degree of analysis and evaluation before inclusion. Fortunately, statistical analysis techniques are available to mitigate these concerns. One significant benefit to a crowdsourcing approach is the gains in alignment and segmentation accuracy over similar products offered by the automated systems mentioned above. Given the scalability of crowdsourcing methods, it is certainly a viable framework for bulk alignment and segmentation. Another consideration for the development of distributed parallel corpora systems is their position in the translation workflow. The outputs and exfiltration paths of such a system can be as used for such diverse purposes as addition to existing TMXs, refinement of existing MT applications (through either improvement of their learning processes or inclusion of parallel-corpora generated domain-specific lexicons), creation of sentence pairs and other products for language learning system (LLS) systems, and support for exemplar language clips such as those developed by the State Department.