Sean Colbath


Portable Speech-to-Speech Translation on an Android Smartphone: The MFLTS System
Ralf Meermeier | Sean Colbath | Martha Lillie
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)


Language and Translation Challenges in Social Media
Sean Colbath
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Government MT User Program

The explosive growth of social media has led to a wide range of new challenges for machine translation and language processing. The language used in social media occupies a new space between structured and unstructured media, formal and informal language, and dialect and standard usage. Yet these new platforms have given a digital voice to millions of user on the Internet, giving them the opportunity to communicate on the first truly global stage – the Internet. Social media covers a broad category of communications formats, ranging from threaded conversations on Facebook, to microblog and short message content on platforms like Twitter and Weibo – but it also includes user-generated comments on YouTube, as well as the contents of the video itself, and even includes ‘traditional’ blogs and forums. The common thread linking all of these is that the media is generated by, and is targeted at individuals. This talk will survey some of the most popular social media platforms, and identify key challenges in translating the content found in them – including dialect, code switching, mixed encodings, the use of “internet speak”, and platform-specific language phenomena, as well as volume and genre. In addition, we will talk about some of the challenges in analyzing social media from an operational point of view, and how language and translation issues influence higher-level analytic processes such as entity extraction, topic classification and clustering, geo-spatial analysis and other technologies that enable comprehension of social media. These latter capabilities are being adapted for social media analytics for US Government analysts under the support of the Technical Support Working Group at the US DoD, enabling translingual comprehension of this style of content in an operational environment.


Terminology Management for Web Monitoring
Sean Colbath
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Government MT User Program

Current state-of-the-art in speech recognition, machine translation, and natural language processing (NLP) technologies has allowed the development of powerful media monitoring systems that provide today's analysts with automatic tools for ingesting and searching through different types of data, such as broadcast video, web pages, documents, and scanned images. However the core human-language technologies (HLT) in these media monitoring systems are static learners, which mean that they learn from a pool of labeled data and apply the induced knowledge to operational data in the field. To enable successful and widespread deployment and adoption of HLT, these technologies need to be able to adapt effectively to new operational domains on demand. To provide the US Government analyst with dynamic tools that adapt to these changing domains, these HLT systems must support customizable lexicons. However, the lexicon customization capability in HLT systems presents another unique challenge especially in the context of multiple users of typical media monitoring system installations in the field. Lexicon customization requests from multiple users can be quite extensive, and may conflict in orthographic representation (spelling, transliteration, or stylistic consistency) or in overall meaning. To protect against spurious and inconsistent updates to the system, the media monitoring systems need to support a central terminology management capability to collect, manage, and execute customization requests across multiple users of the system. In this talk, we will describe the integration of a user-driven lexicon/dictionary customization and terminology management capability in the context of the Raytheon BBN Web Monitoring System (WMS) to allow intelligence analysts to update the Machine Translation (MT) system in the WMS with domain- and mission-specific source-to-English phrase translation rules. The Language Learning Broker (LLB) tool from the Technology Development Group (TDG) is a distributed system that supports dictionary/terminology management, personalized dictionaries, and a workflow between linguists and linguist management. LLB is integrated with the WMS to provide a terminology management capability for users to submit, review, validate, and manage customizations of the MT system through the WMS User Interface (UI). We will also describe an ongoing experiment to measure the effectiveness of this user-driven customization capability, in terms of increased translation utility, through a controlled experiment conducted with the help of intelligence analysts.


TAP-XL: An Automated Analyst’s Assistant
Sean Colbath | Francis Kubala
Companion Volume of the Proceedings of HLT-NAACL 2003 - Demonstrations