Benjamin K. Tsou

Also published as: B. K. T’sou, Benjamin K Tsou, Benjamin K. T’sou, Benjamin K.Y. Tsou, Benjamin Tsou


2023

pdf
Post-editing of Technical Terms based on Bilingual Example Sentences
Elsie K. Y. Chan | John Lee | Chester Cheng | Benjamin Tsou
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

As technical fields become ever more specialized, and with continuous emergence of novel technical terms, it may not be always possible to avail of bilingual experts in the field to perform translation. This paper investigates the performance of bilingual non-experts in Computer-Assisted Translation. The translators were asked to identify and correct errors in MT output of technical terms in patent materials, aided only by example bilingual sentences. Targeting English-to-Chinese translation, we automatically extract the example sentences from a bilingual corpus of English and Chinese patents. We identify the most frequent translation candidates of a term, and then select the most relevant example sentences for each candidate according to semantic similarity. Even when given only two example sentences for each translation candidate, the non-expert translators were able to post-edit effectively, correcting 67.2% of the MT errors while mistakenly revising correct MT output in only 17% of the cases.

pdf
Comparing Chinese-English MT Performance Involving ChatGPT and MT Providers and the Efficacy of AI mediated Post-Editing
Larry Cady | Benjamin Tsou | John Lee
Proceedings of Machine Translation Summit XIX, Vol. 2: Users Track

The recent introduction of ChatGPT has caused much stir in the translation industry because of its impressive translation performance against leaders in the industry. We review some ma-jor issues based on the BLEU comparisons of Chinese-to-English (C2E) and English-to-Chinese (E2C) machine translation (MT) performance by ChatGPT against a range of leading MT providers in mostly technical domains. Based on sample aligned sentences from a sizable bilingual Chinese-English patent corpus and other sources, we find that while ChatGPT perform better generally, it does not consistently perform better than others in all areas or cases. We also draw on novice translators as post-editors to explore a major component in MT post-editing: Optimization of terminology. Many new technical words, including MWEs (Multi-Word Expressions), are problematic because they involve terminological developments which must balance between proper encapsulation of technical innovation and conforming to past traditions . Drawing on the above-mentioned corpus we have been developing an AI mediated MT post-editing (MTPE) system through the optimization of precedent rendition distribution and semantic association to enhance the work of translators and MTPE practitioners.

2020

pdf
Using Bilingual Patents for Translation Training
John Lee | Benjamin Tsou | Tianyuan Cai
Proceedings of the 28th International Conference on Computational Linguistics

While bilingual corpora have been instrumental for machine translation, their utility for training translators has been less explored. We investigate the use of bilingual corpora as pedagogical tools for translation in the technical domain. In a user study, novice translators revised Chinese translations of English patents through bilingual concordancing. Results show that concordancing with an in-domain bilingual corpus can yield greater improvement in translation quality of technical terms than a general-domain bilingual corpus.

pdf
A corpus-based comparative study of light verbs in three Chinese speech communities
Benjamin K Tsou | Ka-Fai Yip
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf
Bilingual Multi-word Expressions, Multiple-correspondence, and their cultivation from parallel patents: The Chinese-English case
Benjamin K. Tsou | Ka Po Chow | John Lee | Ka-Fai Yip | Yaxuan Ji | Kevin Wu
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

2019

pdf
Difficulty-aware Distractor Generation for Gap-Fill Items
Chak Yan Yeung | John Lee | Benjamin Tsou
Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association

pdf
Towards a Proactive MWE Terminological Platform for Cross-Lingual Mediation in the Age of Big Data
Benjamin K. Tsou | Kapo Chow | Junru Nie | Yuan Yuan
Proceedings of the Human-Informed Translation and Interpreting Technology Workshop (HiT-IT 2019)

The emergence of China as a global economic power in the 21st Century has brought about surging needs for cross-lingual and cross-cultural mediation, typically performed by translators. Advances in Artificial Intelligence and Language Engineering have been bolstered by Machine learning and suitable Big Data cultivation. They have helped to meet some of the translator’s needs, though the technical specialists have not kept pace with the practical and expanding requirements in language mediation. One major technical and linguistic hurdle involves words outside the vocabulary of the translator or the lexical database he/she consults, especially Multi-Word Expressions (Compound Words) in technical subjects. A further problem is in the multiplicity of renditions of a term in the target language. This paper discusses a proactive approach following the successful extraction and application of sizable bilingual Multi-Word Expressions (Compound Words) for language mediation in technical subjects, which do not fall within the expertise of typical translators, who have inadequate appreciation of the range of new technical tools available to help him/her. Our approach draws on the personal reflections of translators and teachers of translation and is based on the prior R&D efforts relating to 300,000 comparable Chinese-English patents. The subsequent protocol we have developed aims to be proactive in meeting four identified practical challenges in technical translation (e.g. patents). It has broader economic implication in the Age of Big Data (Tsou et al, 2015) and Trade War, as the workload, if not, the challenges, increasingly cannot be met by currently available front-line translators. We shall demonstrate how new tools can be harnessed to spearhead the application of language technology not only in language mediation but also in the “teaching” and “learning” of translation. It shows how a better appreciation of their needs may enhance the contributions of the technical specialists, and thus enhance the resultant synergetic benefits.

2015

pdf bib
Augmented Comparative Corpora and Monitoring Corpus in Chinese: LIVAC and Sketch Search Engine Compared
Benjamin K. Tsou
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

2012

pdf
Idiomaticity and Classical Traditions in Some East Asian Languages
Benjamin K Tsou
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

2011

pdf
The Cultivation of a Chinese-English-Japanese Trilingual Parallel Corpus from Comparable Patents
Bin Lu | Ka Po Chow | Benjamin K. Tsou
Proceedings of Machine Translation Summit XIII: Papers

pdf
Machine translation between uncommon language pairs via a third common language: the case of patents
Benjamin K. Tsou | Bin Lu
Proceedings of Translating and the Computer 33

pdf
Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
Bin Lu | Chenhao Tan | Claire Cardie | Benjamin K. Tsou
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf
Mining Large-scale Parallel Corpora from Multilingual Patents: An English-Chinese example and its application to SMT
Bin Lu | Benjamin K. Tsou | Tao Jiang | Oi Yee Kwong | Jingbo Zhu
CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
A Note on Pseudo-comparatives like “John is rich like X!” and “Like X, John is rich!”
Benjamin Tsou
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf
CityU-DAC: Disambiguating Sentiment-Ambiguous Adjectives within Context
Bin Lu | Benjamin K. Tsou
Proceedings of the 5th International Workshop on Semantic Evaluation

2009

pdf bib
The Construction of a Chinese-English Patent Parallel Corpus
Bin Lu | Benjamin K. Tsou | Jingbo Zhu | Tao Jiang | Oi Yee Kwong
Proceedings of the Third Workshop on Patent Translation

pdf
Towards Bilingual Term Extraction in Comparable Patents
Bin Lu | Benjamin K. Tsou
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

2008

pdf
Extending a Thesaurus with Words from Pan-Chinese Sources
Oi Yee Kwong | Benjamin K. Tsou
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf
Active Learning with Sampling by Uncertainty and Density for Word Sense Disambiguation and Text Classification
Jingbo Zhu | Huizhen Wang | Tianshun Yao | Benjamin K Tsou
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf
Extending a Thesaurus in the Pan-Chinese Context
Oi Yee Kwong | Benjamin K. Tsou
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf
Court Stenography-To-Text (“STT”) in Hong Kong: A Jurilinguistic Engineering Effort
Benjamin K. Tsou | Tom B.Y. Lai | K.K. Sin | Lawrence Y.L. Cheung
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Implementation of legal bilingualism in Hong Kong after 1997 has necessitated the production of voluminous and extensive court proceedings and judgments in both Chinese and English. For the former, Cantonese, a dialect of Chinese, is the home language of more than 90% of the population in Hong Kong and so used in the courts. To record speech in Cantonese verbatim, a Chinese Computer-Aided Transcription system has been developed. The transcription system converts stenographic codes into Chinese text, i.e. from phonetic to orthographic representation of the language. The main challenge lies in the resolution of the sever ambiguity resulting from homocode problems in the conversion process. Cantonese Chinese is typified by problematic homonymy, which presents serious challenges. The N-gram statistical model is employed to estimate the most probable character string of the input transcription codes. Domain-specific corpora have been compiled to support the statistical computation. To improve accuracy, scalable techniques such as domain-specific transcription and special encoding are used. Put together, these techniques deliver 96% transcription accuracy.

pdf
Toward a Pan-Chinese Thesaurus
Benjamin K. Tsou | Oi Yee Kwong
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we propose a corpus-based approach to the construction of a Pan-Chinese lexical resource, starting out with the aim to enrich existing Chinese thesauri in the Pan-Chinese context. The resulting thesaurus is thus expected to contain not only the core senses and usages of Chinese lexical items but also usages specific to individual Chinese speech communities. We introduce the ideas behind the construction of the resource, outline the steps to be taken, and discuss some preliminary analyses. The work is backed up by a unique and large Chinese synchronous corpus containing textual data from various Chinese speech communities including Hong Kong, Beijing, Taipei and Singapore.

pdf bib
Regional Variation of Domain-Specific Lexical Items: Toward a Pan-Chinese Lexical Resource
Oi Yee Kwong | Benjamin K. Tsou
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing

2005

pdf bib
Data Homogeneity and Semantic Role Tagging in Chinese
Oi Yee Kwong | Benjamin K. Tsou
Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition

pdf
A Synchronous Corpus-Based Study on the Usage and Perception of Judgement Terms in the Pan-Chinese Context
Oi Yee Kwong | Benjamin K. Tsou
International Journal of Computational Linguistics & Chinese Language Processing, Volume 10, Number 4, December 2005: Special Issue on Selected Papers from CLSW-5

pdf
Using Multiple Discriminant Analysis Approach for Linear Text Segmentation
Jingbo Zhu | Na Ye | Xinzhi Chang | Wenliang Chen | Benjamin K Tsou
Second International Joint Conference on Natural Language Processing: Full Papers

pdf
Semantic Role Tagging for Chinese at the Lexical Level
Oi Yee Kwong | Benjamin K. Tsou
Second International Joint Conference on Natural Language Processing: Full Papers

2004

pdf
Morpheme-based Derivation of Bipolar Semantic Orientation of Chinese Words
Raymond W.M. Yuen | Terence Y.W. Chan | Tom B.Y. Lai | O.Y. Kwong | Benjamin K.Y. Tsou
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

pdf
Categorial Fluidity in Chinese and its Implications for Part-of-speech Tagging
Oi Yee Kwong | Benjamin K. Tsou
10th Conference of the European Chapter of the Association for Computational Linguistics

pdf
A Synchronous Corpus-Based Study of Verb-Noun Fluidity in Chinese
Oi Yee Kwong | Benjamin K. Tsou
Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation

2002

pdf
Alignment and Extraction of Bilingual Legal Terminology from Context Profiles
Oi Yee Kwong | Benjamin K. Tsou | Tom B.Y. Lai | Robert W.P. Luk | Lawrence Y.L. Cheung | Francis C.Y. Chik
COLING-02: COMPUTERM 2002: Second International Workshop on Computational Terminology

pdf bib
Some Considerations on Guidelines for Bilingual Alignment and Terminology Extraction
Lawrence Cheung | Tom Lai | Robert Luk | Oi Yee Kwong | King Kui Sin | Benjamin K. Tsou
COLING-02: The First SIGHAN Workshop on Chinese Language Processing

pdf
Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information
Xiao Luo | Maosong Sun | Benjamin K. Tsou
COLING 2002: The 19th International Conference on Computational Linguistics

2001

pdf
Identification of Chinese Personal Names in Unrestricted Texts
Lawrence Cheung | Benjamin K. Tsou | Maosong Sun
Proceedings of the 16th Pacific Asia Conference on Language, Information and Computation

pdf
Evaluating Chinese-English translation systems for personal name coverage
Benjamin K. Tsou | Oi Yee Kwong
Workshop on MT2010: Towards a Road Map for MT

This paper discusses the challenges which Chinese-English machine translation (MT) systems face in translating personal names. We show that the translation of names between Chinese and English is complicated by different factors, including orthographic, phonetic, geographic and social ones. Four existing systems were tested for their capability in translating personal names from Chinese to English. Test data embodying geographic and sociolinguistic differences were obtained from a synchronous Chinese corpus of news media texts. It is obvious that systems vary considerably in their ability to identify personal names in the source language and render them properly in the target language. Given the criticality of personal name translation to the overall intelligibility of a translated text, the coverage of personal names should be one of the important criteria in the evaluation of MT performance. Moreover, name translation, which calls for a hybrid approach, would remain a central issue to the future development of MT systems, especially for online and real-time applications.

pdf bib
Proceedings of the 15th Pacific Asia Conference on Language, Information and Computation
Benjamin K. T’sou | Olivia O.Y. Kwong | Tom B.Y. Lai
Proceedings of the 15th Pacific Asia Conference on Language, Information and Computation

2000

pdf
Jurilinguistic Engineering in Cantonese Chinese: An N-gram-based Speech to Text Transcription System
B. K. T’sou | K. K. Sin | S. W. K. Chan | T. B. Y. Lai | C Lun | K. T. Ko | G. K. K. Chan | L. Y. L. Cheung
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
Mining Discourse Markers for Chinese Textual Summarization
Samuel W. K. Chan | Tom B. Y. Lai | W. J. Gao | Benjamin K. T’sou
NAACL-ANLP 2000 Workshop: Automatic Summarization

pdf
Enhancement of a Chinese Discourse Marker Tagger with C4.5
Benjamin K. T’sou | Tom B.Y Lai | Samuel W.K. Chan | Weijun Gao | Xuegang Zhan
Second Chinese Language Processing Workshop

pdf
Textual Information Segmentation by Cohesive Ties
Samuel W.K. Chan | Benjamin K. T’sou | C.F. Choy
Proceedings of the 14th Pacific Asia Conference on Language, Information and Computation

pdf
Automatic Conversion from Phonetic to Textual Representation of Cantonese : The Case of Hong Kong Court Proceedings
Benjamin K. Tsou | K.K. Sin | Samuel W. K. Chan | Tom B. Y. Lai | Caesar Lun | K. T. Ko | Gary K. K. Chan | Lawrence Y. L. Cheung
Proceedings of the 14th Pacific Asia Conference on Language, Information and Computation

1999

pdf
Anaphora Resolution as Lexical Cohesion Identification
Samuel W.K. Chan | Benjamin K. T’sou
Proceedings of the 13th Pacific Asia Conference on Language, Information and Computation

pdf
MT evaluation
Margaret King | Eduard Hovy | Benjamin K. Tsou | John White | Yusoff Zaharin
Proceedings of Machine Translation Summit VII

This panel deals with the general topic of evaluation of machine translation systems. The first contribution sets out some recent work on creating standards for the design of evaluations. The second, by Eduard Hovy. takes up the particular issue of how metrics can be differentiated and systematized. Benjamin K. T'sou suggests that whilst men may evaluate machines, machines may also evaluate men. John S. White focuses on the question of the role of the user in evaluation design, and Yusoff Zaharin points out that circumstances and settings may have a major influence on evaluation design.

1998

pdf
Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data
Maosong Sun | Dayang Shen | Benjamin K. Tsou
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

pdf
Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data
Maosong Sun | Dayang Shen | Benjamin K. Tsou
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

pdf
Human Judgment as a Basis for Evaluation of Discourse-Connective-Based Full-Text Abstraction in Chinese
Benjamin K. T’sou | Hing-Lung Lin | Tom B. Y. Lai | Samuel W. K. Chan
International Journal of Computational Linguistics & Chinese Language Processing, Volume 3, Number 1, February 1998: Special Issue on the 10th Research on Computational Linguistics International Conference

1997

pdf
Human Judgment as a Basis for Evaluation of Discourse-Connective-based Full-text Abstraction in Chinese
Benjamin K. T’sou | Hing-Lung Lin | Tom B. Y. Lai
Proceedings of the 10th Research on Computational Linguistics International Conference

pdf bib
Chinese Word Segmentation and Part-of-Speech Tagging in One Step
Tom B.Y. Lai | Maosong Sun | Benjamin K. T’sou | S. Caesar Lun
ROCLING 1997 Poster Papers

pdf
A Synchronous Chinese Language Corpus from Different Speech Communities: Construction and Applications
Benjamin K. T’sou | Hing-Lung Lin | Godfrey Liu | Terence Chan | Jerome Hu | Ching-hai Chew | John K.P Tse
International Journal of Computational Linguistics & Chinese Language Processing, Volume 2, Number 1, February 1997: Special Issue on Computational Resources for Research in Chinese Linguistics

1995

pdf bib
Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation
Benjamin K. T’sou | Tom B. Y. Lai
Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation

pdf
Ambiguity Resolution in Chinese Word Segmentation
Maosong Sun | Benjamin K. T’sou
Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation

1992

pdf
A Knowledge-based Machine-aided System for Chinese Text Abstraction
Benjamin K. Tsou | Hing-cheung Ho | Tom Bong-yeung Lai | Caesar Suen Lun | Hing-lung Lin
COLING 1992 Volume 3: The 14th International Conference on Computational Linguistics

1991

pdf
Automatic Chinese Text Generation Based On Inference Trees
Hing-Lung Lin | Benjamin K. T’sou | Hing-Cheung Ho | Bong-Yeung Lai | Suen Caesar Lun | Chi-Yuen Choi | Chun-yu Kit
Proceedings of Rocling IV Computational Linguistics Conference IV