Narrowing the Gap Between Termbases and Corpora in Commercial Environments

Kara Warburton


Abstract
Terminological resources offer potential to support applications beyond translation, such as controlled authoring and indexing, which are increasingly of interest to commercial enterprises. The ad-hoc semasiological approach adopted by commercial terminographers diverges considerably from methodologies prescribed by conventional theory. The notion of termhood in such production-oriented environments is driven by pragmatic criteria such as frequency and repurposability of the terminological unit. A high degree of correspondence between the commercial corpus and the termbase is desired. Research carried out at the City University of Hong Kong using four IT companies as case studies revealed a large gap between corpora and termbases. Problems in selecting terms and in encoding them properly in termbases account for a significant portion of this gap. A rigorous corpus-based approach to term selection would significantly reduce this gap and improve the effectiveness of commercial termbases. In particular, single-word terms (keywords) identified by comparison to a reference corpus offer great potential for identifying important multi-word terms in this context. We conclude that terminography for production purposes should be more corpus-based than is currently the norm.
Anthology ID:
L14-1394
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
722–727
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/466_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Kara Warburton. 2014. Narrowing the Gap Between Termbases and Corpora in Commercial Environments. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 722–727, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Narrowing the Gap Between Termbases and Corpora in Commercial Environments (Warburton, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/466_Paper.pdf