Large Scale Lexical Analysis

Gregor Thurmair, Vera Aleksić, Christoph Schwarz


Abstract
The following paper presents a lexical analysis component as implemented in the PANACEA project. The goal is to automatically extract lexicon entries from crawled corpora, in an attempt to use corpus-based methods for high-quality linguistic text processing, and to focus on the quality of data without neglecting quantitative aspects. Lexical analysis has the task to assign linguistic information (like: part of speech, inflectional class, gender, subcategorisation frame, semantic properties etc.) to all parts of the input text. If tokens are ambiguous, lexical analysis must provide all possible sets of annotation for later (syntactic) disambiguation, be it tagging, or full parsing. The paper presents an approach for assigning part-of-speech tags for German and English to large input corpora (> 50 mio tokens), providing a workflow which takes as input crawled corpora and provides POS-tagged lemmata ready for lexicon integration. Tools include sentence splitting, lexicon lookup, decomposition, and POS defaulting. Evaluation shows that the overall error rate can be brought down to about 2% if language resources are properly designed. The complete workflow is implemented as a sequence of web services integrated into the PANACEA platform.
Anthology ID:
L12-1268
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2849–2855
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/493_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Gregor Thurmair, Vera Aleksić, and Christoph Schwarz. 2012. Large Scale Lexical Analysis. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2849–2855, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Large Scale Lexical Analysis (Thurmair et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/493_Paper.pdf