Adam Kilgarriff

2015

2014

pdf
Hindi Word Sketches
Anil Krishna Eragani | Varun Kuchib Hotla | Dipti Misra Sharma | Siva Reddy | Adam Kilgarriff
Proceedings of the 11th International Conference on Natural Language Processing

pdf
Terminology finding in the Sketch Engine: an evaluation
Adam Kilgarriff
Proceedings of Translating and the Computer 36

pdf abs
Extrinsic Corpus Evaluation with a Collocation Dictionary Task
Adam Kilgarriff | Pavel Rychlý | Miloš Jakubíček | Vojtěch Kovář | Vít Baisa | Lucia Kocincová
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The NLP researcher or application-builder often wonders “what corpus should I use, or should I build one of my own? If I build one of my own, how will I know if I have done a good job?” Currently there is very little help available for them. They are in need of a framework for evaluating corpora. We develop such a framework, in relation to corpora which aim for good coverage of ‘general language’. The task we set is automatic creation of a publication-quality collocations dictionary. For a sample of 100 headwords of Czech and 100 of English, we identify a gold standard dataset of (ideally) all the collocations that should appear for these headwords in such a dictionary. The datasets are being made available alongside this paper. We then use them to determine precision and recall for a range of corpora, with a range of parameters.

pdf abs
Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora
Irina Temnikova | William A. Baumgartner Jr. | Negacy D. Hailu | Ivelina Nikolova | Tony McEnery | Adam Kilgarriff | Galia Angelova | K. Bretonnel Cohen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Sublanguages are varieties of language that form subsets of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed―English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

pdf
Finding Terms in Corpora for Many Languages with the Sketch Engine
Miloš Jakubíček | Adam Kilgarriff | Vojtěch Kovář | Pavel Rychlý | Vít Suchomel
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf
Terminology finding in the Sketch engine
Adam Kilgarriff
Proceedings of Translating and the Computer 35

2012

pdf abs
Word Sketches for Turkish
Bharat Ram Ambati | Siva Reddy | Adam Kilgarriff
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Word sketches are one-page, automatic, corpus-based summaries of a word's grammatical and collocational behaviour. In this paper we present word sketches for Turkish. Until now, word sketches have been generated using a purpose-built finite-state grammars. Here, we use an existing dependency parser. We describe the process of collecting a 42 million word corpus, parsing it, and generating word sketches from it. We evaluate the word sketches in comparison with word sketches from a language independent sketch grammar on an external evaluation task called topic coherence, using Turkish WordNet to derive an evaluation set of coherent topics.

2011

pdf
Helping Our Own: The HOO 2011 Pilot Shared Task
Robert Dale | Adam Kilgarriff
Proceedings of the 13th European Workshop on Natural Language Generation

2010

pdf bib
Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Adam Kilgarriff | Dekang Lin
Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop

pdf
Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task
Robert Dale | Adam Kilgarriff
Proceedings of the 6th International Natural Language Generation Conference

pdf abs
A Corpus Factory for Many Languages
Adam Kilgarriff | Siva Reddy | Jan Pomikálek | Avinesh PVS
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

For many languages there are no large, general-language corpora available. Until the web, all but the institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a corpus factory where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. We use the BootCaT method: we take a set of 'seed words' for the language from Wikipedia. Then, several hundred times over, we * randomly select three or four of the seed words * send as a query to Google or Yahoo or Bing, which returns a 'search hits' page * gather the pages that Google or Yahoo point to and save the text. This forms the corpus, which we then * 'clean' (to remove navigation bars, advertisements etc) * remove duplicates * tokenise and (if tools are available) lemmatise and part-of-speech tag * load into our corpus query tool, the Sketch Engine The corpora we have developed are available for use in the Sketch Engine corpus query tool.

pdf
Fast Syntactic Searching in Very Large Corpora for Many Languages
Miloš Jakubíček | Adam Kilgarriff | Diana McCarthy | Pavel Rychlý
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf
A Detailed, Accurate, Extensive, Available English Lexical Database
Adam Kilgarriff
Proceedings of the NAACL HLT 2010 Demonstration Session

2008

pdf abs
Evaluating a German Sketch Grammar: A Case Study on Noun Phrase Case
Kremena Ivanova | Ulrich Heid | Sabine Schulte im Walde | Adam Kilgarriff | Jan Pomikálek
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Word sketches are part of the Sketch Engine corpus query system. They represent automatic, corpus-derived summaries of the words grammatical and collocational behaviour. Besides the corpus itself, word sketches require a sketch grammar, a regular expression-based shallow grammar over the part-of-speech tags, to extract evidence for the properties of the targeted words from the corpus. The paper presents a sketch grammar for German, a language which is not strictly configurational and which shows a considerable amount of case syncretism, and evaluates its accuracy, which has not been done for other sketch grammars. The evaluation focuses on NP case as a crucial part of the German grammar. We present various versions of NP definitions, so demonstrating the influence of grammar detail on precision and recall.

pdf abs
Cleaneval: a Competition for Cleaning Web Pages
Marco Baroni | Francis Chantree | Adam Kilgarriff | Serge Sharoff
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Cleaneval is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus for linguistic and language technology research and development. The first exercise took place in 2007. We describe how it was set up, results, and lessons learnt

2007

pdf
An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)
Pavel Rychlý | Adam Kilgarriff
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf
Last Words: Googleology is Bad Science
Adam Kilgarriff
Computational Linguistics, Volume 33, Number 1, March 2007

2006

pdf bib
Large Linguistically-Processed Web Corpora for Multiple Languages
Marco Baroni | Adam Kilgarriff
Demonstrations

pdf
Shared-Task Evaluations in HLT: Lessons for NLG
Anja Belz | Adam Kilgarriff
Proceedings of the Fourth International Natural Language Generation Conference

pdf
Annotated Web as corpus
Paul Rayson | James Walkerdine | William H. Fletcher | Adam Kilgarriff
Proceedings of the 2nd International Workshop on Web as Corpus

pdf
WebBootCaT. Instant Domain-Specific Corpora to Support Human Translators
Marco Baroni | Adam Kilgarriff | Jan Pomikalek | Pavel Rychly
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

2005

Most MT lexicography is devoted to developing rules of the kind, “in context C, translate source-language word S as target-language word T”. Very many such rules are required, producing them is laborious, and MT companies standardly spend large sums on it. We present the WASP-Bench, a lexicographer's workstation for the rapid and semi-automatic development of such rule-sets. The WASP-Bench makes use of a large source-language corpus and state-of-the-art techniques for Word Sense Disambiguation. We show that the WSD accuracy is on a par with the best results published to date, with the advantage that the WASP-Bench, unlike other high- performance systems, does not require a sense-disambiguated training corpus as input. The WASP-Bench is designed to fit readily with MT companies' working practices, as it may be used for as many or as few source language words as present disambiguation problems for a given target.