Paul Clough

Also published as: Paul D. Clough


pdf bib
Generating Paths through Cultural Heritage Collections
Samuel Fernando | Paula Goodale | Paul Clough | Mark Stevenson | Mark Hall | Eneko Agirre
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

PATHS: A System for Accessing Cultural Heritage Collections
Eneko Agirre | Nikolaos Aletras | Paul Clough | Samuel Fernando | Paula Goodale | Mark Hall | Aitor Soroa | Mark Stevenson
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations


Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles
Monica Lestari Paramita | Paul Clough | Ahmet Aker | Robert Gaizauskas
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Wikipedia articles in different languages have been mined to support various tasks, such as Cross-Language Information Retrieval (CLIR) and Statistical Machine Translation (SMT). Articles on the same topic in different languages are often connected by inter-language links, which can be used to identify similar or comparable content. In this work, we investigate the correlation between similarity measures utilising language-independent and language-dependent features and respective human judgments. A collection of 800 Wikipedia pairs from 8 different language pairs were collected and judged for similarity by two assessors. We report the development of this corpus and inter-assessor agreement between judges across the languages. Results show that similarity measured using language independent features is comparable to using an approach based on translating non-English documents. In both cases the correlation with human judgments is low but also dependent upon the language pair. The results and corpus generated from this work also provide insights into the measurement of cross-language similarity.

Collecting and Using Comparable Corpora for Statistical Machine Translation
Inguna Skadiņa | Ahmet Aker | Nikos Mastropavlos | Fangzhong Su | Dan Tufis | Mateja Verlic | Andrejs Vasiļjevs | Bogdan Babych | Paul Clough | Robert Gaizauskas | Nikos Glaros | Monica Lestari Paramita | Mārcis Pinnis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Lack of sufficient parallel data for many languages and domains is currently one of the major obstacles to further advancement of automated translation. The ACCURAT project is addressing this issue by researching methods how to improve machine translation systems by using comparable corpora. In this paper we present tools and techniques developed in the ACCURAT project that allow additional data needed for statistical machine translation to be extracted from comparable corpora. We present methods and tools for acquisition of comparable corpora from the Web and other sources, for evaluation of the comparability of collected corpora, for multi-level alignment of comparable corpora and for extraction of lexical and terminological data for machine translation. Finally, we present initial evaluation results on the utility of collected corpora in domain-adapted machine translation and real-life applications.

Enabling the Discovery of Digital Cultural Heritage Objects through Wikipedia
Mark Michael Hall | Oier Lopez de Lacalle | Aitor Soroa Etxabe | Paul Clough | Eneko Agirre
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Detecting Text Reuse with Modified and Weighted N-grams
Rao Muhammad Adeel Nawab | Mark Stevenson | Paul Clough
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Comparing Taxonomies for Organising Collections of Documents
Samuel Fernando | Mark Hall | Eneko Agirre | Aitor Soroa | Paul Clough | Mark Stevenson
Proceedings of COLING 2012


Multilingual interactive experiments with Flickr
Paul D. Clough | Julio Gonzales | Jussi Karlgren
Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources


EuroWordNet as a Resource for Cross-language Information Retrieval
Mark Stevenson | Paul Clough
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)


Measuring Text Reuse
Paul Clough | Robert Gaizauskas | Scott S.L. Piao | Yorick Wilks
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

Building and annotating a corpus for the study of journalistic text reuse
Paul Clough | Robert Gaizauskas | S. L. Piao
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)