Scott S.L. Piao

Also published as: S. L. Piao, Scott Piao, Scott S. L. Piao


Survey on Thai NLP Language Resources and Tools
Ratchakrit Arreerard | Stephen Mander | Scott Piao
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Over the past decades, Natural Language Processing (NLP) research has been expanding to cover more languages. Recently particularly, NLP community has paid increasing attention to under-resourced languages. However, there are still many languages for which NLP research is limited in terms of both language resources and software tools. Thai language is one of the under-resourced languages in the NLP domain, although it is spoken by nearly 70 million people globally. In this paper, we report on our survey on the past development of Thai NLP research to help understand its current state and future research directions. Our survey shows that, although Thai NLP community has achieved a significant achievement over the past three decades, particularly on NLP upstream tasks such as tokenisation, research on downstream tasks such as syntactic parsing and semantic analysis is still limited. But we foresee that Thai NLP research will advance rapidly as richer Thai language resources and more robust NLP techniques become available.


Metaphorical Expressions in Automatic Arabic Sentiment Analysis
Israa Alsiyat | Scott Piao
Proceedings of the Twelfth Language Resources and Evaluation Conference

Over the recent years, Arabic language resources and NLP tools have been under rapid development. One of the important tasks for Arabic natural language processing is the sentiment analysis. While a significant improvement has been achieved in this research area, the existing computational models and tools still suffer from the lack of capability of dealing with Arabic metaphorical expressions. Metaphor has an important role in Arabic language due to its unique history and culture. Metaphors provide a linguistic mechanism for expressing ideas and notions that can be different from their surface form. Therefore, in order to efficiently identify true sentiment of Arabic language data, a computational model needs to be able to “read between lines”. In this paper, we examine the issue of metaphors in automatic Arabic sentiment analysis by carrying out an experiment, in which we observe the performance of a state-of-art Arabic sentiment tool on metaphors and analyse the result to gain a deeper insight into the issue. Our experiment evidently shows that metaphors have a significant impact on the performance of current Arabic sentiment tools, and it is an important task to develop Arabic language resources and computational models for Arabic metaphors.

Infrastructure for Semantic Annotation in the Genomics Domain
Mahmoud El-Haj | Nathan Rutherford | Matthew Coole | Ignatius Ezeani | Sheryl Prentice | Nancy Ide | Jo Knight | Scott Piao | John Mariani | Paul Rayson | Keith Suderman
Proceedings of the Twelfth Language Resources and Evaluation Conference

We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST is also connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words.


Leveraging Pre-Trained Embeddings for Welsh Taggers
Ignatius Ezeani | Scott Piao | Steven Neale | Paul Rayson | Dawn Knight
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

While the application of word embedding models to downstream Natural Language Processing (NLP) tasks has been shown to be successful, the benefits for low-resource languages is somewhat limited due to lack of adequate data for training the models. However, NLP research efforts for low-resource languages have focused on constantly seeking ways to harness pre-trained models to improve the performance of NLP systems built to process these languages without the need to re-invent the wheel. One such language is Welsh and therefore, in this paper, we present the results of our experiments on learning a simple multi-task neural network model for part-of-speech and semantic tagging for Welsh using a pre-trained embedding model from FastText. Our model’s performance was compared with those of the existing rule-based stand-alone taggers for part-of-speech and semantic taggers. Despite its simplicity and capacity to perform both tasks simultaneously, our tagger compared very well with the existing taggers.


Towards a Welsh Semantic Annotation System
Scott Piao | Paul Rayson | Dawn Knight | Gareth Watkins
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger
Mahmoud El-Haj | Paul Rayson | Scott Piao | Jo Knight
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds
Mahmoud El-Haj | Paul Rayson | Scott Piao | Stephen Wattam
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications

Creating high-quality wide-coverage multilingual semantic lexicons to support knowledge-based approaches is a challenging time-consuming manual task. This has traditionally been performed by linguistic experts: a slow and expensive process. We present an experiment in which we adapt and evaluate crowdsourcing methods employing native speakers to generate a list of coarse-grained senses under a common multilingual semantic taxonomy for sets of words in six languages. 451 non-experts (including 427 Mechanical Turk workers) and 15 expert participants semantically annotated 250 words manually for Arabic, Chinese, English, Italian, Portuguese and Urdu lexicons. In order to avoid erroneous (spam) crowdsourced results, we used a novel task-specific two-phase filtering process where users were asked to identify synonyms in the target language, and remove erroneous senses.


Lexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages
Scott Piao | Paul Rayson | Dawn Archer | Francesca Bianchi | Carmen Dayrell | Mahmoud El-Haj | Ricardo-María Jiménez | Dawn Knight | Michal Křen | Laura Löfberg | Rao Muhammad Adeel Nawab | Jawad Shafi | Phoey Lee Teh | Olga Mudraya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The last two decades have seen the development of various semantic lexical resources such as WordNet (Miller, 1995) and the USAS semantic lexicon (Rayson et al., 2004), which have played an important role in the areas of natural language processing and corpus-based studies. Recently, increasing efforts have been devoted to extending the semantic frameworks of existing lexical knowledge resources to cover more languages, such as EuroWordNet and Global WordNet. In this paper, we report on the construction of large-scale multilingual semantic lexicons for twelve languages, which employ the unified Lancaster semantic taxonomy and provide a multilingual lexical knowledge base for the automatic UCREL semantic annotation system (USAS). Our work contributes towards the goal of constructing larger-scale and higher-quality multilingual semantic lexical resources and developing corpus annotation tools based on them. Lexical coverage is an important factor concerning the quality of the lexicons and the performance of the corpus annotation tools, and in this experiment we focus on evaluating the lexical coverage achieved by the multilingual lexicons and semantic annotation tools based on them. Our evaluation shows that some semantic lexicons such as those for Finnish and Italian have achieved lexical coverage of over 90% while others need further expansion.


Development of the Multilingual Semantic Annotation System
Scott Piao | Francesca Bianchi | Carmen Dayrell | Angela D’Egidio | Paul Rayson
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies


Clustering Related Terms with Definitions
Scott Piao | John McNaught | Sophia Ananiadou
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

It is a challenging task to match similar or related terms/expressions in NLP and Text Mining applications. Two typical areas in need for such work are terminology and ontology constructions, where terms and concepts are extracted and organized into certain structures with various semantic relations. In the EU BOOTSTrep Project we test various techniques for matching terms that can assist human domain experts in building and enriching ontologies. This paper reports on a work in which we evaluated a text comparing and clustering tool for this task. Particularly, we explore the feasibility of matching related terms with their definitions. Ontology terms, such as Gene Ontology terms, are often assigned with detailed definitions, which provide a fundamental information source for detecting relations between terms. Here we focus on the exploitation of term definitions for the term matching task. Our experiment shows that the tool is capable of grouping many related terms using their definitions.


An Annotation Type System for a Data-Driven NLP Pipeline
Udo Hahn | Ekaterina Buyko | Katrin Tomanek | Scott Piao | John McNaught | Yoshimasa Tsuruoka | Sophia Ananiadou
Proceedings of the Linguistic Annotation Workshop


ASSIST: Automated Semantic Assistance for Translators
Serge Sharoff | Bogdan Babych | Paul Rayson | Olga Mudraya | Scott Piao

pdf bib
Measuring MWE Compositionality Using Semantic Annotation
Scott S.L. Piao | Paul Rayson | Olga Mudraya | Andrew Wilson | Roger Garside
Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties

Automatic Extraction of Chinese Multiword Expressions with a Statistical Tool
Scott S.L. Piao | Guangfan Sun | Paul Rayson | Qi Yuan
Proceedings of the Workshop on Multi-word-expressions in a multilingual context


Evaluating Lexical Resources for a Semantic Tagger
Scott S. L. Piao | Paul Rayson | Dawn Archer | Tony McEnery
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Semantic lexical resources play an important part in both linguistic study and natural language engineering. In Lancaster, a large semantic lexical resource has been built over the past 14 years, which provides a knowledge base for the USAS semantic tagger. Capturing semantic lexicological theory and empirical lexical usage information extracted from corpora, the Lancaster semantic lexicon provides a valuable resource for the corpus research and NLP community. In this paper, we evaluate the lexical coverage of the semantic lexicon both in terms of genres and time periods. We conducted the evaluation on test corpora including the BNC sampler, the METER Corpus of law/court journalism reports and some corpora of Newsbooks, prose and fictional works published between 17th and 19th centuries. In the evaluation, the semantic lexicon achieved a lexical coverage of 98.49% on the BNC sampler, 95.38% on the METER Corpus and 92.76% -- 97.29% on the historical data. Our evaluation reveals that the Lancaster semantic lexicon has a remarkably high lexical coverage on modern English lexicon, but needs expansion with domain-specific terms and historical words. Our evaluation also shows that, in order to make claims about the lexical coverage of annotation systems as well as to render them ‘future proof’, we need to evaluate their potential both synchronically and diachronically across genres.


Extracting Multiword Expressions with A Semantic Tagger
Scott S. L. Piao | Paul Rayson | Dawn Archer | Andrew Wilson | Tony McEnery
Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment


Building and annotating a corpus for the study of journalistic text reuse
Paul Clough | Robert Gaizauskas | S. L. Piao
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

Measuring Text Reuse
Paul Clough | Robert Gaizauskas | Scott S.L. Piao | Yorick Wilks
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics