2022
pdf
abs
Evaluating Retrieval for Multi-domain Scientific Publications
Nancy Ide
|
Keith Suderman
|
Jingxuan Tu
|
Marc Verhagen
|
Shanan Peters
|
Ian Ross
|
John Lawson
|
Andrew Borg
|
James Pustejovsky
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper provides an overview of the xDD/LAPPS Grid framework and provides results of evaluating the AskMe retrievalengine using the BEIR benchmark datasets. Our primary goal is to determine a solid baseline of performance to guide furtherdevelopment of our retrieval capabilities. Beyond this, we aim to dig deeper to determine when and why certain approachesperform well (or badly) on both in-domain and out-of-domain data, an issue that has to date received relatively little attention.
2020
pdf
abs
Infrastructure for Semantic Annotation in the Genomics Domain
Mahmoud El-Haj
|
Nathan Rutherford
|
Matthew Coole
|
Ignatius Ezeani
|
Sheryl Prentice
|
Nancy Ide
|
Jo Knight
|
Scott Piao
|
John Mariani
|
Paul Rayson
|
Keith Suderman
Proceedings of the Twelfth Language Resources and Evaluation Conference
We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST is also connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words.
pdf
abs
AskMe: A LAPPS Grid-based NLP Query and Retrieval System for Covid-19 Literature
Keith Suderman
|
Nancy Ide
|
Verhagen Marc
|
Brent Cochran
|
James Pustejovsky
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020
In a recent project, the Language Application Grid was augmented to support the mining of scientific publications. The results of that ef- fort have now been repurposed to focus on Covid-19 literature, including modification of the LAPPS Grid “AskMe” query and retrieval engine. We describe the AskMe system and discuss its functionality as compared to other query engines available to search covid-related publications.
pdf
abs
Towards Standardization of Web Service Protocols for NLPaaS
Jin-Dong Kim
|
Nancy Ide
|
Keith Suderman
Proceedings of the 1st International Workshop on Language Technology Platforms
Several web services for various natural language processing (NLP) tasks (‘‘NLP-as-a-service” or NLPaaS) have recently been made publicly available. However, despite their similar functionality these services often differ in the protocols they use, thus complicating the development of clients accessing them. A survey of currently available NLPaaS services suggests that it may be possible to identify a minimal application layer protocol that can be shared by NLPaaS services without sacrificing functionality or convenience, while at the same time simplifying the development of clients for these services. In this paper, we hope to raise awareness of the interoperability problems caused by the variety of existing web service protocols, and describe an effort to identify a set of best practices for NLPaaS protocol design. To that end, we survey and compare protocols used by NLPaaS services and suggest how these protocols may be further aligned to reduce variation.
2019
pdf
abs
A Multi-Platform Annotation Ecosystem for Domain Adaptation
Richard Eckart de Castilho
|
Nancy Ide
|
Jin-Dong Kim
|
Jan-Christoph Klie
|
Keith Suderman
Proceedings of the 13th Linguistic Annotation Workshop
This paper describes an ecosystem consisting of three independent text annotation platforms. To demonstrate their ability to work in concert, we illustrate how to use them to address an interactive domain adaptation task in biomedical entity recognition. The platforms and the approach are in general domain-independent and can be readily applied to other areas of science.
2018
pdf
Bridging the LAPPS Grid and CLARIN
Erhard Hinrichs
|
Nancy Ide
|
James Pustejovsky
|
Jan Hajič
|
Marie Hinrichs
|
Mohammad Fazleh Elahi
|
Keith Suderman
|
Marc Verhagen
|
Kyeongmin Rim
|
Pavel Straňák
|
Jozef Mišutka
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
Mining Biomedical Publications With The LAPPS Grid
Nancy Ide
|
Keith Suderman
|
Jin-Dong Kim
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
abs
Representation and Interchange of Linguistic Annotation. An In-Depth, Side-by-Side Comparison of Three Designs
Richard Eckart de Castilho
|
Nancy Ide
|
Emanuele Lapponi
|
Stephan Oepen
|
Keith Suderman
|
Erik Velldal
|
Marc Verhagen
Proceedings of the 11th Linguistic Annotation Workshop
For decades, most self-respecting linguistic engineering initiatives have designed and implemented custom representations for various layers of, for example, morphological, syntactic, and semantic analysis. Despite occasional efforts at harmonization or even standardization, our field today is blessed with a multitude of ways of encoding and exchanging linguistic annotations of these types, both at the levels of ‘abstract syntax’, naming choices, and of course file formats. To a large degree, it is possible to work within and across design plurality by conversion, and often there may be good reasons for divergent design reflecting differences in use. However, it is likely that some abstract commonalities across choices of representation are obscured by more superficial differences, and conversely there is no obvious procedure to tease apart what actually constitute contentful vs. mere technical divergences. In this study, we seek to conceptually align three representations for common types of morpho-syntactic analysis, pinpoint what in our view constitute contentful differences, and reflect on the underlying principles and specific requirements that led to individual choices. We expect that a more in-depth understanding of these choices across designs may led to increased harmonization, or at least to more informed design of future representations.
2016
pdf
bib
abs
LAPPS/Galaxy: Current State and Next Steps
Nancy Ide
|
Keith Suderman
|
Eric Nyberg
|
James Pustejovsky
|
Marc Verhagen
Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016)
The US National Science Foundation (NSF) SI2-funded LAPPS/Galaxy project has developed an open-source platform for enabling complex analyses while hiding complexities associated with underlying infrastructure, that can be accessed through a web interface, deployed on any Unix system, or run from the cloud. It provides sophisticated tool integration and history capabilities, a workflow system for building automated multi-step analyses, state-of-the-art evaluation capabilities, and facilities for sharing and publishing analyses. This paper describes the current facilities available in LAPPS/Galaxy and outlines the project’s ongoing activities to enhance the framework.
pdf
abs
The Language Application Grid and Galaxy
Nancy Ide
|
Keith Suderman
|
James Pustejovsky
|
Marc Verhagen
|
Christopher Cieri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The NSF-SI2-funded LAPPS Grid project is a collaborative effort among Brandeis University, Vassar College, Carnegie-Mellon University (CMU), and the Linguistic Data Consortium (LDC), which has developed an open, web-based infrastructure through which resources can be easily accessed and within which tailored language services can be efficiently composed, evaluated, disseminated and consumed by researchers, developers, and students across a wide variety of disciplines. The LAPPS Grid project recently adopted Galaxy (Giardine et al., 2005), a robust, well-developed, and well-supported front end for workflow configuration, management, and persistence. Galaxy allows data inputs and processing steps to be selected from graphical menus, and results are displayed in intuitive plots and summaries that encourage interactive workflows and the exploration of hypotheses. The Galaxy workflow engine provides significant advantages for deploying pipelines of LAPPS Grid web services, including not only means to create and deploy locally-run and even customized versions of the LAPPS Grid as well as running the LAPPS Grid in the cloud, but also access to a huge array of statistical and visualization tools that have been developed for use in genomics research.
2014
pdf
abs
The Language Application Grid
Nancy Ide
|
James Pustejovsky
|
Christopher Cieri
|
Eric Nyberg
|
Di Wang
|
Keith Suderman
|
Marc Verhagen
|
Jonathan Wright
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The Language Application (LAPPS) Grid project is establishing a framework that enables language service discovery, composition, and reuse and promotes sustainability, manageability, usability, and interoperability of natural language Processing (NLP) components. It is based on the service-oriented architecture (SOA), a more recent, web-oriented version of the pipeline architecture that has long been used in NLP for sequencing loosely-coupled linguistic analyses. The LAPPS Grid provides access to basic NLP processing tools and resources and enables pipelining such tools to create custom NLP applications, as well as composite services such as question answering and machine translation together with language resources such as mono- and multi-lingual corpora and lexicons that support NLP. The transformative aspect of the LAPPS Grid is that it orchestrates access to and deployment of language resources and processing functions available from servers around the globe and enables users to add their own language resources, services, and even service grids to satisfy their particular needs.
pdf
The Language Application Grid Web Service Exchange Vocabulary
Nancy Ide
|
James Pustejovsky
|
Keith Suderman
|
Marc Verhagen
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT
2012
pdf
A Model for Linguistic Resource Description
Nancy Ide
|
Keith Suderman
Proceedings of the Sixth Linguistic Annotation Workshop
2010
pdf
abs
ANC2Go: A Web Application for Customized Corpus Creation
Nancy Ide
|
Keith Suderman
|
Brian Simms
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We describe a web application called ANC2Go that enables the user to select data from the Open American National Corpus (OANC) and the Manually Annotated Sub-corpus (MASC) together with some or all of the annotations available. The user also may select from among a variety of options for output format, or may receive the selected portions of the corpus and annotations in their original GrAF XML standoff format.. The request is processed by merging the annotations selected and rendering them in the desired output format, then bundling the results and making it available for download. Thus, users can create a customized corpus with data and annotations of their choosing, delivered in the format that is most convenient for their use. ANC2Go will be released as a web service in the near future. Both the OANC and MASC are freely available for any use from the American National Corpus website and may be accessed through the ANC2Go application, or they may downloaded in their entirety.
2009
pdf
Bridging the Gaps: Interoperability for GrAF, GATE, and UIMA
Nancy Ide
|
Keith Suderman
Proceedings of the Third Linguistic Annotation Workshop (LAW III)
2007
pdf
bib
GrAF: A Graph-based Format for Linguistic Annotations
Nancy Ide
|
Keith Suderman
Proceedings of the Linguistic Annotation Workshop
2006
pdf
abs
Integrating Linguistic Resources: The American National Corpus Model
Nancy Ide
|
Keith Suderman
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper describes the architecture of the American National Corpus and the design decisions we have made in order to make the corpus easy to use with a variety of existing tools with varying functionality, and to allow for layering multiple annotations over the data. The overall goal of the ANC project is to provide an open linguistic infrastructure for American English, consisting of as many self-generated or contributed annotations of the data as possible together with derived. The availability of a wide variety of annotations for the same data and in a common format should significantly simplify the processing required to extract annotations from different sources and enable use of the ANC and its annotations with off-the-shelf software.
pdf
Layering and Merging Linguistic Annotations
Keith Suderman
|
Nancy Ide
Proceedings of the 5th Workshop on NLP and XML (NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing
2004
pdf
The American National Corpus First Release
Nancy Ide
|
Keith Suderman
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
2002
pdf
The American National Corpus: More Than the Web Can Provide
Nancy Ide
|
Randi Reppen
|
Keith Suderman
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)