Nancy Ide

Also published as: Nancy M. Ide

2022

This paper provides an overview of the xDD/LAPPS Grid framework and provides results of evaluating the AskMe retrievalengine using the BEIR benchmark datasets. Our primary goal is to determine a solid baseline of performance to guide furtherdevelopment of our retrieval capabilities. Beyond this, we aim to dig deeper to determine when and why certain approachesperform well (or badly) on both in-domain and out-of-domain data, an issue that has to date received relatively little attention.

2020

pdf abs
AskMe: A LAPPS Grid-based NLP Query and Retrieval System for Covid-19 Literature
Keith Suderman | Nancy Ide | Verhagen Marc | Brent Cochran | James Pustejovsky
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

In a recent project, the Language Application Grid was augmented to support the mining of scientific publications. The results of that ef- fort have now been repurposed to focus on Covid-19 literature, including modification of the LAPPS Grid “AskMe” query and retrieval engine. We describe the AskMe system and discuss its functionality as compared to other query engines available to search covid-related publications.

pdf abs
Towards Standardization of Web Service Protocols for NLPaaS
Jin-Dong Kim | Nancy Ide | Keith Suderman
Proceedings of the 1st International Workshop on Language Technology Platforms

Several web services for various natural language processing (NLP) tasks (‘‘NLP-as-a-service” or NLPaaS) have recently been made publicly available. However, despite their similar functionality these services often differ in the protocols they use, thus complicating the development of clients accessing them. A survey of currently available NLPaaS services suggests that it may be possible to identify a minimal application layer protocol that can be shared by NLPaaS services without sacrificing functionality or convenience, while at the same time simplifying the development of clients for these services. In this paper, we hope to raise awareness of the interoperability problems caused by the variety of existing web service protocols, and describe an effort to identify a set of best practices for NLPaaS protocol design. To that end, we survey and compare protocols used by NLPaaS services and suggest how these protocols may be further aligned to reduce variation.

We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST is also connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words.

pdf abs
Interchange Formats for Visualization: LIF and MMIF
Kyeongmin Rim | Kelley Lynch | Marc Verhagen | Nancy Ide | James Pustejovsky
Proceedings of the Twelfth Language Resources and Evaluation Conference

Promoting interoperrable computational linguistics (CL) and natural language processing (NLP) application platforms and interchange-able data formats have contributed improving discoverabilty and accessbility of the openly available NLP software. In this paper, wediscuss the enhanced data visualization capabilities that are also enabled by inter-operating NLP pipelines and interchange formats. For adding openly available visualization tools and graphical annotation tools to the Language Applications Grid (LAPPS Grid) andComputational Linguistics Applications for Multimedia Services (CLAMS) toolboxes, we have developed interchange formats that cancarry annotations and metadata for text and audiovisual source data. We descibe those data formats and present case studies where wesuccessfully adopt open-source visualization tools and combine them with CL tools.

2019

pdf abs
A Multi-Platform Annotation Ecosystem for Domain Adaptation
Richard Eckart de Castilho | Nancy Ide | Jin-Dong Kim | Jan-Christoph Klie | Keith Suderman
Proceedings of the 13th Linguistic Annotation Workshop

This paper describes an ecosystem consisting of three independent text annotation platforms. To demonstrate their ability to work in concert, we illustrate how to use them to address an interactive domain adaptation task in biomedical entity recognition. The platforms and the approach are in general domain-independent and can be readily applied to other areas of science.

2018

pdf
Mining Biomedical Publications With The LAPPS Grid
Nancy Ide | Keith Suderman | Jin-Dong Kim
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)
Nancy Ide | Aurélie Herbelot | Lluís Màrquez
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

For decades, most self-respecting linguistic engineering initiatives have designed and implemented custom representations for various layers of, for example, morphological, syntactic, and semantic analysis. Despite occasional efforts at harmonization or even standardization, our field today is blessed with a multitude of ways of encoding and exchanging linguistic annotations of these types, both at the levels of ‘abstract syntax’, naming choices, and of course file formats. To a large degree, it is possible to work within and across design plurality by conversion, and often there may be good reasons for divergent design reflecting differences in use. However, it is likely that some abstract commonalities across choices of representation are obscured by more superficial differences, and conversely there is no obvious procedure to tease apart what actually constitute contentful vs. mere technical divergences. In this study, we seek to conceptually align three representations for common types of morpho-syntactic analysis, pinpoint what in our view constitute contentful differences, and reflect on the underlying principles and specific requirements that led to individual choices. We expect that a more in-depth understanding of these choices across designs may led to increased harmonization, or at least to more informed design of future representations.

2016

pdf bib
Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016)
Yohei Murakami | Donghui Lin | Nancy Ide | James Pustejovsky
Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016)

pdf bib abs
LAPPS/Galaxy: Current State and Next Steps
Nancy Ide | Keith Suderman | Eric Nyberg | James Pustejovsky | Marc Verhagen
Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016)

The US National Science Foundation (NSF) SI2-funded LAPPS/Galaxy project has developed an open-source platform for enabling complex analyses while hiding complexities associated with underlying infrastructure, that can be accessed through a web interface, deployed on any Unix system, or run from the cloud. It provides sophisticated tool integration and history capabilities, a workflow system for building automated multi-step analyses, state-of-the-art evaluation capabilities, and facilities for sharing and publishing analyses. This paper describes the current facilities available in LAPPS/Galaxy and outlines the project’s ongoing activities to enhance the framework.

pdf abs
The Language Application Grid and Galaxy
Nancy Ide | Keith Suderman | James Pustejovsky | Marc Verhagen | Christopher Cieri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The NSF-SI2-funded LAPPS Grid project is a collaborative effort among Brandeis University, Vassar College, Carnegie-Mellon University (CMU), and the Linguistic Data Consortium (LDC), which has developed an open, web-based infrastructure through which resources can be easily accessed and within which tailored language services can be efficiently composed, evaluated, disseminated and consumed by researchers, developers, and students across a wide variety of disciplines. The LAPPS Grid project recently adopted Galaxy (Giardine et al., 2005), a robust, well-developed, and well-supported front end for workflow configuration, management, and persistence. Galaxy allows data inputs and processing steps to be selected from graphical menus, and results are displayed in intuitive plots and summaries that encourage interactive workflows and the exploration of hypotheses. The Galaxy workflow engine provides significant advantages for deploying pipelines of LAPPS Grid web services, including not only means to create and deploy locally-run and even customized versions of the LAPPS Grid as well as running the LAPPS Grid in the cloud, but also access to a huge array of statistical and visualization tools that have been developed for use in genomics research.

2015

pdf bib
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications
Christian Chiarcos | John Philip McCrae | Petya Osenova | Philipp Cimiano | Nancy Ide
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications

2014

pdf
FrameNet and Linked Data
Nancy Ide
Proceedings of Frame Semantics in NLP: A Workshop in Honor of Chuck Fillmore (1929-2014)

pdf bib
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT
Nancy Ide | Jens Grivolla
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT

pdf
The Language Application Grid Web Service Exchange Vocabulary
Nancy Ide | James Pustejovsky | Keith Suderman | Marc Verhagen
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT

The Language Application (LAPPS) Grid project is establishing a framework that enables language service discovery, composition, and reuse and promotes sustainability, manageability, usability, and interoperability of natural language Processing (NLP) components. It is based on the service-oriented architecture (SOA), a more recent, web-oriented version of the pipeline architecture that has long been used in NLP for sequencing loosely-coupled linguistic analyses. The LAPPS Grid provides access to basic NLP processing tools and resources and enables pipelining such tools to create custom NLP applications, as well as composite services such as question answering and machine translation together with language resources such as mono- and multi-lingual corpora and lexicons that support NLP. The transformative aspect of the LAPPS Grid is that it orchestrates access to and deployment of language resources and processing functions available from servers around the globe and enables users to add their own language resources, services, and even service grids to satisfy their particular needs.

pdf
Biber Redux: Reconsidering Dimensions of Variation in American English
Rebecca J. Passonneau | Nancy Ide | Songqiao Su | Jesse Stuart
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf
Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF
Arne Neumann | Nancy Ide | Manfred Stede
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

2012

pdf bib
Proceedings of the Sixth Linguistic Annotation Workshop
Nancy Ide | Fei Xia
Proceedings of the Sixth Linguistic Annotation Workshop

pdf
A Model for Linguistic Resource Description
Nancy Ide | Keith Suderman
Proceedings of the Sixth Linguistic Annotation Workshop

pdf abs
The MASC Word Sense Corpus
Rebecca J. Passonneau | Collin F. Baker | Christiane Fellbaum | Nancy Ide
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The MASC project has produced a multi-genre corpus with multiple layers of linguistic annotation, together with a sentence corpus containing WordNet 3.1 sense tags for 1000 occurrences of each of 100 words produced by multiple annotators, accompanied by indepth inter-annotator agreement data. Here we give an overview of the contents of MASC and then focus on the word sense sentence corpus, describing the characteristics that differentiate it from other word sense corpora and detailing the inter-annotator agreement studies that have been performed on the annotations. Finally, we discuss the potential to grow the word sense sentence corpus through crowdsourcing and the plan to enhance the content and annotations of MASC through a community-based collaborative effort.

pdf abs
Empirical Comparisons of MASC Word Sense Annotations
Gerard de Melo | Collin F. Baker | Nancy Ide | Rebecca J. Passonneau | Christiane Fellbaum
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We analyze how different conceptions of lexical semantics affect sense annotations and how multiple sense inventories can be compared empirically, based on annotated text. Our study focuses on the MASC project, where data has been annotated using WordNet sense identifiers on the one hand, and FrameNet lexical units on the other. This allows us to compare the sense inventories of these lexical resources empirically rather than just theoretically, based on their glosses, leading to new insights. In particular, we compute contingency matrices and develop a novel measure, the Expected Jaccard Index, that quantifies the agreement between annotations of the same data based on two different resources even when they have different sets of categories.

2011

pdf bib
Proceedings of the 5th Linguistic Annotation Workshop
Nancy Ide | Adam Meyers | Sameer Pradhan | Katrin Tomanek
Proceedings of the 5th Linguistic Annotation Workshop

2010

pdf
Anveshan: A Framework for Analysis of Multiple Annotators’ Labeling Behavior
Vikas Bhardwaj | Rebecca Passonneau | Ansaf Salleb-Aouissi | Nancy Ide
Proceedings of the Fourth Linguistic Annotation Workshop

pdf
Anatomy of Annotation Schemes: Mapping to GrAF
Nancy Ide | Harry Bunt
Proceedings of the Fourth Linguistic Annotation Workshop

pdf
The Manually Annotated Sub-Corpus: A Community Resource for and by the People
Nancy Ide | Collin Baker | Christiane Fellbaum | Rebecca Passonneau
Proceedings of the ACL 2010 Conference Short Papers

pdf abs
ANC2Go: A Web Application for Customized Corpus Creation
Nancy Ide | Keith Suderman | Brian Simms
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe a web application called ANC2Go that enables the user to select data from the Open American National Corpus (OANC) and the Manually Annotated Sub-corpus (MASC) together with some or all of the annotations available. The user also may select from among a variety of options for output format, or may receive the selected portions of the corpus and annotations in their original GrAF XML standoff format.. The request is processed by merging the annotations selected and rendering them in the desired output format, then bundling the results and making it available for download. Thus, users can create a customized corpus with data and annotations of their choosing, delivered in the format that is most convenient for their use. ANC2Go will be released as a web service in the near future. Both the OANC and MASC are freely available for any use from the American National Corpus website and may be accessed through the ANC2Go application, or they may downloaded in their entirety.

pdf abs
Word Sense Annotation of Polysemous Words by Multiple Annotators
Rebecca J. Passonneau | Ansaf Salleb-Aoussi | Vikas Bhardwaj | Nancy Ide
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe results of a word sense annotation task using WordNet, involving half a dozen well-trained annotators on ten polysemous words for three parts of speech. One hundred sentences for each word were annotated. Annotators had the same level of training and experience, but interannotator agreement (IA) varied across words. There was some effect of part of speech, with higher agreement on nouns and adjectives, but within the words for each part of speech there was wide variation. This variation in IA does not correlate with number of senses in the inventory, or the number of senses actually selected by annotators. In fact, IA was sometimes quite high for words with many senses. We claim that the IA variation is due to the word meanings, contexts of use, and individual differences among annotators. We find some correlation of IA with sense confusability as measured by a sense confusion threshhold (CT). Data mining for association rules on a flattened data representation indicating each annotator's sense choices identifies outliers for some words, and systematic differences among pairs of annotators on others.

LRs remain expensive to create and thus rare relative to demand across languages and technology types. The accidental re-creation of an LR that already exists is a nearly unforgivable waste of scarce resources that is unfortunately not so easy to avoid. The number of catalogs the HLT researcher must search, with their different formats, make it possible to overlook an existing resource. This paper sketches the sources of this problem and outlines a proposal to rectify along with a new vision of LR cataloging that will to facilitates the documentation and exploitation of a much wider range of LRs than previously considered.

2009

pdf
Latin Etymologies as Features on BNC Text Categorization
Alex Chengyu Fang | Wanyin Li | Nancy Ide
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

pdf bib
Making Sense of Word Sense Variation
Rebecca Passonneau | Ansaf Salleb-Aouissi | Nancy Ide
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

pdf bib
Proceedings of the Third Linguistic Annotation Workshop (LAW III)
Manfred Stede | Chu-Ren Huang | Nancy Ide | Adam Meyers
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf
Bridging the Gaps: Interoperability for GrAF, GATE, and UIMA
Nancy Ide | Keith Suderman
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf
The SILT and FlaReNet International Collaboration for Interoperability
Nancy Ide | James Pustejovsky | Nicoletta Calzolari | Claudia Soria
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

pdf abs
MASC: the Manually Annotated Sub-Corpus of American English
Nancy Ide | Collin Baker | Christiane Fellbaum | Charles Fillmore | Rebecca Passonneau
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing maximum accessibility for researchers from around the globe.

pdf abs
A Bilingual Corpus of Inter-linked Events
Tommaso Caselli | Nancy Ide | Roberto Bartolini
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes the creation of a bilingual corpus of inter-linked events for Italian and English. Linkage is accomplished through the Inter-Lingual Index (ILI) that links ItalWordNet with WordNet. The availability of this resource, on the one hand, enables contrastive analysis of the linguistic phenomena surrounding events in both languages, and on the other hand, can be used to perform multilingual temporal analysis of texts. In addition to describing the methodology for construction of the inter-linked corpus and the analysis of the data collected, we demonstrate that the ILI could potentially be used to bootstrap the creation of comparable corpora by exporting layers of annotation for words that have the same sense.

2007

pdf bib
GrAF: A Graph-based Format for Linguistic Annotations
Nancy Ide | Keith Suderman
Proceedings of the Linguistic Annotation Workshop

pdf
Shared Corpora Working Group Report
Adam Meyers | Nancy Ide | Ludovic Denoyer | Yusuke Shinyama
Proceedings of the Linguistic Annotation Workshop

2006

pdf
Layering and Merging Linguistic Annotations
Keith Suderman | Nancy Ide
Proceedings of the 5th Workshop on NLP and XML (NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing

pdf abs
Integrating Linguistic Resources: The American National Corpus Model
Nancy Ide | Keith Suderman
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the architecture of the American National Corpus and the design decisions we have made in order to make the corpus easy to use with a variety of existing tools with varying functionality, and to allow for layering multiple annotations over the data. The overall goal of the ANC project is to provide an open linguistic infrastructure for American English, consisting of as many self-generated or contributed annotations of the data as possible together with derived. The availability of a wide variety of annotations for the same data and in a common format should significantly simplify the processing required to extract annotations from different sources and enable use of the ANC and its annotations with off-the-shelf software.

pdf abs
Representing Linguistic Corpora and Their Annotations
Nancy Ide | Laurent Romary
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

A Linguistic Annotation Framework (LAF) is being developed within the International Standards Organization Technical Committee 37 Sub-committee on Language Resource Management (ISO TC37 SC4). LAF is intended to provide a standardized means to represent linguistic data and its annotations that is defined broadly enough to accommodate all types of linguistic annotations, and at the same time provide means to represent precise and potentially complex linguistic information. The general principles informing the design of LAF have been previously reported (Ide and Romary, 2003; Ide and Romary, 2004a). This paper describes some of the more technical aspects of the LAF design that have been addressed in the process of finalizing the specifications for the standard.

Machine-readable versions of everyday dictionaries have been seen as a likely source of information for use in natural language processing because they contain an enormous amount of lexical and semantic knowledge. However, after 15 years of research, the results appear to be disappointing. No comprehensive evaluation of machine-readable dictionaries (MRDs) as a knowledge source has been made to date, although this is necessary to determine what, if anything, can be gained from MRD research. To this end, this paper will first consider the postulates upon which MRD research has been based over the past fifteen years, discuss the validity of these postulates, and evaluate the results of this work. We will then propose possible future directions and applications that may exploit these years of effort, in the light of current directions in not only NLP research, but also fields such as lexicography and electronic publishing.