Marc Verhagen

2023

pdf abs
The Coreference under Transformation Labeling Dataset: Entity Tracking in Procedural Texts Using Event Models
Kyeongmin Rim | Jingxuan Tu | Bingyang Ye | Marc Verhagen | Eben Holderness | James Pustejovsky
Findings of the Association for Computational Linguistics: ACL 2023

We demonstrate that coreference resolution in procedural texts is significantly improved when performing transformation-based entity linking prior to coreference relation identification. When events in the text introduce changes to the state of participating entities, it is often impossible to accurately link entities in anaphoric and coreference relations without an understanding of the transformations those entities undergo. We show how adding event semantics helps to better model entity coreference. We argue that all transformation predicates, not just creation verbs, introduce a new entity into the discourse, as a kind of generalized Result Role, which is typically not textually mentioned. This allows us to model procedural texts as process graphs and to compute the coreference type for any two entities in the recipe. We present our annotation methodology and the corpus generated as well as describe experiments on coreference resolution of entity mentions under a process-oriented model of events.

2022

pdf abs
The CLAMS Platform at Work: Processing Audiovisual Data from the American Archive of Public Broadcasting
Marc Verhagen | Kelley Lynch | Kyeongmin Rim | James Pustejovsky
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The Computational Linguistics Applications for Multimedia Services (CLAMS) platform provides access to computational content analysis tools for multimedia material. The version we present here is a robust update of an initial prototype implementation from 2019. The platform now sports a variety of image, video, audio and text processing tools that interact via a common multi-modal representation language named MMIF (Multi-Media Interchange Format). We describe the overall architecture, the MMIF format, some of the tools included in the platform, the process to set up and run a workflow, visualizations included in CLAMS, and evaluate aspects of the platform on data from the American Archive of Public Broadcasting, showing how CLAMS can add metadata to mass-digitized multimedia collections, metadata that are typically only available implicitly in now largely unsearchable digitized media in archives and libraries.

This paper provides an overview of the xDD/LAPPS Grid framework and provides results of evaluating the AskMe retrievalengine using the BEIR benchmark datasets. Our primary goal is to determine a solid baseline of performance to guide furtherdevelopment of our retrieval capabilities. Beyond this, we aim to dig deeper to determine when and why certain approachesperform well (or badly) on both in-domain and out-of-domain data, an issue that has to date received relatively little attention.

2021

pdf abs
Exploration and Discovery of the COVID-19 Literature through Semantic Visualization
Jingxuan Tu | Marc Verhagen | Brent Cochran | James Pustejovsky
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

We propose semantic visualization as a linguistic visual analytic method. It can enable exploration and discovery over large datasets of complex networks by exploiting the semantics of the relations in them. This involves extracting information, applying parameter reduction operations, building hierarchical data representation and designing visualization. We also present the accompanying COVID-SemViz a searchable and interactive visualization system for knowledge exploration of COVID-19 data to demonstrate the application of our proposed method. In the user studies, users found that semantic visualization-powered COVID-SemViz is helpful in terms of finding relevant information and discovering unknown associations.

2020

pdf abs
Interchange Formats for Visualization: LIF and MMIF
Kyeongmin Rim | Kelley Lynch | Marc Verhagen | Nancy Ide | James Pustejovsky
Proceedings of the Twelfth Language Resources and Evaluation Conference

Promoting interoperrable computational linguistics (CL) and natural language processing (NLP) application platforms and interchange-able data formats have contributed improving discoverabilty and accessbility of the openly available NLP software. In this paper, wediscuss the enhanced data visualization capabilities that are also enabled by inter-operating NLP pipelines and interchange formats.For adding openly available visualization tools and graphical annotation tools to the Language Applications Grid (LAPPS Grid) andComputational Linguistics Applications for Multimedia Services (CLAMS) toolboxes, we have developed interchange formats that cancarry annotations and metadata for text and audiovisual source data. We descibe those data formats and present case studies where wesuccessfully adopt open-source visualization tools and combine them with CL tools.

2018

2017

For decades, most self-respecting linguistic engineering initiatives have designed and implemented custom representations for various layers of, for example, morphological, syntactic, and semantic analysis. Despite occasional efforts at harmonization or even standardization, our field today is blessed with a multitude of ways of encoding and exchanging linguistic annotations of these types, both at the levels of ‘abstract syntax’, naming choices, and of course file formats. To a large degree, it is possible to work within and across design plurality by conversion, and often there may be good reasons for divergent design reflecting differences in use. However, it is likely that some abstract commonalities across choices of representation are obscured by more superficial differences, and conversely there is no obvious procedure to tease apart what actually constitute contentful vs. mere technical divergences. In this study, we seek to conceptually align three representations for common types of morpho-syntactic analysis, pinpoint what in our view constitute contentful differences, and reflect on the underlying principles and specific requirements that led to individual choices. We expect that a more in-depth understanding of these choices across designs may led to increased harmonization, or at least to more informed design of future representations.

2016

pdf abs
The Language Application Grid and Galaxy
Nancy Ide | Keith Suderman | James Pustejovsky | Marc Verhagen | Christopher Cieri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The NSF-SI2-funded LAPPS Grid project is a collaborative effort among Brandeis University, Vassar College, Carnegie-Mellon University (CMU), and the Linguistic Data Consortium (LDC), which has developed an open, web-based infrastructure through which resources can be easily accessed and within which tailored language services can be efficiently composed, evaluated, disseminated and consumed by researchers, developers, and students across a wide variety of disciplines. The LAPPS Grid project recently adopted Galaxy (Giardine et al., 2005), a robust, well-developed, and well-supported front end for workflow configuration, management, and persistence. Galaxy allows data inputs and processing steps to be selected from graphical menus, and results are displayed in intuitive plots and summaries that encourage interactive workflows and the exploration of hypotheses. The Galaxy workflow engine provides significant advantages for deploying pipelines of LAPPS Grid web services, including not only means to create and deploy locally-run and even customized versions of the LAPPS Grid as well as running the LAPPS Grid in the cloud, but also access to a huge array of statistical and visualization tools that have been developed for use in genomics research.

pdf bib abs
LAPPS/Galaxy: Current State and Next Steps
Nancy Ide | Keith Suderman | Eric Nyberg | James Pustejovsky | Marc Verhagen
Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016)

The US National Science Foundation (NSF) SI2-funded LAPPS/Galaxy project has developed an open-source platform for enabling complex analyses while hiding complexities associated with underlying infrastructure, that can be accessed through a web interface, deployed on any Unix system, or run from the cloud. It provides sophisticated tool integration and history capabilities, a workflow system for building automated multi-step analyses, state-of-the-art evaluation capabilities, and facilities for sharing and publishing analyses. This paper describes the current facilities available in LAPPS/Galaxy and outlines the project’s ongoing activities to enhance the framework.

2015

pdf
SemEval-2015 Task 6: Clinical TempEval
Steven Bethard | Leon Derczynski | Guergana Savova | James Pustejovsky | Marc Verhagen
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf
The Language Application Grid Web Service Exchange Vocabulary
Nancy Ide | James Pustejovsky | Keith Suderman | Marc Verhagen
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT

pdf
A Conceptual Framework of Online Natural Language Processing Pipeline Application
Chunqi Shi | James Pustejovsky | Marc Verhagen
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT

pdf
Extracting Aspects and Polarity from Patents
Peter Anick | Marc Verhagen | James Pustejovsky
Proceedings of the COLING Workshop on Synchronic and Diachronic Approaches to Analyzing Technical Language

pdf abs
Identification of Technology Terms in Patents
Peter Anick | Marc Verhagen | James Pustejovsky
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Natural language analysis of patents holds promise for the development of tools designed to assist analysts in the monitoring of emerging technologies. One component of such tools is the identification of technology terms. We describe an approach to the discovery of technology terms using supervised machine learning and evaluate its performance on subsets of patents in three languages: English, German, and Chinese.

The Language Application (LAPPS) Grid project is establishing a framework that enables language service discovery, composition, and reuse and promotes sustainability, manageability, usability, and interoperability of natural language Processing (NLP) components. It is based on the service-oriented architecture (SOA), a more recent, web-oriented version of the pipeline architecture that has long been used in NLP for sequencing loosely-coupled linguistic analyses. The LAPPS Grid provides access to basic NLP processing tools and resources and enables pipelining such tools to create custom NLP applications, as well as composite services such as question answering and machine translation together with language resources such as mono- and multi-lingual corpora and lexicons that support NLP. The transformative aspect of the LAPPS Grid is that it orchestrates access to and deployment of language resources and processing functions available from servers around the globe and enables users to add their own language resources, services, and even service grids to satisfy their particular needs.

2013

pdf bib
SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations
Naushad UzZaman | Hector Llorens | Leon Derczynski | James Allen | Marc Verhagen | James Pustejovsky
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf abs
The TARSQI Toolkit
Marc Verhagen | James Pustejovsky
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present and demonstrate the updated version of the TARSQI Toolkit, a suite of temporal processing modules that extract temporal information from natural language texts. It parses the document and identifies temporal expressions, recognizes events, anchor events to temporal expressions and orders events relative to each other. The toolkit was previously demonstrated at COLING 2008, but has since seen substantial changes including: (1) incorporation of a new time expression tagger, (2)~embracement of stand-off annotation, (3) application to the medical domain and (4) introduction of narrative containers.

pdf abs
ATLIS: Identifying Locational Information in Text Automatically
John Vogel | Marc Verhagen | James Pustejovsky
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

ATLIS (short for ATLIS Tags Locations in Strings) is a tool being developed using a maximum-entropy machine learning model for automatically identifying information relating to spatial and locational information in natural language text. It is being developed in parallel with the ISO-Space standard for annotation of spatial information (Pustejovsky, Moszkowicz & Verhagen 2011). The goal of ATLIS is to be able to take in a document as raw text and mark it up with ISO-Space annotation data, so that another program could use the information in a standardized format to reason about the semantics of the spatial information in the document. The tool (as well as ISO-Space itself) is still in the early stages of development. At present it implements a subset of the proposed ISO-Space annotation standard: it identifies expressions that refer to specific places, as well as identifying prepositional constructions that indicate a spatial relationship between two objects. In this paper, the structure of the ATLIS tool is presented, along with preliminary evaluations of its performance.

2011

pdf
Medstract - The Next Generation
Marc Verhagen | James Pustejovsky
Proceedings of BioNLP 2011 Workshop

2010

pdf abs
The Brandeis Annotation Tool
Marc Verhagen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Brandeis Annotation Tool (BAT) is a web-based text annotation tool that is centered around the notions of layered annotation and task decomposition. It allows annotations to refer to other annotations and to take a complicated task and split it into easier subtasks. The central organizing concept of BAT is the annotation layer. A corpus administrator can create annotation layers that involve annotation of extents, attributes or relations. The layer definition includes the labels used, the attributes that are available and restrictions on the values for those attributes. For each annotation layer, files can be assigned to one or more annotators and one judge. When annotators log in, the assigned layers and files therein are presented. When selecting a file to annotate, the interface uses the layer definition to display the annotation interface. The web-interface connects administrators and annotators to a central repository for all data and simplifies many of the housekeeping tasks while keeping requirements at a minimum (that is, users only need an internet connection and a well-behaved browser). BAT has been used mainly for temporal annotation, but can be considered a more general tool for several kinds of textual annotation.

pdf
SemEval-2010 Task 13: TempEval-2
Marc Verhagen | Roser Saurí | Tommaso Caselli | James Pustejovsky
Proceedings of the 5th International Workshop on Semantic Evaluation

Natural language processing researchers currently have access to a wealth of information about words and word senses. This presents problems as well as resources, as it is often difficult to search through and coordinate lexical information across various data sources. We have approached this problem by creating a shared environment for various lexical resources. This browser, BULB (Brandeis Unified Lexical Browser) and its accompanying front-end provides the NLP researcher with a coordinated display from many of the available lexical resources, focusing, in particular, on a newly developed lexical database, the Brandeis Semantic Ontology (BSO). BULB is a module-based browser focusing on the interaction and display of modules from existing NLP tools. We discuss the BSO, PropBank, FrameNet, WordNet, and CQP, as well as other modules which will extend the system. We then outline future extensions to this work and present a release schedule for BULB.

pdf abs
Towards a Generative Lexical Resource: The Brandeis Semantic Ontology
James Pustejovsky | Catherine Havasi | Jessica Littman | Anna Rumshisky | Marc Verhagen
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we describe the structure and development of the Brandeis Semantic Ontology (BSO), a large generative lexicon ontology and lexical database. The BSO has been designed to allow for more widespread access to Generative Lexicon-based lexical resources and help researchers in a variety of computational tasks. The specification of the type system used in the BSO largely follows that proposed by the SIMPLE specification (Busa et al., 2001), which was adopted by the EU-sponsored SIMPLE project (Lenci et al., 2000).

pdf abs
Annotation of Temporal Relations with Tango
Marc Verhagen | Robert Knippen | Inderjeet Mani | James Pustejovsky
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Temporal annotation is a complex task characterized by low markup speed and low inter-annotator agreements scores. Tango is a graphical annotation tool for temporal relations. It is developed for the TimeML annotation language and allows annotators to build a graph that resembles a timeline. Temporal relations are added by selecting events and drawing labeled arrows between them. Tango is integrated with a temporal closure component and includes features like SmartLink, user prompting and automatic linking of time expressions. Tango has been used to create two corpora with temporal annotation, TimeBank and the AQUAINT Opinion corpus.

pdf abs
SlinkET: A Partial Modal Parser for Events
Roser Saurí | Marc Verhagen | James Pustejovsky
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We present SlinkET, a parser for identifying contexts of event modality in text developed within the TARSQI (Temporal Awareness and Reasoning Systems for Question Interpretation) research framework. SlinkET is grounded on TimeML, a specification language for capturing temporal and event related information in discourse, which provides an adequate foundation to handle event modality. SlinkET builds on top of a robust event recognizer, and provides each relevant event with a value that specifies the degree of certainty about its factuality; e.g., whether it has happened or holds (factive or counter-factive), whether it is being reported or witnessed by somebody else (evidential), or if it is introduced as a possibility (modal). It is based on well-established technology in the field (namely, finite-state techniques), and informed with corpus-induced knowledge that relies on basic information, such as morphological features, POS, and chunking. SlinkET is under continuing development and it currently achieves a performance ratio of 70% F1-measure.