Seth Kulick

2023

pdf bib
Parsing “Early English Books Online” for Linguistic Search
Seth Kulick | Neville Ryant | Beatrice Santorini
Proceedings of the Society for Computation in Linguistics 2023

2022

pdf bib abs
Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis
Seth Kulick | Neville Ryant | Beatrice Santorini
Findings of the Association for Computational Linguistics: NAACL 2022

The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), a 1.7-million-word treebank that is an important resource for research in syntactic change, has several properties that present potential challenges for NLP technologies. We describe these key features of PPCEME that make it challenging for parsing, including a larger and more varied set of function tags than in the Penn Treebank, and present results for this corpus using a modified version of the Berkeley Neural Parser and the approach to function tag recovery of Gabbard et al. (2006). While this approach to function tag recovery gives reasonable results, it is in some ways inappropriate for span-based parsers. We also present further evidence of the importance of in-domain pretraining for contextualized word representations. The resulting parser will be used to parse Early English Books Online, a 1.5 billion word corpus whose utility for the study of syntactic change will be greatly increased with the addition of accurate parse trees.

pdf bib
Parsing Early Modern English for Linguistic Search
Seth Kulick | Neville Ryant | Beatrice Santorini
Proceedings of the Society for Computation in Linguistics 2022

2019

2016

pdf bib abs
Rapid Development of Morphological Analyzers for Typologically Diverse Languages
Seth Kulick | Ann Bies
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Low Resource Language research conducted under DARPA’s Broad Operational Language Translation (BOLT) program required the rapid creation of text corpora of typologically diverse languages (Turkish, Hausa, and Uzbek) which were annotated with morphological information, along with other types of annotation. Since the output of morphological analyzers is a significant aid to morphological annotation, we developed a morphological analyzer for each language in order to support the annotation task, and also as a deliverable by itself. Our framework for analyzer creation results in tables similar to those used in the successful SAMA analyzer for Arabic, but with a more abstract linguistic level, from which the tables are derived. A lexicon was developed from available resources for integration with the analyzer, and given the speed of development and uncertain coverage of the lexicon, we assumed that the analyzer would necessarily be lacking in some coverage for the project annotation. Our analyzer framework was therefore focused on rapid implementation of the key structures of the language, together with accepting “wildcard” solutions as possible analyses for a word with an unknown stem, building upon our similar experiences with morphological annotation with Modern Standard Arabic and Egyptian Arabic.

2015

2014

pdf bib abs
Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development
Mohamed Maamouri | Ann Bies | Seth Kulick | Michael Ciul | Nizar Habash | Ramy Eskander
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the parallel development of an Egyptian Arabic Treebank and a morphological analyzer for Egyptian Arabic (CALIMA). By the very nature of Egyptian Arabic, the data collected is informal, for example Discussion Forum text, which we use for the treebank discussed here. In addition, Egyptian Arabic, like other Arabic dialects, is sufficiently different from Modern Standard Arabic (MSA) that tools and techniques developed for MSA cannot be simply transferred over to work on Egyptian Arabic work. In particular, a morphological analyzer for Egyptian Arabic is needed to mediate between the written text and the segmented, vocalized form used for the syntactic trees. This led to the necessity of a feedback loop between the treebank team and the analyzer team, as improvements in each area were fed to the other. Therefore, by necessity, there needed to be close cooperation between the annotation team and the tool development team, which was to their mutual benefit. Collaboration on this type of challenge, where tools and resources are limited, proved to be remarkably synergistic and opens the way to further fruitful work on Arabic dialects.

pdf bib abs
Incorporating Alternate Translations into English Translation Treebank
Ann Bies | Justin Mott | Seth Kulick | Jennifer Garland | Colin Warner
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

New annotation guidelines and new processing methods were developed to accommodate English treebank annotation of a parallel English/Chinese corpus of web data that includes alternate English translations (one fluent, one literal) of expressions that are idiomatic in the Chinese source. In previous machine translation programs, alternate translations of idiomatic expressions had been present in untreebanked data only, but due to the high frequency of such expressions in informal genres such as discussion forums, machine translation system developers requested that alternatives be added to the treebanked data as well. In consultation with machine translation researchers, we chose a pragmatic approach of syntactically annotating only the fluent translation, while retaining the alternate literal translation as a segregated node in the tree. Since the literal translation alternates are often incompatible with English syntax, this approach allows us to create fluent trees without losing information. This resource is expected to support machine translation efforts, and the flexibility provided by the alternate translations is an enhancement to the treebank for this purpose.

pdf bib
The Penn Parsed Corpus of Modern British English: First Parsing Results and Analysis
Seth Kulick | Anthony Kroch | Beatrice Santorini
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Parser Evaluation Using Derivation Trees: A Complement to evalb
Seth Kulick | Ann Bies | Justin Mott | Anthony Kroch | Beatrice Santorini | Mark Liberman
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Inter-annotator Agreement for ERE annotation
Seth Kulick | Ann Bies | Justin Mott
Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference, and Representation

2013

pdf bib
Using Derivation Trees for Informative Treebank Inter-Annotator Agreement Evaluation
Seth Kulick | Ann Bies | Justin Mott | Mohamed Maamouri | Beatrice Santorini | Anthony Kroch
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Automatic Correction and Extension of Morphological Annotations
Ramy Eskander | Nizar Habash | Ann Bies | Seth Kulick | Mohamed Maamouri
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

2012

pdf bib abs
Further Developments in Treebank Error Detection Using Derivation Trees
Seth Kulick | Ann Bies | Justin Mott
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This work describes how derivation tree fragments based on a variant of Tree Adjoining Grammar (TAG) can be used to check treebank consistency. Annotation of word sequences are compared both for their internal structural consistency, and their external relation to the rest of the tree. We expand on earlier work in this area in three ways. First, we provide a more complete description of the system, showing how a naive use of TAG structures will not work, leading to a necessary refinement. We also provide a more complete account of the processing pipeline, including the grouping together of structurally similar errors and their elimination of duplicates. Second, we include the new experimental external relation check to find an additional class of errors. Third, we broaden the evaluation to include both the internal and external relation checks, and evaluate the system on both an Arabic and English treebank. The evaluation has been successful enough that the internal check has been integrated into the standard pipeline for current English treebank construction at the Linguistic Data Consortium

pdf bib abs
Expanding Arabic Treebank to Speech: Results from Broadcast News
Mohamed Maamouri | Ann Bies | Seth Kulick
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Treebanking a large corpus of relatively structured speech transcribed from various Arabic Broadcast News (BN) sources has allowed us to begin to address the many challenges of annotating and parsing a speech corpus in Arabic. The now completed Arabic Treebank BN corpus consists of 432,976 source tokens (517,080 tree tokens) in 120 files of manually transcribed news broadcasts. Because news broadcasts are predominantly scripted, most of the transcribed speech is in Modern Standard Arabic (MSA). As such, the lexical and syntactic structures are very similar to the MSA in written newswire data. However, because this is spoken news, cross-linguistic speech effects such as restarts, fillers, hesitations, and repetitions are common. There is also a certain amount of dialect data present in the BN corpus, from on-the-street interviews and similar informal contexts. In this paper, we describe the finished corpus and focus on some of the necessary additions to our annotation guidelines, along with some of the technical challenges of a treebanked speech corpus and an initial parsing evaluation for this data. This corpus will be available to the community in 2012 as an LDC publication.

pdf bib
Using Supertags and Encoded Annotation Principles for Improved Dependency to Phrase Structure Conversion
Seth Kulick | Ann Bies | Justin Mott
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2011

pdf bib
Using Derivation Trees for Treebank Error Detection
Seth Kulick | Ann Bies | Justin Mott
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib abs
From Speech to Trees: Applying Treebank Annotation to Arabic Broadcast News
Mohamed Maamouri | Ann Bies | Seth Kulick | Wajdi Zaghouani | Dave Graff | Mike Ciul
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Arabic Treebank (ATB) Project at the Linguistic Data Consortium (LDC) has embarked on a large corpus of Broadcast News (BN) transcriptions, and this has led to a number of new challenges for the data processing and annotation procedures that were originally developed for Arabic newswire text (ATB1, ATB2 and ATB3). The corpus requirements currently posed by the DARPA GALE Program, including English translation of Arabic BN transcripts, word-level alignment of Arabic and English data, and creation of a corresponding English Treebank, place significant new constraints on ATB corpus creation, and require careful coordination among a wide assortment of concurrent activities and participants. Nonetheless, in spite of the new challenges posed by BN data, the ATBs newly improved pipeline and revised annotation guidelines for newswire have proven to be robust enough that very few changes were necessary to account for the new genre of data. This paper presents the points where some adaptation has been necessary, and the overall pipeline as used in the production of BN ATB data.

pdf bib abs
Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank
Seth Kulick | Ann Bies | Mohamed Maamouri
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Complications arise for standoff annotation when the annotation is not on the source text itself, but on a more abstract representation. This is particularly the case in a language such as Arabic with morphological and orthographic challenges, and we discuss various aspects of these issues in the context of the Arabic Treebank. The Standard Arabic Morphological Analyzer (SAMA) is closely integrated into the annotation workflow, as the basis for the abstraction between the explicit source text and the more abstract token representation. However, this integration with SAMA gives rise to various problems for the annotation workflow and for maintaining the link between the Treebank and SAMA. In this paper we discuss how we have overcome these problems with consistent and more precise categorization of all of the tokens for their relationship with SAMA. We also discuss how we have improved the creation of several distinct alternative forms of the tokens used in the syntactic trees. As a result, the Treebank provides a resource relating the different forms of the same underlying token with varying degrees of vocalization, in terms of how they relate (1) to each other, (2) to the syntactic structure, and (3) to the morphological analyzer.

pdf bib
A Treebank Query System Based on an Extracted Tree Grammar
Seth Kulick | Ann Bies
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Simultaneous Tokenization and Part-Of-Speech Tagging for Arabic without a Morphological Analyzer
Seth Kulick
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
A TAG-derived Database for Treebank Search and Parser Analysis
Seth Kulick | Ann Bies
Proceedings of the 10th International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+10)

2008

pdf bib abs
Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation
Mohamed Maamouri | Seth Kulick | Ann Bies
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The Arabic Treebank (ATB), released by the Linguistic Data Consortium, contains multiple annotation files for each source file, due in part to the role of diacritic inclusion in the annotation process. The data is made available in both vocalized and unvocalized forms, with and without the diacritic marks, respectively. Much parsing work with the ATB has used the unvocalized form, on the basis that it more closely represents the real-world situation. We point out some problems with this usage of the unvocalized data and explain why the unvocalized form does not in fact represent real-world data. This is due to some aspects of the treebank annotation that to our knowledge have never before been published.

pdf bib abs
Enhancing the Arabic Treebank: a Collaborative Effort toward New Annotation Guidelines
Mohamed Maamouri | Ann Bies | Seth Kulick
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The Arabic Treebank team at the Linguistic Data Consortium has significantly revised and enhanced its annotation guidelines and procedure over the past year. Improvements were made to both the morphological and syntactic annotation guidelines, and annotators were trained in the new guidelines, focusing on areas of low inter-annotator agreement. The revised guidelines are now being applied in annotation production, and the combination of the revised guidelines and a period of intensive annotator training has raised inter-annotator agreement f-measure scores already and has also improved parsing results.

pdf bib
Construct State Modification in the Arabic Treebank
Ryan Gabbard | Seth Kulick
Proceedings of ACL-08: HLT, Short Papers

2007

pdf bib
Determining Case in Arabic: Learning Complex Linguistic Behavior Requires Complex Linguistic Features
Nizar Habash | Ryan Gabbard | Owen Rambow | Seth Kulick | Mitch Marcus
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib abs
Diacritization: A Challenge to Arabic Treebank Annotation and Parsing
Mohamed Maamouri | Seth Kulick | Ann Bies
Proceedings of the International Conference on the Challenge of Arabic for NLP/MT

Arabic diacritization (referred to sometimes as vocalization or vowelling), defined as the full or partial representation of short vowels, shadda (consonantal length or germination), tanween (nunation or definiteness), and hamza (the glottal stop and its support letters), is still largely understudied in the current NLP literature. In this paper, the lack of diacritics in standard Arabic texts is presented as a major challenge to most Arabic natural language processing tasks, including parsing. Recent studies (Messaoudi, et al. 2004; Vergyri & Kirchhoff 2004; Zitouni, et al. 2006 and Maamouri, et al. forthcoming) about the place and impact of diacritization in text-based NLP research are presented along with an analysis of the weight of the missing diacritics on Treebank morphological and syntactic analyses and the impact on parser development.

We report on the success of a two-pass approach to annotating metadata, speech effects and syntactic structure in English conversational speech: separately annotating transcribed speech for structural metadata, or structural events, (fillers, speech repairs ( or edit dysfluencies) and SUs, or syntactic/semantic units) and for syntactic structure (treebanking constituent structure and shallow argument structure). The two annotations were then combined into a single representation. Certain alignment issues between the two types of annotation led to the discovery and correction of annotation errors in each, resulting in a more accurate and useful resource. The development of this corpus was motivated by the need to have both metadata and syntactic structure annotated in order to support synergistic work on speech parsing and structural event detection. Automatic detection of these speech phenomena would simultaneously improve parsing accuracy and provide a mechanism for cleaning up transcriptions for downstream text processing. Similarly, constraints imposed by text processing systems such as parsers can be used to help improve identification of disfluencies and sentence boundaries. This paper reports on our efforts to develop a linguistic resource providing both spoken metadata and syntactic structure information, and describes the resulting corpus of English conversational speech.

pdf bib
Fully Parsing the Penn Treebank
Ryan Gabbard | Seth Kulick | Mitchell Marcus
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

Recent work in machine translation and information extraction has demonstrated the utility of a level that represents the predicate-argument structure. It would be especially useful for machine translation to have two such Proposition Banks, one for each language under consideration. A Proposition Bank for English has been developed over the last few years, and we describe here our development of a tool for facilitating the development of a Chinese Proposition Bank. We also discuss some issues specific to the Chinese Treebank that complicate the matter of mapping syntactic representation to a predicate-argument level, and report on some preliminary evaluation of the accuracy of the semantic tagging tool.

1998

pdf bib
TAG and raising in VSO languages
Heidi Harley | Seth Kulick
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)

pdf bib
Partial proof trees and structural modalities
Aravind K. Joshi | Seth Kulick | Natasha Kurtonina
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)

pdf bib
Clitic climbing in Romance: “Restructuring”, causatives, and object-control verbs
Seth Kulick
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)

1995

pdf bib abs
Heuristics and Parse Ranking
B. Srinivas | Christine Doran | Seth Kulick
Proceedings of the Fourth International Workshop on Parsing Technologies

There are currently two philosophies for building grammars and parsers – Statistically induced grammars and Wide-coverage grammars. One way to combine the strengths of both approaches is to have a wide-coverage grammar with a heuristic component which is domain independent but whose contribution is tuned to particular domains. In this paper, we discuss a three-stage approach to disambiguation in the context of a lexicalized grammar, using a variety of domain independent heuristic techniques. We present a training algorithm which uses hand-bracketed treebank parses to set the weights of these heuristics. We compare the performance of our grammar against the performance of the IBM statistical grammar, using both untrained and trained weights for the heuristics.

pdf bib
Using Higher-Order Logic Programming for Semantic Interpretation of Coordinate Constructs
Seth Kulick
33rd Annual Meeting of the Association for Computational Linguistics