pdf
bib
Proceedings of the Fourth International Workshop on Parsing Technologies
pdf
bib
abs
Acyclic Contextsensitive Grammars
Erik Aarts
A grammar formalism is introduced that generates parse trees with crossing branches. The uniform recognition problem is NPcomplete, but for any fixed grammar the recognition problem is polynomial.
pdf
bib
abs
Parsing in Dialogue Systems Using Typed Feature Structures
Rieks op den Akker

Hugo ter Doest

Mark Moll

Anton Nijholt
The analysis of natural language in the context of keyboarddriven dialogue systems is the central issue addressed in this paper. A module that corrects typing errors, performs domainspecific morphological analysis is developed. A parser for typed unification grammars is designed and implemented in C++; for description of the lexicon and the grammer a specialised specification language is developed. It is argued that typed unification grammars and especially the newly developed specification language are convenient formalisms for describing natural language use in dialogue systems. Research on these issues is carried out in the context of the SCHISMA project, a research project in linguistic engineering; participants in SCHISMA are KPN Research and the University of Twente.
pdf
Parallel Parsing: Different Distribution Schemata for Charts
Jan W. Amtrup
pdf
abs
A Fuzzy Approach to Erroneous Inputs in ContextFree Language Recognition
Peter R.J. Asveld
Using fuzzy contextfree grammars one can easily describe a finite number of ways to derive incorrect strings together with their degree of correctness. However, in general there is an infinite number of ways to perform a certain task wrongly. In this paper we introduce a generalization of fuzzy contextfree grammars, the socalled fuzzy contextfree Kgrammars, to model the situation of malting a finite choice out of an infinity of possible grammatical errors during each contextfree derivation step. Under minor assumptions on the parameter K this model happens to be a very general framework to describe correctly as well as erroneously derived sentences by a single generating mechanism. Our first result characterizes the generating capacity of these fuzzy contextfree Kgrammars. As consequences we obtain: (i) bounds on modeling grammatical errors within the framework of fuzzy contextfree grammars, and (ii) the fact that the family of languages generated by fuzzy contextfree Kgrammars shares closure properties very similar to those of the family of ordinary contextfree languages. The second part of the paper is devoted to a few algorithms to recognize fuzzy contextfree languages: viz. a variant of a functional version of CockeYoungerKasami’s algorithm and some recursive descent algorithms. These algorithms tum out to be robust in some very elementary sense and they can easily be extended to corresponding parsing algorithms.
pdf
abs
Parsing NonImmediate Dominance Relations
Tilman Becker

Owen Rambow
We present a new technique for parsing grammar formalisms that express nonimmediate dominance relations by ‘dominancelinks’. Dominance links have been introduced in various formalisms such as extensions to CFG and TAG in order to capture longdistance dependencies in freeword order languages (Becker et al., 1991; Rambow, 1994). We show how the addition of ‘link counters’ to standard parsing algorithms such as CKY and Earleybased methods for TAG results in a polynomial time complexity algorithm for parsing lexicalized VTAG, a multicomponent version of TAGs defined in (Rambow, 1994). A variant of this method has previously been applied to contextfree grammar based formalisms such as UVGDL.
pdf
abs
Yet Another 0(n^{6}) Recognition Algorithm for Mildly ContextSensitive Languages
Pierre Boullier
VijayShanker and Weir have shown in [17] that Tree Adjoining Grammars and Combinatory Categorial Grammars can be transformed into equivalent Linear Indexed Grammars (LIGs) which can be recognized in 0(n^{6}) time using a CockeKasamiYounger style algorithm. This paper exhibits another recognition algorithm for LIGs, with the same upperbound complexity, but whose average case behaves much better. This algorithm works in two steps: first a general contextfree parsing algorithm (using the underlying contextfree grammar) builds a shared parse forest, and second, the LIG properties are checked on this forest. This check is based upon the composition of simple relations and does not require any computation of symbol stacks.
pdf
abs
Developing and Evaluating a Probabilistic LR Parser of PartofSpeech and Punctuation Labels
Ted Briscoe

John Carroll
We describe an approach to robust domainindependent syntactic parsing of unrestricted naturallyoccurring (English) input. The technique involves parsing sequences of partofspeech and punctuation labels using a unificationbased grammar coupled with a probabilistic LR parser. We describe the coverage of several corpora using this grammar and report the results of a parsing experiment using probabilities derived from bracketed training data. We report the first substantial experiments to assess the contribution of punctuation to deriving an accurate syntactic analysis, by parsing identical texts both with and without naturallyoccurring punctuation marks.
pdf
abs
An Abstract Machine for AttributeValue Logics
Bob Carpenter

Yan Qu
A direct abstract machine implementation of the core attributevalue logic operations is shown to decrease the number of operations and conserve the amount of storage required when compared to interpreters or indirect compilers. In this paper, we describe the fundamental data structures and compilation techniques that we have employed to develop a unification and constraintresolution engine capable of performance rivaling that of directly compiled Prolog terms while greatly exceeding Prolog in flexibility, expressiveness and modularity. In this paper, we will discuss the core architecture of our machine. We begin with a survey of the data structures supporting the small set of attributevalue logic instructions. These instructions manipulate feature structures by means of features, equality, and typing, and manipulate the program state by search and sequencing operations. We further show how these core operations can be integrated with a broad range of standard parsing techniques. Feature structures improve upon Prolog terms by allowing data to be organized by feature rather than by position. This encourages modular program development through the use of sparse structural descriptions which can be logically conjoined into larger units and directly executed. Standard linguistic representations, even of relatively simple local syntactic and semantic structures, typically run to hundreds of substructures. The type discipline we impose organizes information in an objectoriented manner by the multiple inheritance of classes and their associated features and type value constraints. In practice, this allows the construction of largescale grammars in a relatively short period of time. At runtime, eager copying and structuresharing is replaced with lazy, incremental, and localized branch and write operations. In order to allow for applications with parallel search, incremental backtracking can be localized to disjunctive choice points within the description of a single structure, thus supporting the kind of conditional mutual consistency checks used in modern grammatical theories such as HPSG, GB, and LFG. Further attention is paid to the bytecoding of instructions and their efficient indexing and subsequent retrieval, all of which is keyed on type information.
pdf
abs
A ChunkingandRaising Partial Parser
HsinHsi Chen

YueShi Lee
Parsing is often seen as a combinatorial problem. It is not due to the properties of the natural languages, but due to the parsing strategies. This paper investigates a Constrained Grammar extracted from a Treebank and applies it in a noncombinatorial partial parser. This parser is a simpler version of a chunkingandraising parser. The chunking and raising actions can be done in linear time. The shortterm goal of this research is to help the development of a partially bracketed corpus, i.e., a simpler version of a treebank. The longterm goal is to provide high level linguistic constraints for many natural language applications.
pdf
Distributed Parsing With HPSG Grammars
Abdel Kader Diagne

Walter Kasper

HansUlrich Krieger
pdf
Chartbased Incremental Semantics Construction with Anaphora Resolution Using 𝜆DRT
Ingrid Fischer

Bernd Geistert

Günther Görz
pdf
Term Encoding of Typed Feature Structures
Dale Gerdemann
pdf
abs
Generic Rules and NonConstituent Coordination
Julio Gonzalo

Teresa Solías
We present a metagrammatical formalism, generic rules, to give a default interpretation to grammar rules. Our formalism introduces a process of dynamic binding interfacing the level of pure grammatical knowledge representation and the parsing level. We present an approach to nonconstituent coordination within categorial grammars, and reformulate it as a generic rule. This reformulation is contextfree parsable and reduces drastically the search space associated to the parsing task for such phenomena.
pdf
abs
A Robust Parsing Algorithm for Link Grammars
Dennis Grinberg

John Lafferty

Daniel Sleator
In this paper we present a robust parsing algorithm based on the link grammar formalism for parsing natural languages. Our algorithm is a natural extension of the original dynamic programming recognition algorithm which recursively counts the number of linkages between two words in the input sentence. The modified algorithm uses the notion of a null link in order to allow a connection between any pair of adjacent words, regardless of their dictionary definitions. The algorithm proceeds by making three dynamic programming passes. In the first pass, the input is parsed using the original algorithm which enforces the constraints on links to ensure grammaticality. In the second pass, the total cost of each substring of words is computed, where cost is determined by the number of null links necessary to parse the substring. The final pass counts the total number of parses with minimal cost. All of the original pruning techniques have natural counterparts in the robust algorithm. When used together with memoization, these techniques enable the algorithm to run efficiently with cubic worstcase complexity. We have implemented these ideas and tested them by parsing the Switchboard corpus of conversational English. This corpus is comprised of approximately three million words of text, corresponding to more than 150 hours of transcribed speech collected from telephone conversations restricted to 70 different topics. Although only a small fraction of the sentences in this corpus are “grammatical” by standard criteria, the robust link grammar parser is able to extract relevant structure for a large portion of the sentences. We present the results of our experiments using this system, including the analyses of selected and random sentences from the corpus. We placed a version of the robust parser on the Word Wide Web for experimentation. It can be reached at URL
http://www.cs.cmu.edu/afs/es.emu.edu/project/link/www/robust.html. In this version there are some limitations such as the maximum length of a sentence in words and the maximum amount of memory the parser can use.
pdf
abs
An Implementation of Syntactic Analysis of Czech
Tomáš Holan

Vladislav Kuboň

Martin Plátek
This paper describes current results achieved during the work on parsing of a freewordorder natural language (Czech) . It contains theoretical base for a new class of grammars  CFG extended for dependecies and nonprojectivities – and also the description of the implementation of a parser and grammarchecker. The paper also describes some typical problems of parsing of freewordorder languages and their solutions (or discusssion of those problems), which are still subject of investigation. The implementation described here serves currently as a testing tool for the development of a large scale grammar of Czech. Some of the quantitative data from a processing of test sentences are also included.
pdf
abs
Analyzing Coordinate Structures Including Punctuation in English
Sadao Kurohashi
We present a met hod of identifying coordinate structure scopes and determining usages of commas in sentences at the same time. All possible interpretations concerning comma usages and coordinate structure scopes are ranked by taking advantage of parallelism between conjoined phrases/clauses/sentences and calculating their similarity scores. We evaluated this method through experiments on heldout test sentences and obtained promising results: both the success ratio of interpreting commas and that of detecting CS scopes were about 80%.
pdf
On Parsing Control for Efficient Text Analysis
Fabio Ciravegna

Alberto Lavelli
pdf
abs
A Practical Dependency Parser
Vincenzo Lombardo

Leonardo Lesmo
The working assumption is that cognitive modeling of NLP and engineering solutions to free text parsing can converge to optimal parsing. The claim of the paper is that the methodology to achieve such a result is to develop a concrete environment with a flexible parser, that allows the testing of various psycholinguistic strategies on real texts. In this paper we outline a flexible parser based on a dependency grammar.
pdf
abs
A Labelled Analytic Theorem Proving Environment for Categorial Grammar
Saturnino F. LuzFilho

Patrick Sturt
We present a system for the investigation of computational properties of categorial grammar parsing based on a labelled analytic tableaux theorem prover. This proof method allows us to take a modular approach, in which the basic grammar can be kept constant, while a range of categorial calculi can be captured by assigning different properties to the labelling algebra. The theorem proving strategy is particularly well suited to the treatment of categorial grammar, because it allows us to distribute the computational cost between the algorithm which deals with the grammatical types and the algebraic checker which constrains the derivation.
pdf
abs
A UnificationBased ID/LP Parsing Schema
Frank Morawietz
In contemporary natural language formalisms like HPSG (Pollard and Sag 1994) the ID/LP format is used to separate the information on dominance from the one on linear precedence thereby allowing significant generalizations on word order. In this paper, we define unification ID/LP grammars. But as mentioned in Seiffert (1991) there are problems concerning the locality of the information determining LP acceptability during parsing. Since one is dealing with partially specified data, the information that is relevant to decide whether the local tree under construction is LP acceptable might be instantiated further during processing. In this paper we propose a modification of the Earley/Shieber algorithm on direct parsing of ID/LP grammars. We extend the items involved to include the relevant underspecified information using it in the completion steps to ensure the acceptability of the resulting structure. Following Sikkel (1993) we define it not as an algorithm, but as a parsing schema to allow the most abstract representation.
pdf
abs
Parsing Without Grammar
Shinsuke Mori

Makoto Nagao
We describe and evaluate experimentally a method to parse a tagged corpus without grammar modeling a natural language on contextfree language. This method is based on the following three hypotheses. 1) Partofspeech sequences on the righthand side of a rewriting rule are less constrained as to what partofspeech precedes and follows them than nonconstituent sequences. 2) Partofspeech sequences directly derived from the same nonterminal symbol have similar environments. 3) The most suitable set of rewriting rules makes the greatest reduction of the corpus size. Based on these hypotheses, the system finds a set of constituentlike partofspeech sequences and replaces them with a new symbol. The repetition of these processes brings us a set of rewriting rules, a grammar, and the bracketed corpus.
pdf
A Formalism and a Parser for Lexicalised Dependency Grammars
Alexis Nasr
pdf
abs
Errortolerant Finite State Recognition
Kemal Oflazer
Errortolerant recognition enables the recognition of strings that deviate slightly from any string in the regular set recognized by the underlying finite state recognizer. In the context of natural language processing, it has applications in errortolerant morphological analysis, and spelling correction. After a description of the concepts and algorithms involved, we give examples from these two applications: In morphological analysis, errortolerant recognition allows misspelled input word forms to be corrected, and morphologically analyzed concurrently. The algorithm can be applied to the moiphological analysis of any language whose morphology is fully captured by a single (and possibly very large) finite state transducer, regardless of the word formation processes (such as agglutination or productive compounding) and morphographemic phenomena involved. We present an application to error tolerant analysis of agglutinative morphology of Turkish words. In spelling correction, errortolerant recognition can be used to enumerate correct candidate forms from a given misspelled string within a certain edit distance. It can be applied to any language whose morphology is fully described by a finite state transducer, or with a word list comprising all inflected forms with very large word lists of root and inflected forms (some containing well over 200,000 forms), generating all candidate solutions within 10 to 45 milliseconds (with edit distance 1) on a SparcStation 10/41. For spelling correction in Turkish, errortolerant recognition operating with a (circular) recognizer of Turkish words (with about 29,000 states and 119,000 transitions) can generate all candidate words in less than 20 milliseconds (with edit distance 1). Spelling correction using a recognizer constructed from a large word German list that simulates compounding, also indicates that the approach is applicable in such cases.
pdf
abs
A Novel Framework for Reductionistic Statistical Parsing
Christer Samuelsson
A reductionistic statistical framework for partofspeech tagging and surface syntactic parsing is presented that has the same expressive power as the highly successful Constraint Grammar approach, see [Karlsson et al. 1995]. The structure of the Constraint Grammar rules allows them to be viewed as conditional probabilities that can be used to update the lexical tag probabilities, after which lowprobability tags are repeatedly removed. Experiments using strictly conventional information sources on the Susanne and Teleman corpora indicate that the system performs as well as a traditional HMMbased partofspeech tagger, yielding stateoftheart results. The scheme also enables using the same information sources as the Constraint Grammar approach, and the hope is that it can improve on the performance of both statistical taggers and surfacesyntactic analyzers.
pdf
abs
A Corpusbased Probabilistic Grammar with Only Two Nonterminals
Satoshi Sekine

Ralph Grishman
The availability of large, syntacticallybracketed corpora such as the Penn Tree Bank affords us the opportunity to automatically build or train broadcoverage grammars, and in particular to train probabilistic grammars. A number of recent parsing experiments have also indicated that grammars whose production probabilities are dependent on the context can be more effective than contextfree grammars in selecting a correct parse. To make maximal use of context, we have automatically constructed, from the Penn Tree Bank version 2, a grammar in which the symbols S and NP are the only real nonterminals, and the other nonterminals or grammatical nodes are in effect embedded into the righthandsides of the S and NP rules. For example, one of the rules extracted from the tree bank would be S > NP VBX JJ CC VBX NP [1] ( where NP is a nonterminal and the other symbols are terminals – partofspeech tags of the Tree Bank). The most common structure in the Tree Bank associated with this expansion is (S NP (VP (VP VBX (ADJ JJ) CC (VP VBX NP)))) [2]. So if our parser uses rule [1] in parsing a sentence, it will generate structure [2] for the corresponding part of the sentence. Using 94% of the Penn Tree Bank for training, we extracted 32,296 distinct rules ( 23,386 for S, and 8,910 for NP). We also built a smaller version of the grammar based on higher frequency patterns for use as a backup when the larger grammar is unable to produce a parse due to memory limitation. We applied this parser to 1,989 Wall Street Journal sentences (separate from the training set and with no limit on sentence length). Of the parsed sentences (1,899), the percentage of nocrossing sentences is 33.9%, and Parseval recall and precision are 73.43% and 72 .61%.
pdf
abs
Heuristics and Parse Ranking
B. Srinivas

Christine Doran

Seth Kulick
There are currently two philosophies for building grammars and parsers – Statistically induced grammars and Widecoverage grammars. One way to combine the strengths of both approaches is to have a widecoverage grammar with a heuristic component which is domain independent but whose contribution is tuned to particular domains. In this paper, we discuss a threestage approach to disambiguation in the context of a lexicalized grammar, using a variety of domain independent heuristic techniques. We present a training algorithm which uses handbracketed treebank parses to set the weights of these heuristics. We compare the performance of our grammar against the performance of the IBM statistical grammar, using both untrained and trained weights for the heuristics.
pdf
abs
Stochastic ParseTree Recognition by a Pushdown Automaton
Frédéric Tendeau
We present the stochastic generalization of what is usually called correctness theorems: we guarantee that the probabilities computed operationally by the parsing algorithms are the same as those defined denotationally on the trees and forests defined by the grammar. The main idea of the paper is to precisely relate the parsing strategy with a parsetree exploration strategy: a computational path of a parsing. algorithm simply performs an exploration of a parsetree for the input portion already parsed. This approach is applied in particular to Earley and LeftCorner parsing algorithms. Probability computations follow parsing operations: looping problems (in rule prediction and subtree recognition) are solved by introducing probability variables (which may not be immediately evaluated). Convergence is ensured by the syntactic construction that leads to stochastic equations systems, which are solved as soon as possible. Our algorithms accept any (probabilistic) CF grammar. No restrictions are made such as prescribing normal form, proscribing empty rules or cyclic grammars.
pdf
An HPSGbased Parser for Automatic Knowledge Acquisition
Kentaro Torisawa

Jun’ichi Tsujii
pdf
Parsing DTree Grammars
K. VijayShanker

David Weir

Owen Rambow
pdf
The Influence of Tagging on the Results of Partial Parsing in German Corpora
Oliver Wauschkuhn
pdf
Partitioning Grammars and Composing Parsers
Fuliang Weng

Andreas Stolcke
pdf
abs
Parsing with Typed Feature Structures
Shuly Wintner

Nissim Francez
In this paper we provide for parsing with respect to grammars expressed in a general TFSbased formalism, a restriction of ALE ([2]). Our motivation being the design of an abstract (WAMlike) machine for the formalism ([14]), we consider parsing as a computational process and use it as an operational semantics to guide the design of the control structures for the abstract machine. We emphasize the notion of abstract typed feature structures (AFSs) that encode the essential information of TFSs and define unification over AFSs rather than over TFSs. We then introduce an explicit construct of multirooted feature structures (MRSs) that naturally extend TFSs and use them to represent phrasal signs as well as grammar rules. We also employ abstractions of MRSs and give the mathematical foundations needed for manipulating them. We then present a simple bottomup chart parser as a model for computation: grammars written in the TFSbased formalism are executed by the parser. Finally, we show that the parser is correct.