This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
EvaHajicova
Also published as:
Eva Hajičová,
E. Hajičová,
Eva Hajicová
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
We introduce the first version of the Czech RST Discourse Treebank, a collection of Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST), a global coherence model proposed by Mann and Thompson (1988). Each document in the corpus is represented as a single tree-like structure, where discourse units are interconnected through hierarchical rhetorical relations and their relative importance for the main purpose of a text is modeled by the nuclearity principle. The treebank is freely available in the LINDAT/CLARIAH-CZ repository under the Creative Commons license; for some documents, it includes two gold annotations representing divergent yet relevant interpretations. The paper outlines the annotation process, provides corpus statistics and evaluation, and discusses the issue of consistency associated with the global level of textual interpretation. In general, good agreement on the structure and labeling could be achieved on the lowest, local tree level and on the identification of the most central (nuclear) elementary discourse units. Disagreements mostly concerned segmentation and, in the structure, differences in the stepwise process of linking the largest text blocks. The project contributes to the advancement of RST research and its application to real-world text analysis challenges.
Recently, many corpora have been developed that contain multiple annotations of various linguistic phenomena, from morphological categories of words through the syntactic structure of sentences to discourse and coreference relations in texts. Discussions are ongoing on an appropriate annotation scheme for a large amount of diverse information. In our contribution we express our conviction that a multilayer annotation scheme offers to view the language system in its complexity and in the interaction of individual phenomena and that there are at least two aspects that support such a scheme: (i) A multilayer annotation scheme makes it possible to use the annotation of one layer to design the annotation of another layer(s) both conceptually and in a form of a pre-annotation procedure or annotation checking rules. (ii) A multilayer annotation scheme presents a reliable ground for corpus studies based on features across the layers. These aspects are demonstrated on the case of the Prague Dependency Treebank. Its multilayer annotation scheme withstood the test of time and serves well also for complex textual annotations, in which earlier morpho-syntactic annotations are advantageously used. In addition to a reference to the previous projects that utilise its annotation scheme, we present several current investigations.
This paper reports on an extended version of a synonym verb class lexicon, newly called SynSemClass (formerly CzEngClass). This lexicon stores cross-lingual semantically similar verb senses in synonym classes extracted from a richly annotated parallel corpus, the Prague Czech-English Dependency Treebank. When building the lexicon, we make use of predicate-argument relations (valency) and link them to semantic roles; in addition, each entry is linked to several external lexicons of more or less “semantic” nature, namely FrameNet, WordNet, VerbNet, OntoNotes and PropBank, and Czech VALLEX. The aim is to provide a linguistic resource that can be used to compare semantic roles and their syntactic properties and features across languages within and across synonym groups (classes, or ’synsets’), as well as gold standard data for automatic NLP experiments with such synonyms, such as synonym discovery, feature mapping, etc. However, perhaps the most important goal is to eventually build an event type ontology that can be referenced and used as a human-readable and human-understandable “database” for all types of events, processes and states. While the current paper describes primarily the content of the lexicon, we are also presenting a preliminary design of a format compatible with Linked Data, on which we are hoping to get feedback during discussions at the workshop. Once the resource (in whichever form) is applied to corpus annotation, deep analysis will be possible using such combined resources as training data.
The view that the representation of information structure (IS) should be a part of (any type of) representation of meaning is based on the fact that IS is a semantically relevant phenomenon. In the contribution, three arguments supporting this view are briefly summarized, namely, the relation of IS to the interpretation of negation and presupposition, the relevance of IS to the understanding of discourse connectivity and for the establishment and interpretation of coreference relations. Afterwards, possible integration of the description of the main ingredient of IS into a meaning representation is illustrated.
This paper describes CzEngClass, a bilingual lexical resource being built to investigate verbal synonymy in bilingual context and to relate semantic roles common to one synonym class to verb arguments (verb valency). In addition, the resource is linked to existing resources with the same of a similar aim: English and Czech WordNet, FrameNet, PropBank, VerbNet (SemLink), and valency lexicons for Czech and English (PDT-Vallex, Vallex, and EngVallex). There are several goals of this work and resource: (a) to provide gold standard data for automatic experiments in the future (such as automatic discovery of synonym classes, word sense disambiguation, assignment of classes to occurrences of verbs in text, coreferential linking of verb and event arguments in text, etc.), (b) to build a core (bilingual) lexicon linked to existing resources, for comparative studies and possibly for training automatic tools, and (c) to enrich the annotation of a parallel treebank, the Prague Czech English Dependency Treebank, which so far contained valency annotation but has not linked synonymous senses of verbs together. The method used for extracting the synonym classes is a semi-automatic process with a substantial amount of manual work during filtering, role assignment to classes and individual Class members’ arguments, and linking to the external lexical resources. We present the first version with 200 classes (about 1800 verbs) and evaluate interannotator agreement using several metrics.
“Interoperability” of annotation schemes is one of the key words in the discussions about annotation of corpora. In the present contribution, we propose to look at the so-called interoperability from (at least) three angles, namely (i) as a relation (and possible interaction or cooperation) of different annotation schemes for different layers or phenomena of a single language, (ii) the possibility to annotate different languages by a single (modified or not) annotation scheme, and (iii) the relation between different annotation schemes for a single language, or for a single phenomenon or layer of the same language. The pros and cons of each of these aspects are discussed as well as their contribution to linguistic studies and natural language processing. It is stressed that a communication and collaboration between different annotation schemes requires an explicit specification and consistency of each of the schemes.
We introduce a substantial update of the Prague Czech-English Dependency Treebank, a parallel corpus manually annotated at the deep syntactic layer of linguistic representation. The English part consists of the Wall Street Journal (WSJ) section of the Penn Treebank. The Czech part was translated from the English source sentence by sentence. This paper gives a high level overview of the underlying linguistic theory (the so-called tectogrammatical annotation) with some details of the most important features like valency annotation, ellipsis reconstruction or coreference.
Currently, research infrastructures are being designed and established in many disciplines since they all suffer from an enormous fragmentation of their resources and tools. In the domain of language resources and tools the CLARIN initiative has been funded since 2008 to overcome many of the integration and interoperability hurdles. CLARIN can build on knowledge and work from many projects that were carried out during the last years and wants to build stable and robust services that can be used by researchers. Here service centres will play an important role that have the potential of being persistent and that adhere to criteria as they have been established by CLARIN. In the last year of the so-called preparatory phase these centres are currently developing four use cases that can demonstrate how the various pillars CLARIN has been working on can be integrated. All four use cases fulfil the criteria of being cross-national.
The present paper reports on a preparatory research for building a language corpus annotation scenario capturing the discourse relations in Czech. We primarily focus on the description of the syntactically motivated relations in discourse, basing our findings on the theoretical background of the Prague Dependency Treebank 2.0 and the Penn Discourse Treebank 2. Our aim is to revisit the present-day syntactico-semantic (tectogrammatical) annotation in the Prague Dependency Treebank, extend it for the purposes of a sentence-boundary-crossing representation and eventually to design a new, discourse level of annotation. In this paper, we propose a feasible process of such a transfer, comparing the possibilities the Praguian dependency-based approach offers with the Penn discourse annotation based primarily on the analysis and classification of discourse connectives.
In the present contribution we claim that corpus annotation serves, among other things, as an invaluable test for linguistic theories standing behind the annotation schemes, and as such represents an irreplaceable resource of linguistic information for the build-up of grammars. To support this claim we present four linguistic phenomena for the study and relevant description of which in grammar a deep layer of corpus annotation as introduced in the Prague Dependency Treebank has brought important observations, namely the information structure of the sentence, condition of projectivity and word order, types of dependency relations and textual coreference.
The claim made in this paper is that in a formal description of language, it is possible and useful to work with dependency-based underlying representations of sentences (tectogrammatical representations) meeting the condition of projectivity. The reasons for the inclusion of this condition into the definition of the tectogrammatical representations are both formally and empirically sound (Section 1). An analysis of the material offered by the Prague Dependency Treebank with annotations of the underlying syntactic structure of sentences (described in Section 2) has led to an interesting classification of non-projective constructions in Czech (Section 3). It documents that most (types of) constructions that appear to be non-projective in the surface shape of sentences can be described by means of projective trees. The realization of the surface word order (with the use of movement rules) is then relegated to the morphemic level, where the representation of the sentence has the shape of a string rather than a tree.
The annotation of the Prague Dependency Treebank (PDT) is conceived of as a multilayered scenario that comprises also dependency representations (tectogrammatical tree structures, TGTS's) of the underlying structure of the sentences. TGTS's capture three basic aspects of the underlying structure of sentences: (a) the dependency tree structure, (b) the kinds of dependency syntactic relations, and (c) the basic characteristics of the topic-focus articulation (TFA). Since the PDT is a large collection and the annotations on the deepest layer are to a large extent performed by several human annotators (based on an automatic preprocessing module), it is more than necessary to observe the consistence of annotators and the agreement among them. In the present paper, we summarize the results of the evaluation of parallel annotations of several samples taken from PDT and the measures accepted to improve the consistency of annotations.
After a brief characterization of the theory of the topic-focus articulation of the sentence (TFA), rules are formulated that determine the assignment of appropriate values of the TFA attribute in the process of syntactico-semantic tagging of a very large corpus of Czech.
The procedure of reconstruction of the underlying structure of sentences (in the process of tagging a very large corpus of Czech) is described, with a special attention paid to the conditions under which the reconstruction of ellipted nodes is carried out.
The dichotomy of topic and focus, based, in the Praguean Functional Generative Description, on the scale of communicative dynamism, is relevant not only for a possible placement of the sentence in a context, but also for its semantic interpretation. An automatic identification of topic and focus may use the input information on word order, on the systemic ordering of kinds of complementations (reflected by the underlying order of the items included in the focus), on definiteness, and on lexical semantic properties of words. An algorithm for the analysis of English sentences has been implemented and is discussed and illustrated on several examples.
An algorithm for automatic identification of topic and focus of the sentence is presented, based on dependency syntax and using written input, which is much more ambiguous than spoken utterance.
The paper develops further the idea of using the notion of the stock of shared knowledge (SSK) for anaphora resolution following a more subtle treatment of the influence of the topic/focus articulation of the sentence on the degrees of salience of items of the SSK. An algorithmic evaluation procedure of the SSK is formulated taking into account the notions of contextual boundness, syntactic associations, complexity of the sentences and existence/nonexistence of possible competitors, and a general evaluating function is proposed, essential for the process of anaphora resolution. In the present paper the analysis is performed for Czech; however, the considerations are claimed to be of a universal validity, the actual relations between different factors and the values, of course, being language-dependent.
The authors collect lexical data for a module of English syntactic analysis in the context of a bilingual research project. The computer usable version of OALD (Hornby, 1974) is used as the primary source. The main focus is on the structure and derivation of valency frames for verbal entries in the target lexicon. Illustration of the complex relation between OALD's verb subcategorization codes and the target complementation paradigms is provided, and an approach to the derivation procedure design suggested.
The hierarchy of salience of the items of the knowledge assumed by the speaker to be shared by him and by the hearer constitutes one aspect of a dynamic account of discourse (Sect. 1). It is claimed that a representation of this hierarchy is a good support for discourse analysis (reference assignement , Sect. 2) and for discourse production (pronominalization, definite description, Sect. 3).
A system of fail-soft (emergency) measures for a production-oriented MT system is discussed, stating first the specific purposes of such a system, and showing then, how these measures are being used in the system of English-to-Czech machine translation as prepared by the group of mathematical linguistics at Charles University in Prague.
In the present paper we characterize in more detail some of the aspects of a question answering system using as its starting point the underlying structure of sentences (which with some approaches can be identified with the level of meaning or of logical form). First of all, the criteria are described that are used to identify the elementary units of underlying structure and the operations conjoining them into complex units (Sect. 1), then the main types of units and operations resulting from an empirical investigation on the basis of the criteria are registered (Sect. 2), and finally the rules of inference , accounting for the relevant aspects of the relationship between linguistic and cognitive structures are illustrated (Sec. 3).
The elements of the stock of knowledge shared by the speaker and the hearer change their salience, in the sense of being immediately accessible in the hearer's memory. The hierarchy of salience is argued to be a basic component of a mechanism serving for the identification of reference. Some of the regularities of this mechanism are discussed, the description of which is a necessary prerequisite of an automatic understanding of connected texts.
The necessity of and means for distinguishing between a level of linguistic meaning and a domain of "factual knowledge" (or cognitive content) are argued for, supported by a survey of relevant operational criteria. The level of meaning is characterized as a safe base for computational applications, which allows for a set of inference rules accounting for the content (factual relations) of a given domain.