Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias (Editors)

Anthology ID:
Marrakech, Morocco
European Language Resources Association (ELRA)
Bib Export formats:

Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Nicoletta Calzolari | Khalid Choukri | Bente Maegaard | Joseph Mariani | Jan Odijk | Stelios Piperidis | Daniel Tapias

pdf bib
Unsupervised Relation Extraction From Web Documents
Kathrin Eichler | Holmer Hemsen | Günter Neumann

The IDEX system is a prototype of an interactive dynamic Information Extraction (IE) system. A user of the system expresses an information request in the form of a topic description, which is used for an initial search in order to retrieve a relevant set of documents. On basis of this set of documents, unsupervised relation extraction and clustering is done by the system. The results of these operations can then be interactively inspected by the user. In this paper we describe the relation extraction and clustering components of the IDEX system. Preliminary evaluation results of these components are presented and an overview is given of possible enhancements to improve the relation extraction and clustering components.

pdf bib
Combining Multiple Models for Speech Information Retrieval
Muath Alzghool | Diana Inkpen

In this article we present a method for combining different information retrieval models in order to increase the retrieval performance in a Speech Information Retrieval task. The formulas for combining the models are tuned on training data. Then the system is evaluated on test data. The task is particularly difficult because the text collection is automatically transcribed spontaneous speech, with many recognition errors. Also, the topics are real information needs, difficult to satisfy. Information Retrieval systems are not able to obtain good results on this data set, except for the case when manual summaries are included.

Event Detection and Summarization in Weblogs with Temporal Collocations
Chun-Yuan Teng | Hsin-Hsi Chen

This paper deals with the relationship between weblog content and time. With the proposed temporal mutual information, we analyze the collocations in time dimension, and the interesting collocations related to special events. The temporal mutual information is employed to observe the strength of term-to-term associations over time. An event detection algorithm identifies the collocations that may cause an event in a specific timestamp. An event summarization algorithm retrieves a set of collocations which describe an event. We compare our approach with the approach without considering the time interval. The experimental results demonstrate that the temporal collocations capture the real world semantics and real world events over time.

The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines
Cvetana Krstev | Ranka Stanković | Duško Vitas | Ivan Obradović

In this paper we present how resources and tools developed within the Human Language Technology Group at the University of Belgrade can be used for tuning queries before submitting them to a web search engine. We argue that the selection of words chosen for a query, which are of paramount importance for the quality of results obtained by the query, can be substantially improved by using various lexical resources, such as morphological dictionaries and wordnets. These dictionaries enable semantic and morphological expansion of the query, the latter being very important in highly inflective languages, such as Serbian. Wordnets can also be used for adding another language to a query, if appropriate, thus making the query bilingual. Problems encountered in retrieving documents of interest are discussed and illustrated by examples. A brief description of resources is given, followed by an outline of the web tool which enables their integration. Finally, a set of examples is chosen in order to illustrate the use of the lexical resources and tool in question. Results obtained for these examples show that the number of documents obtained through a query by using our approach can double and even quadruple in some cases.

The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics
Steven Bird | Robert Dale | Bonnie Dorr | Bryan Gibson | Mark Joseph | Min-Yen Kan | Dongwon Lee | Brett Powley | Dragomir Radev | Yee Fan Tan

The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of research results, but we believe that it can also be an object of study and a platform for research in its own right. We describe an enriched and standardized reference corpus derived from the ACL Anthology that can be used for research in scholarly document processing. This corpus, which we call the ACL Anthology Reference Corpus (ACL ARC), brings together the recent activities of a number of research groups around the world. Our goal is to make the corpus widely available, and to encourage other researchers to use it as a standard testbed for experiments in both bibliographic and bibliometric research.

The Linguistic Data Consortium Member Survey: Purpose, Execution and Results
Marian Reed | Denise DiPersio | Christopher Cieri

The Linguistic Data Consortium (LDC) seeks to provide its members with quality linguistic resources and services. In order to pursue these ideals and to remain current, LDC monitors the needs and sentiments of its communities. One mechanism LDC uses to generate feedback on consortium and resource issues is the LDC Member Survey. The survey allows LDC Members and nonmembers to provide LDC with valuable insight into their own unique circumstances, their current and future data needs and their views on LDC’s role in meeting them. When the 2006 Survey was found to be a useful tool for communicating with the Consortium membership, a 2007 Survey was organized and administered. As a result of the surveys, LDC has confirmed that it has made a positive impact on the community and has identified ways to improve the quality of service and the diversity of monthly offerings. Many respondents recommended ways to improve LDC’s functions, ordering mechanism and webpage. Some of these comments have inspired changes to LDC’s operation and strategy.

Language-Sites: Accessing and Presenting Language Resources via Geographic Information Systems
Dieter Van Uytvanck | Alex Dukers | Jacquelijn Ringersma | Paul Trilsbeek

The emerging area of Geographic Information Systems (GIS) has proven to add an interesting dimension to many research projects. Within the language-sites initiative we have brought together a broad range of links to digital language corpora and resources. Via Google Earth’s visually appealing 3D-interface users can spin the globe, zoom into an area they are interested in and access directly the relevant language resources. This paper focuses on several ways of relating the map and the online data (lexica, annotations, multimedia recordings, etc.). Furthermore, we discuss some of the implementation choices that have been made, including future challenges. In addition, we show how scholars (both linguists and anthropologists) are using GIS tools to fulfill their specific research needs by making use of practical examples. This illustrates how both scientists and the general public can benefit from geography-based access to digital language data.

CLARIN: Common Language Resources and Technology Infrastructure
Tamás Váradi | Steven Krauwer | Peter Wittenburg | Martin Wynne | Kimmo Koskenniemi

The paper provides a general introduction to the CLARIN project, a large-scale European research infrastructure project designed to establish an integrated and interoperable infrastructure of language resources and technologies. The goal is to make language resources and technology much more accessible to all researchers working with language material, particularly non-expert users in the Humanities and Social Sciences. CLARIN intends to build a virtual, distributed infrastructure consisting of a federation of trusted digital archives and repositories where language resources and tools are accessible through web services. The CLARIN project consists of 32 partners from 22 countries and is currently engaged in the preparatory phase of developing the infrastructure. The paper describes the objectives of the project in terms of its technical, legal, linguistic and user dimensions.

Evaluating Dialogue Act Tagging with Naive and Expert Annotators
Jeroen Geertzen | Volha Petukhova | Harry Bunt

In this paper the dialogue act annotation of naive and expert annotators, both annotating the same data, are compared in order to characterise the insights annotations made by different kind of annotators may provide for evaluating dialogue act tagsets. It is argued that the agreement among naive annotators provides insight in the clarity of the tagset, whereas agreement among expert annotators provides an indication of how reliably the tagset can be applied when errors are ruled out that are due to deficiencies in understanding the concepts of the tagset, to a lack of experience in using the annotation tool, or to little experience in annotation more generally. An indication of the differences between the two groups in terms of inter-annotator agreement and tagging accuracy on task-oriented dialogue in different domains, annotated with the DIT++ dialogue act tagset is presented, and the annotations of both groups are assessed against a gold standard. Additionally, the effect of the reduction of the tagset’s granularity on the performances of both groups is looked into. In general, it is concluded that the annotations of both groups provide complementary insights in reliability, clarity, and more fundamental conceptual issues.

Validating the Quality of Full Morphological Annotation
Drahomíra „johanka“ Spoustová | Pavel Pecina | Jan Hajič | Miroslav Spousta

In our paper we present a methodology used for low-cost validation of quality of Part-of-Speech annotation of the Prague Dependency Treebank based on multiple re-annotation of data samples carefully selected with the help of several different Part-of-Speech taggers.

Evaluating a German Sketch Grammar: A Case Study on Noun Phrase Case
Kremena Ivanova | Ulrich Heid | Sabine Schulte im Walde | Adam Kilgarriff | Jan Pomikálek

Word sketches are part of the Sketch Engine corpus query system. They represent automatic, corpus-derived summaries of the words’ grammatical and collocational behaviour. Besides the corpus itself, word sketches require a sketch grammar, a regular expression-based shallow grammar over the part-of-speech tags, to extract evidence for the properties of the targeted words from the corpus. The paper presents a sketch grammar for German, a language which is not strictly configurational and which shows a considerable amount of case syncretism, and evaluates its accuracy, which has not been done for other sketch grammars. The evaluation focuses on NP case as a crucial part of the German grammar. We present various versions of NP definitions, so demonstrating the influence of grammar detail on precision and recall.

Evaluating Complement-Modifier Distinctions in a Semantically Annotated Corpus
Mark McConville | Myroslava O. Dzikovska

We evaluate the extent to which the distinction between semantically core and non-core dependents as used in the FrameNet corpus corresponds to the traditional distinction between syntactic complements and modifiers of a verb, for the purposes of harvesting a wide-coverage verb lexicon from FrameNet for use in deep linguistic processing applications. We use the VerbNet verb database as our gold standard for making judgements about complement-hood, in conjunction with our own intuitions in cases where VerbNet is incomplete. We conclude that there is enough agreement between the two notions (0.85) to make practical the simple expedient of equating core PP dependents in FrameNet with PP complements in our lexicon. Doing so means that we lose around 13% of PP complements, whilst around 9% of the PP dependents left in the lexicon are not complements.

The PIT Corpus of German Multi-Party Dialogues
Petra-Maria Strauß | Holger Hoffmann | Wolfgang Minker | Heiko Neumann | Günther Palm | Stefan Scherer | Harald Traue | Ulrich Weidenbacher

The PIT corpus is a German multi-media corpus of multi-party dialogues recorded in a Wizard-of-Oz environment at the University of Ulm. The scenario involves two human dialogue partners interacting with a multi-modal dialogue system in the domain of restaurant selection. In this paper we present the characteristics of the data which was recorded in three sessions resulting in a total of 75 dialogues and about 14 hours of audio and video data. The corpus is available at

Annotation and analysis of overlapping speech in political interviews
Martine Adda-Decker | Claude Barras | Gilles Adda | Patrick Paroubek | Philippe Boula de Mareüil | Benoit Habert

Looking for a better understanding of spontaneous speech-related phenomena and to improve automatic speech recognition (ASR), we present here a study on the relationship between the occurrence of overlapping speech segments and disfluencies (filled pauses, repetitions, revisions) in political interviews. First we present our data, and our overlap annotation scheme. We detail our choice of overlapping tags and our definition of disfluencies; the observed ratios of the different overlapping tags are examined, as well as their correlation with of the speaker role and propose two measures to characterise speakers’ interacting attitude: the attack/resist ratio and the attack density. We then study the relationship between the overlapping speech segments and the disfluencies in our corpus, before concluding on the perspectives that our experiments offer.

Data Collection for the CHIL CLEAR 2007 Evaluation Campaign
Nicolas Moreau | Djamel Mostefa | Rainer Stiefelhagen | Susanne Burger | Khalid Choukri

This paper describes in detail the data that was collected and annotated during the third and final year of the CHIL project. This data was used for the CLEAR evaluation campaign in spring 2007. The paper also introduces the CHIL Evaluation Package 2007 that resulted from this campaign including a complete description of the performed evaluation tasks. This evaluation package will be made available to the community through the ELRA General Catalogue.

A Comparative Cross-Domain Study of the Occurrence of Laughter in Meeting and Seminar Corpora
Susanne Burger | Kornel Laskowski | Matthias Woelfel

Laughter is an intrinsic component of human-human interaction, and current automatic speech understanding paradigms stand to gain significantly from its detection and modeling. In the current work, we produce a manual segmentation of laughter in a large corpus of interactive multi-party seminars, which promises to be a valuable resource for acoustic modeling purposes. More importantly, we quantify the occurrence of laughter in this new domain, and contrast our observations with findings for laughter in multi-party meetings. Our analyses show that, with respect to the majority of measures we explore, the occurrence of laughter in both domains is quite similar.

SpatialML: Annotation Scheme, Corpora, and Tools
Inderjeet Mani | Janet Hitzeman | Justin Richer | Dave Harris | Rob Quimby | Ben Wellner

SpatialML is an annotation scheme for marking up references to places in natural language. It covers both named and nominal references to places, grounding them where possible with geo-coordinates, including both relative and absolute locations, and characterizes relationships among places in terms of a region calculus. A freely available annotation editor has been developed for SpatialML, along with a corpus of annotated documents released by the Linguistic Data Consortium. Inter-annotator agreement on SpatialML is 77.0 F-measure for extents on that corpus. An automatic tagger for SpatialML extents scores 78.5 F-measure. A disambiguator scores 93.0 F-measure and 93.4 Predictive Accuracy. In adapting the extent tagger to new domains, merging the training data from the above corpus with annotated data in the new domain provides the best performance.

Building a Corpus of Temporal-Causal Structure
Steven Bethard | William Corvey | Sara Klingenstein | James H. Martin

While recent corpus annotation efforts cover a wide variety of semantic structures, work on temporal and causal relations is still in its early stages. Annotation efforts have typically considered either temporal relations or causal relations, but not both, and no corpora currently exist that allow the relation between temporals and causals to be examined empirically. We have annotated a corpus of 1000 event pairs for both temporal and causal relations, focusing on a relatively frequent construction in which the events are conjoined by the word “and”. Temporal relations were annotated using an extension of the BEFORE and AFTER scheme used in the TempEval competition, and causal relations were annotated using a scheme based on connective phrases like “and as a result”. The annotators achieved 81.2% agreement on temporal relations and 77.8% agreement on causal relations. Analysis of the resulting corpus revealed some interesting findings, for example, that over 30% of CAUSAL relations do not have an underlying BEFORE relation. The corpus was also explored using machine learning methods, and while model performance exceeded all baselines, the results suggested that simple grammatical cues may be insufficient for identifying the more difficult temporal and causal relations.

Computational Models for Event Type Classification in Context
Alessandra Zarcone | Alessandro Lenci

Verb lexical semantic properties are only one of the factors that contribute to the determination of the event type expressed by a sentence, which is instead the result of a complex interplay between the verb meaning and its linguistic context. We report on two computational models for the automatic identification of event type in Italian. Both models use linguistically-motivated features extracted from Italian corpora. The main goal of our experiments is to evaluate the contribution of different types of linguistic indicators to identify the event type of a sentence, as well as to model various cases of context-driven event type shift. In the first model, event type identification has been modelled as a supervised classification task, performed with Maximum Entropy classifiers. In the second model, Self-Organizing Maps have been used to define and identify event types in an unsupervised way. The interaction of various contextual factors in determining the event type expressed by a sentence makes event type identification a highly challenging task. Computational models can help us to shed new light on the real structure of event type classes as well as to gain a better understanding of context-driven semantic shifts.

GMT to +2 or how can TimeML be used in Romanian
Corina Forăscu

The paper describes the construction and usage of the Romanian version of the TimeBank corpus. The success rate of 96.53% for the automatic import of the temporal annotation from English to Romanian shows that the automatic transfer is a worth doing enterprise if temporality is to be studied in another language than the one for which TimeML, the annotation standard used, was developed. A preliminary study identifies the main situations that occurred during the automatic transfer, as well as temporal elements not (yet) marked in the English corpus.

Annotating “tense” in a Tense-less Language
Nianwen Xue | Hua Zhong | Kai-Yun Chen

In the context of Natural Language Processing, annotation is about recovering implicit information that is useful for natural language applications. In this paper we describe a “tense” annotation task for Chinese - a language that does not have grammatical tense - that is designed to infer the temporal location of a situation in relation to the temporal deixis, the moment of speech. If successful, this would be a highly rewarding endeavor as it has application in many natural language systems. Our preliminary experiments show that while this is a very challenging annotation task for which high annotation consistency is very difficult but not impossible to achieve. We show that guidelines that provide a conceptually intuitive framework will be crucial to the success of this annotation effort.

Subdomain Sensitive Statistical Parsing using Raw Corpora
Barbara Plank | Khalil Sima’an

Modern statistical parsers are trained on large annotated corpora (treebanks). These treebanks usually consist of sentences addressing different subdomains (e.g. sports, politics, music), which implies that the statistics gathered by current statistical parsers are mixtures of subdomains of language use. In this paper we present a method that exploits raw subdomain corpora gathered from the web to introduce subdomain sensitivity into a given parser. We employ statistical techniques for creating an ensemble of domain sensitive parsers, and explore methods for amalgamating their predictions. Our experiments show that introducing domain sensitivity by exploiting raw corpora can improve over a tough, state-of-the-art baseline.

Developing a TT-MCTAG for German with an RCG-based Parser
Laura Kallmeyer | Timm Lichte | Wolfgang Maier | Yannick Parmentier | Johannes Dellert

Developing linguistic resources, in particular grammars, is known to be a complex task in itself, because of (amongst others) redundancy and consistency issues. Furthermore some languages can reveal themselves hard to describe because of specific characteristics, e.g. the free word order in German. In this context, we present (i) a framework allowing to describe tree-based grammars, and (ii) an actual fragment of a core multicomponent tree-adjoining grammar with tree tuples (TT-MCTAG) for German developed using this framework. This framework combines a metagrammar compiler and a parser based on range concatenation grammar (RCG) to respectively check the consistency and the correction of the grammar. The German grammar being developed within this framework already deals with a wide range of scrambling and extraction phenomena.

Some Fine Points of Hybrid Natural Language Parsing
Peter Adolphs | Stephan Oepen | Ulrich Callmeier | Berthold Crysmann | Dan Flickinger | Bernd Kiefer

Large-scale grammar-based parsing systems nowadays increasingly rely on independently developed, more specialized components for pre-processing their input. However, different tools make conflicting assumptions about very basic properties such as tokenization. To make linguistic annotation gathered in pre-processing available to “deep” parsing, a hybrid NLP system needs to establish a coherent mapping between the two universes. Our basic assumption is that tokens are best described by attribute value matrices (AVMs) that may be arbitrarily complex. We propose a powerful resource-sensitive rewrite formalism, “chart mapping”, that allows us to mediate between the token descriptions delivered by shallow pre-processing components and the input expected by the grammar. We furthermore propose a novel way of unknown word treatment where all generic lexical entries are instantiated that are licensed by a particular token AVM. Again, chart mapping is used to give the grammar writer full control as to which items (e.g. native vs. generic lexical items) enter syntactic parsing. We discuss several further uses of the original idea and report on early experiences with the new machinery.

Evaluating and Extending the Coverage of HPSG Grammars: A Case Study for German
Jeremy Nicholson | Valia Kordoni | Yi Zhang | Timothy Baldwin | Rebecca Dridan

In this work, we examine and attempt to extend the coverage of a German HPSG grammar. We use the grammar to parse a corpus of newspaper text and evaluate the proportion of sentences which have a correct attested parse, and analyse the cause of errors in terms of lexical or constructional gaps which prevent parsing. Then, using a maximum entropy model, we evaluate prediction of lexical types in the HPSG type hierarchy for unseen lexemes. By automatically adding entries to the lexicon, we observe that we can increase coverage without substantially decreasing precision.

Robust Parsing with a Large HPSG Grammar
Yi Zhang | Valia Kordoni

In this paper we propose a partial parsing model which achieves robust parsing with a large HPSG grammar. Constraint-based precision grammars, like the HPSG grammar we are using for the experiments reported in this paper, typically lack robustness, especially when applied to real world texts. To maximally recover the linguistic knowledge from an unsuccessful parse, a proper selection model must be used. Also, the efficiency challenges usually presented by the selection model must be answered. Building on the work reported in (Zhang et al., 2007), we further propose a new partial parsing model that splits the parsing process into two stages, both of which use the bottom-up chart-based parsing algorithm. The algorithm is implemented and a preliminary experiment shows promising results.

Modeling Document Dynamics: an Evolutionary Approach
Jahna Otterbacher | Dragomir Radev

News articles about the same event published over time have properties that challenge NLP and IR applications. A cluster of such texts typically exhibits instances of paraphrase and contradiction, as sources update the facts surrounding the story, often due to an ongoing investigation. The current hypothesis is that the stories “evolve” over time, beginning with the first text published on a given topic. This is tested using a phylogenetic approach as well as one based on language modeling. The fit of the evolutionary models is evaluated with respect to how well they facilitate the recovery of chronological relationships between the documents. Over all data clusters, the language modeling approach consistently outperforms the phylogenetics model. However, on manually collected clusters in which the documents are published within short time spans of one another, both have a similar performance, and produce statistically significant results on the document chronology recovery evaluation.

Semantic Vectors: a Scalable Open Source Package and Online Technology Management Application
Dominic Widdows | Kathleen Ferraro

This paper describes the open source SemanticVectors package that efficiently creates semantic vectors for words and documents from a corpus of free text articles. We believe that this package can play an important role in furthering research in distributional semantics, and (perhaps more importantly) can help to significantly reduce the current gap that exists between good research results and valuable applications in production software. Two clear principles that have guided the creation of the package so far include ease-of-use and scalability. The basic package installs and runs easily on any Java-enabled platform, and depends only on Apache Lucene. Dimension reduction is performed using Random Projection, which enables the system to scale much more effectively than other algorithms used for the same purpose. This paper also describes a trial application in the Technology Management domain, which highlights some user-centred design challenges which we believe are also key to successful deployment of this technology.

Revealing Relations between Open and Closed Answers in Questionnaires through Text Clustering Evaluation
Magnus Rosell | Sumithra Velupillai

Open answers in questionnaires contain valuable information that is very time-consuming to analyze manually. We present a method for hypothesis generation from questionnaires based on text clustering. Text clustering is used interactively on the open answers, and the user can explore the cluster contents. The exploration is guided by automatic evaluation of the clusters against a closed answer regarded as a categorization. This simplifies the process of selecting interesting clusters. The user formulates a hypothesis from the relation between the cluster content and the closed answer categorization. We have applied our method on an open answer regarding occupation compared to a closed answer on smoking habits. With no prior knowledge of smoking habits in different occupation groups we have generated the hypothesis that farmers smoke less than the average. The hypothesis is supported by several separate surveys. Closed answers are easy to analyze automatically but are restricted and may miss valuable aspects. Open answers, on the other hand, fully capture the dynamics and diversity of possible outcomes. With our method the process of analyzing open answers becomes feasible.

Personae: a Corpus for Author and Personality Prediction from Text
Kim Luyckx | Walter Daelemans

We present a new corpus for computational stylometry, more specifically authorship attribution and the prediction of author personality from text. Because of the large number of authors (145), the corpus will allow previously impossible studies of variation in features considered predictive for writing style. The innovative meta-information (personality profiles of the authors) associated with these texts allows the study of personality prediction, a not yet very well researched aspect of style. In this paper, we describe the contents of the corpus and show its use in both authorship attribution and personality prediction. We focus on features that have been proven useful in the field of author recognition. Syntactic features like part-of-speech n-grams are generally accepted as not being under the author’s conscious control and therefore providing good clues for predicting gender or authorship. We want to test whether these features are helpful for personality prediction and authorship attribution on a large set of authors. Both tasks are approached as text categorization tasks. First a document representation is constructed based on feature selection from the linguistically analyzed corpus (using the Memory-Based Shallow Parser (MBSP)). These are associated with each of the 145 authors or each of the four components of the Myers-Briggs Type Indicator (Introverted-Extraverted, Sensing-iNtuitive, Thinking-Feeling, Judging-Perceiving). Authorship attribution on 145 authors achieves results around 50%-accuracy. Preliminary results indicate that the first two personality dimensions can be predicted fairly accurately.

Using the Complexity of the Distribution of Lexical Elements as a Feature in Authorship Attribution
Leanne Spracklin | Diana Inkpen | Amiya Nayak

Traditional Authorship Attribution models extract normalized counts of lexical elements such as nouns, common words and punctuation and use these normalized counts or ratios as features for author fingerprinting. The text is viewed as a “bag-of-words” and the order of words and their position relative to other words is largely ignored. We propose a new method of feature extraction which quantifies the distribution of lexical elements within the text using Kolmogorov complexity estimates. Testing carried out on blog corpora indicates that such measures outperform ratios when used as features in an SVM authorship attribution model. Moreover, by adding complexity estimates to a model using ratios, we were able to increase the F-measure by 5.2-11.8%

An Exchange Format for Multimodal Annotations
Thomas Schmidt | Susan Duncan | Oliver Ehmer | Jeffrey Hoyt | Michael Kipp | Dan Loehr | Magnus Magnusson | Travis Rose | Han Sloetjes

This paper presents the results of a joint effort of a group of multimodality researchers and tool developers to improve the interoperability between several tools used for the annotation of multimodality. We propose a multimodal annotation exchange format, based on the annotation graph formalism, which is supported by import and export routines in the respective tools.

SCARE: a Situated Corpus with Annotated Referring Expressions
Laura Stoia | Darla Magdalene Shockley | Donna K. Byron | Eric Fosler-Lussier

Even though a wealth of speech data is available for the dialog systems research community, the particular field of situated language has yet to find an appropriate free resource. The corpus required to answer research questions related to situated language should connect world information to the human language. In this paper we report on the release of a corpus of English spontaneous instruction giving situated dialogs. The corpus was collected using the Quake environment, a first-person virtual reality game, and consists of pairs of participants completing a direction giver- direction follower scenario. The corpus contains the collected audio and video, as well as word-aligned transcriptions and the positional/gaze information of the player. Referring expressions in the corpus are annotated with the IDs of their virtual world referents.

Annotation by Category: ELAN and ISO DCR
Han Sloetjes | Peter Wittenburg

The Data Category Registry is one of the ISO initiatives towards the establishment of standards for Language Resource management, creation and coding. Successful application of the DCR depends on the availability of tools that can interact with it. This paper describes the first steps that have been taken to provide users of the multimedia annotation tool ELAN, with the means to create references from tiers and annotations to data categories defined in the ISO Data Category Registry. It first gives a brief description of the capabilities of ELAN and the structure of the documents it creates. After a concise overview of the goals and current state of the ISO DCR infrastructure, a description is given of how the preliminary connectivity with the DCR is implemented in ELAN.

A Common Multimedia Annotation Framework for Cross Linking Cultural Heritage Digital Collections
Hennie Brugman | Véronique Malaisé | Laura Hollink

In the context of the CATCH research program that is currently carried out at a number of large Dutch cultural heritage institutions our ambition is to combine and exchange heterogeneous multimedia annotations between projects and institutions. As first step we designed an Annotation Meta Model: a simple but powerful RDF/OWL model mainly addressing the anchoring of annotations to segments of the many different media types used in the collections of the archives, museums and libraries involved. The model includes support for the annotation of annotations themselves, and of segments of annotation values, to be able to layer annotations and in this way enable projects to process each other’s annotation data as the primary data for further annotation. On basis of AMM we designed an application programming interface for accessing annotation repositories and implemented it both as a software library and as a web service. Finally, we report on our experiences with the application of model, API and repository when developing web applications for collection managers in cultural heritage institutions.

Creating and Exploiting Multimodal Annotated Corpora
Philippe Blache | Roxane Bertrand | Gaëlle Ferré

The paper presents a project of the Laboratoire Parole & Langage which aims at collecting, annotating and exploiting a corpus of spoken French in a multimodal perspective. The project directly meets the present needs in linguistics where a growing number of researchers become aware of the fact that a theory of communication which aims at describing real interactions should take into account the complexity of these interactions. However, in order to take into account such a complexity, linguists should have access to spoken corpora annotated in different fields. The paper presents the annotation schemes used in phonetics, morphology and syntax, prosody, gestuality at the LPL together with the type of linguistic description made from the annotations seen in two examples.

The Encoding of lexical implications in VerbNet Predicates of change of locations
Annie Zaenen | Daniel Bobrow | Cleo Condoravdi

This paper describes an attempt to use the information contained in VerbNet to obtain change of location inferences. We show that the information is available but not encoded in a consistent enough form to be optimally useful.

FATE: a FrameNet-Annotated Corpus for Textual Entailment
Aljoscha Burchardt | Marco Pennacchiotti

Several studies indicate that the level of predicate-argument structure is relevant for modeling prevalent phenomena in current textual entailment corpora. Although large resources like FrameNet have recently become available, attempts to integrate this type of information into a system for textual entailment did not confirm the expected gain in performance. The reasons for this are not fully obvious; candidates include FrameNet’s restricted coverage, limitations of semantic parsers, or insufficient modeling of FrameNet information. To enable further insight on this issue, in this paper we present FATE (FrameNet-Annotated Textual Entailment), a manually crafted, fully reliable frame-annotated RTE corpus. The annotation has been carried out over the 800 pairs of the RTE-2 test set. This dataset offers a safe basis for RTE systems to experiment, and enables researchers to develop clearer ideas on how to effectively integrate frame knowledge in semantic inferenence tasks like recognizing textual entailment. We describe and present statistics over the adopted annotation, which introduces a new schema based on full-text annotation of so called relevant frame evoking elements.

Projecting Propbank Roles onto the CCGbank
Stephen Boxwell | Michael White

This paper describes a method of accurately projecting Propbank roles onto constituents in the CCGbank and automatically annotating verbal categories with the semantic roles of their arguments. This method will be used to improve the structure of the derivations in the CCGbank and to facilitate research on semantic role tagging and broad coverage generation with CCG.

Integrating Lexical Units, Synsets and Ontology in the Cornetto Database
Piek Vossen | Isa Maks | Roxane Segers | Hennie VanderVliet

Cornetto is a two-year Stevin project (project number STE05039) in which a lexical semantic database is built that combines Wordnet with Framenet-like information for Dutch. The combination of the two lexical resources (the Dutch Wordnet and the Referentie Bestand Nederlands) will result in a much richer relational database that may improve natural language processing (NLP) technologies, such as word sense-disambiguation, and language-generation systems. In addition to merging the Dutch lexicons, the database is also mapped to a formal ontology to provide a more solid semantic backbone. Since the database represents different traditions and perspectives of semantic organization, a key issue in the project is the alignment of concepts across the resources. This paper discusses our methodology to first automatically align the word meanings and secondly to manually revise the most critical cases.

Complete and Consistent Annotation of WordNet using the Top Concept Ontology
Javier Álvez | Jordi Atserias | Jordi Carrera | Salvador Climent | Egoitz Laparra | Antoni Oliver | German Rigau

This paper presents the complete and consistent ontological annotation of the nominal part of WordNet. The annotation has been carried out using the semantic features defined in the EuroWordNet Top Concept Ontology and made available to the NLP community. Up to now only an initial core set of 1,024 synsets, the so-called Base Concepts, was ontologized in such a way. The work has been achieved by following a methodology based on an iterative and incremental expansion of the initial labeling through the hierarchy while setting inheritance blockage points. Since this labeling has been set on the EuroWordNet’s Interlingual Index (ILI), it can be also used to populate any other wordnet linked to it through a simple porting process. This feature-annotated WordNet is intended to be useful for a large number of semantic NLP tasks and for testing for the first time componential analysis on real environments. Moreover, the quantitative analysis of the work shows that more than 40% of the nominal part of WordNet is involved in structure errors or inadequacies.

A Conceptual Approach to Web Image Retrieval
Adrian Popescu | Gregory Grefenstette

People use the Internet to find a wide variety of images. Existing image search engines do not understand the pictures they return. The introduction of semantic layers in information retrieval frameworks may enhance the quality of the results compared to existing systems. One important challenge in the field is to develop architectures that fit the requirements of real-life applications, like the Internet search engines. In this paper, we describe Olive, an image retrieval application that exploits a large scale conceptual hierarchy (extracted from WordNet) to automatically reformulate user queries, search for associated images and present results in an interactive and structured fashion. When searching a concept in the hierarchy, Olive reformulates the query using its deepest subtypes in WordNet. On the answers page, the system displays a selection of related classes and proposes a content based retrieval functionality among the pictures sharing the same linguistic label. In order to validate our approach, we run to series of tests to assess the performances of the application and report the results here. First, two precision evaluations over a panel of concepts from different domains are realized and second, a user test is designed so as to assess the interaction with the system.

On the Use of Web Resources and Natural Language Processing Techniques to Improve Automatic Speech Recognition Systems
Gwénolé Lecorvé | Guillaume Gravier | Pascale Sébillot

Language models used in current automatic speech recognition systems are trained on general-purpose corpora and are therefore not relevant to transcribe spoken documents dealing with successive precise topics, such as long multimedia streams, frequently tacking reportages and debates. To overcome this problem, this paper shows that Web resources and natural language processing techniques can be effective to automatically adapt the baseline language model of an automatic speech recognition system to any encountered topic. More precisely, we detail how to characterize the topic of transcription segment and how to collect Web pages from which a topic-specific language model can be trained. Then, an adapted language model is obtained by combining the topic-specific language model with the general-purpose language model. Finally, new transcriptions are generated using the adapted language model and are compared with transcriptions previously obtained with the baseline language model. Experiments show that our topic adaptation technique leads to significant transcription quality gains.

Local Methods for On-Demand Out-of-Vocabulary Word Retrieval
Stanislas Oger | Georges Linarès | Frédéric Béchet

Most of the Web-based methods for lexicon augmenting consist in capturing global semantic features of the targeted domain in order to collect relevant documents from the Web. We suggest that the local context of the out-of-vocabulary (OOV) words contains relevant information on the OOV words. With this information, we propose to use the Web to build locally-augmented lexicons which are used in a final local decoding pass. First, an automatic web based OOV word detection method is proposed. Then, we demonstrate the relevance of the Web for the OOV word retrieval. Different methods are proposed to retrieve the hypothesis words. We finally retrieve about 26% of the OOV words with a lexicon increase of less than 1000 words using the reference context.

Exploring and Enriching a Language Resource Archive via the Web
Marc Kemps-Snijders | Alex Klassmann | Claus Zinn | Peter Berck | Albert Russel | Peter Wittenburg

The “download first, then process paradigm” is still the predominant working method amongst the research community. The web-based paradigm, however, offers many advantages from a tool development and data management perspective as they allow a quick adaptation to changing research environments. Moreover, new ways of combining tools and data are increasingly becoming available and will eventually enable a true web-based workflow approach, thus challenging the “download first, then process” paradigm. The necessary infrastructure for managing, exploring and enriching language resources via the Web will need to be delivered by projects like CLARIN and DARIAH.

Talking and Looking: the SmartWeb Multimodal Interaction Corpus
Florian Schiel | Hannes Mögele

Nowadays portable devices such as smart phones can be used to capture the face of a user simultaneously with the voice input. Server based or even embedded dialogue system might utilize this additional information to detect whether the speaking user addresses the system or other parties or whether the listening user is focused on the display or not. Depending on these findings the dialogue system might change its strategy to interact with the user improving the overall communication between human and system. To develop and test methods for On/Off-Focus detection a multimodal corpus of user-machine interactions was recorded within the German SmartWeb project. The corpus comprises 99 recording sessions of a triad communication between the user, the system and a human companion. The user can address/watch/listen to the system but also talk to his companion, read from the display or simply talk to herself. Facial video is captured with a standard built-in video camera of a smart phone while voice input in being recorded by a high quality close microphone as well as over a realistic transmission line via Bluetooth and WCDMA. The resulting SmartWeb Video Corpus (SVC) can be obtained from the Bavarian Archive for Speech Signals.

In Contrast - A Complex Discourse Connective
Erhard Hinrichs | Monica Lău

This paper presents a corpus-based study of the discourse connective “in contrast”. The corpus data are drawn from the British National Corpus (BNC) and are analyzed at the levels of syntax, discourse structure, and compositional semantics. Following Webber et al. (2003), the paper argues that “in contrast” crucially involves discourse anaphora and, thus, resembles other discourse adverbials such as “then”, “otherwise”, and “nevertheless”. The compositional semantics proposed for other discourse connectives, however, does not straightforwardly generalize to “in contrast”, for which the notions of contrast pairs and contrast properties are essential.

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems
Georg Rehm | Marina Santini | Alexander Mehler | Pavel Braslavski | Rüdiger Gleim | Andrea Stubbe | Svetlana Symonenko | Mirko Tavosanis | Vedrana Vidulin

We present initial results from an international and multi-disciplinary research collaboration that aims at the construction of a reference corpus of web genres. The primary application scenario for which we plan to build this resource is the automatic identification of web genres. Web genres are rather difficult to capture and to describe in their entirety, but we plan for the finished reference corpus to contain multi-level tags of the respective genre or genres a web document or a website instantiates. As the construction of such a corpus is by no means a trivial task, we discuss several alternatives that are, for the time being, mostly based on existing collections. Furthermore, we discuss a shared set of genre categories and a multi-purpose tool as two additional prerequisites for a reference corpus of web genres.

Error Analysis for Learning-based Coreference Resolution
Olga Uryupina

State-of-the-art coreference resolution engines show similar performance figures (low sixties on the MUC-7 data). Our system with a rich linguistically motivated feature set yields significantly better performance values for a variety of machine learners, but still leaves substantial room for improvement. In this paper we address a relatively unexplored area of coreference resolution - we present a detailed error analysis in order to understand the issues raised by corpus-based approaches to coreference resolution.

From Sentence to Discourse: Building an Annotation Scheme for Discourse Based on Prague Dependency Treebank
Lucie Mladová | Šárka Zikánová | Eva Hajičová

The present paper reports on a preparatory research for building a language corpus annotation scenario capturing the discourse relations in Czech. We primarily focus on the description of the syntactically motivated relations in discourse, basing our findings on the theoretical background of the Prague Dependency Treebank 2.0 and the Penn Discourse Treebank 2. Our aim is to revisit the present-day syntactico-semantic (tectogrammatical) annotation in the Prague Dependency Treebank, extend it for the purposes of a sentence-boundary-crossing representation and eventually to design a new, discourse level of annotation. In this paper, we propose a feasible process of such a transfer, comparing the possibilities the Praguian dependency-based approach offers with the Penn discourse annotation based primarily on the analysis and classification of discourse connectives.

A Corpus for Cross-Document Co-reference
David Day | Janet Hitzeman | Michael Wick | Keith Crouch | Massimo Poesio

This paper describes a newly created text corpus of news articles that has been annotated for cross-document co-reference. Being able to robustly resolve references to entities across document boundaries will provide a useful capability for a variety of tasks, ranging from practical information retrieval applications to challenging research in information extraction and natural language understanding. This annotated corpus is intended to encourage the development of systems that can more accurately address this problem. A manual annotation tool was developed that allowed the complete corpus to be searched for likely co-referring entity mentions. This corpus of 257K words links mentions of co-referent people, locations and organizations (subject to some additional constraints). Each of the documents had already been annotated for within-document co-reference by the LDC as part of the ACE series of evaluations. The annotation process was bootstrapped with a string-matching-based linking procedure, and we report on some of initial experimentation with the data. The cross-document linking information will be made publicly available.

Named Entity WordNet
Antonio Toral | Rafael Muñoz | Monica Monachini

This paper presents the automatic extension of Princeton WordNet with Named Entities (NEs). This new resource is called Named Entity WordNet. Our method maps the noun is-a hierarchy of WordNet to Wikipedia categories, identifies the NEs present in the latter and extracts different information from them such as written variants, definitions, etc. This information is inserted into a NE repository. A module that converts from this generic repository to the WordNet specific format has been developed. The paper explores different aspects of our methodology such as the treatment of polysemous terms, the identification of hyponyms within the Wikipedia categorization system, the identification of Wikipedia articles which are NEs and the design of a NE repository compliant with the LMF ISO standard. So far, this procedure enriches WordNet with 310,742 NEs and 381,043 “instance of” relations.

Is this NE tagger getting old?
Cristina Mota | Ralph Grishman

This paper focuses on the influence of changing the text time frame on the performance of a named entity tagger. We followed a twofold approach to investigate this subject: on the one hand, we analyzed a corpus that spans 8 years, and, on the other hand, we assessed the performance of a name tagger trained and tested on that corpus. We created 8 samples from the corpus, each drawn from the articles for a particular year. In terms of corpus analysis, we calculated the corpus similarity and names shared between samples. To see the effect on tagger performance, we implemented a semi-supervised name tagger based on co-training; then, we trained and tested our tagger on those samples. We observed that corpus similarity, names shared between samples, and tagger performance all decay as the time gap between the samples increases. Furthermore, we observed that the corpus similarity and names shared correlate with the tagger F-measure. These results show that named entity recognition systems may become obsolete in a short period of time.

Improving NER in Arabic Using a Morphological Tagger
Benjamin Farber | Dayne Freitag | Nizar Habash | Owen Rambow

We discuss a named entity recognition system for Arabic, and show how we incorporated the information provided by MADA, a full morphological tagger which uses a morphological analyzer. Surprisingly, the relevant features used are the capitalization of the English gloss chosen by the tagger, and the fact that an analysis is returned (that a word is not OOV to the morphological analyzer). The use of the tagger also improves over a third system which just uses a morphological analyzer, yielding a 14\% reduction in error over the baseline. We conduct a thorough error analysis to identify sources of success and failure among the variations, and show that by combining the systems in simple ways we can significantly influence the precision-recall trade-off.

Identifying Foreign Person Names in Chinese Text
Stephan Busemann | Yajing Zhang

Foreign name expressions written in Chinese characters are difficult to recognize since the sequence of characters represents the Chinese pronunciation of the name. This paper suggests that known English or German person names can reliably be identified on the basis of the similarity between the Chinese and the foreign pronunciation. In addition to locating a person name in the text and learning that it is foreign, the corresponding foreign name is identified, thus gaining precious additional information for cross-lingual applications. This idea is implemented as a statistical module into the rule-based shallow parsing system SProUT, forming the HyFex system. The statistical component is invoked if a sequence of “trigger” characters is found that may correspond to a foreign name. Their phonetic Pinyin representation is produced and compared to the phonetic representations (SAMPA) of given foreign names, which are generated by the MARY TTS system for German and English pronunciations. This comparison is achieved by a hand-crafted metric that assigns costs to specific edit operations. The person name corresponding to the SAMPA representation with the lowest costs attached is returned as the most similar result, if a threshold is not exceeded. Our evaluation on publicly available data shows competitive results.

Low-Complexity Heuristics for Deriving Fine-Grained Classes of Named Entities from Web Textual Data
Marius Paşca

We introduce a low-complexity method for acquiring fine-grained classes of named entities from the Web. The method exploits the large amounts of textual data available on the Web, while avoiding the use of any expensive text processing techniques or tools. The quality of the extracted classes is encouraging with respect to both the precision of the sets of named entities acquired within various classes, and the labels assigned to the sets of named entities.

Annotation Guidelines for Chinese-Korean Word Alignment
Jin-Ji Li | Dong-Il Kim | Jong-Hyeok Lee

For a language pair such as Chinese and Korean that belong to entirely different language families in terms of typology and genealogy, finding the correspondences is quite obscure in word alignment. We present annotation guidelines for Chinese-Korean word alignment through contrastive analysis of morpho-syntactic encodings. We discuss the differences in verbal systems that cause most of linking obscurities in annotation process. Systematic comparison of verbal systems is conducted by analyzing morpho-syntactic encodings. The viewpoint of grammatical category allows us to define consistent and systematic instructions for linguistically distant languages such as Chinese and Korean. The scope of our guidelines is limited to the alignment between Chinese and Korean, but the instruction methods exemplified in this paper are also applicable in developing systematic and comprehensible alignment guidelines for other languages having such different linguistic phenomena.

CzEng 0.7: Parallel Corpus with Community-Supplied Translations
Ondřej Bojar | Miroslav Janíček | Zdeněk Žabokrtský | Pavel Češka | Peter Beňa

This paper describes CzEng 0.7, a new release of Czech-English parallel corpus freely available for research and educational purposes. We provide basic statistics of the corpus and focus on data produced by a community of volunteers. Anonymous contributors manually correct the output of a machine translation (MT) system, generating on average 2000 sentences a month, 70% of which are indeed correct translations. We compare the utility of community-supplied and of professionally translated training data for a baseline English-to-Czech MT system.

Toward Active Learning in Data Selection: Automatic Discovery of Language Features During Elicitation
Jonathan Clark | Robert Frederking | Lori Levin

Data Selection has emerged as a common issue in language technologies. We define Data Selection as the choosing of a subset of training data that is most effective for a given task. This paper describes deductive feature detection, one component of a data selection system for machine translation. Feature detection determines whether features such as tense, number, and person are expressed in a language. The database of the The World Atlas of Language Structures provides a gold standard against which to evaluate feature detection. The discovered features can be used as input to a Navigator, which uses active learning to determine which piece of language data is the most important to acquire next.

Babylon Parallel Text Builder: Gathering Parallel Texts for Low-Density Languages
Michael Mohler | Rada Mihalcea

This paper describes Babylon, a system that attempts to overcome the shortage of parallel texts in low-density languages by supplementing existing parallel texts with texts gathered automatically from the Web. In addition to the identification of entire Web pages, we also propose a new feature specifically designed to find parallel text chunks within a single document. Experiments carried out on the Quechua-Spanish language pair show that the system is successful in automatically identifying a significant amount of parallel texts on the Web. Evaluations of a machine translation system trained on this corpus indicate that the Web-gathered parallel texts can supplement manually compiled parallel texts and perform significantly better than the manually compiled texts when tested on other Web-gathered data.

SECTra_w.1: an Online Collaborative System for Evaluating, Post-editing and Presenting MT Translation Corpora
Cong-Phap Huynh | Christian Boitet | Hervé Blanchon

SECTra_w is a web-oriented system mainly dedicated to the evaluation of MT systems. After importing a source corpus, and possibly reference translations, one can call various MT systems, store their results, and have a collection of human judges perform subjective evaluation online (fluidity, adequacy). It is also possible to perform objective, task-oriented evaluation by letting humans post-edit the MT results, using a web translation editor, and measuring an edit distance and/or the post-editing time. The post-edited results can be added to the set of reference translations, or constitute it if there were no references. SECTra_w makes it possible to show not only tables of figures as results of an evaluation campaign, but also the real data (source, MT outputs, references, post-edited outputs), and to make the post-edition effort sensible by transforming the trace of the edit distance computation in an intuitive presentation, much like a “revision” presentation in Word. The system is written in java under Xwiki and uses the Ajax technique. It can handle large, multilingual and multimedia corpora: EuroParl, BTEC, ERIM (bilingual interpreted dialogues with audio and text), Unesco-B@bel, and a test corpus by France Telecom have been loaded together and used in tests.

Adjudicator Agreement and System Rankings for Person Name Search
Mark Arehart | Chris Wolf | Keith J. Miller

We have analyzed system rankings for person name search algorithms using a data set for which several versions of ground truth were developed by employing different means of resolving adjudicator conflicts. Thirteen algorithms were ranked by F-score, using bootstrap resampling for significance testing, on a dataset containing 70,000 romanized names from various cultures. We found some disagreement among the four adjudicators, with kappa ranging from 0.57 to 0.78. Truth sets based on a single adjudicator, and on the intersection or union of positive adjudications produced sizeable variability in scoring sensitivity - and to a lesser degree rank order - compared to the consensus truth set. However, results on truth sets constructed by randomly choosing an adjudicator for each item were highly consistent with the consensus. The implication is that an evaluation where one adjudicator has judged each item is nearly as good as a more expensive and labor-intensive one where multiple adjudicators have judged each item and conflicts are resolved through voting.

Evaluating Summaries Automatically - A system Proposal
Paulo C F de Oliveira | Edson Wilson Torrens | Alexandre Cidral | Sidney Schossland | Evandro Bittencourt

We propose in this paper an automatic evaluation procedure based on a metric which could provide summary evaluation without human assistance. Our system includes two metrics, which are presented and discussed. The first metric is based on a known and powerful statistical test, the X2 goodness-of-fit test, and has been used in several applications. The second metric is derived from three common metrics used to evaluate Natural Language Processing (NLP) systems, namely precision, recall and f-measure. The combination of these two metrics is intended to allow one to assess the quality of summaries quickly, cheaply and without the need of human intervention, minimizing though, the role of subjective judgment and bias.

Do we Still Need Gold Standards for Evaluation?
Thierry Poibeau | Cédric Messiant

The availability of a huge mass of textual data in electronic format has increased the need for fast and accurate techniques for textual data processing. Machine learning and statistical approaches have been increasingly used in NLP since a decade, mainly because they are quick, versatile and efficient. However, despite this evolution of the field, evaluation still rely (most of the time) on a comparison between the output of a probabilistic or statistical system on the one hand, and a non-statistic, most of the time hand-crafted, gold standard on the other hand. In this paper, we take the example of the acquisition of subcategorization frames from corpora as a practical example. Our study is motivated by the fact that, even if a gold standard is an invaluable resource for evaluation, a gold standard is always partial and does not really show how accurate and useful results are.

The Dutch-Flemish Comprehensive Approach to HLT Stimulation and Innovation: STEVIN, HLT Agency and beyond
Peter Spyns | Elisabeth D’Halleweyn | Catia Cucchiarini

This paper shows how a research and industry stimulation programme on human language technologies (HLT) for Dutch can be “enhanced” with more specific innovation policy aspects to support the take-up by the HLT industry in the Netherlands and Flanders. Important to note is the distinction between the HLT programme itself (called STEVIN) with its specific related committees and actions and the overall policy instruments (HLT Agency, HLT steering board?) that try to span the entire domain of HLT for Dutch and have a more permanent character. The establishment of a pricing committee and a PR & communication working group is explained as a consequence of adopting the notion of “innovation system” as a theoretical framework. It means that a stronger emphasis is put on improving knowledge transfer and exchange amongst actors in the field. Therefore, the focus at the programme management level is shifting from the projects’ research activities producing results to gathering the results, making them available at a certain cost and advertising them through the appropriate channels to the appropriate potential customers. Our conclusion is that this policy stimulates the transfer from academia to industry though it is too soon for an in-depth assessment of the STEVIN programme and other HLT innovation policy instruments.

15 Years of Language Resource Creation and Sharing: a Progress Report on LDC Activities
Christopher Cieri | Mark Liberman

This paper, the fifth in a series of biennial progress reports, reviews the activities of the Linguistic Data Consortium with particular emphasis on general trends in the language resource landscape and on changes that distinguish the two years since LDC’s last report at LREC from the preceding 8 years. After providing a perspective on the current landscape of language resources, the paper goes on to describe our vision of the role of LDC within the research communities it serves before sketching briefly specific publications and resources creations projects that have been the focus our attention since the last report.

Estimating the Resource Adaption Cost from a Resource Rich Language to a Similar Resource Poor Language
Anil Kumar Singh | Kiran Pala | Harshit Surana

Developing resources which can be used for Natural Language Processing is an extremely difficult task for any language, but is even more so for less privileged (or less computerized) languages. One way to overcome this difficulty is to adapt the resources of a linguistically close resource rich language. In this paper we discuss how the cost of such adaption can be estimated using subjective and objective measures of linguistic similarity for allocating financial resources, time, manpower etc. Since this is the first work of its kind, the method described in this paper should be seen as only a preliminary method, indicative of how better methods can be developed. Corpora of several less computerized languages had to be collected for the work described in the paper, which was difficult because for many of these varieties there is not much electronic data available. Even if it is, it is in non-standard encodings, which means that we had to build encoding converters for these varieties. The varieties we have focused on are some of the varieties spoken in the South Asian region.

Latest Developments in ELRA’s Services
Valérie Mapelli | Victoria Arranz | Hélène Mazo | Khalid Choukri

This paper describes the latest developments in ELRA’s services within the field of Language Resources (LR). These developments focus on 4 main groups of activities: the identification and distribution of Language Resources; the production of LRs; the evaluation of Human Language Technology (HLT), and the dissemination of information in the field. ELRA’s initial work on the distribution of language resources has evolved throughout the years, currently covering a much wider range of activities that have been considered crucial for the current needs of the R&D community and the “good health” of the LR world. Regarding distribution, considerable work has been done on a broader identification, which does not only consider resources to be immediately negotiated for distribution but which aims to inform on all available resources. This has been the seed for the Universal Catalogue. Furthermore, a Catalogue of LRs with favourable conditions for R&D has also been created. Moreover, the different activities in what regards identification on demand, production within different frameworks, evaluation of language technologies and participation in evaluation campaigns, as well as our very specific focus on information dissemination are described in detail in this paper.

From Research to Application in Multilingual Information Access: the Contribution of Evaluation
Carol Peters | Martin Braschler | Giorgio Di Nunzio | Nicola Ferro | Julio Gonzalo | Mark Sanderson

The importance of evaluation in promoting research and development in the information retrieval and natural language processing domains has long been recognised but is this sufficient? In many areas there is still a considerable gap between the results achieved by the research community and their implementation in commercial applications. This is particularly true for the cross-language or multilingual retrieval areas. Despite the strong demand for and interest in multilingual IR functionality, there are still very few operational systems on offer. The Cross Language Evaluation Forum (CLEF) is now taking steps aimed at changing this situation. The paper provides a critical assessment of the main results achieved by CLEF so far and discusses plans now underway to extend its activities in order to have a more direct impact on the application sector.

Clustering Related Terms with Definitions
Scott Piao | John McNaught | Sophia Ananiadou

It is a challenging task to match similar or related terms/expressions in NLP and Text Mining applications. Two typical areas in need for such work are terminology and ontology constructions, where terms and concepts are extracted and organized into certain structures with various semantic relations. In the EU BOOTSTrep Project we test various techniques for matching terms that can assist human domain experts in building and enriching ontologies. This paper reports on a work in which we evaluated a text comparing and clustering tool for this task. Particularly, we explore the feasibility of matching related terms with their definitions. Ontology terms, such as Gene Ontology terms, are often assigned with detailed definitions, which provide a fundamental information source for detecting relations between terms. Here we focus on the exploitation of term definitions for the term matching task. Our experiment shows that the tool is capable of grouping many related terms using their definitions.

Challenges in Pronoun Resolution System for Biomedical Text
Ngan Nguyen | Jin-Dong Kim | Jun’ichi Tsujii

This paper presents our findings on the feasibility of doing pronoun resolution for biomedical texts, in comparison with conducting pronoun resolution for the newswire domain. In our experiments, we built a simple machine learning-based pronoun resolution system, and evaluated the system on three different corpora: MUC, ACE, and GENIA. Comparative statistics not only reveal the noticeable issues in constructing an effective pronoun resolution system for a new domain, but also provides a comprehensive view of those corpora often used for this task.

Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks
Barry Haddow | Beatrice Alex

This paper discusses the problem of utilising multiply annotated data in training biomedical information extraction systems. Two corpora, annotated with entities and relations, and containing a number of multiply annotated documents, are used to train named entity recognition and relation extraction systems. Several methods of automatically combining the multiple annotations to produce a single annotation are compared, but none produces better results than simply picking one of the annotated versions at random. It is also shown that adding extra singly annotated documents produces faster performance gains than adding extra multiply annotated documents.

GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain
Yuka Tateisi | Yusuke Miyao | Kenji Sagae | Jun’ichi Tsujii

We report the construction of a corpus for parser evaluation in the biomedical domain. A 50-abstract subset (492 sentences) of the GENIA corpus (Kim et al., 2003) is annotated with labeled head-dependent relations using the grammatical relations (GR) evaluation scheme (Carroll et al., 1998) ,which has been used for parser evaluation in the newswire domain.

Learning the Species of Biomedical Named Entities from Annotated Corpora
Xinglong Wang | Claire Grover

In biomedical articles, terms with the same surface forms are often used to refer to different entities across a number of model organisms, in which case determining the species becomes crucial to term identification systems that ground terms to specific database identifiers. This paper describes a rule-based system that extracts “species indicating words”, such as human or murine, which can be used to decide the species of the nearby entity terms, and a machine-learning species disambiguation system that was developed on manually species-annotated corpora. Performance of both systems were evaluated on gold-standard datasets, where the machine-learning system yielded better overall results.

Acquiring Naturalistic Concept Descriptions from the Web
Tony Veale | Yanfen Hao

Many of the beliefs that one uses to reason about everyday entities and events are neither strictly true or even logically consistent. Rather, people appear to rely on a large body of folk knowledge in the form of stereotypical associations, clichés and other kinds of naturalistic descriptions, many of which express views of the world that are second-hand, overly-simplified and, in some cases, non-literal to the point of being poetic. These descriptions pervade our language yet one rarely finds them in authoritative linguistic resources like dictionaries and encyclopaedias. We describe here how such naturalistic descriptions can be harvested from the web in the guise of explicit similes and related text patterns, and empirically demonstrate that these descriptions do broadly capture the way people see the world, at least from the perspective of category organization in an ontology.

Tools for Collocation Extraction: Preferences for Active vs. Passive
Ulrich Heid | Marion Weller

We present and partially evaluate procedures for the extraction of noun+verb collocation candidates from German text corpora, along with their morphosyntactic preferences, especially for the active vs. passive voice. We start from tokenized, tagged, lemmatized and chunked text, and we use extraction patterns formulated in the CQP corpus query language. We discuss the results of a precision evaluation, on administrative texts from the European Union: we find a considerable amount of specialized collocations, as well as general ones and complex predicates; overall the precision is considerably higher than that of a statistical extractor used as a baseline.

Boot-Strapping a WordNet Using Multiple Existing WordNets
Francis Bond | Hitoshi Isahara | Kyoko Kanzaki | Kiyotaka Uchimoto

In this paper we describe the construction of an illustrated Japanese Wordnet. We bootstrap the Wordnet using existing multiple existing wordnets in order to deal with the ambiguity inherent in translation. We illustrate it with pictures from the Open Clip Art Library.

Corpus-based Semantic Relatedness for the Construction of Polish WordNet
Bartosz Broda | Magdalena Derwojedowa | Maciej Piasecki | Stanislaw Szpakowicz

The construction of a wordnet, a labour-intensive enterprise, can be significantly assisted by automatic grouping of lexical material and discovery of lexical semantic relations. The objective is to ensure high quality of automatically acquired results before they are presented for lexicographers’ approval. We discuss a software tool that suggests synset members using a measure of semantic relatedness with a given verb or adjective; this extends previous work on nominal synsets in Polish WordNet. Syntactically-motivated constraints are deployed on a large morphologically annotated corpus of Polish. Evaluation has been performed via the WordNet-Based Similarity Test and additionally supported by human raters. A lexicographer also manually assessed a suitable sample of suggestions. The results compare favourably with other known methods of acquiring semantic relations.

Developing Verb Frames for Hindi
Rafiya Begum | Samar Husain | Lakshmi Bai | Dipti Misra Sharma

This paper introduces an ongoing work on developing verb frames for Hindi. Verb frames capture syntactic commonalities of semantically related verbs. The main objective of this work is to create a linguistic resource which will prove to be indispensable for various NLP applications. We also hope this resource to help us better understand Hindi verbs. We motivate the basic verb argument structure using relations as introduced by Panini. We show the methodology used in preparing these frames and the criteria followed for classifying Hindi verbs.

Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems
Kate Forbes-Riley | Diane Litman | Scott Silliman | Amruta Purandare

We present a corpus of spoken dialogues between students and an adaptive Wizard-of-Oz tutoring system, in which student uncertainty was manually annotated in real-time. We detail the corpus contents, including speech files, transcripts, annotations, and log files, and we discuss possible future uses by the computational linguistics community as a novel resource for studying naturally occurring user affect and adaptation in complex spoken dialogue systems.

On the Role of the NIMITEK Corpus in Developing an Emotion Adaptive Spoken Dialogue System
Milan Gnjatović | Dietmar Roesner

This paper reports on the creation of the multimodal NIMITEK corpus of affected behavior in human-machine interaction and its role in the development of the NIMITEK prototype system. The NIMITEK prototype system is a spoken dialogue system for supporting users while they solve problems in a graphics system. The central feature of the system is adaptive dialogue management. The system dynamically defines a dialogue strategy according to the current state of the interaction (including also the emotional state of the user). Particular emphasis is devoted to the level of naturalness of interaction. We discuss that a higher level of naturalness can be achieved by combining a habitable natural language interface and an appropriate dialogue strategy. The role of the NIMITEK multimodal corpus in achieving these requirements is twofold: (1) in developing the model of attentional state on the level of user’s commands that facilitates processing of flexibly formulated commands, and (2) in defining the dialogue strategy that takes the emotional state of the user into account. Finally, we sketch the implemented prototype system and describe the incorporated dialogue management module. Whereas the prototype system itself is task-specific, the described underlying concepts are intended to be task-independent.

Emotion Recognition from Speech: Stress Experiment
Stefan Scherer | Hansjörg Hofmann | Malte Lampmann | Martin Pfeil | Steffen Rhinow | Friedhelm Schwenker | Günther Palm

The goal of this work is to introduce an architecture to automatically detect the amount of stress in the speech signal close to real time. For this an experimental setup to record speech rich in vocabulary and containing different stress levels is presented. Additionally, an experiment explaining the labeling process with a thorough analysis of the labeled data is presented. Fifteen subjects were asked to play an air controller simulation that gradually induced more stress by becoming more difficult to control. During this game the subjects were asked to answer questions, which were then labeled by a different set of subjects in order to receive a subjective target value for each of the answers. A recurrent neural network was used to measure the amount of stress contained in the utterances after training. The neural network estimated the amount of stress at a frequency of 25 Hz and outperformed the human baseline.

Automatic Phone Segmentation of Expressive Speech
Laure Charonnat | Gaëlle Vidal | Olivier Boeffard

In order to improve the flexibility and the precision of an automatic phone segmentation system for a type of expressive speech, the dubbing into French of fiction movies, we developed both the phonetic labeling process and the alignment process. The automatic labelling system relies on an automatic grapheme-to-phoneme conversion including all the variants of the phonetic chain and on HMM modeling. In this article, we will distinguish three sets of phone models: a set of context independent models, a set of left and right context dependant models and finally a mixing of the two that combines phone and triphone models according to the precision of alignment obtained for each phonetic broad-class. The three models are evaluated on a test corpus. On the one hand we notice a little decrease in the score of phonetic labelling mainly due to pauses insertions, but on the other hand the mixed set of models gives the best results for the score of precision of the alignment.

Multimodal Spontaneous Expressive Speech Corpus for Hungarian
Márk Fék | Nicolas Audibert | János Szabó | Albert Rilliard | Géza Németh | Véronique Aubergé

A Hungarian multimodal spontaneous expressive speech corpus was recorded following the methodology of a similar French corpus. The method relied on a Wizard of Oz scenario-based induction of varying affective states. The subjects were interacting with a supposedly voice-recognition driven computer application using simple command words. Audio and video signals were captured for the 7 recorded subjects. After the experiment, the subjects watched the video recording of their session and labelled the recorded corpus themselves, freely describing the evolution of their affective states. The obtained labels were later classified into one of the following broad emotional categories: satisfaction, dislike, stress, or other. A listening test was performed by 25 naïve listeners in order to validate the category labels originating from the self-labelling. For 52 of the 149 stimuli, listeners’ judgements of the emotional content were in agreement with the labels. The result of the listening test was compared with an earlier test validating a part of the French corpus. While the French test had a higher success ratio, validating the labels of 79 tested stimuli, out of the 193, the stimuli validated by the two tests can form the basis of cross linguistic comparison experiments.

Vox Populi Annotation: Measuring Intensity of Ideological Perspectives by Aggregating Group Judgments
Wei-Hao Lin | Alexander Hauptmann

Polarizing discussions about political and social issues are common in mass media. Annotations on the degree to which a sentence expresses an ideological perspective can be valuable for evaluating computer programs that can automatically identify strongly biased sentences, but such annotations remain scarce. We annotated the intensity of ideological perspectives expressed in 250 sentences by aggregating judgments from 18 annotators. We proposed methods of determining the number of annotators and assessing reliability, and showed the annotations were highly consistent across different annotator groups.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources
Carmen Banea | Rada Mihalcea | Janyce Wiebe

This paper introduces a method for creating a subjectivity lexicon for languages with scarce resources. The method is able to build a subjectivity lexicon by using a small seed set of subjective words, an online dictionary, and a small raw corpus, coupled with a bootstrapping process that ranks new candidate words based on a similarity measure. Experiments performed with a rule-based sentence level subjectivity classifier show an 18% absolute improvement in F-measure as compared to previously proposed semi-supervised methods.

Finding the Sources and Targets of Subjective Expressions
Josef Ruppenhofer | Swapna Somasundaran | Janyce Wiebe

As many popular text genres such as blogs or news contain opinions by multiple sources and about multiple targets, finding the sources and targets of subjective expressions becomes an important sub-task for automatic opinion analysis systems. We argue that while automatic semantic role labeling systems (ASRL) have an important contribution to make, they cannot solve the problem for all cases. Based on the experience of manually annotating opinions, sources, and targets in various genres, we present linguistic phenomena that require knowledge beyond that of ASRL systems. In particular, we address issues relating to the attribution of opinions to sources; sources and targets that are realized as zero-forms; and inferred opinions. We also discuss in some depth that for arguing attitudes we need to be able to recover propositions and not only argued-about entities. A recurrent theme of the discussion is that close attention to specific discourse contexts is needed to identify sources and targets correctly.

Annotating Topics of Opinions
Veselin Stoyanov | Claire Cardie

Fine-grained subjectivity analysis has been the subject of much recent research attention. As a result, the field has gained a number of working definitions, technical approaches and manually annotated corpora that cover many facets of subjectivity. Little work has been done, however, on one aspect of fine-grained opinions - the specification and identification of opinion topics. In particular, due to the difficulty of manual opinion topic annotation, no general-purpose opinion corpus with information about topics of fine-grained opinions currently exists. In this paper, we propose a methodology for the manual annotation of opinion topics and use it to annotate a portion of an existing general-purpose opinion corpus with opinion topic information. Inter-annotator agreement results according to a number of metrics suggest that the annotations are reliable.

From Extracting to Abstracting: Generating Quasi-abstractive Summaries
Zhuli Xie | Barbara Di Eugenio | Peter C. Nelson

In this paper, we investigate quasi-abstractive summaries, a new type of machine-generated summaries that do not use whole sentences, but only fragments from the source. Quasi-abstractive summaries aim at bridging the gap between human-written abstracts and extractive summaries. We present an approach that learns how to identify sets of sentences, where each set contains fragments that can be used to produce one sentence in the abstract; and then uses these sets to produce the abstract itself. Our experiments show very promising results. Importantly, we obtain our best results when the summary generation is anchored by the most salient Noun Phrases predicted from the text to be summarized.

Controlling Redundancy in Referring Expressions
Jette Viethen | Robert Dale | Emiel Krahmer | Mariët Theune | Pascal Touset

Krahmer et al.’s (2003) graph-based framework provides an elegant and flexible approach to the generation of referring expressions. In this paper, we present the first reported study that systematically investigates how to tune the parameters of the graph-based framework on the basis of a corpus of human-generated descriptions. We focus in particular on replicating the redundant nature of human referring expressions, whereby properties not strictly necessary for identifying a referent are nonetheless included in descriptions. We show how statistics derived from the corpus data can be integrated to boost the framework’s performance over a non-stochastic baseline.

Anaphoric Annotation in the ARRAU Corpus
Massimo Poesio | Ron Artstein

Arrau is a new corpus annotated for anaphoric relations, with information about agreement and explicit representation of multiple antecedents for ambiguous anaphoric expressions and discourse antecedents for expressions which refer to abstract entities such as events, actions and plans. The corpus contains texts from different genres: task-oriented dialogues from the Trains-91 and Trains-93 corpus, narratives from the English Pear Stories corpus, newspaper articles from the Wall Street Journal portion of the Penn Treebank, and mixed text from the Gnome corpus.

Knowledge Sources for Bridging Resolution in Multi-Party Dialog
Mark-Christoph Mueller | Margot Mieskes | Michael Strube

In this paper we investigate the coverage of the two knowledge sources WordNet and Wikipedia for the task of bridging resolution. We report on an annotation experiment which yielded pairs of bridging anaphors and their antecedents in spoken multi-party dialog. Manual inspection of the two knowledge sources showed that, with some interesting exceptions, Wikipedia is superior to WordNet when it comes to the coverage of information necessary to resolve the bridging anaphors in our data set. We further describe a simple procedure for the automatic extraction of the required knowledge from Wikipedia by means of an API, and discuss some of the implications of the procedure’s performance.

The Penn Discourse TreeBank 2.0.
Rashmi Prasad | Nikhil Dinesh | Alan Lee | Eleni Miltsakaki | Livio Robaldo | Aravind Joshi | Bonnie Webber

We present the second version of the Penn Discourse Treebank, PDTB-2.0, describing its lexically-grounded annotations of discourse relations and their two abstract object arguments over the 1 million word Wall Street Journal corpus. We describe all aspects of the annotation, including (a) the argument structure of discourse relations, (b) the sense annotation of the relations, and (c) the attribution of discourse relations and each of their arguments. We list the differences between PDTB-1.0 and PDTB-2.0. We present representative statistics for several aspects of the annotation in the corpus.

A Coreference Corpus and Resolution System for Dutch
Iris Hendrickx | Gosse Bouma | Frederik Coppens | Walter Daelemans | Veronique Hoste | Geert Kloosterman | Anne-Marie Mineur | Joeri Van Der Vloet | Jean-Luc Verschelde

We present the main outcomes of the COREA project: a corpus annotated with coreferential relations and a coreference resolution system for Dutch. In the project we developed annotation guidelines for coreference resolution for Dutch and annotated a corpus of 135K tokens. We discuss these guidelines, the annotation tool, and the inter-annotator agreement. We also show a visualization of the annotated relations. The standard approach to evaluate a coreference resolution system is to compare the predictions of the system to a hand-annotated gold standard test set (cross-validation). A more practically oriented evaluation is to test the usefulness of coreference relation information in an NLP application. We run experiments with an Information Extraction module for the medical domain, and measure the performance of this module with and without the coreference relation information. We present the results of both this application-oriented evaluation of our system and of a standard cross-validation evaluation. In a separate experiment we also evaluate the effect of coreference information produced by a simple rule-based coreference module in a Question Answering application.

Statistical Identification of English Loanwords in Korean Using Automatically Generated Training Data
Kirk Baker | Chris Brew

This paper describes an accurate, extensible method for automatically classifying unknown foreign words that requires minimal monolingual resources and no bilingual training data (which is often difficult to obtain for an arbitrary language pair). We use a small set of phonologically-based transliteration rules to generate a potentially unlimited amount of pseudo-data that can be used to train a classifier to distinguish etymological classes of actual words. We ran a series of experiments on identifying English loanwords in Korean, in order to explore the consequences of using pseudo-data in place of the original training data. Results show that a sufficient quantity of automatically generated training data, even produced by fairly low precision transliteration rules, can be used to train a classifier that performs within 0.3% of one trained on actual English loanwords (96% accuracy).

Romanian Semantic Role Resource
Diana Trandabăţ | Maria Husarciuc

Semantic databases are a stable starting point in developing knowledge based systems. Since creating language resources demands many temporal, financial and human resources, a possible solution could be the import of a resource annotation from one language to another. This paper presents the creation of a semantic role database for Romanian, starting from the English FrameNet semantic resource. The intuition behind the importing program is that most of the frames defined in the English FN are likely to be valid cross-lingual, since semantic frames express conceptual structures, language independent at the deep structure level. The surface realization, the surface level, is realized according to each language syntactic constraints. In the paper we present the advantages of choosing to import the English FrameNet annotation, instead of annotating a new corpus. We also take into account the mismatches encountered in the validation process. The rules created to manage particular situations are used to improve the import program. We believe the information and argumentations in this paper could be of interest for those who wish develop FrameNet-like systems for other languages.

Unsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora
Alessandro Lenci | Barbara McGillivray | Simonetta Montemagni | Vito Pirrelli

In this paper, we reported experiments of unsupervised automatic acquisition of Italian and English verb subcategorization frames (SCFs) from general and domain corpora. The proposed technique operates on syntactically shallow-parsed corpora on the basis of a limited number of search heuristics not relying on any previous lexico-syntactic knowledge about SCFs. Although preliminary, reported results are in line with state-of-the-art lexical acquisition systems. The issue of whether verbs sharing similar SCFs distributions happen to share similar semantic properties as well was also explored by clustering verbs that share frames with the same distribution using the Minimum Description Length Principle (MDL). First experiments in this direction were carried out on Italian verbs with encouraging results.

A Method for Automatically Constructing Case Frames for English
Daisuke Kawahara | Kiyotaka Uchimoto

Case frames are an important knowledge base for a variety of natural language processing (NLP) systems. For the practical use of these systems in the real world, wide-coverage case frames are required. In order to acquire such large-scale case frames, in this paper, we automatically compile case frames from a large corpus. The resultant case frames that are compiled from the English Gigaword corpus contain 9,300 verb entries. The case frames include most examples of normal usage, and are ready to be used in numerous NLP analyzers and applications.

Automatic Acquisition for low frequency lexical items
Núria Bel | Sergio Espeja | Montserrat Marimon

This paper addresses a specific case of the task of lexical acquisition understood as the induction of information about the linguistic characteristics of lexical items on the basis of information gathered from their occurrences in texts. Most of the recent works in the area of lexical acquisition have used methods that take as much textual data as possible as source of evidence, but their performance decreases notably when only few occurrences of a word are available. The importance of covering such low frequency items lies in the fact that a large quantity of the words in any particular collection of texts will be occurring few times, if not just once. Our work proposes to compensate the lack of information resorting to linguistic knowledge on the characteristics of lexical classes. This knowledge, obtained from a lexical typology, is formulated probabilistically to be used in a Bayesian method to maximize the information gathered from single occurrences as to predict the full set of characteristics of the word. Our results show that our method achieves better results than others for the treatment of low frequency items.

BioSec Multimodal Biometric Database in Text-Dependent Speaker Recognition
Doroteo Toledano | Daniel Hernandez-Lopez | Cristina Esteve-Elizalde | Julian Fierrez | Javier Ortega-Garcia | Daniel Ramos | Joaquin Gonzalez-Rodriguez

In this paper we briefly describe the BioSec multimodal biometric database and analyze its use in automatic text-dependent speaker recognition research. The paper is structured into four parts: a short introduction to the problem of text-dependent speaker recognition; a brief review of other existing databases, including monomodal text-dependent speaker recognition databases and multimodal biometric recognition databases; a description of the BioSec database; and, finally, an experimental section in which speaker recognition results on some of these databases are presented and compared, using the same underlying speaker recognition technique in all cases.

Text Independent Speaker Identification in Multilingual Environments
Iker Luengo | Eva Navas | Iñaki Sainz | Ibon Saratxaga | Jon Sanchez | Igor Odriozola | Inma Hernaez

Speaker identification and verification systems have a poor performance when model training is done in one language while the testing is done in another. This situation is not unusual in multilingual environments, where people should be able to access the system in any language he or she prefers in each moment, without noticing a performance drop. In this work we study the possibility of using features derived from prosodic parameters in order to reinforce the language robustness of these systems. First the features’ properties in terms of language and session variability are studied, predicting an increase in the language robustness when frame-wise intonation and energy values are combined with traditional MFCC features. The experimental results confirm that these features provide an improvement in the speaker recognition rates under language-mismatch conditions. The whole study is carried out in the Basque Country, a bilingual region in which Basque and Spanish languages co-exist.

NineOneOne: Recognizing and Classifying Speech for Handling Minority Language Emergency Calls
Udhyakumar Nallasamy | Alan Black | Tanja Schultz | Robert Frederking

In this paper, we describe NineOneOne (9-1-1), a system designed to recognize and translate Spanish emergency calls for better dispatching. We analyze the research challenges in adapting speech translation technology to 9-1-1 domain. We report our initial research towards building the system and the results of our initial experiments.

Bridging the Gap between Linguists and Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and Speaker Recognition
Christopher Cieri | Stephanie Strassel | Meghan Glenn | Reva Schwartz | Wade Shen | Joseph Campbell

Recent years have seen increased interest within the speaker recognition community in high-level features including, for example, lexical choice, idiomatic expressions or syntactic structures. The promise of speaker recognition in forensic applications drives development toward systems robust to channel differences by selecting features inherently robust to channel difference. Within the language recognition community, there is growing interest in differentiating not only languages but also mutually intelligible dialects of a single language. Decades of research in dialectology suggest that high-level features can enable systems to cluster speakers according to the dialects they speak. The Phanotics (Phonetic Annotation of Typicality in Conversational Speech) project seeks to identify high-level features characteristic of American dialects, annotate a corpus for these features, use the data to dialect recognition systems and also use the categorization to create better models for speaker recognition. The data, once published, should be useful to other developers of speaker and dialect recognition systems and to dialectologists and sociolinguists. We expect the methods will generalize well beyond the speakers, dialects, and languages discussed here and should, if successful, provide a model for how linguists and technology developers can collaborate in the future for the benefit of both groups and toward a deeper understanding of how languages vary and change.

Speaker Recognition: Building the Mixer 4 and 5 Corpora
Linda Brandschain | Christopher Cieri | David Graff | Abby Neely | Kevin Walker

The original Mixer corpus was designed to satisfy developing commercial and forensic needs. The resulting Mixer corpora, Phases 1 through 5, have evolved to support and increasing variety of research tasks, including multilingual and cross-channel recognition. The Mixer Phases 4 and 5 corpora feature a wider variety of channels and greater variation in the situations under which the speech is recorded. This paper focuses on the plans, progress and results of Mixer 4 and 5.

MASC: the Manually Annotated Sub-Corpus of American English
Nancy Ide | Collin Baker | Christiane Fellbaum | Charles Fillmore | Rebecca Passonneau

To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing maximum accessibility for researchers from around the globe.

Quality Assurance of Automatic Annotation of Very Large Corpora: a Study based on heterogeneous Tagging System
Chu-Ren Huang | Lung-Hao Lee | Wei-guang Qu | Jia-Fei Hong | Shiwen Yu

We propose a set of heuristics for improving annotation quality of very large corpora efficiently. The Xinhua News portion of the Chinese Gigaword Corpus was tagged independently with both the Peking University ICL tagset and the Academia Sinica CKIP tagset. The corpus-based POS tags mapping will serve as the basis of the possible contrast in grammatical systems between PRC and Taiwan. And it can serve as the basic model for mapping between the CKIP and ICL tagging systems for any data.

An eRulemaking Corpus: Identifying Substantive Issues in Public Comments
Claire Cardie | Cynthia Farina | Matt Rawding | Adil Aijaz

We describe the creation of a corpus that supports a real-world hierarchical text categorization task in the domain of electronic rulemaking (eRulemaking). Features of the task and of the eRulemaking domain engender both a non-traditional text categorization corpus and a correspondingly difficult machine learning task. Interannotator agreement results are presented for a group of six annotators. We also briefly describe the results of experiments that apply standard and hierarchical text categorization techniques to the eRulemaking data sets. The corpus is the first in a series of related sentence-level text categorization corpora to be developed in the eRulemaking domain.

Navigating through Dense Annotation Spaces
Branimir Boguraev | Mary Neff

Pattern matching, or querying, over annotations is a general purpose paradigm for inspecting, navigating, mining, and transforming annotation repositories - the common representation basis for modern pipelined text-processing frameworks. Configurability of such frameworks and expressiveness of feature structure-based annotation schemes account for the “high density” of some such annotation repositories. This particular characteristic makes challenging the design of a pattern matching engine, capable of interpreting (or imposing) flat patterns over an arbitrarily dense annotation lattice. We present an approach where a finite state device carries out the application of (compiled) grammars over what is, in effect, a linearized “projection” of a unique route through the lattice; a route derived by a mix of static pattern (grammar) analysis and interpretation of navigational directives within the extended grammar formalism. Our approach achieves a mix of finite state scanning and lattice traversal for expressive and efficient pattern matching in dense annotations stores.

An Unsupervised Probabilistic Approach for the Detection of Outliers in Corpora
David Guthrie | Louise Guthrie | Yorick Wilks

Many applications of computational linguistics are greatly influenced by the quality of corpora available and as automatically generated corpora continue to play an increasingly common role, it is essential that we not overlook the importance of well-constructed and homogeneous corpora. This paper describes an automatic approach to improving the homogeneity of corpora using an unsupervised method of statistical outlier detection to find documents and segments that do not belong in a corpus. We consider collections of corpora that are homogeneous with respect to topic (i.e. about the same subject), or genre (written for the same audience or from the same source) and use a combination of stylistic and lexical features of the texts to automatically identify pieces of text in these collections that break the homogeneity. These pieces of text that are significantly different from the rest of the corpus are likely to be errors that are out of place and should be removed from the corpus before it is used for other tasks. We evaluate our techniques by running extensive experiments over large artificially constructed corpora that each contain single pieces of text from a different topic, author, or genre than the rest of the collection and measure the accuracy of identifying these pieces of text without the use of training data. We show that when these pieces of text are reasonably large (1,000 words) we can reliably identify them in a corpus.

Using Log-linear Models for Tuning Machine Translation Output
Michael Carl

We describe a set of experiments to explore statistical techniques for ranking and selecting the best translations in a graph of translation hypotheses. In a previous paper (Carl, 2007) we have described how the graph of hypotheses is generated through shallow transfer and chunk permutation rules, where nodes consist of vectors representing morpho-syntactic properties of words and phrases. This paper describes a number of methods to train statistical feature functions from some of the vector’s components. The feature functions are trained off-line on different types of text and their log-linear combination is then used to retrieve the best translation paths in the graph. We compare two language modelling toolkits, the CMU and the SRI toolkit and arrive at three results: 1) models of lemma-based feature functions produce better results than token-based models, 2) adding PoS-tag feature function to the lemma models improves the output and 3) weights for lexical translations are suited if the training material is similar to the texts to be translated.

Generalising Lexical Translation Strategies for MT Using Comparable Corpora
Bogdan Babych | Serge Sharoff | Anthony Hartley

We report on an on-going research project aimed at increasing the range of translation equivalents which can be automatically discovered by MT systems. The methodology is based on semi-supervised learning of indirect translation strategies from large comparable corpora and applying them in run-time to generate novel, previously unseen translation equivalents. This approach is different from methods based on parallel resources, which currently can reuse only individual translation equivalents. Instead it models translation strategies which generalise individual equivalents and can successfully generate an open class of new translation solutions. The task of the project is integration of the developed technology into open-source MT systems.

Post-MT Term Swapper: Supplementing a Statistical Machine Translation System with a User Dictionary
Masaki Itagaki | Takako Aikawa

A statistical machine translation (SMT) system requires homogeneous training data in order to get domain-sensitive (or context-sensitive) terminology translations. If the data contains various domains, it is difficult for an SMT to learn context-sensitive terminology mappings probabilistically. Yet, terminology translation accuracy is an important issue for MT users. This paper explores an approach to tackle this terminology translation problem for an SMT. We propose a way to identify terminology translations from MT output and automatically swap them with user-defined translations. Our approach is simple and can be applied to any type of MT system. We call our prototype “Term Swapper”. Term Swapper allows MT users to draw on their own dictionaries without affecting any parts of the MT output except for the terminology translation(s) in question. Using an SMT developed at Microsoft Research, called MSR-MT (Quirk et al., (2005); Menezes & Quirk (2005)), we conducted initial experiments to investigate the coverage rate of Term Swapper and its impact on the overall quality of MT output. The results from our experiments show high coverage and positive impact on the overall MT quality.

Using Parsed Corpora for Estimating Stochastic Inversion Transduction Grammars
Germán Sanchis | Joan Andreu Sánchez

An important problem when using Stochastic Inversion Transduction Grammars is their computational cost. More specifically, when dealing with corpora such as Europarl. only one iteration of the estimation algorithm becomes prohibitive. In this work, we apply a reduction of the cost by taking profit of the bracketing information in parsed corpora and show machine translation results obtained with a bracketed Europarl corpus, yielding interresting improvements when increasing the number of non-terminal symbols.

Experiments on Processing Overlapping Parallel Corpora
Mark Fishel | Heiki-Jaan Kaalep

The number and sizes of parallel corpora keep growing, which makes it necessary to have automatic methods of processing them: combining, checking and improving corpora quality, etc. We here introduce a method which enables performing many of these by exploiting overlapping parallel corpora. The method finds the correspondence between sentence pairs in two corpora: first the corresponding language parts of the corpora are aligned and then the two resulting alignments are compared. The method takes into consideration slight differences in the source documents, different levels of segmentation of the input corpora, encoding differences and other aspects of the task. The paper describes two experiments conducted to test the method. In the first experiment, the Estonian-English part of the JRC-Acquis corpus was combined with another corpus of legislation texts. In the second experiment alternatively aligned versions of the JRC-Acquis are compared to each other with the example of all language pairs between English, Estonian and Latvian. Several additional conclusions about the corpora can be drawn from the results. The method proves to be effective for several parallel corpora processing tasks.

Parser Evaluation and the BNC: Evaluating 4 constituency parsers with 3 metrics
Jennifer Foster | Josef van Genabith

We evaluate discriminative parse reranking and parser self-training on a new English test set using four versions of the Charniak parser and a variety of parser evaluation metrics. The new test set consists of 1,000 hand-corrected British National Corpus parse trees. We directly evaluate parser output using both the Parseval and the Leaf Ancestor metrics. We also convert the hand-corrected and parser output phrase structure trees to dependency trees using a state-of-the-art functional tag labeller and constituent-to-dependency conversion tool, and then calculate label accuracy, unlabelled attachment and labelled attachment scores over the dependency structures. We find that reranking leads to a performance improvement on the new test set (albeit a modest one). We find that self-training using BNC data leads to significantly better results. However, it is not clear how effective self-training is when the training material comes from the North American News Corpus.

EASY, Evaluation of Parsers of French: what are the Results?
Patrick Paroubek | Isabelle Robba | Anne Vilnat | Christelle Ayache

This paper presents EASY, which has been the first campaign evaluating syntactic parsers on all the common syntactic phenomena and a large set of dependency relations. The language analyzed was French. During this campaign, an annotation scheme has been elaborated with the different actors: participants and corpus providers; then a corpus made of several syntactic materials has been built and annotated: it reflects a great variety of linguistic styles (from literature to oral transcriptions, and from newspapers to medical texts). Both corpus and annotation scheme are here briefly presented. Moreover, evaluation measures are explained and detailed results are given. The results of the 15 parsers coming from 12 teams are analyzed. To conclude, a first experiment aiming to combine the outputs of the different systems is shown.

Evaluation Metrics for Automatic Temporal Annotation of Texts
Xavier Tannier | Philippe Muller

Recent years have seen increasing attention in temporal processing of texts as well as a lot of standardization effort of temporal information in natural language. A central part of this information lies in the temporal relations between events described in a text, when their precise times or dates are not known. Reliable human annotation of such information is difficult, and automatic comparisons must follow procedures beyond mere precision-recall of local pieces of information, since a coherent picture can only be considered at a global level. We address the problem of evaluation metrics of such information, aiming at fair comparisons between systems, by proposing some measures taking into account the globality of a text.

A Comparative Study on Language Identification Methods
Lena Grothe | Ernesto William De Luca | Andreas Nürnberger

In this paper we present two experiments conducted for comparison of different language identification algorithms. Short words-, frequent words- and n-gram-based approaches are considered and combined with the Ad-Hoc Ranking classification method. The language identification process can be subdivided into two main steps: first a document model is generated for the document and a language model for the language; second the language of the document is determined on the basis of the language model and is added to the document as additional information. In this work we present our evaluation results and discuss the importance of a dynamic value for the out-of-place measure.

PASSAGE: from French Parser Evaluation to Large Sized Treebank
Éric Villemonte de la Clergerie | Olivier Hamon | Djamel Mostefa | Christelle Ayache | Patrick Paroubek | Anne Vilnat

In this paper we present the PASSAGE project which aims at building automatically a French Treebank of large size by combining the output of several parsers, using the EASY annotation scheme. We present also the results of the of the first evaluation campaign of the project and the preliminary results we have obtained with our ROVER procedure for combining parsers automatically.

Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations
Jáchym Kolář | Jan Švec

Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and to downstream automatic processes. It may be achieved by inserting boundaries of syntactic/semantic units to the flow of speech, labeling non-content words like filled pauses and discourse markers for optional removal, and identifying sections of disfluent speech. This paper compares two Czech MDE speech corpora, one in the domain of broadcast news and the other in the domain of broadcast conversations. A variety of statistics about fillers, edit disfluencies, and syntactic/semantic units are presented. In addition, it is reported that disfluent portions of speech show differences in the distribution of parts of speech (POS) of their content in comparison with the general POS distribution. The two Czech corpora are not only compared with each other, but also with available numbers relating to English MDE corpora of broadcast news and telephone conversations.

Thai Broadcast News Corpus Construction and Evaluation
Markpong Jongtaveesataporn | Chai Wutiwiwatchai | Koji Iwano | Sadaoki Furui

Large speech and text corpora are crucial to the development of a state-of-the-art speech recognition system. This paper reports on the construction and evaluation of the first Thai broadcast news speech and text corpora. Specifications and conventions used in the transcription process are described in the paper. The speech corpus contains about 17 hours of speech data while the text corpus was transcribed from around 35 hours of television broadcast news. The characteristics of the corpus were analyzed and shown in the paper. The speech corpus was split according to the evaluation focus condition used in the DARPA Hub-4 evaluation. An 18K-word Thai speech recognition system was setup to test with this speech corpus as a preliminary experiment. Acoustic model adaptations were performed to improve the system performance. The best system yielded a word error rate of about 20% for clean and planned speech, and below 30% for the overall condition.

RUNDKAST: an Annotated Norwegian Broadcast News Speech Corpus
Ingunn Amdal | Ole Morten Strand | Jørn Almberg | Torbjørn Svendsen

This paper describes the Norwegian broadcast news speech corpus RUNDKAST. The corpus contains recordings of approximately 77 hours of broadcast news shows from the Norwegian broadcasting company NRK. The corpus covers both read and spontaneous speech as well as spontaneous dialogues and multipart discussions, including frequent occurrences of non-speech material (e.g. music, jingles). The recordings have large variations in speaking styles, dialect use and recording/transmission quality. RUNDKAST has been annotated for research in speech technology. The entire corpus has been manually segmented and transcribed using hierarchical levels. A subset of one hour of read and spontaneous speech from 10 different speakers has been manually annotated using broad phonetic labels. We provide a description of the database content, the annotation tools and strategies, and the conventions used for the different levels of annotation. A corpus of this kind has up to this point not been available for Norwegian, but is considered a necessary part of the infrastructure for language technology research in Norway. The RUNDKAST corpus is planned to be included in a future national Norwegian language resource bank.

First Broadcast News Transcription System for Khmer Language
Sopheap Seng | Sethserey Sam | Laurent Besacier | Brigitte Bigi | Eric Castelli

In this paper we present an overview on the development of a large vocabulary continuous speech recognition (LVCSR) system for Khmer, the official language of Cambodia, spoken by more than 15 million people. As an under-resourced language, develop a LVCSR system for Khmer is a challenging task. We describe our methodologies for quick language data collection and processing for language modeling and acoustic modeling. For language modeling, we investigate the use of word and sub-word as basic modeling unit in order to see the potential of sub-word units in the case of unsegmented language like Khmer. Grapheme-based acoustic modeling is used to quickly build our Khmer language acoustic model. Furthermore, the approaches and tools used for the development of our system are documented and made publicly available on the web. We hope this will contribute to accelerate the development of LVCSR system for a new language, especially for under-resource languages of developing countries where resources and expertise are limited.

Quick Rich Transcriptions of Arabic Broadcast News Speech Data
Chomicha Bendahman | Meghan Glenn | Djamel Mostefa | Niklas Paulsson | Stephanie Strassel

This paper describes the collect and transcription of a large set of Arabic broadcast news speech data. A total of more than 2000 hours of data was transcribed. The transcription factor for transcribing the broadcast news data has been reduced using a method such as Quick Rich Transcription (QRTR) as well as reducing the number of quality controls performed on the data. The data was collected from several Arabic TV and radio sources and from both Modern Standard Arabic and dialectal Arabic. The orthographic transcriptions included segmentation, speaker turns, topics, sentence unit types and a minimal noise mark-up. The transcripts were produced as a part of the GALE project.

A General Methodology for Mapping EuroWordNets to the Suggested Upper Merged Ontology
Dennis Spohr

This paper presents a general methodology to mapping EuroWordNets (Vossen, 1998) to the Suggested Upper Merged Ontology (SUMO; Niles and Pease (2001)), and we show its application to the French EuroWordNet. The process makes use of existing work on mapping Princeton WordNet (Fellbaum, 1998) to SUMO (Niles and Pease, 2003). After a general discussion of the usefulness of our approach, we provide details on the procedure of mapping individual EuroWordNet synsets to SUMO conceptual classes, and discuss issues arising from a fully automatic mapping. In addition to this, we present a quantitative analysis of the thus created semantic resource and discuss how the accuracy in determining the correct SUMO class for a particular EuroWordNet synset might be improved. Finally, we briefly hint at how such resources may be used, e.g. in order to extract selectional preferences of verbal predicates with respect to the ontological categories of their syntactic arguments.

Extended Named Entity Ontology with Attribute Information
Satoshi Sekine

Named Entities (NE) are regarded as an important type of semantic knowledge in many natural language processing (NLP) applications. Originally, a limited number of NE categories were proposed. In MUC, it was 7 categories - people, organization, location, time, date, money and percentage expressions. However, it was noticed that such a limited number of NE categories is too small for many applications. The author has proposed Extended Named Entity (ENE), which has about 200 categories (Sekine and Nobata 04). During the development of ENE, we noticed that many ENE categories have specific attributes, and those provide very important information for the entities. For example, “rivers” have attributes like “source location”, “outflow”, and “length”. Some such information is essential to “knowing about” the river, while the name is only a label which can be used to refer to the river. Also, such attributes are important information for many NLP applications. In this paper, we report on the design of a set of attributes for ENE categories. We used a bottom up approach to creating the knowledge using a Japanese encyclopedia, which contains abundant descriptions of ENE instances.

Towards a Glossary of Activities in the Ontology Engineering Field
Mari Carmen Suárez-Figueroa | Asunción Gómez-Pérez

The Semantic Web of the future will be characterized by using a very large number of ontologies embedded in ontology networks. It is important to provide strong methodological support for collaborative and context-sensitive development of networks of ontologies. This methodological support includes the identification and definition of which activities should be carried out when ontology networks are collaboratively built. In this paper we present the consensus reaching process followed within the NeOn consortium for the identification and definition of the activities involved in the ontology network development process. The consensus reaching process here presented produces as a result the NeOn Glossary of Activities. This work was conceived due to the lack of standardization in the Ontology Engineering terminology, which clearly contrasts with the Software Engineering field. Our future aim is to standardize the NeOn Glossary of Activities.

Chinese Core Ontology Construction from a Bilingual Term Bank
Yirong Chen | Qin Lu | Wenjie Li | Gaoying Cui

A core ontology is a mid-level ontology which bridges the gap between an upper ontology and a domain ontology. Automatic Chinese core ontology construction can help quickly model domain knowledge. A graph based core ontology construction algorithm (COCA) is proposed to automatically construct a core ontology from an English-Chinese bilingual term bank. This algorithm computes the mapping strength from a selected Chinese term to WordNet synset with association to an upper-level SUMO concept. The strength is measured using a graph model integrated with several mapping features from multiple information sources. The features include multiple translation feature between Chinese core term and WordNet, extended string feature and Part-of-Speech feature. Evaluation of COCA repeated on an English-Chinese bilingual Term bank with more than 130K entries shows that the algorithm is improved in performance compared with our previous research and can better serve the semi-automatic construction of mid-level ontology.

The European Thesaurus on International Relations and Area Studies - a Multilingual Resource for Indexing, Retrieval, and Translation
Michael Kluck | Axel Huckstorf

The multilingual European Thesaurus on International Relations and Area Studies (European Thesaurus) is a special subject thesaurus for the field of international affairs. It is intended for use in libraries and documentation centres of academic institutions and international organizations. The European Thesaurus was established in a collaborative project involving a number of leading European research institutes on international politics. It integrates the controlled terminologies of several existing thesauri. The European Thesaurus comprises about 8,200 terms and proper names from the 24 subject areas covered by the thesaurus. Because of its multilinguality, the European Thesaurus can not only be used for indexing, retrieval and terminological reference, but serves also as a translation tool for the languages represented. The establishment of cross-concordances to related thesauri extends the range of application of the European Thesaurus even further. They enable the treatment of semantic heterogeneity within subject gateways. The European Thesaurus is available both in a seven-lingual print-version as well as in an eight-lingual online-version. To reflect the changes in terminology the European Thesau-rus is regularly being amended and modified. Further languages are going to be included.

Building Bilingual Lexicons using Lexical Translation Probabilities via Pivot Languages
Takashi Tsunakawa | Naoaki Okazaki | Jun’ichi Tsujii

This paper proposes a method of increasing the size of a bilingual lexicon obtained from two other bilingual lexicons via a pivot language. When we apply this approach, there are two main challenges, “ambiguity” and “mismatch” of terms; we target the latter problem by improving the utilization ratio of the bilingual lexicons. Given two bilingual lexicons between language pairs Lf-Lp and Lp-Le, we compute lexical translation probabilities of word pairs by using a statistical word-alignment model, and term decomposition/composition techniques. We compare three approaches to generate the bilingual lexicon: “exact merging”, “word-based merging”, and our proposed “alignment-based merging”. In our method, we combine lexical translation probabilities and a simple language model for estimating the probabilities of translation pairs. The experimental results show that our method could drastically improve the number of translation terms compared to the two methods mentioned above. Additionally, we evaluated and discussed the quality of the translation outputs.

Improving Statistical Machine Translation Efficiency by Triangulation
Yu Chen | Andreas Eisele | Martin Kay

In current phrase-based Statistical Machine Translation systems, more training data is generally better than less. However, a larger data set eventually introduces a larger model that enlarges the search space for the decoder, and consequently requires more time and more resources to translate. This paper describes an attempt to reduce the model size by filtering out the less probable entries based on testing correlation using additional training data in an intermediate third language. The central idea behind the approach is triangulation, the process of incorporating multilingual knowledge in a single system, which eventually utilizes parallel corpora available in more than two languages. We conducted experiments using Europarl corpus to evaluate our approach. The reduction of the model size can be up to 70% while the translation quality is being preserved.

Phrase-Based Machine Translation based on Simulated Annealing
Caroline Lavecchia | David Langlois | Kamel Smaïli

In this paper, we propose a new phrase-based translation model based on inter-lingual triggers. The originality of our method is double. First we identify common source phrases. Then we use inter-lingual triggers in order to retrieve their translations. Furthermore, we consider the way of extracting phrase translations as an optimization issue. For that we use simulated annealing algorithm to find out the best phrase translations among all those determined by inter-lingual triggers. The best phrases are those which improve the translation quality in terms of Bleu score. Tests are achieved on movie subtitle corpora. They show that our phrase-based machine translation (PBMT) system outperforms a state-of-the-art PBMT system by almost 7 points.

Evaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation
Marine Carpuat | Dekai Wu

We present new direct data analysis showing that dynamically-built context-dependent phrasal translation lexicons are more useful resources for phrase-based statistical machine translation (SMT) than conventional static phrasal translation lexicons, which ignore all contextual information. After several years of surprising negative results, recent work suggests that context-dependent phrasal translation lexicons are an appropriate framework to successfully incorporate Word Sense Disambiguation (WSD) modeling into SMT. However, this approach has so far only been evaluated using automatic translation quality metrics, which are important, but aggregate many different factors. A direct analysis is still needed to understand how context-dependent phrasal translation lexicons impact translation quality, and whether the additional complexity they introduce is really necessary. In this paper, we focus on the impact of context-dependent translation lexicons on lexical choice in phrase-based SMT and show that context-dependent lexicons are more useful to a phrase-based SMT system than a conventional lexicon. A typical phrase-based SMT system makes use of more and longer phrases with context modeling, including phrases that were not seen very frequently in training. Even when the segmentation is identical, the context-dependent lexicons yield translations that match references more often than conventional lexicons.

A Multi-Genre SMT System for Arabic to French
Saša Hasan | Hermann Ney

This work presents improvements of a large-scale Arabic to French statistical machine translation system over a period of three years. The development includes better preprocessing, more training data, additional genre-specific tuning for different domains, namely newswire text and broadcast news transcripts, and improved domain-dependent language models. Starting with an early prototype in 2005 that participated in the second CESTA evaluation, the system was further upgraded to achieve favorable BLEU scores of 44.8% for the text and 41.1% for the audio setting. These results are compared to a system based on the freely available Moses toolkit. We show significant gains both in terms of translation quality (up to +1.2% BLEU absolute) and translation speed (up to 16 times faster) for comparable configuration settings.

Investigating the Structure of Procedural Texts for Answering How-to Questions
Estelle Delpech | Patrick Saint-Dizier

This paper presents ongoing work dedicated to parsing the textual structure of procedural texts. We propose here a model for the intructional structure and criteria to identify its main components: titles, instructions, warnings and prerequisites. The main aim of this project, besides a contribution to text processing, is to be able to answer procedural questions (How-to? questions), where the answer is a well-formed portion of a text, not a small set of words as for factoid questions.

Analysis and Performance of Morphological Query Expansion and Language-Filtering Words on Basque Web Searching
Igor Leturia | Antton Gurrutxaga | Nerea Areta | Eli Pociello

Morphological query expansion and language-filtering words have proved to be valid methods when searching the web for content in Basque via APIs of commercial search engines, as the implementation of these methods in recent IR and web-as-corpus tools shows, but no real analysis has been carried out to ascertain the degree of improvement, apart from a comparison of recall and precision using a classical web search engine and measured in terms of hit counts. This paper deals with a more theoretical study that confirms the validity of the combination of both methods. We have measured the increase in recall obtained by morphological query expansion and the increase in precision and loss in recall produced by language-filtering-words, but not only by searching the web directly and looking at the hit counts “which are not considered to be very reliable at best”, but also using both a Basque web corpus and a classical lemmatised corpus, thus providing more exact quantitative results. Furthermore, we provide various corpora-extracted data to be used in the aforementioned methods, such as lists of the most frequent inflections and declinations (cases, persons, numbers, times, etc.) for each POS “the most interesting word forms for a morphologically expanded query”, or a list of the most used Basque words with their frequencies and document-frequencies “the ones that should be used as language-filtering words”.

Scaling Answer Type Detection to Large Hierarchies
Kirk Roberts | Andrew Hickl

This paper describes the creation of a state-of-the-art answer type detection system capable of recognizing more than 200 different expected answer types with greater than 85% precision and recall. After describing how we constructed a new, multi-tiered answer type hierarchy from the set of entity types recognized by Language Computer Corporation’s CICEROLITE named entity recognition system, we describe how we used this hierarchy to annotate a new corpus of more than 10,000 English factoid questions. We show how an answer type detection system trained on this corpus can be used to enhance the accuracy of a state-of-the-art question-answering system (Hickl et al., 2007; Hickl et al., 2006b) by more than 7% overall.

Answering List Questions using Co-occurrence and Clustering
Majid Razmara | Leila Kosseim

Although answering list questions is not a new research area, answering them automatically still remains a challenge. The median F-score of systems that participated in TREC 2007 Question Answering track is still very low (0.085) while 74% of the questions had a median F-score of 0. In this paper, we propose a novel approach to answering list questions. This approach is based on the hypothesis that answer instances of a list question co-occur in the documents and sentences related to the topic of the question. We use a clustering method to group the candidate answers that co-occur more often. To pinpoint the right cluster, we use the target and the question keywords as spies to return the cluster that contains these keywords.

Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary
Torsten Zesch | Christof Müller | Iryna Gurevych

Recently, collaboratively constructed resources such as Wikipedia and Wiktionary have been discovered as valuable lexical semantic knowledge bases with a high potential in diverse Natural Language Processing (NLP) tasks. Collaborative knowledge bases however significantly differ from traditional linguistic knowledge bases in various respects, and this constitutes both an asset and an impediment for research in NLP. This paper addresses one such major impediment, namely the lack of suitable programmatic access mechanisms to the knowledge stored in these large semantic knowledge bases. We present two application programming interfaces for Wikipedia and Wiktionary which are especially designed for mining the rich lexical semantic information dispersed in the knowledge bases, and provide efficient and structured access to the available knowledge. As we believe them to be of general interest to the NLP community, we have made them freely available for research purposes.

Odds of Successful Transfer of Low-Level Concepts: a Key Metric for Bidirectional Speech-to-Speech Machine Translation in DARPA’s TRANSTAC Program
Gregory Sanders | Sébastien Bronsart | Sherri Condon | Craig Schlenoff

The Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program is a Defense Advanced Research Agency (DARPA) program to create bidirectional speech-to-speech machine translation (MT) that will allow U.S. Soldiers and Marines, speaking only English, to communicate, in tactical situations, with civilian populations who speak only other languages (for example, Iraqi Arabic). A key metric for the program is the odds of successfully transferring low-level concepts, defined as the source-language content words. The National Institute of Standards and Technology (NIST) has now carried out two large-scale evaluations of TRANSTAC systems, using that metric. In this paper we discuss the merits of that metric. It has proven to be quite informative. We describe exactly how we defined this metric and how we obtained values for it from panels of bilingual judges allowing others to do what we have done. We compare results on this metric to results on Likert-type judgments of semantic adequacy, from the same panels of bilingual judges, as well as to a suite of typical automated MT metrics (BLEU, TER, METEOR).

Question Answering on Speech Transcriptions: the QAST evaluation in CLEF
Lori Lamel | Sophie Rosset | Christelle Ayache | Djamel Mostefa | Jordi Turmo | Pere Comas

This paper reports on the QAST track of CLEF aiming to evaluate Question Answering on Speech Transcriptions. Accessing information in spoken documents provides additional challenges to those of text-based QA, needing to address the characteristics of spoken language, as well as errors in the case of automatic transcriptions of spontaneous speech. The framework and results of the pilot QAst evaluation held as part of CLEF 2007 is described, illustrating some of the additional challenges posed by QA in spoken documents relative to written ones. The current plans for future multiple-language and multiple-task QAst evaluations are described.

Evaluation of Spoken Document Retrieval for Historic Speech Collections
Willemijn Heeren | Franciska de Jong | Laurens van der Werff | Marijn Huijbregts | Roeland Ordelman

The re-use of spoken word audio collections maintained by audiovisual archives is severely hindered by their generally limited access. The CHoral project, which is part of the CATCH program funded by the Dutch Research Council, aims to provide users of speech archives with online, instead of on-location, access to relevant fragments, instead of full documents. To meet this goal, a spoken document retrieval framework is being developed. In this paper the evaluation efforts undertaken so far to assess and improve various aspects of the framework are presented. These efforts include (i) evaluation of the automatically generated textual representations of the spoken word documents that enable word-based search, (ii) the development of measures to estimate the quality of the textual representations for use in information retrieval, and (iii) studies to establish the potential user groups of the to-be-developed technology, and the first versions of the user interface supporting online access to spoken word collections.

Applying Automated Metrics to Speech Translation Dialogs
Sherri Condon | Jon Phillips | Christy Doran | John Aberdeen | Dan Parvaz | Beatrice Oshika | Greg Sanders | Craig Schlenoff

Over the past five years, the Defense Advanced Research Projects Agency (DARPA) has funded development of speech translation systems for tactical applications. A key component of the research program has been extensive system evaluation, with dual objectives of assessing progress overall and comparing among systems. This paper describes the methods used to obtain BLEU, TER, and METEOR scores for two-way English-Iraqi Arabic systems. We compare the scores with measures based on human judgments and demonstrate the effects of normalization operations on BLEU scores. Issues that are highlighted include the quality of test data and differential results of applying automated metrics to Arabic vs. English.

A Three-stage Disfluency Classifier for Multi Party Dialogues
Margot Mieskes | Michael Strube

We present work on a three-stage system to detect and classify disfluencies in multi party dialogues. The system consists of a regular expression based module and two machine learning based modules. The results are compared to other work on multi party dialogues and we show that our system outperforms previously reported ones.

Towards Heterogeneous Automatic MT Error Analysis
Jesús Giménez | Lluís Màrquez

This work studies the viability of performing heterogeneous automatic MT error analyses. Error analysis is, undoubtly, one of the most crucial stages in the development cycle of an MT system. However, often not enough attention is paid to this process. The reason is that performing an accurate error analysis requires intensive human labor. In order to speed up the error analysis process, we suggest partially automatizing it by having automatic evaluation metrics play a more active role. For that purpose, we have compiled a large and heterogeneous set of features at different linguistic levels and at different levels of granularity. Through a practical case study, we show how these features provide an effective means of ellaborating interpretable and detailed automatic reports of translation quality.

Sensitivity of Automated MT Evaluation Metrics on Higher Quality MT Output: BLEU vs Task-Based Evaluation Methods
Bogdan Babych | Anthony Hartley

We report the results of our experiment on assessing the ability of automated MT evaluation metrics to remain sensitive to variations in MT quality as the average quality of the compared systems goes up. We compare two groups of metrics: those, which measure the proximity of MT output to some reference translation, and those which evaluate the performance of some automated process on degraded MT output. The experiment shows that proximity-based metrics (such as BLEU) loose sensitivity as the scores go up, but performance-based metrics (e.g., Named Entity recognition from MT output) remain sensitive across the scale. We suggest a model for explaining this result, which attributes stable sensitivity of performance-based metrics to measuring cumulative functional effect of different language levels, while proximity-based metrics measure structural matches on a lexical level and therefore miss higher-level errors that are more typical for better MT systems. Development of new automated metrics should take into account possible decline in sensitivity on higher-quality MT, which should be tested as part of meta-evaluation of the metrics.

Translation Adequacy and Preference Evaluation Tool (TAP-ET)
Mark Przybocki | Kay Peterson | Sébastien Bronsart

Evaluation of Machine Translation (MT) technology is often tied to the requirement for tedious manual judgments of translation quality. While automated MT metrology continues to be an active area of research, a well known and often accepted standard metric is the manual human assessment of adequacy and fluency. There are several software packages that have been used to facilitate these judgments, but for the 2008 NIST Open MT Evaluation, NIST’s Speech Group created an online software tool to accommodate the requirement for centralized data and distributed judges. This paper introduces the NIST TAP-ET application and reviews the reasoning underlying its design. Where available, analysis of data sets judged for Adequacy and Preference using the TAP-ET application will be presented. TAP-ET is freely available and ready to download, and contains a variety of customizable features.

Evaluation of a Cross-lingual Romanian-English Multi-document Summariser
Constantin Orăsan | Oana Andreea Chiorean

The rapid growth of the Internet means that more information is available than ever before. Multilingual multi-document summarisation offers a way to access this information even when it is not in a language spoken by the reader by extracting the gist from related documents and translating it automatically. This paper presents an experiment in which Maximal Marginal Relevance (MMR), a well known multi-document summarisation method, is used to produce summaries from Romanian news articles. A task-based evaluation performed on both the original summaries and on their automatically translated versions reveals that they still contain a significant portion of the important information from the original texts. However, direct evaluation of the automatically translated summaries shows that they are not very legible and this can put off some readers who want to find out more about a topic.

The BNC Parsed with RASP4UIMA
Øistein E. Andersen | Julien Nioche | Ted Briscoe | John Carroll

We have integrated the RASP system with the UIMA framework (RASP4UIMA) and used this to parse the XML-encoded version of the British National Corpus (BNC). All original annotation is preserved, and parsing information, mainly in the form of grammatical relations, is added in an XML format. A few specific adaptations of the system to give better results with the BNC are discussed briefly. The RASP4UIMA system is publicly available and can be used to parse other corpora or document collections, and the final parsed version of the BNC will be deposited with the Oxford Text Archive.

Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and its Application
Kiyotaka Uchimoto | Yasuharu Den

In Japanese, the syntactic structure of a sentence is generally represented by the relationship between phrasal units, bunsetsus in Japanese, based on a dependency grammar. In many cases, the syntactic structure of a bunsetsu is not considered in syntactic structure annotation. This paper gives the criteria and definitions of dependency relationships between words in a bunsetsu and their applications. The target corpus for the word-level dependency annotation is a large spontaneous Japanese-speech corpus, the Corpus of Spontaneous Japanese (CSJ). One application of word-level dependency relationships is to find basic units for constructing accent phrases.

Induction of Treebank-Aligned Lexical Resources
Tejaswini Deoskar | Mats Rooth

We describe the induction of lexical resources from unannotated corpora that are aligned with treebank grammars, providing a systematic correspondence between features in the lexical resource and a treebank syntactic resource. We first describe a methodology based on parsing technology for augmenting a treebank database with linguistic features. A PCFG containing these features is created from the augmented treebank. We then use a procedure based on the inside-outside algorithm to learn lexical resources aligned with the treebank PCFG from large unannotated corpora. The method has been applied in creating a feature-annotated English treebank based on the Penn Treebank. The unsupervised estimation procedure gives a substantial error reduction (up to 31.6%) on the task of learning the subcategorization preference of novel verbs that are not present in the annotated training sample.

A Unified Database of Dependency Treebanks: Integrating, Quantifying & Evaluating Dependency Data
Olga Pustylnikov | Alexander Mehler | Rüdiger Gleim

This paper describes a database of 11 dependency treebanks which were unified by means of a two-dimensional graph format. The format was evaluated with respect to storage-complexity on the one hand, and efficiency of data access on the other hand. An example of how the treebanks can be integrated within a unique interface is given by means of the DTDB interface.

Amazigh Language Terminology in Morocco or Management of a “Multidimensional” Variation
Aicha Bouhjar

The present communication brings to the fore the work undertaken at the Royal Institute of the Amazigh Culture (IRCAM, henceforth) within the Language Planning Center known as “Centre de l’Aménagement Linguistique” (CAL) within the framework of the language planning of Amazigh, particularly on the side of terminology. The focus will be on the concept of “variation” that affects different levels in the course of standardizing a language: orthography, spelling, grammar and lexis. Thus, after a brief survey of the main features of the Amazigh (Berber) language in general, the missions and the projects far achieved by CAL will be presented, particularly the objectives that relate to the work on the multiply varied corpus-based terminology. It appears that eliciting the pertinent information, for the most part, requires a whole amount of work on the re-writing of corpora so that the latter become exploitable in the standardization process. It should be pointed out that this stage of data homogenization, seemingly unwieldy for optimal exploitation, cannot be undertaken Amazighist linguists being involved in theoretical and methodological presuppositions that are at the root of this variation.

Chinese Term Extraction Based on Delimiters
Yuhang Yang | Qin Lu | Tiejun Zhao

Existing techniques extract term candidates by looking for internal and contextual information associated with domain specific terms. The algorithms always face the dilemma that fewer features are not enough to distinguish terms from non-terms whereas more features lead to more conflicts among selected features. This paper presents a novel approach for term extraction based on delimiters which are much more stable and domain independent. The proposed approach is not as sensitive to term frequency as that of previous works. This approach has no strict limit or hard rules and thus they can deal with all kinds of terms. It also requires no prior domain knowledge and no additional training to adapt to new domains. Consequently, the proposed approach can be applied to different domains easily and it is especially useful for resource-limited domains. Evaluations conducted on two different domains for Chinese term extraction show significant improvements over existing techniques which verifies its efficiency and domain independent nature. Experiments on new term extraction indicate that the proposed approach can also serve as an effective tool for domain lexicon expansion.

A Multi-Word Term Extraction Program for Arabic Language
Siham Boulaknadel | Beatrice Daille | Driss Aboutajdine

Terminology extraction commonly includes two steps: identification of term-like units in the texts, mostly multi-word phrases, and the ranking of the extracted term-like units according to their domain representativity. In this paper, we design a multi-word term extraction program for Arabic language. The linguistic filtering performs a morphosyntactic analysis and takes into account several types of variations. The domain representativity is measure thanks to statistical scores. We evalutate several association measures and show that the results we otained are consitent with those obtained for Romance languages.

Using Similarity Metrics For Terminology Recognition
Jonathan Butters | Fabio Ciravegna

In this paper we present an approach to terminology recognition whereby a sublanguage term (e.g. an aircraft engine component term extracted from a maintenance log) is matched to its corresponding term from a pre-defined list (such as a taxonomy representing the official break-down of the engine). Terminology recognition is addressed as a classification task whereby the extracted term is associated to one or more potential terms in the official description list via the application of string similarity metrics. The solution described in the paper uses dynamically computed similarity cut-off thresholds calculated on the basis of modeling a noise curve. Dissimilar string matches form a Gaussian distributed noise curve that can be identified and extracted leaving only mostly similar string matches. Dynamically calculated thresholds are preferable over fixed similarity thresholds as fixed thresholds are inherently imprecise, that is, there is no similarity boundary beyond which any two strings always describe the same concept.

Resources for Persuasion
Marco Guerini | Carlo Strapparava | Oliviero Stock

This paper presents resources and strategies for persuasive natural language processing. After the introduction of a specifically tagged corpus, some techniques for affective language processing and for persuasive lexicon extraction are provided together with prospective scenarios of application.

Semi-automatic Building Method for a Multidimensional Affect Dictionary for a New Language
Guillaume Pitel | Gregory Grefenstette

Detecting the tone or emotive content of a text message is increasingly important in many natural language processing applications. While for the English language there exists a number of affect, emotive, opinion, or affect computer-usable lexicons for automatically processing text, other languages rarely possess these primary resources. Here we present a semi-automatic technique for quickly building a multidimensional affect lexicon for a new language. Most of the work consists of defining 44 paired affect directions (e.g. love-hate, courage-fear, etc.) and choosing a small number of seed words for each dimension. From this initial investment, we show how a first pass affect lexicon can be created for new language, using a SVM classifier trained on a feature space produced from Latent Semantic Analysis over a large corpus in the new language. We evaluate the accuracy of placing newly found emotive words in one or more of the defined semantic dimensions. We illustrate this technique by creating an affect lexicon for French, but the techniques can be applied to any language found on the Web and for which a large quantity of text exists.

Coding Emotional Events in Audiovisual Corpora
Laurence Devillers | Jean-Claude Martin

The modelling of realistic emotional behaviour is needed for various applications in multimodal human-machine interaction such as the design of emotional conversational agents (Martin et al., 2005) or of emotional detection systems (Devillers and Vidrascu, 2007). Yet, building such models requires appropriate definition of various levels for representing the emotions themselves but also some contextual information such as the events that elicit these emotions. This paper presents a coding scheme that has been defined following annotations of a corpus of TV interviews (EmoTV). Deciding which events triggered or may trigger which emotion is a challenge for building efficient emotion eliciting protocols. In this paper, we present the protocol that we defined for collecting another corpus of spontaneous human-human interactions recorded in laboratory conditions (EmoTaboo). We discuss the events that we designed for eliciting emotions. Part of this scheme for coding emotional event is being included in the specifications that are currently defined by a working group of the W3C (the W3C Emotion Incubator Working group). This group is investigating the feasibility of working towards a standard representation of emotions and related states in technological contexts.

Annotating Expressions of Opinion and Emotion in the Italian Content Annotation Bank
Andrea Esuli | Fabrizio Sebastiani | Ilaria Urciuoli

In this paper we describe the result of manually annotating I-CAB, the Italian Content Annotation Bank, by expressions of private state (EPSs), i.e., expressions that denote the presence of opinions, emotions, and other cognitive states. The aim of this effort was the generation of a standard resource for supporting the development of opinion extraction algorithms for Italian, and of a benchmark for testing such algorithms. To this end we have employed a previously existing annotation language (here dubbed WWC, from the initials of its proponents). We here describe the results of this annotation effort, including the results of a thorough inter-annotator agreement test. We conclude by discussing how WWC can be adapted to the specificities of a Romance language such as Italian.

Adjectives in the Dutch Semantic Lexical Database CORNETTO
Isa Maks | Piek Vossen | Roxane Segers | Hennie van der Vliet

The goal of this paper is to describe how adjectives are encoded in Cornetto, a semantic lexical database for Dutch. Cornetto combines two existing lexical resources with different semantic organisation, i.e. Dutch Wordnet (DWN) with a synset organisation and Referentie Bestand Nederlands (RBN) with an organisation in Lexical Units. Both resources will be aligned and mapped on the formal ontology SUMO. In this paper, we will first present details of the description of adjectives in each of the the two resources. We will then address the problems that are encountered during alignment to the SUMO ontology which are greatly due to the fact that SUMO has never been tested for its adequacy with respect to adjectives. We contrasted SUMO with an existing semantic classification which resulted in a further refined and extended SUMO geared for the description of adjectives.

Detecting Errors in Semantic Annotation
Markus Dickinson | Chong Min Lee

We develop a method for detecting errors in semantic predicate-argument annotation, based on the variation n-gram error detection method. After establishing an appropriate data representation, we detect inconsistencies by searching for identical text with varying annotation. By remaining data-driven, we are able to detect inconsistencies arising from errors at lower layers of annotation.

Corpus Co-Occurrence, Dictionary and Wikipedia Entries as Resources for Semantic Relatedness Information
Michael Roth | Sabine Schulte im Walde

Distributional, corpus-based descriptions have frequently been applied to model aspects of word meaning. However, distributional models that use corpus data as their basis have one well-known disadvantage: even though the distributional features based on corpus co-occurrence were often successful in capturing meaning aspects of the words to be described, they generally fail to capture those meaning aspects that refer to world knowledge, because coherent texts tend not to provide redundant information that is presumably available knowledge. The question we ask in this paper is whether dictionary and encyclopaedic resources might complement the distributional information in corpus data, and provide world knowledge that is missing in corpora. As test case for meaning aspects, we rely on a collection of semantic associates to German verbs and nouns. Our results indicate that a combination of the knowledge resources should be helpful in work on distributional descriptions.

Ontology Learning and Semantic Annotation: a Necessary Symbiosis
Emiliano Giovannetti | Simone Marchi | Simonetta Montemagni | Roberto Bartolini

Semantic annotation of text requires the dynamic merging of linguistically structured information and a “world model”, usually represented as a domain-specific ontology. On the other hand, the process of engineering a domain-ontology through semi-automatic ontology learning system requires the availability of a considerable amount of semantically annotated documents. Facing this bootstrapping paradox requires an incremental process of annotation-acquisition-annotation, whereby domain-specific knowledge is acquired from linguistically-annotated texts and then projected back onto texts for extra linguistic information to be annotated and further knowledge layers to be extracted. The presented methodology is a first step in the direction of a full “virtuous” circle where the semantic annotation platform and the evolving ontology interact in symbiosis. As a case study we have chosen the semantic annotation of product catalogues. We propose a hybrid approach, combining pattern matching techniques to exploit the regular structure of product descriptions in catalogues, and Natural Language Processing techniques which are resorted to analyze natural language descriptions. The semantic annotation involves the access to the ontology, semi-automatically bootstrapped with an ontology learning tool from annotated collections of catalogues.

Semantically Annotated Snapshot of the English Wikipedia
Jordi Atserias | Hugo Zaragoza | Massimiliano Ciaramita | Giuseppe Attardi

This paper describes SW1, the first version of a semantically annotated snapshot of the English Wikipedia. In recent years Wikipedia has become a valuable resource for both the Natural Language Processing (NLP) community and the Information Retrieval (IR) community. Although NLP technology for processing Wikipedia already exists, not all researchers and developers have the computational resources to process such a volume of information. Moreover, the use of different versions of Wikipedia processed differently might make it difficult to compare results. The aim of this work is to provide easy access to syntactic and semantic annotations for researchers of both NLP and IR communities by building a reference corpus to homogenize experiments and make results comparable. These resources, a semantically annotated corpus and a “entity containment” derived graph, are licensed under the GNU Free Documentation License and available from

Annotating Students’ Understanding of Science Concepts
Rodney D. Nielsen | Wayne Ward | James Martin | Martha Palmer

This paper summarizes the annotation of fine-grained entailment relationships in the context of student answers to science assessment questions. We annotated a corpus of 15,357 answer pairs with 145,911 fine-grained entailment relationships. We provide the rationale for such fine-grained analysis and discuss its perceived benefits to an Intelligent Tutoring System. The corpus also has potential applications in other areas, such as question answering and multi-document summarization. Annotators achieved 86.2% inter-annotator agreement (Kappa=0.728, corresponding to substantial agreement) annotating the fine-grained facets of reference answers with regard to understanding expressed in student answers and labeling from one of five possible detailed relationship categories. The corpus described in this paper, which is the only one providing such detailed entailment annotations, is available as a public resource for the research community. The corpus is expected to enable application development, not only for intelligent tutoring systems, but also for general textual entailment applications, that is currently not practical.

Relation between Agreement Measures on Human Labeling and Machine Learning Performance: Results from an Art History Domain
Rebecca Passonneau | Tom Lippincott | Tae Yano | Judith Klavans

We discuss factors that affect human agreement on a semantic labeling task in the art history domain, based on the results of four experiments where we varied the number of labels annotators could assign, the number of annotators, the type and amount of training they received, and the size of the text span being labeled. Using the labelings from one experiment involving seven annotators, we investigate the relation between interannotator agreement and machine learning performance. We construct binary classifiers and vary the training and test data by swapping the labelings from the seven annotators. First, we find performance is often quite good despite lower than recommended interannotator agreement. Second, we find that on average, learning performance for a given functional semantic category correlates with the overall agreement among the seven annotators for that category. Third, we find that learning performance on the data from a given annotator does not correlate with the quality of that annotator’s labeling. We offer recommendations for the use of labeled data in machine learning, and argue that learners should attempt to accommodate human variation. We also note implications for large scale corpus annotation projects that deal with similarly subjective phenomena.

The Construction and Evaluation of Word Space Models
Yves Peirsman | Simon De Deyne | Kris Heylen | Dirk Geeraerts

Semantic similarity is a key issue in many computational tasks. This paper goes into the development and evaluation of two common ways of automatically calculating the semantic similarity between two words. On the one hand, such methods may depend on a manually constructed thesaurus like (Euro)WordNet. Their performance is often evaluated on the basis of a very restricted set of human similarity ratings. On the other hand, corpus-based methods rely on the distribution of two words in a corpus to determine their similarity. Their performance is generally quantified through a comparison with the judgements of the first type of approach. This paper introduces a new Gold Standard of more than 5,000 human intra-category similarity judgements. We show that corpus-based methods often outperform (Euro)WordNet on this data set, and that the use of the latter as a Gold Standard for the former, is thus often far from ideal.

Annotation of Nuggets and Relevance in GALE Distillation Evaluation
Olga Babko-Malaya

This paper presents an approach to annotation that BAE Systems has employed in the DARPA GALE Phase 2 Distillation evaluation. The purpose of the GALE Distillation evaluation is to quantify the amount of relevant and non-redundant information a distillation engine is able to produce in response to a specific, formatted query; and to compare that amount of information to the amount of information gathered by a bilingual human using commonly available state-of-the-art tools. As part of the evaluation, following NIST evaluation methodology of complex question answering (Voorhees, 2003), human annotators were asked to establish the relevancy of responses as well as the presence of atomic facts or information units, called nuggets of information. This paper discusses various challenges to the annotation of nuggets, called nuggetization, which include interaction between the granularity of nuggets and relevancy of these nuggets to the query in question. The approach proposed in the paper views nuggetization as a procedural task and allows annotators to revisit nuggetization based on the requirements imposed by the relevancy guidelines defined with a specific end-user in mind. This approach is shown in the paper to produce consistent annotations with high inter-annotator agreement scores.

Statistical Evaluation of Information Distillation Systems
J.V. White | D. Hunter | J.D. Goldstein

We describe a methodology for evaluating the statistical performance of information distillation systems and apply it to a simple illustrative example. (An information distiller provides written English responses to English queries based on automated searches/transcriptions/translations of English and foreign-language sources. The sources include written documents and sound tracks.) The evaluation methodology extracts information nuggets from the distiller response texts and gathers them into fuzzy equivalence classes called nugs. Themethodology supports the usual performancemetrics, such as recall and precision, as well as a new information-theoretic metric called proficiency, which measures how much information a distiller provides relative to all of the information provided by a collection of distillers working on a common query and corpora. Unlike previous evaluation techniques, the methodology evaluates the relevance, granularity, and redundancy of information nuggets explicitly.

Automatic Learning and Evaluation of User-Centered Objective Functions for Dialogue System Optimisation
Verena Rieser | Oliver Lemon

The ultimate goal when building dialogue systems is to satisfy the needs of real users, but quality assurance for dialogue strategies is a non-trivial problem. The applied evaluation metrics and resulting design principles are often obscure, emerge by trial-and-error, and are highly context dependent. This paper introduces data-driven methods for obtaining reliable objective functions for system design. In particular, we test whether an objective function obtained from Wizard-of-Oz (WOZ) data is a valid estimate of real users’ preferences. We test this in a test-retest comparison between the model obtained from the WOZ study and the models obtained when testing with real users. We can show that, despite a low fit to the initial data, the objective function obtained from WOZ data makes accurate predictions for automatic dialogue evaluation, and, when automatically optimising a policy using these predictions, the improvement over a strategy simply mimicking the data becomes clear from an error analysis.

Building the Valency Lexicon of Arabic Verbs
Viktor Bielický | Otakar Smrž

This paper describes the building of a valency lexicon of Arabic verbs using a morphologically and syntactically annotated corpus, the Prague Arabic Dependency Treebank (PADT), as its primary source. We present the theoretical account on valency developed within the Functional Generative Description (FGD) theory. We apply the framework to Modern Standard Arabic and discuss various valency-related phenomena with respect to examples from the corpus. We then outline the methodology and the linguistic and technical resources used in the building of the lexicon. The key concept in our scenario is that of PDT-VALLEX of Czech. Our lexicon will be developed by linking the conceivable entries with their instances in the treebank. Conversely, the treebank’s annotations will be linked to the lexicon. While a comparable scheme has been developed for Czech, our own contribution is to design and implement this model thoroughly for Arabic and the PADT data. The Arabic valency lexicon is intended for applications in computational parsing or language generation, and for use by human researchers. The proposed valency lexicon will be exploited in particular during further tectogrammatical annotations of PADT and might serve for enriching the expected second edition of the corpus-based Arabic-Czech Dictionary.

Combining Terminology Resources and Statistical Methods for Entity Recognition: an Evaluation
Angus Roberts | Robert Gaizasukas | Mark Hepple | Yikun Guo

Terminologies and other knowledge resources are widely used to aid entity recognition in specialist domain texts. As well as providing lexicons of specialist terms, linkage from the text back to a resource can make additional knowledge available to applications. Use of such resources is especially pertinent in the biomedical domain, where large numbers of these resources are available, and where they are widely used in informatics applications. Terminology resources can be most readily used by simple lexical lookup of terms in the text. A major drawback with such lexical lookup, however, is poor precision caused by ambiguity between domain terms and general language words. We combine lexical lookup with simple filtering of ambiguous terms, to improve precision. We compare this lexical lookup with a statistical method of entity recognition, and to a method which combines the two approaches. We show that the combined method boosts precision with little loss of recall, and that linkage from recognised entities back to the domain knowledge resources can be maintained.

A Suite to Compile and Analyze an LSP Corpus
Rogelio Nazar | Jorge Vivaldi | Teresa Cabré

This paper presents a series of tools for the extraction of specialized corpora from the web and its subsequent analysis mainly with statistical techniques. It is an integrated system of original as well as standard tools and has a modular conception that facilitates its re-integration on different systems. The first part of the paper describes the original techniques, which are devoted to the categorization of documents as relevant or irrelevant to the corpus under construction, considering relevant a specialized document of the selected technical domain. Evaluation figures are provided for the original part, but not for the second part involving the analysis of the corpus, which is composed of algorithms that are well known in the field of Natural Language Processing, such as Kwic search, measures of vocabulary richness, the sorting of n-grams by frequency of occurrence or by measures of statistical association, distribution or similarity.

Causal Relation Extraction
Eduardo Blanco | Nuria Castell | Dan Moldovan

This paper presents a supervised method for the detection and extraction of Causal Relations from open domain text. First we give a brief outline of the definition of causation and how it relates to other Semantic Relations, as well as a characterization of their encoding. In this work, we only consider marked and explicit causations. Our approach first identifies the syntactic patterns that may encode a causation, then we use Machine Learning techniques to decide whether or not a pattern instance encodes a causation. We focus on the most productive pattern, a verb phrase followed by a relator and a clause, and its reverse version, a relator followed by a clause and a verb phrase. As relators we consider the words as, after, because and since. We present a set of lexical, syntactic and semantic features for the classification task, their rationale and some examples. The results obtained are discussed and the errors analyzed.

Learning Morphology with Morfette
Grzegorz Chrupala | Georgiana Dinu | Josef van Genabith

Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy classifier. The third module dynamically combines the predictions of the Maximum-Entropy models and outputs a probability distribution over tag-lemma pair sequences. The lemmatization module exploits the idea of recasting lemmatization as a classification task by using class labels which encode mappings from word forms to lemmas. Experimental evaluation results and error analysis on three morphologically rich languages show that the system achieves high accuracy with no language-specific feature engineering or additional resources.

Corpus Exploitation from Wikipedia for Ontology Construction
Gaoying Cui | Qin Lu | Wenjie Li | Yirong Chen

Ontology construction usually requires a domain-specific corpus for building corresponding concept hierarchy. The domain corpus must have a good coverage of domain knowledge. Wikipedia(Wiki), the world’s largest online encyclopaedic knowledge source, is open-content, collaboratively edited, and free of charge. It covers millions of articles and still keeps on expanding continuously. These characteristics make Wiki a good candidate as domain corpus resource in ontology construction. However, the selected article collection must have considerable quality and quantity. In this paper, a novel approach is proposed to identify articles in Wiki as domain-specific corpus by using available classification information in Wiki pages. The main idea is to generate a domain hierarchy from the hyperlinked pages of Wiki. Only articles strongly linked to this hierarchy are selected as the domain corpus. The proposed approach makes use of linked category information in Wiki pages to produce the hierarchy as a directed graph for obtaining a set of pages in the same connected branch. Ranking and filtering are then done on these pages based on the classification tree generated by the traversal algorithm. The experiment and evaluation results show that Wiki is a good resource for acquiring a relative high quality domain-specific corpus for ontology construction.

Development and Alignment of a Domain-Specific Ontology for Question Answering
Shiyan Ou | Viktor Pekar | Constantin Orasan | Christian Spurk | Matteo Negri

With the appearance of Semantic Web technologies, it becomes possible to develop novel, sophisticated question answering systems, where ontologies are usually used as the core knowledge component. In the EU-funded project, QALL-ME, a domain-specific ontology was developed and applied for question answering in the domain of tourism, along with the assistance of two upper ontologies for concept expansion and reasoning. This paper focuses on the development of the QALL-ME ontology in the tourism domain and its alignment with the upper ontologies - WordNet and SUMO. The design of the ontology is presented in the paper, and a semi-automatic alignment procedure is described with some alignment results given as well. Furthermore, the aligned ontology was used to semantically annotate original data obtained from the tourism web sites and natural language questions. The storage schema of the annotated data and the data access method for retrieving answers from the annotated data are also reported in the paper.

Unsupervised and Domain Independent Ontology Learning: Combining Heterogeneous Sources of Evidence
David Manzano-Macho | Asunción Gómez-Pérez | Daniel Borrajo

Acquiring knowledge from the Web to build domain ontologies has become a common practice in the Ontological Engineering field. The vast amount of freely available information allows collecting enough information about any domain. However, the Web usually suffers a lack of structure, untrustworthiness and ambiguity of the content. These drawbacks hamper the application of unsupervised methods of building ontologies demanded by the increasingly popular applications of the Semantic Web. We believe that the combination of several processing mechanisms and complementary information sources may potentially solve the problem. The analysis of different sources of evidence allows determining with greater reliability the validity of the detected knowledge. In this paper, we present GALeOn (General Architecture for Learning Ontologies) that combines sources and processing resources to provide complementary and redundant evidence for making better estimations about the relevance of the extracted knowledge and their relationships. Our goal in this paper is to show how combining several information sources and extraction mechanisms is possible to build a taxonomy of concepts with a higher accuracy than if only one of them is applied. The experimental results show how this combination notably increases the precision of the obtained results with minimum user intervention.

L-ISA: Learning Domain Specific Isa-Relations from the Web
Alessandra Potrich | Emanuele Pianta

Automated extraction of ontological knowledge from text corpora is a relevant task in Natural Language Processing. In this paper, we focus on the problem of finding hypernyms for relevant concepts in a specific domain (e.g. Optical Recording) in the context of a concrete and challenging application scenario (patent processing). To this end information available on the Web is exploited. The extraction method includes four mains steps. Firstly, the Google search engine is exploited to retrieve possible instances of isa-patterns reported in the literature. Then, the returned snippets are filtered on the basis of lexico-syntactic criteria (e.g. the candidate hypernym must be expressed as a noun phrase without complex modifiers). In a further filtering step, only candidate hypernyms compatible with the target domain are kept. Finally a candidate ranking mechanism is applied to select one hypernym as output of the algorithm. The extraction method was evaluated on 100 concepts of the Optical Recording domain. Moreover, the reliability of isa-patterns reported in the literature as predictors of isa-relations was assessed by manually evaluating the template instances remaining after lexico-syntactic filtering, for 3 concepts of the same domain. While more extensive testing is needed the method appears promising especially for its portability across different domains.

A Common Ground for Virtual Humans: Using an Ontology in a Natural Language Oriented Virtual Human Architecture
Arno Hartholt | Thomas Russ | David Traum | Eduard Hovy | Susan Robinson

When dealing with large, distributed systems that use state-of-the-art components, individual components are usually developed in parallel. As development continues, the decoupling invariably leads to a mismatch between how these components internally represent concepts and how they communicate these representations to other components: representations can get out of synch, contain localized errors, or become manageable only by a small group of experts for each module. In this paper, we describe the use of an ontology as part of a complex distributed virtual human architecture in order to enable better communication between modules while improving the overall flexibility needed to change or extend the system. We focus on the natural language understanding capabilities of this architecture and the relationship between language and concepts within the entire system in general and the ontology in particular.

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation
Eneko Agirre | Aitor Soroa

This paper presents the results of a graph-based method for performing knowledge-based Word Sense Disambiguation (WSD). The technique exploits the structural properties of the graph underlying the chosen knowledge base. The method is general, in the sense that it is not tied to any particular knowledge base, but in this work we have applied it to the Multilingual Central Repository (MCR). The evaluation has been performed on the Senseval-3 all-words task. The main contributions of the paper are twofold: (1) We have evaluated the separate and combined performance of each type of relation in the MCR, and thus indirectly validated the contents of the MCR and their potential for WSD. (2) We obtain state-of-the-art results, and in fact yield the best results that can be obtained using publicly available data.

A Japanese-English Technical Lexicon for Translation and Language Research
Fredric Gey | David Kirk Evans | Noriko Kando

In this paper we present a Japanese-English Bilingual lexicon of technical terms. The lexicon was derived from the first and second NTCIR evaluation collections for research into cross-language information retrieval for Asian languages. While it can be utilized for translation between Japanese and English, the lexicon is also suitable for language research and language engineering. Since it is collection-derived, it contains instances of word variants and miss-spellings which make it eminently suitable for further research. For a subset of the lexicon we make available the collection statistics. In addition we make available a Katakana subset suitable for transliteration research.

Mutual Bilingual Terminology Extraction
Le An Ha | Gabriela Fernandez | Ruslan Mitkov | Gloria Corpas

This paper describes a novel methodology to perform bilingual terminology extraction, in which automatic alignment is used to improve the performance of terminology extraction for each language. The strengths of monolingual terminology extraction for each language are exploited to improve the performance of terminology extraction in the other language, thanks to the availability of a sentence-level aligned bilingual corpus, and an automatic noun phrase alignment mechanism. The experiment indicates that weaknesses in monolingual terminology extraction due to the limitation of resources in certain languages can be overcome by using another language which has no such limitation.

Building a Golden Collection of Parallel Multi-Language Word Alignment
João Graça | Joana Paulo Pardal | Luísa Coheur | Diamantino Caseiro

This paper reports an experience on producing manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) (Graça et al., 2008). Word alignment of each language pair is made over the first 100 sentences of the common test set from the Europarl corpora (Koehn, 2005), corresponding to 600 new annotated sentences. This collection is publicly available at http://www.l2f.inesc- It contains, to our knowledge, the first word alignment gold set for the Portuguese language, with three other languages. Besides, it is to our knowledge, the first multi-language manual word aligned parallel corpus, where the same sentences are annotated for each language pair. We started by using the guidelines presented at (Mariño, 2005) and performed several refinements: some due to under-specifications on the original guidelines, others because of disagreement on some choices. This lead to the development of an extensive new set of guidelines for multi-lingual word alignment annotation that, we believe, makes the alignment process less ambiguous. We evaluate the inter-annotator agreement obtaining an average of 91.6% agreement between the different language pairs.

The QALL-ME Benchmark: a Multilingual Resource of Annotated Spoken Requests for Question Answering
Elena Cabrio | Milen Kouylekov | Bernardo Magnini | Matteo Negri | Laura Hasler | Constantin Orasan | David Tomás | Jose Luis Vicedo | Guenter Neumann | Corinna Weber

This paper presents the QALL-ME benchmark, a multilingual resource of annotated spoken requests in the tourism domain, freely available for research purposes. The languages currently involved in the project are Italian, English, Spanish and German. It introduces a semantic annotation scheme for spoken information access requests, specifically derived from Question Answering (QA) research. In addition to pragmatic and semantic annotations, we propose three QA-based annotation levels: the Expected Answer Type, the Expected Answer Quantifier and the Question Topical Target of a request, to fully capture the content of a request and extract the sought-after information. The QALL-ME benchmark is developed under the EU-FP6 QALL-ME project which aims at the realization of a shared and distributed infrastructure for Question Answering (QA) systems on mobile devices (e.g. mobile phones). Questions are formulated by the users in free natural language input, and the system returns the actual sequence of words which constitutes the answer from a collection of information sources (e.g. documents, databases). Within this framework, the benchmark has the twofold purpose of training machine learning based applications for QA, and testing their actual performance with a rapid turnaround in controlled laboratory setting.

Tools & Resources for Visualising Conversational-Speech Interaction
Nick Campbell

This paper describes tools and techniques for accessing large quantities of speech data and for the visualisation of discourse interactions and events at levels above that of linguistic content. We are working with large quantities of dialogue speech including business meetings, friendly discourse, and telephone conversations, and have produced web-based tools for the visualisation of non-verbal and paralinguistic features of the speech data. In essence, they provide higher-level displays so that specific sections of speech, text, or other annotation can be accessed by the researcher and provide an interactive interface to the large amount of data through an Archive Browser.

A Web Browser Extension for Growing-up Ontological Knowledge from Traditional Web Content
Maria Teresa Pazienza | Marco Pennacchiotti | Armando Stellato

While the Web is facing interesting new changes in the way users access, interact and even participate to its growth, the most traditional applications dedicated to its fruition: web browsers, are not responding with the same euphoric boost for innovation, mostly relying on third party or open-source community-driven extensions for addressing the new Social and Semantic Web trends and technologies. This technological and decisional gap, which is probably due to the lack of a strong standardization commitment on the one side (Web 2.0/Social Web) and in the delay of massive adherence to new officially approved standards (W3C approved Semantic Web languages), has to be filled by successful stories which could lay the path for the evolution of browsers. In this work we present a novel web browser extension which combines several features coming from the worlds of terminology and information extraction, semantic annotation and knowledge management, to support users in the process of both keeping track of interesting information they find on the web, and organizing its associated content following knowledge representation standards offered by the Semantic Web

A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture
Youssef Drissi | Branimir Boguraev | David Ferrucci | Paul Keyser | Anthony Levas

Information extraction from large data repositories is critical to Information Management solutions. In addition to prerequisite corpus analysis, to determine domain-specific characteristics of text resources, developing, refining and evaluating analytics entails a complex and lengthy process, typically requiring more than just domain expertise. Modern architectures for text processing, while facilitating reuse and (re-)composition of analytical pipelines, do place additional constraints upon the analytics development, as domain experts need not only configure individual annotator components, but situate these within a fully functional annotator pipeline. We present the design, and current status, of a tool for configuring model-driven annotators, which abstracts away from annotator implementation details, pipeline composition constraints, and data management. Instead, the tool embodies support for all stages of ontology-centric model development cycle from corpus analysis and concept definition, to model development and testing, to large scale evaluation, to easy and rapid composition of text applications deploying these concept models. With our design, we aim to meet the needs of domain experts, who are not necessarily expert NLP practitioners.

Ontology-Based XQuery’ing of XML-Encoded Language Resources on Multiple Annotation Layers
Georg Rehm | Richard Eckart | Christian Chiarcos | Johannes Dellert

We present an approach for querying collections of heterogeneous linguistic corpora that are annotated on multiple layers using arbitrary XML-based markup languages. An OWL ontology provides a homogenising view on the conceptually different markup languages so that a common querying framework can be established using the method of ontology-based query expansion. In addition, we present a highly flexible web-based graphical interface that can be used to query corpora with regard to several different linguistic properties such as, for example, syntactic tree fragments. This interface can also be used for ontology-based querying of multiple corpora simultaneously.

A Lightweight and Efficient Tool for Cleaning Web Pages
Stefan Evert

Originally conceived as a “naïve” baseline experiment using traditional n-gram language models as classifiers, the NCleaner system has turned out to be a fast and lightweight tool for cleaning Web pages with state-of-the-art accuracy (based on results from the CLEANEVAL competition held in 2007). Despite its simplicity, the algorithm achieves a significant improvement over the baseline (i.e. plain, uncleaned text dumps), trading off recall for substantially higher precision. NCleaner is available as an open-source software package. It is pre-configured for English Web pages, but can be adapted to other languages with minimal amounts of manually cleaned training data. Since NCleaner does not make use of HTML structure, it can also be applied to existing Web corpora that are only available in plain text format, with a minor loss in classfication accuracy.

Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages
Lynette Melnar | Chen Liu

In this paper we describe an approach that both creates crosslingual acoustic monophone model sets for speech recognition tasks and objectively predicts their performance without target-language speech data or acoustic measurement techniques. This strategy is based on a series of linguistic metrics characterizing the articulatory phonetic and phonological distances of target-language phonemes from source-language phonemes. We term these algorithms the Combined Phonetic and Phonological Crosslingual Distance (CPP-CD) metric and the Combined Phonetic and Phonological Crosslingual Prediction (CPP-CP) metric. The particular motivations for this project are the current unavailability and often prohibitively high production cost of speech databases for many strategically important low- and middle-density languages. First, we describe the CPP-CD approach and compare the performance of CPP-CD-specified models to both native language models and crosslingual models selected by the Bhattacharyya acoustic-model distance metric in automatic speech recognition (ASR) experiments. Results confirm that the CPP-CD approach nearly matches those achieved by the acoustic distance metric. We then test the CPP-CP algorithm on the CPP-CD models by comparing the CPP-CP scores to the recognition phoneme error rates. Based on this comparison, we conclude that the CPP-CP algorithm is a reliable indicator of crosslingual model performance in speech recognition tasks.

Corpus Analysis of Spoken Smart-Home Interactions with Older Users
Sebastian Möller | Florian Gödde | Maria Wolters

In this paper, we present the collection and analysis of a spoken dialogue corpus obtained from interactions of older and younger users with a smart-home system. Our aim is to identify the amount and the origin of linguistic differences in the way older and younger users address the system. In addition, we investigate changes in the users’ linguistic behaviour after exposure to the system. The results show that the two user groups differ in their speaking style as well as their vocabulary. In contrast to younger users, who adapt their speaking style to the expected limitations of the system, older users tend to use a speaking style that is closer to human-human communication in terms of sentence complexity and politeness. However, older users are far less easy to stereotype than younger users.

A Fully Annotated Corpus for Studying the Effect of Cognitive Ageing on Users’ Interactions with Spoken Dialogue Systems
Kallirroi Georgila | Maria Wolters | Vasilis Karaiskos | Melissa Kronenthal | Robert Logie | Neil Mayo | Johanna Moore | Matt Watson

In this paper we present a corpus of interactions of older and younger users with nine different dialogue systems. The corpus has been fully transcribed and annotated with dialogue acts and “Information State Update” (ISU) representations of dialogue context. Users not only underwent a comprehensive battery of cognitive assessments, but they also rated the usability of each dialogue system on a standardised questionnaire. In this paper, we discuss the corpus collection and outline the semi-automatic methods we used for discourse-level annotations. We expect that the corpus will provide a key resource for modelling older people’s interaction with spoken dialogue systems.

Recording Speech of Children, Non-Natives and Elderly People for HLT Applications: the JASMIN-CGN Corpus.
Catia Cucchiarini | Joris Driesen | Hugo Van hamme | Eric Sanders

Within the framework of the Dutch-Flemish programme STEVIN, the JASMIN-CGN (Jongeren, Anderstaligen en Senioren in Mens-machine Interactie’ Corpus Gesproken Nederlands) project was carried out, which was aimed at collecting speech of children, non-natives and elderly people. The JASMIN-CGN project is an extension of the Spoken Dutch Corpus (CGN) along three dimensions. First, by collecting a corpus of contemporary Dutch as spoken by children of different age groups, elderly people and non-natives with different mother tongues, an extension along the age and mother tongue dimensions was achieved. In addition, we collected speech material in a communication setting that was not envisaged in the CGN: human-machine interaction. One third of the data was collected in Flanders and two thirds in the Netherlands. In this paper we report on our experiences in collecting this corpus and we describe some of the important decisions that we made in the attempt to combine efficiency and high quality.

F0 of Adolescent Speakers - First Results for the German Ph@ttSessionz Database
Christoph Draxler | Florian Schiel | Tania Ellbogen

The first release of the German Ph@ttSessionz speech database contains read and spontaneous speech from 864 adolescent speakers and is the largest database of its kind for German. It was recorded via the WWW in over 40 public schools in all dialect regions of Germany. In this paper, we present a cross-sectional study of f0 measurements on this database. The study documents the profound changes in male voices at the age 13-15. Furthermore, it shows that on a perceptive mel-scale, there is little difference in the relative f0 variability for male and female speakers. A closer analysis reveals that f0 variability is dependent on the speech style and both the length and the type of the utterance. The study provides statistically reliable voice parameters of adolescent speakers for German. The results may contribute to making spoken dialog systems more robust by restricting user input to utterances with low f0 variability.

Dialogue, Speech and Images: the Companions Project Data Set
Yorick Wilks | David Benyon | Christopher Brewster | Pavel Ircing | Oli Mival

This paper describes part of the corpus collection efforts underway in the EC funded Companions project. The Companions project is collecting substantial quantities of dialogue a large part of which focus on reminiscing about photographs. The texts are in English and Czech. We describe the context and objectives for which this dialogue corpus is being collected, the methodology being used and make observations on the resulting data. The corpora will be made available to the wider research community through the Companions Project web site.

Creating and Using a Correlated Corpus to Glean Communicative Commonalities
Jade Goldstein-Stewart | Kerri Goodwin | Roberta Sabin | Ransom Winder

This paper describes a collection of correlated communicative samples collected from the same individuals across six diverse genres. Three of the genres were computer mediated: email, blog, and chat, and three non-computer-mediated: essay, interview, and discussion. Participants were drawn from a college student population with an equal number of males and females recruited. All communication expressed opinion on six pre-selected, current topics that had been determined to stimulate communication. The experimental design including methods of collection, randomization of scheduling of genre order and topic order is described. Preliminary results for two descriptive metrics, word count and Flesch readability, are presented. Interesting and, in some cases, significant effects were observed across genres by topic and by gender of participant. This corpus will provide a resource to investigate communication stylistics of individuals across genres, the identification of individuals from correlated data, as well as commonalities and differences across samples that agree in genre, topic, and/or gender of participant.

Information Extraction Tools and Methods for Understanding Dialogue in a Companion
Roberta Catizone | Alexiei Dingli | Hugo Pinto | Yorick Wilks

This paper discusses how Information Extraction is used to understand and manage Dialogue in the EU-funded Companions project. This will be discussed with respect to the Senior Companion, one of two applications under development in the EU-funded Companions project. Over the last few years, research in human-computer dialogue systems has increased and much attention has focused on applying learning methods to improving a key part of any dialogue system, namely the dialogue manager. Since the dialogue manager in all dialogue systems relies heavily on the quality of the semantic interpretation of the user’s utterance, our research in the Companions project, focuses on how to improve the semantic interpretation and combine it with knowledge from the Knowledge Base to increase the performance of the Dialogue Manager. Traditionally the semantic interpretation of a user utterance is handled by a natural language understanding module which embodies a variety of natural language processing techniques, from sentence splitting, to full parsing. In this paper we discuss the use of a variety of NLU processes and in particular Information Extraction as a key part of the NLU module in order to improve performance of the dialogue manager and hence the overall dialogue system.

Production in a Multimodal Corpus: how Speakers Communicate Complex Actions
Carlos Gómez Gallo | T. Florian Jaeger | James Allen | Mary Swift

We describe a new multimodal corpus currently under development. The corpus consists of videos of task-oriented dialogues that are annotated for speaker’s verbal requests and domain action executions. This resource provides data for new research on language production and comprehension. The corpus can be used to study speakers’ decisions as to how to structure their utterances given the complexity of the message they are trying to convey.

Towards Formal Interpretation of Semantic Annotation
Harry Bunt | Chwhynny Overbeeke

In this paper we present a novel approach to the incremental incorporation of semantic information in natural language processing which does not fall victim to the notorious problems of ambiguity and lack of robustness, namely through the formal interpretation of semantic annotation. We present a formal semantics for a language for the integrated annotation of several types of semantic information, such as (co-)reference relations, temporal information, and semantic roles. This semantics has the form of a compositional translation into second-order logic. We show that a truly semantic approach to the annotation of different types of semantic information raises interesting issues relating to the borders between these areas of semantics, and to the consistency of semantic annotations in multiple areas or in multiple annotation layers. The approach is compositional, in the sense that every well-formed subexpression of the annotation language can be translated to formal logic (and hence interpreted) independent of the rest of the annotation structure. The approach is also incremental in the sense that it is designed to be extendable to the semantic annotation of many other types of semantic information, such as spatial information, noun-noun relations, or quantification and modification structures.

Towards a Vector Space Model for FrameNet-like Resources
Marco Pennacchiotti | Diego De Cao | Paolo Marocco | Roberto Basili

In this paper, we present an original framework to model frame semantic resources (namely, FrameNet) using minimal supervision. This framework can be leveraged both to expand an existing FrameNet with new knowledge, and to induce a FrameNet in a new language. Our hypothesis is that a frame semantic resource can be modeled and represented by a suitable semantic space model. The intuition is that semantic spaces are an effective model of the notion of “being characteristic of a frame” for both lexical elements and full sentences. The paper gives two main contributions. First, it shows that our hypothesis is valid and can be successfully implemented. Second, it explores different types of semantic VSMs, outlining which one is more suitable for representing a frame semantic resource. In the paper, VSMs are used for modeling the linguistic core of a frame, the lexical units. Indeed, if the hypothesis is verified for these units, the proposed framework has a much wider application.

KnoFusius: a New Knowledge Fusion System for Interpretation of Gene Expression Data
Pavel Smrž

This paper introduces a new architecture that aims at combining molecular biology data with information automatically extracted from relevant scientific literature (using text mining techniques on PubMed abstracts and fulltext papers) to help biomedical experts to interpret experimental results in hand. The infrastructural level bears on semantic-web technologies and standards that facilitate the actual fusion of the multi-source knowledge.

Modelling Word Similarity: an Evaluation of Automatic Synonymy Extraction Algorithms.
Kris Heylen | Yves Peirsman | Dirk Geeraerts | Dirk Speelman

Vector-based models of lexical semantics retrieve semantically related words automatically from large corpora by exploiting the property that words with a similar meaning tend to occur in similar contexts. Despite their increasing popularity, it is unclear which kind of semantic similarity they actually capture and for which kind of words. In this paper, we use three vector-based models to retrieve semantically related words for a set of Dutch nouns and we analyse whether three linguistic properties of the nouns influence the results. In particular, we compare results from a dependency-based model with those from a 1st and 2nd order bag-of-words model and we examine the effect of the nouns’ frequency, semantic speficity and semantic class. We find that all three models find more synonyms for high-frequency nouns and those belonging to abstract semantic classses. Semantic specificty does not have a clear influence.

Children’s Oral Reading Corpus (CHOREC): Description and Assessment of Annotator Agreement
Leen Cleuren | Jacques Duchateau | Pol Ghesquière | Hugo Van hamme

Within the scope of the SPACE project, the CHildren’s Oral REading Corpus (CHOREC) is developed. This database contains recorded, transcribed and annotated read speech (42 GB or 130 hours) of 400 Dutch speaking elementary school children with or without reading difficulties. Analyses of inter- and intra-annotator agreement are carried out in order to investigate the consistency with which reading errors are detected, orthographic and phonetic transcriptions are made, and reading errors and reading strategies are labeled. Percentage agreement scores and kappa values both show that agreement between annotations, and therefore the quality of the annotations, is high. Taken all double or triple annotations (for 10% resp. 30% of the corpus) together, % agreement varies between 86.4% and 98.6%, whereas kappa varies between 0.72 and 0.97 depending on the annotation tier that is being assessed. School type and reading type seem to account for systematic differences in % agreement, but these differences disappear when kappa values are calculated that correct for chance agreement. To conclude, an analysis of the annotation differences with respect to the ’*s’ label (i.e. a label that is used to annotate undistinguishable spelling behaviour), phoneme labels, reading strategy and error labels is given.

A Bilingual Corpus of Inter-linked Events
Tommaso Caselli | Nancy Ide | Roberto Bartolini

This paper describes the creation of a bilingual corpus of inter-linked events for Italian and English. Linkage is accomplished through the Inter-Lingual Index (ILI) that links ItalWordNet with WordNet. The availability of this resource, on the one hand, enables contrastive analysis of the linguistic phenomena surrounding events in both languages, and on the other hand, can be used to perform multilingual temporal analysis of texts. In addition to describing the methodology for construction of the inter-linked corpus and the analysis of the data collected, we demonstrate that the ILI could potentially be used to bootstrap the creation of comparable corpora by exporting layers of annotation for words that have the same sense.

New Resources for Document Classification, Analysis and Translation Technologies
Stephanie Strassel | Lauren Friedman | Safa Ismael | Linda Brandschain

The goal of the DARPA MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Program is to automatically convert foreign language text images into English transcripts, for use by humans and downstream applications. The first phase the program focuses on translation of handwritten Arabic documents. Linguistic Data Consortium (LDC) is creating publicly available linguistic resources for MADCAT technologies, on a scale and richness not previously available. Corpora will consist of existing LDC corpora and data donations from MADCAT partners, plus new data collection to provide high quality material for evaluation and to address strategic gaps (for genre, dialect, image quality, etc.) in the existing resources. Training and test data properties will expand over time to encompass a wide range of topics and genres: letters, diaries, training manuals, brochures, signs, ledgers, memos, instructions, postcards and forms among others. Data will be ground truthed, with line, word and token segmentation and zoning, and translations and word alignments will be produced for a subset. Evaluation data will be carefully selected from the available data pools and high quality references will be produced, which can be used to compare MADCAT system performance against the human-produced gold standard.

Approximating Learning Curves for Active-Learning-Driven Annotation
Katrin Tomanek | Udo Hahn

Active learning (AL) is getting more and more popular as a methodology to considerably reduce the annotation effort when building training material for statistical learning methods for various NLP tasks. A crucial issue rarely addressed, however, is when to actually stop the annotation process to profit from the savings in efforts. This question is tightly related to estimating the classifier performance after a certain amount of data has already been annotated. While learning curves are the default means to monitor the progress of the annotation process in terms of classifier performance, this requires a labeled gold standard which - in realistic annotation settings, at least - is often unavailable. We here propose a method for committee-based AL to approximate the progression of the learning curve based on the disagreement among the committee members. This method relies on a separate, unlabeled corpus and is thus well suited for situations where a labeled gold standard is not available or would be too expensive to obtain. Considering named entity recognition as a test case we provide empirical evidence that this approach works well under simulation as well as under real-world annotation conditions.

Lexicon Schemas and Related Data Models: when Standards Meet Users
Thorsten Trippel | Michael Maxwell | Greville Corbett | Cambell Prince | Christopher Manning | Stephen Grimes | Steve Moran

Lexicon schemas and their use are discussed in this paper from the perspective of lexicographers and field linguists. A variety of lexicon schemas have been developed, with goals ranging from computational lexicography (DATR) through archiving (LIFT, TEI) to standardization (LMF, FSR). A number of requirements for lexicon schemas are given. The lexicon schemas are introduced and compared to each other in terms of conversion and usability for this particular user group, using a common lexicon entry and providing examples for each schema under consideration. The formats are assessed and the final recommendation is given for the potential users, namely to request standard compliance from the developers of the tools used. This paper should foster a discussion between authors of standards, lexicographers and field linguists.

LexSchem: a Large Subcategorization Lexicon for French Verbs
Cédric Messiant | Thierry Poibeau | Anna Korhonen

This paper presents LexSchem - the first large, fully automatically acquired subcategorization lexicon for French verbs. The lexicon includes subcategorization frame and frequency information for 3297 French verbs. When evaluated on a set of 20 test verbs against a gold standard dictionary, it shows 0.79 precision, 0.55 recall and 0.65 F-measure. We have made this resource freely available to the research community on the web.

Arabic WordNet: Semi-automatic Extensions using Bayesian Inference
Horacio Rodríguez | David Farwell | Javi Ferreres | Manuel Bertran | Musa Alkhalifa | M. Antonia Martí

This presentation focuses on the semi-automatic extension of Arabic WordNet (AWN) using lexical and morphological rules and applying Bayesian inference. We briefly report on the current status of AWN and propose a way of extending its coverage by taking advantage of a limited set of highly productive Arabic morphological rules for deriving a range of semantically related word forms from verb entries. The application of this set of rules, combined with the use of bilingual Arabic-English resources and Princeton’s WordNet, allows the generation of a graph representing the semantic neighbourhood of the original word. In previous work, a set of associations between the hypothesized Arabic words and English synsets was proposed on the basis of this graph. Here, a novel approach to extending AWN is presented whereby a Bayesian Network is automatically built from the graph and then the net is used as an inferencing mechanism for scoring the set of candidate associations. Both on its own and in combination with the previous technique, this new approach has led to improved results.

Subjective Evaluation of an Emotional Speech Database for Basque
Iñaki Sainz | Ibon Saratxaga | Eva Navas | Inmaculada Hernáez | Jon Sanchez | Iker Luengo | Igor Odriozola

This paper describes the evaluation process of an emotional speech database recorded for standard Basque, in order to determine its adequacy for the analysis of emotional models and its use in speech synthesis. The corpus consists of seven hundred semantically neutral sentences that were recorded for the Big Six emotions and neutral style, by two professional actors. The test results show that every emotion is readily recognized far above chance level for both speakers. Therefore the database is a valid linguistic resource for the research and development purposes it was designed for.

How to Compare Treebanks
Sandra Kübler | Wolfgang Maier | Ines Rehbein | Yannick Versley

Recent years have seen an increasing interest in developing standards for linguistic annotation, with a focus on the interoperability of the resources. This effort, however, requires a profound knowledge of the advantages and disadvantages of linguistic annotation schemes in order to avoid importing the flaws and weaknesses of existing encoding schemes into the new standards. This paper addresses the question how to compare syntactically annotated corpora and gain insights into the usefulness of specific design decisions. We present an exhaustive evaluation of two German treebanks with crucially different encoding schemes. We evaluate three different parsers trained on the two treebanks and compare results using EvalB, the Leaf-Ancestor metric, and a dependency-based evaluation. Furthermore, we present TePaCoC, a new testsuite for the evaluation of parsers on complex German grammatical constructions. The testsuite provides a well thought-out error classification, which enables us to compare parser output for parsers trained on treebanks with different encoding schemes and provides interesting insights into the impact of treebank annotation schemes on specific constructions like PP attachment or non-constituent coordination.

The INFILE Project: a Crosslingual Filtering Systems Evaluation Campaign
Romaric Besançon | Stéphane Chaudiron | Djamel Mostefa | Ismaïl Timimi | Khalid Choukri

The InFile project (INformation, FILtering, Evaluation) is a cross-language adaptive filtering evaluation campaign, sponsored by the French National Research Agency. The campaign is organized by the CEA LIST, ELDA and the University of Lille3-GERiiCO. It has an international scope as it is a pilot track of the CLEF 2008 campaigns. The corpus is built from a collection of about 1.4 million newswires (10 GB) in three languages, Arabic, English and French provided by the French news Agency Agence France Press (AFP) and selected from a 3-year period. The profiles corpus is made of 50 profiles from which 30 concern general news and events (national and international affairs, politics, sports?) and 20 concern scientific and technical subjects.

DIAC+: a Professional Diacritics Recovering System
Dan Tufiş | Alexandru Ceauşu

In languages that use diacritical characters, if these special signs are stripped-off from a word, the resulted string of characters may not exist in the language, and therefore its normative form is, in general, easy to recover. However, this is not always the case, as presence or absence of a diacritical sign attached to a base letter of a word which exists in both variants, may change its grammatical properties or even the meaning, making the recovery of the missing diacritics a difficult task, not only for a program but sometimes even for a human reader. We describe and evaluate an accurate knowledge-based system for automatic recovery of the missing diacritics in MS-Office documents written in Romanian. For the rare cases when the system is not able to make a reliable decision, it either provides the user a list of words with their recovery suggestions, or probabilistically chooses one of the possible changes, but leaves a trace (a highlighted comment) on each word the modification of which was uncertain.

Annotating an Arabic Learner Corpus for Error
Ghazi Abuhakema | Reem Faraj | Anna Feldman | Eileen Fitzpatrick

This paper describes an ongoing project in which we are collecting a learner corpus of Arabic, developing a tagset for error annotation and performing Computer-aided Error Analysis (CEA) on the data. We adapted the French Interlanguage Database FRIDA tagset (Granger, 2003a) to the data. We chose FRIDA in order to follow a known standard and to see whether the changes needed to move from a French to an Arabic tagset would give us a measure of the distance between the two languages with respect to learner difficulty. The current collection of texts, which is constantly growing, contains intermediate and advanced-level student writings. We describe the need for such corpora, the learner data we have collected and the tagset we have developed. We also describe the error frequency distribution of both proficiency levels and the ongoing work.

All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation
Martin Reynaert

Some time in the future, some spelling error correction system will correct all the errors, and only the errors. We need evaluation metrics that will tell us when this has been achieved and that can help guide us there. We survey the current practice in the form of the evaluation scheme of the latest major publication on spelling correction in a leading journal. We are forced to conclude that while the metric used there can tell us exactly when the ultimate goal of spelling correction research has been achieved, it offers little in the way of directions to be followed to eventually get there. We propose to consistently use the well-known metrics Recall and Precision, as combined in the F score, on 5 possible levels of measurement that should guide us more informedly along that path. We describe briefly what is then measured or measurable at these levels and propose a framework that should allow for concisely stating what it is one performs in one’s evaluations. We finally contrast our preferred metrics to Accuracy, which is widely used in this field to this day and to the Area-Under-the-Curve, which is increasingly finding acceptance in other fields.

Using Movie Subtitles for Creating a Large-Scale Bilingual Corpora
Einav Itamar | Alon Itai

This paper presents a method for compiling a large-scale bilingual corpus from a database of movie subtitles. To create the corpus, we propose an algorithm based on Gale and Church’s sentence alignment algorithm(1993). However, our algorithm not only relies on character length information, but also uses subtitle-timing information, which is encoded in the subtitle files. Timing is highly correlated between subtitles in different versions (for the same movie), since subtitles that match should be displayed at the same time. However, the absolute time values can’t be used for alignment, since the timing is usually specified by frame numbers and not by real time, and converting it to real time values is not always possible, hence we use normalized subtitle duration instead. This results in a significant reduction in the alignment error rate.

The IFADV Corpus: a Free Dialog Video Corpus
Rob van Son | Wieneke Wesseling | Eric Sanders | Henk van den Heuvel

Research into spoken language has become more visual over the years. Both fundamental and applied research have progressively included gestures, gaze, and facial expression. Corpora of multi-modal conversational speech are rare and frequently difficult to use due to privacy and copyright restrictions. A freely available annotated corpus is presented, gratis and libre, of high quality video recordings of face-to-face conversational speech. Annotations include orthography, POS tags, and automatically generated phonemes transcriptions and word boundaries. In addition, labeling of both simple conversational function and gaze direction has been a performed. Within the bounds of the law, everything has been done to remove copyright and use restrictions. Annotations have been processed to RDBMS tables that allow SQL queries and direct connections to statistical software. From our experiences we would like to advocate the formulation of “best practises” for both legal handling and database storage of recordings and annotations.

WOZ Acoustic Data Collection for Interactive TV
Alessio Brutti | Luca Cristoforetti | Walter Kellermann | Lutz Marquardt | Maurizio Omologo

This paper describes a multichannel acoustic data collection recorded under the European DICIT project, during the Wizard of Oz (WOZ) experiments carried out at FAU and FBK-irst laboratories. The scenario is a distant-talking interface for interactive control of a TV. The experiments involve the acquisition of multichannel data for signal processing front-end and were carried out due to the need to collect a database for testing acoustic pre-processing algorithms. In this way, realistic scenarios can be simulated at a preliminary stage, instead of real-time implementations, allowing for repeatable experiments. To match the project requirements, the WOZ experiments were recorded in three languages: English, German and Italian. Besides the user inputs, the database also contains non-speech related acoustic events, room impulse response measurements and video data, the latter used to compute 3D labels. Sessions were manually transcribed and segmented at word level, introducing also specific labels for acoustic events.

Process Model for Composing High-quality Text Corpora
Mikko Lounela

The Teko corpus composing model offers a decentralized, dynamic way of collecting high-quality text corpora for linguistic research. The resulting corpus consists of independent text sets. The sets are composed in cooperation with linguistic research projects, so each of them responds to a specific research need. The corpora are morphologically annotated and XML-based, with in-built compatibilty with the Kaino user interface used in the corpus server of the Research Institute for the Languages of Finland. Furthermore, software for extracting standard quantitative reports from the text sets has been created during the project. The paper describes the project, and estimates its benefits and problems. It also gives an overview of the technical qualities of the corpora and corpus interface connected to the Teko project.

AnCora: Multilevel Annotated Corpora for Catalan and Spanish
Mariona Taulé | M. Antònia Martí | Marta Recasens

This paper presents AnCora, a multilingual corpus annotated at different linguistic levels consisting of 500,000 words in Catalan (AnCora-Ca) and in Spanish (AnCora-Es). At present AnCora is the largest multilayer annotated corpus of these languages freely available from The two corpora consist mainly of newspaper texts annotated at different levels of linguistic description: morphological (PoS and lemmas), syntactic (constituents and functions), and semantic (argument structures, thematic roles, semantic verb classes, named entities, and WordNet nominal senses). All resulting layers are independent of each other, thus making easier the data management. The annotation was performed manually, semiautomatically, or fully automatically, depending on the encoded linguistic information. The development of these basic resources constituted a primary objective, since there was a lack of such resources for these languages. A second goal was the definition of a consistent methodology that can be followed in further annotations. The current versions of AnCora have been used in several international evaluation competitions

The U.S. Policy Agenda Legislation Corpus Volume 1 - a Language Resource from 1947 - 1998
Stephen Purpura | John Wilkerson | Dustin Hillard

We introduce the corpus of United States Congressional bills from 1947 to 1998 for use by language research communities. The U.S. Policy Agenda Legislation Corpus Volume 1 (USPALCV1) includes more than 375,000 legislative bills annotated with a hierarchical policy area category. The human annotations in USPALCV1 have been reliably applied over time to enable social science analysis of legislative trends. The corpus is a member of an emerging family of corpora that are annotated by policy area to enable comparative parallel trend recognition across countries and domains (legislation, political speeches, newswire articles, budgetary expenditures, web sites, etc.). This paper describes the origins of the corpus, its creation, ways to access it, design criteria, and an analysis with common supervised machine learning methods. The use of machine learning methods establishes a baseline proposed modeling for the topic classification of legal documents.

Unsupervised Resource Creation for Textual Inference Applications
Jeremy Bensley | Andrew Hickl

This paper explores how a battery of unsupervised techniques can be used in order to create large, high-quality corpora for textual inference applications, such as systems for recognizing textual entailment (TE) and textual contradiction (TC). We show that it is possible to automatically generate sets of positive and negative instances of textual entailment and contradiction from textual corpora with greater than 90% precision. We describe how we generated more than 1 million TE pairs - and a corresponding set of and 500,000 TC pairs - from the documents found in the 2 GB AQUAINT-2 newswire corpus.

A Simple Method for Tagset Comparision
Markus Dickinson | Charles Jochim

Based on the idea that local contexts predict the same basic category across a language, we develop a simple method for comparing tagsets across corpora. The principle differences between tagsets are evidenced by variation in categories in one corpus in the same contexts where another corpus exhibits only a single tag. Such mismatches highlight differences in the definitions of tags which are crucial when porting technology from one annotation scheme to another.

From D-Coi to SoNaR: a reference corpus for Dutch
Nelleke Oostdijk | Martin Reynaert | Paola Monachesi | Gertjan Van Noord | Roeland Ordelman | Ineke Schuurman | Vincent Vandeghinste

The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built.

Relationships between Nursing Converstaions and Activities
Hiromi Itoh Ozaku | Akinori Abe | Kaoru Sagara | Kiyoshi Kogure

In this paper, we determine the relationships between nursing activities and nurseing conversations based on the principle of maximum entropy. For analysis of the features of nursing activities, we built nursing corpora from actual nursing conversation sets collected in hospitals that involve various information about nursing activities. Ex-nurses manually assigned nursing activity information to the nursing conversations in the corpora. Since it is inefficient and too expensive to attach all information manually, we introduced an automatic nursing activity determination method for which we built models of relationships between nursing conversations and activities. In this paper, we adopted a maximum entropy approach for learning. Even though the conversation data set is not large enough for learning, acceptable results were obtained.

Management of Large Annotation Projects Involving Multiple Human Judges: a Case Study of GALE Machine Translation Post-editing
Meghan Lammie Glenn | Stephanie Strassel | Lauren Friedman | Haejoong Lee | Shawn Medero

Managing large groups of human judges to perform any annotation task is a challenge. Linguistic Data Consortium coordinated the creation of manual machine translation post-editing results for the DARPA Global Autonomous Language Exploration Program. Machine translation is one of three core technology components for GALE, which includes an annual MT evaluation administered by National Institute of Standards and Technology. Among the training and test data LDC creates for the GALE program are gold standard translations for system evaluation. The GALE machine translation system evaluation metric is edit distance, measured by HTER (human translation edit rate), which calculates the minimum number of changes required for highly-trained human editors to correct MT output so that it has the same meaning as the reference translation. LDC has been responsible for overseeing the post-editing process for GALE. We describe some of the accomplishments and challenges of completing the post-editing effort, including developing a new web-based annotation workflow system, and recruiting and training human judges for the task. In addition, we suggest that the workflow system developed for post-editing could be ported efficiently to other annotation efforts.

Bootstrapping Language Description: the case of Mpiemo (Bantu A, Central African Republic)
Harald Hammarström | Christina Thornell | Malin Petzell | Torbjörn Westerlund

Linguists have long been producing grammatical decriptions of yet undescribed languages. This is a time-consuming process, which has already adapted to improved technology for recording and storage. We present here a novel application of NLP techniques to bootstrap analysis of collected data and speed-up manual selection work. To be more precise, we argue that unsupervised induction of morphology and part-of-speech analysis from raw text data is mature enough to produce useful results. Experiments with Latent Semantic Analysis were less fruitful. We exemplify this on Mpiemo, a so-far essentially undescribed Bantu language of the Central African Republic, for which raw text data was available.

Automatic Assessment of Japanese Text Readability Based on a Textbook Corpus
Satoshi Sato | Suguru Matsuyoshi | Yohsuke Kondoh

This paper describes a method of readability measurement of Japanese texts based on a newly compiled textbook corpus. The textbook corpus consists of 1,478 sample passages extracted from 127 textbooks of elementary school, junior high school, high school, and university; it is divided into thirteen grade levels and the total size is about a million characters. For a given text passage, the readability measurement method determines the grade level to which the passage is the most similar by using character-unigram models, which are constructed from the textbook corpus. Because this method does not require sentence-boundary analysis and word-boundary analysis, it is applicable to texts that include incomplete sentences and non-regular text fragments. The performance of this method, which is measured by the correlation coefficient, is considerably high (R > 0.9); in case that the length of a text passage is limited in 25 characters, the correlation coefficient is still high (R = 0.83).

Building a Bio-Event Annotated Corpus for the Acquisition of Semantic Frames from Biomedical Corpora
Paul Thompson | Philip Cotter | John McNaught | Sophia Ananiadou | Simonetta Montemagni | Andrea Trabucco | Giulia Venturi

This paper reports on the design and construction of a bio-event annotated corpus which was developed with a specific view to the acquisition of semantic frames from biomedical corpora. We describe the adopted annotation scheme and the annotation process, which is supported by a dedicated annotation tool. The annotated corpus contains 677 abstracts of biomedical research articles.

Language Resources and Chemical Informatics
C.J. Rupp | Ann Copestake | Peter Corbett | Peter Murray-Rust | Advaith Siddharthan | Simone Teufel | Benjamin Waldron

Chemistry research papers are a primary source of information about chemistry, as in any scientific field. The presentation of the data is, predominantly, unstructured information, and so not immediately susceptible to processes developed within chemical informatics for carrying out chemistry research by information processing techniques. At one level, extracting the relevant information from research papers is a text mining task, requiring both extensive language resources and specialised knowledge of the subject domain. However, the papers also encode information about the way the research is conducted and the structure of the field itself. Applying language technology to research papers in chemistry can facilitate eScience on several different levels. The SciBorg project sets out to provide an extensive, analysed corpus of published chemistry research. This relies on the cooperation of several journal publishers to provide papers in an appropriate form. The work is carried out as a collaboration involving the Computer Laboratory, Chemistry Department and eScience Centre at Cambridge University, and is funded under the UK eScience programme.

Semantic Annotations for Biology: a Corpus Development Initiative at the Jena University Language & Information Engineering (JULIE) Lab
Udo Hahn | Elena Beisswanger | Ekaterina Buyko | Michael Poprat | Katrin Tomanek | Joachim Wermter

We provide an overview of corpus building efforts at the Jena University Language & Information Engineering (JULIE) Lab which are focused on life science documents. Special emphasis is laid on semantic annotations in terms of a large amount of biomedical named entities (almost 100 entity types), semantic relations, as well as discourse phenomena, reference relations in particular.

A lexicon for biology and bioinformatics: the BOOTStrep experience.
Valeria Quochi | Monica Monachini | Riccardo Del Gratta | Nicoletta Calzolari

This paper describes the design, implementation and population of a lexical resource for biology and bioinformatics (the BioLexicon) developed within an ongoing European project. The aim of this project is text-based knowledge harvesting for support to information extraction and text mining in the biomedical domain. The BioLexicon is a large-scale lexical-terminological resource encoding different information types in one single integrated resource. In the design of the resource we follow the ISO/DIS 24613 “Lexical Mark-up Framework” standard, which ensures reusability of the information encoded and easy exchange of both data and architecture. The design of the resource also takes into account the needs of our text mining partners who automatically extract syntactic and semantic information from texts and feed it to the lexicon. The present contribution first describes in detail the model of the BioLexicon along its three main layers: morphology, syntax and semantics; then, it briefly describes the database implementation of the model and the population strategy followed within the project, together with an example. The BioLexicon database in fact comes equipped with automatic uploading procedures based on a common exchange XML format, which guarantees that the lexicon can be properly populated with data coming from different sources.

Dependency-Based Relation Mining for Biomedical Literature
Fabio Rinaldi | Gerold Schneider | Kaarel Kaljurand | Michael Hess

We describe techniques for the automatic detection of relationships among domain entities (e.g. genes, proteins, diseases) mentioned in the biomedical literature. Our approach is based on the adaptive selection of candidate interactions sentences, which are then parsed using our own dependency parser. Specific syntax-based filters are used to limit the number of possible candidate interacting pairs. The approach has been implemented as a demonstrator over a corpus of 2000 richly annotated MedLine abstracts, and later tested by participation to a text mining competition. In both cases, the results obtained have proved the adequacy of the proposed approach to the task of interaction detection.

MeSH©: from a Controlled Vocabulary to a Processable Resource
Dimitrios Kokkinakis

Large repositories of life science data in the form of domain-specific literature and large specialised textual collections increase on a daily basis to a level beyond the human mind can grasp and interpret. As the volume of data continues to increase, substantial support from new information technologies and computational techniques grounded in the mining paradigm is becoming apparent. These emerging technologies play a critical role in aiding research productivity, and they provide the means for reducing the workload for information access and decision support and for speeding up and enhancing the knowledge discovery process. In order to accomplish these higher level goals a fundamental and unavoidable starting point is the identification and mapping of terminology from unstructured data to biomedical knowledge sources and concept hierarchies. This paper provides a description of the work regarding terminology recognition using the Swedish MeSH© thesaurus and its corresponding English source. The various transformation and refinement steps applied to the original database tables into a fully-fledged processing-oriented annotating resource are explained. Particular attention has been given to a number of these steps in order to automatically map the extensive variability of lexical terms to structured MeSH© nodes. Issues on annotation and coverage are also discussed.

A Semantically Annotated Swedish Medical Corpus
Dimitrios Kokkinakis

With the information overload in the life sciences there is an increasing need for annotated corpora, particularly with biological and biomedical entities, which is the driving force for data-driven language processing applications and the empirical approach to language study. Inspired by the work in the GENIA Corpus, which is one of the very few of such corpora, extensively used in the biomedical field, and in order to fulfil the needs of our research, we have collected a Swedish medical corpus, the MEDLEX Corpus. MEDLEX is a large structurally and linguistically annotated document collection, consisting of a variety of text documents related to various medical text subfields, and does not focus at a particular medical genre, due to the lack of large Swedish resources within a particular medical subdomain. Out of this collection we selected 300 documents which were manually examined by two human experts who inspected, corrected and/or accordingly modified the automatically provided annotations according to a set of provided labelling guidelines. The annotations consist of medical terminology provided by the Swedish and English MeSH© (Medical Subject Headings) thesauri as well as named entity labels provided by an enhanced named entity recognition software.

Learning Patterns for Building Resources about Semantic Relations in the Medical Domain
Mehdi Embarek | Olivier Ferret

In this article, we present a method for extracting automatically semantic relations from texts in the medical domain using linguistic patterns. These patterns refer to three levels of information about words: inflected form, lemma and part-of-speech. The method we present consists first in identifying the entities that are part of the relations to extract, that is to say diseases, exams, treatments, drugs or symptoms. Thereafter, sentences that contain couples of entities are extracted and the presence of a semantic relation is validated by applying linguistic patterns. These patterns were previously learnt automatically from a manually annotated corpus by relying onan algorithm based on the edit distance. We first report the results of an evaluation of our medical entity tagger for the five types of entities we have mentioned above and then, more globally, the results of an evaluation of our extraction method for four relations between these entities. Both evaluations were done for French.

Automatic extraction of subcategorization frames for Italian
Dino Ienco | Serena Villata | Cristina Bosco

Subcategorization is a kind of knowledge which can be considered as crucial in several NLP tasks, such as Information Extraction or parsing, but the collection of very large resources including subcategorization representation is difficult and time-consuming. Various experiences show that the automatic extraction can be a practical and reliable solution for acquiring such a kind of knowledge. The aim of this paper is to investigate the relationships between subcategorization frame extraction and the nature of data from which the frames have to be extracted, e.g. how much the task can be influenced by the richness/poorness of the annotation. Therefore, we present some experiments that apply statistical subcategorization extraction methods, known in literature, on an Italian treebank that exploits a rich set of dependency relations that can be annotated at different degrees of specificity. Benefiting from the availability of relation sets that implement different granularity in the representation of relations, we evaluate our results with reference to previous works in a cross-linguistic perspective.

Parallel Multi-Theory Annotations of Syntactic Structure
Jerid Francom | Mans Hulden

We present an approach to creating a treebank of sentences using multiple notations or linguistic theories simultaneously. We illustrate the method by annotating sentences from the Penn Treebank II in three different theories in parallel: the original PTB notation, a Functional Dependency Grammar notation, and a Government and Binding style notation. Sentences annotated with all of these theories are represented in XML as a directed acyclic graph where nodes and edges may carry extra information depending on the theory encoded.

Tagging a Hebrew Corpus: the Case of Participles
Meni Adler | Yael Netzer | Yoav Goldberg | David Gabay | Michael Elhadad

We report on an effort to build a corpus of Modern Hebrew tagged with part-of-speech and morphology. We designed a tagset specific to Hebrew while focusing on four aspects: the tagset should be consistent with common linguistic knowledge; there should be maximal agreement among taggers as to the tags assigned to maintain consistency; the tagset should be useful for machine taggers and learning algorithms; and the tagset should be effective for applications relying on the tags as input features. In this paper, we illustrate these issues by explaining our decision to introduce a tag for beinoni forms in Hebrew. We explain how this tag is defined, and how it helped us improve manual tagging accuracy to a high-level, while improving automatic tagging and helping in the task of syntactic chunking.

Unsupervised Parts-of-Speech Induction for Bengali
Joydeep Nath | Monojit Choudhury | Animesh Mukherjee | Christian Biemann | Niloy Ganguly

We present a study of the word interaction networks of Bengali in the framework of complex networks. The topological properties of these networks reveal interesting insights into the morpho-syntax of the language, whereas clustering helps in the induction of the natural word classes leading to a principled way of designing POS tagsets. We compare different network construction techniques and clustering algorithms based on the cohesiveness of the word clusters. Cohesiveness is measured against two gold-standard tagsets by means of the novel metric of tag-entropy. The approach presented here is a generic one that can be easily extended to any language.

Tagging Spanish Texts: the Problem of Problem of “SE
Guadalupe Aguado de Cea | Javier Puche | José Ángel Ramos

Automatic tagging in Spanish has historically faced many problems because of some specific grammatical constructions. One of these traditional pitfalls is the “se” particle. This particle is a multifunctional and polysemous word used in many different contexts. Many taggers do not distinguish the possible uses of “se” and thus provide poor results at this point. In tune with the philosophy of free software, we have taken a free annotation tool as a basis, we have improved and enhanced its behaviour by adding new rules at different levels and by modifying certain parts in the code to allow for its possible implementation in other EAGLES-compliant tools. In this paper, we present the analysis carried out with different annotators for selecting the tool, the results obtained in all cases as well as the improvements added and the advantages of the modified tagger.

Does Netgraph Fit Prague Dependency Treebank?
Jiří Mírovský

On many examples we present a query language of Netgraph - a fully graphical tool for searching in the Prague Dependency Treebank 2.0. To demonstrate that the query language fits the treebank well, we study an annotation manual for the most complex layer of the treebank - the tectogrammatical layer - and show that linguistic phenomena annotated on the layer can be searched for using the query language.

The Kalshnikov 691 Dependency Bank
Tomas By

The PARC 700 dependency bank has a number of features that would seem to make it less than optimally suited for its intended purpose, parser evaluation. However, it is difficult to know precisely what impact these problems have on the evaluation results, and as a first step towards making comparison possible, a subset of the same sentences is presented here, marked up using a different format that avoids them. In this new representation, the tokens contain exactly the same sequence of characters as the original text, word order is encoded explicitly, and there is no artificial distinction between full tokens and attribute tokens. There is also a clear division between word tokens and empty nodes, and the token attributes are stored together with the word, instead of being spread out individually in the file. A standard programming language syntax is used for the data, so there is little room for markup errors. Finally, the dependency links are closer to standard grammatical terms, which presumably makes it easier to understand what they mean and to convert any particular parser output format to the Kalashnikov 691 representation. The data is provided both in machine-readable format and as graphical dependency trees.

Treebank-Based Acquisition of LFG Parsing Resources for French
Natalie Schluter | Josef van Genabith

Motivated by the expense in time and other resources to produce hand-crafted grammars, there has been increased interest in automatically obtained wide-coverage grammars from treebanks for natural language processing. In particular, recent years have seen the growth in interest in automatically obtained deep resources that can represent information absent from simple CFG-type structured treebanks and which are considered to produce more language-neutral linguistic representations, such as dependency syntactic trees. As is often the case in early pioneering work on natural language processing, English has provided the focus of first efforts towards acquiring deep-grammar resources, followed by successful treatments of, for example, German, Japanese, Chinese and Spanish. However, no comparable large-scale automatically acquired deep-grammar resources have been obtained for French to date. The goal of this paper is to present the application of treebank-based language acquisition to the case of French. We show that with modest changes to the established parsing architectures, encouraging results can be obtained for French, with an overall best dependency structure f-score of 86.73%.

Chooser: a Multi-Task Annotation Tool
Svetla Koeva | Borislav Rizov | Svetlozara Leseva

The paper presents a tool assisting manual annotation of linguistic data developed at the Department of Computational linguistics, IBL-BAS. Chooser is a general-purpose modular application for corpus annotation based on the principles of commonality and reusability of the created resources, language and theory independence, extendibility and user-friendliness. These features have been achieved through a powerful abstract architecture within the Model-View-Controller paradigm that is easily tailored to task-specific requirements and readily extendable to new applications. The tool is to a considerable extent independent of data format and representation and produces outputs that are largely consistent with existing standards. The annotated data are therefore reusable in tasks requiring different levels of annotation and are accessible to external applications. The tool incorporates edit functions, pass and arrangement strategies that facilitate annotators’ work. The relevant module produces tree-structured and graph-based representations in respective annotation modes. Another valuable feature of the application is concurrent access by multiple users and centralised storage of lexical resources underlying annotation schemata, as well as of annotations, including frequency of selection, updates in the lexical database, etc. Chooser has been successfully applied to a number of tasks: POS tagging, WS and syntactic annotation.

BOEMIE Ontology-Based Text Annotation Tool
Pavlina Fragkou | Georgios Petasis | Aris Theodorakos | Vangelis Karkaletsis | Constantine Spyropoulos

The huge amount of the available information in the Web creates the need of effective information extraction systems that are able to produce metadata that satisfy user’s information needs. The development of such systems, in the majority of cases, depends on the availability of an appropriately annotated corpus in order to learn extraction models. The production of such corpora can be significantly facilitated by annotation tools that are able to annotate, according to a defined ontology, not only named entities but most importantly relations between them. This paper describes the BOEMIE ontology-based annotation tool which is able to locate blocks of text that correspond to specific types of named entities, fill tables corresponding to ontology concepts with those named entities and link the filled tables based on relations defined in the domain ontology. Additionally, it can perform annotation of blocks of text that refer to the same topic. The tool has a user-friendly interface, supports automatic pre-annotation, annotation comparison as well as customization to other annotation schemata. The annotation tool has been used in a large scale annotation task involving 3,000 web pages regarding athletics. It has also been used in another annotation task involving 503 web pages with medical information, in different languages.

Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles
Ralf Krestel | Sabine Bergler | René Witte

Reported speech in the form of direct and indirect reported speech is an important indicator of evidentiality in traditional newspaper texts, but also increasingly in the new media that rely heavily on citation and quotation of previous postings, as for instance in blogs or newsgroups. This paper details the basic processing steps for reported speech analysis and reports on performance of an implementation in form of a GATE resource.

KYOTO: a System for Mining, Structuring and Distributing Knowledge across Languages and Cultures
Piek Vossen | Eneko Agirre | Nicoletta Calzolari | Christiane Fellbaum | Shu-kai Hsieh | Chu-Ren Huang | Hitoshi Isahara | Kyoko Kanzaki | Andrea Marchetti | Monica Monachini | Federico Neri | Remo Raffaelli | German Rigau | Maurizio Tescon | Joop VanGent

We outline work performed within the framework of a current EC project. The goal is to construct a language-independent information system for a specific domain (environment/ecology/biodiversity) anchored in a language-independent ontology that is linked to wordnets in seven languages. For each language, information extraction and identification of lexicalized concepts with ontological entries is carried out by text miners (“Kybots”). The mapping of language-specific lexemes to the ontology allows for crosslinguistic identification and translation of equivalent terms. The infrastructure developed within this project enables long-range knowledge sharing and transfer across many languages and cultures, addressing the need for global and uniform transition of knowledge beyond the specific domains addressed here.

Extracting and Querying Relations in Scientific Papers on Language Technology
Ulrich Schäfer | Hans Uszkoreit | Christian Federmann | Torsten Marek | Yajing Zhang

We describe methods for extracting interesting factual relations from scientific texts in computational linguistics and language technology taken from the ACL Anthology. We use a hybrid NLP architecture with shallow preprocessing for increased robustness and domain-specific, ontology-based named entity recognition, followed by a deep HPSG parser running the English Resource Grammar (ERG). The extracted relations in the MRS (minimal recursion semantics) format are simplified and generalized using WordNet. The resulting “quriples” are stored in a database from where they can be retrieved (again using abstraction methods) by relation-based search. The query interface is embedded in a web browser-based application we call the Scientist’s Workbench. It supports researchers in editing and online-searching scientific papers.

Named Entity Relation Mining using Wikipedia
Adrian Iftene | Alexandra Balahur-Dobrescu

Discovering relations among Named Entities (NEs) from large corpora is both a challenging, as well as useful task in the domain of Natural Language Processing, with applications in Information Retrieval (IR), Summarization (SUM), Question Answering (QA) and Textual Entailment (TE). The work we present resulted from the attempt to solve practical issues we were confronted with while building systems for the tasks of Textual Entailment Recognition and Question Answering, respectively. The approach consists in applying grammar induced extraction patterns on a large corpus - Wikipedia - for the extraction of relations between a given Named Entity and other Named Entities. The results obtained are high in precision, determining a reliable and useful application of the built resource.

Named Entity Recognition for Digitised Historical Texts
Claire Grover | Sharon Givon | Richard Tobin | Julian Ball

We describe and evaluate a prototype system for recognising person and place names in digitised records of British parliamentary proceedings from the late 17th and early 19th centuries. The output of an OCR engine is the input for our system and we describe certain issues and errors in this data and discuss the methods we have used to overcome the problems. We describe our rule-based named entity recognition system for person and place names which is implemented using the LT-XML2 and LT-TTT2 text processing tools. We discuss the annotation of a development and testing corpus and provide results of an evaluation of our system on the test corpus.

Entity Translation and Alignment in the ACE-07 ET Task
Zhiyi Song | Stephanie Strassel

Entities - people, organizations, locations and the like - have long been a central focus of natural language processing technology development, since entities convey essential content in human languages. For multilingual systems, accurate translation of named entities and their descriptors is critical. LDC produced Entity Translation pilot data to support the ACE ET 2007 Evaluation and the current paper delves more deeply into the entity alignment issue across languages, combining the automatic alignment techniques developed for ACE-07 with manual alignment. Altogether 84% of the Chinese-English entity mentions and 74% of the Arabic-English entity mentions are perfect aligned. The results of this investigation offer several important insights. Automatic alignment algorithms predicted that perfect alignment for the ET corpus was likely to be no greater than 55%; perfect alignment on the 15 pilot documents was predicted at 62.5%. Our results suggest the actual perfect alignment rate is substantially higher (82% average, 92% for NAM entities). The careful analysis of alignment errors also suggests strategies for human translation to support the ET task; for instance, translators might be given additional guidance about preferred treatments of name versus nominal translation. These results can also contribute to refined methods of evaluating ET systems.

Automated Subject Induction from Query Keywords through Wikipedia Categories and Subject Headings
Yoji Kiyota | Noriyuki Tamura | Satoshi Sakai | Hiroshi Nakagawa | Hidetaka Masuda

This paper addresses a novel approach that integrates two different types of information resources: the World Wide Web and libraries. This approach is based on a hypothesis: advantages and disadvantages of the Web and libraries are complemental. The integration is based on correspondent conceptual label names between the Wikipedia categories and subject headings of library materials. The method enables us to find locations of bookshelves in a library easily, using any query keywords. Any keywords which are registered as Wikipedia items are acceptable. The advantages of the method are: the integrative approach makes subject access of library resources have broader coverage than an approach which only uses subject headings; and the approach navigates us to reliable information resources. We implemented the proposed method into an application system, and are now operating the system at several university libraries in Japan. We are planning to evaluate the method based on the query logs collected by the system.

Using Random Indexing to improve Singular Value Decomposition for Latent Semantic Analysis
Linus Sellberg | Arne Jönsson

In this paper we present results from using Random indexing for Latent Semantic Analysis to handle Singular Value Decomposition tractability issues. In the paper we compare Latent Semantic Analysis, Random Indexing and Latent Semantic Analysis on Random Indexing reduced matrices. Our results show that Latent Semantic Analysis on Random Indexing reduced matrices provide better results on Precision and Recall than Random Indexing only. Furthermore, computation time for Singular Value Decomposition on a Random indexing reduced matrix is almost halved compared to Latent Semantic Analysis.

Harvesting Multi-Word Expressions from Parallel Corpora
Špela Vintar | Darja Fišer

The paper presents a set of approaches to extend the automatically created Slovene wordnet with nominal multi-word expressions. In the first approach multi-word expressions from Princeton WordNet are translated with a technique that is based on word-alignment and lexico-syntactic patterns. This is followed by extracting new terms from a monolingual corpus using keywordness ranking and contextual patterns. Finally, the multi-word expressions are assigned a hypernym and added to our wordnet. Manual evaluation and comparison of the results shows that the translation approach is the most straightforward and accurate. However, it is successfully complemented by the two monolingual approaches which are able to identify more term candidates in the corpus that would otherwise go unnoticed. Some weaknesses of the proposed wordnet extension techniques are also addressed.

Integration of a Multilingual Keyword Extractor in a Document Management System
Andrea Agili | Marco Fabbri | Alessandro Panunzi | Manuel Zini

In this paper we present a new Document Management System called DrStorage. This DMS is multi-platform, JCR-170 compliant, supports WebDav, versioning, user authentication and authorization and the most widespread file formats (Adobe PDF, Microsoft Office, HTML,...). It is also easy to customize in order to enhance its search capabilities and to support automatic metadata assignment. DrStorage has been integrated with an automatic language guesser and with an automatic keyword extractor: these metadata can be assigned automatically to documents, because the DrStorage’s server part has benn modified to allow that metadata assignment takes place as documents are put in the repository. Metadata can greatly improve the search capabilites and the results quality of a search engine. DrStorage’s client has been customized with two search results view: the first, called timeline view, shows temporal trends of queries as an histogram, the second, keyword cloud, shows which words are correlated and how much are correlated with the results of a particular day.

Dictionary of Multiword Expressions for Translation into highly Inflected Languages
Daiga Deksne | Raivis Skadiņš | Inguna Skadiņa

Treatment of Multiword Expressions (MWEs) is one of the most complicated issues in natural language processing, especially in Machine Translation (MT). The paper presents dictionary of MWEs for a English-Latvian MT system, demonstrating a way how MWEs could be handled for inflected languages with rich morphology and rather free word order. The proposed dictionary of MWEs consists of two constituents: a lexicon of phrases and a set of MWE rules. The lexicon of phrases is rather similar to translation lexicon of the MT system, while MWE rules describe syntactic structure of the source and target sentence allowing correct transformation of different MWE types into the target language and ensuring correct syntactic structure. The paper demonstrates this approach on different MWE types, starting from simple syntactic structures, followed by more complicated cases and including fully idiomatic expressions. Automatic evaluation shows that the described approach increases the quality of translation by 0.6 BLEU points.

Verb-Noun Collocation SyntLex Dictionary: Corpus-Based Approach
Grazyna Vetulani | Zygmunt Vetulani | Tomasz Obrębski

The project presented here is a part of a long term research program aiming at a full lexicon grammar for Polish (SyntLex). The main concern of this project is computer-assisted acquisition and morpho-syntactic description of verb-noun collocations in Polish. We present methodology and resources obtained in three main project phases which are: dictionary-based acquisition of collocation lexicon, feasibility study for corpus-based lexicon enlargement phase, corpus-based lexicon enlargement and collocation description. In this paper we focus on the results of the third phase. The presented here corpus-based approach permitted us to triple the size the verb-noun collocation dictionary for Polish. In the paper we describe the SyntLex Dictionary of Collocations and announce some future research intended to be a separate project continuation.

Targeting Chinese Nominal Compounds in Corpora
Weiruo Qu | Christoph Ringlstetter | Randy Goebel

For compounding languages, a great part of the topical semantics is transported via nominal compounds. Various applications of natural language processing can profit from explicit access to these compounds, provided by a lexicon. The best way to acquire such a resource is to harvest corpora that represent the domain in question. For Chinese, a significant difficulty lies in the fact that the text comes as a string of characters, only segmented by sentence boundaries. Extraction algorithms that solely rely on context variety do not perform precisely enough. We propose a pipeline of filters that starts from a candidate set established by accessor variety and then employs several methods to improve precision. For the experiments the Xinhua part of the Chinese Gigaword Corpus was used. We extracted a random sample of 200 story texts with 119,509 Hanzi characters. All compound words of this evaluation corpus were tagged, segmented into their morphemes, and augmented with the POS-information of their segments. A cascade of filters applied to a preliminary set of compound candidates led to a very high precision of over 90%, measured for the types. The result also holds for a small corpus where a solely contextual method introduces too much noise, even for the longer compounds. An introduction of MI into the basic candidacy algorithm led to a much higher recall with still reasonable precision for subsequent manual processing. Especially for the four-character compounds, that in our sample represent over 40% of the target data, the method has sufficient efficacy to support the rapid construction of compound dictionaries from domain corpora.

Using Semantically Annotated Corpora to Build Collocation Resources
Margarita Alonso Ramos | Owen Rambow | Leo Wanner

We present an experiment in extracting collocations from the FrameNet corpus, specifically, support verbs such as direct in Environmentalists directed strong criticism at world leaders. Support verbs do not contribute meaning of their own and the meaning of the construction is provided by the noun; the recognition of support verbs is thus useful in text understanding. Having access to a list of support verbs is also useful in applications that can benefit from paraphrasing, such as generation (where paraphrasing can provide variety). This paper starts with a brief presentation of the notion of lexical function in Meaning-Text Theory, where they fall under the notion of lexical function, and then discusses how relevant information is encoded in the FrameNet corpus. We describe the resource extracted from the FrameNet corpus.

Eksairesis: A Domain-Adaptable System for Ontology Building from Unstructured Text
Katia Lida Kermanidis | Aristomenis Thanopoulos | Manolis Maragoudakis | Nikos Fakotakis

This paper describes Eksairesis, a system for learning economic domain knowledge automatically from Modern Greek text. The knowledge is in the form of economic terms and the semantic relations that govern them. The entire process in based on the use of minimal language-dependent tools, no external linguistic resources, and merely free, unstructured text. The methodology is thereby easily portable to other domains and other languages. The text is pre-processed with basic morphological annotation, and semantic (named and other) entities are identified using supervised learning techniques. Statistical filtering, i.e. corpora comparison is used to extract domain terms and supervised learning is again employed to detect the semantic relations between pairs of terms. Advanced classification schemata, ensemble learning, and one-sided sampling, are experimented with in order to deal with the noise in the data, which is unavoidable due to the low pre-processing level and the lack of sophisticated resources. An average 68.5% f-score over all the classes is achieved when learning semantic relations. Bearing in mind the use of minimal resources and the highly automated nature of the process, classification performance is very promising, compared to results reported in previous work.

Conceptual Modeling of Ontology-based Linguistic Resources with a Focus on Semantic Relations
Francisco Alvarez Montero | Antonio Vaquero Sanchez | Fernando Sáenz Perez

Although ontologies and linguistic resources play a key role in applied AI and NLP, they have not been developed in a common and systematic way. The lack of a systematic methodology for their development has lead to the production of resources that exhibit common flaws between them, and that, at least when it come to ontologies, negatively impact their results and reusability. In this paper, we introduce a software-engineering methodology for the construction of ontology-based linguistic resources, and present a sound conceptual schema that takes into account several considerations for the construction of software tools that allow the systematic and controlled construction of ontology-based linguistic resources.

Ontology Search with the OntoSelect Ontology Library
Paul Buitelaar | Thomas Eigner

OntoSelect is a dynamic web-based ontology library that harvests, analyzes and organizes ontologies published on the Semantic Web. OntoSelect allows searching as well as browsing of ontologies according to size (number of classes, properties), representation format (DAML, RDFS, OWL), connectedness (score over the number of included and referring ontologies) and human languages used for class- and object property-labels. Ontology search in OntoSelect is based on a combined measure of coverage, structure and connectedness. Further, and in contrast to other ontology search engines, OntoSelect provides ontology search based on a complete web document instead of one or more keywords only.

A Framework for Multilingual Ontology Mapping
Cássia Trojahn | Paulo Quaresma | Renata Vieira

In the field of ontology mapping, multilingual ontology mapping is an issue that is not well explored. This paper proposes a framework for mapping of multilingual Description Logics (DL) ontologies. First, the DL source ontology is translated to the target ontology language, using a lexical database or a dictionary, generating a DL translated ontology. The target and the translated ontologies are then used as input for the mapping process. The mappings are computed by specialized agents using different mapping approaches. Next, these agents use argumentation to exchange their local results, in order to agree on the obtained mappings. Based on their preferences and confidence of the arguments, the agents compute their preferred mapping sets. The arguments in such preferred sets are viewed as the set of globally acceptable arguments. A DL mapping ontology is generated as result of the mapping process. In this paper we focus on the process of generating the DL translated ontology.

Acquiring a Taxonomy from the German Wikipedia
Laura Kassner | Vivi Nastase | Michael Strube

This paper presents the process of acquiring a large, domain independent, taxonomy from the German Wikipedia. We build upon a previously implemented platform that extracts a semantic network and taxonomy from the English version of the Wikipedia. We describe two accomplishments of our work: the semantic network for the German language in which isa links are identified and annotated, and an expansion of the platform for easy adaptation for a new language. We identify the platform’s strengths and shortcomings, which stem from the scarcity of free processing resources for languages other than English. We show that the taxonomy induction process is highly reliable - evaluated against the German version of WordNet, GermaNet, the resource obtained shows an accuracy of 83.34%.

LMM: an OWL-DL MetaModel to Represent Heterogeneous Lexical Knowledge
Davide Picca | Alfio Massimiliano Gliozzo | Aldo Gangemi

In this paper we present a Linguistic Meta-Model (LMM) allowing a semiotic-cognitive representation of knowledge. LMM is freely available and integrates the schemata of linguistic knowledge resources, such as WordNet and FrameNet, as well as foundational ontologies, such as DOLCE and its extensions. In addition, LMM is able to deal with multilinguality and to represent individuals and facts in an open domain perspective.

Development of the Japanese WordNet
Hitoshi Isahara | Francis Bond | Kiyotaka Uchimoto | Masao Utiyama | Kyoko Kanzaki

After a long history of compilation of our own lexical resources, EDR Japanese/English Electronic Dictionary, and discussions with major players on development of various WordNets, Japanese National Institute of Information and Communications Technology started developing the Japanese WordNet in 2006 and will publicly release the first version, which includes both the synset in Japanese and the annotated Japanese corpus of SemCor, in June 2008. As the first step in compiling the Japanese WordNet, we added Japanese equivalents to synsets of the Princeton WordNet. Of course, we must also add some synsets which do not exist in the Princeton WordNet, and must modify synsets in the Princeton WordNet, in order to make the hierarchical structure of Princeton synsets represent thesaurus-like information found in the Japanese language, however, we will address these tasks in a future study. We then translated English sentences which are used in the SemCor annotation into Japanese and annotated them using our Japanese WordNet. This article describes the overview of our project to compile Japanese WordNet and other resources which relate to our Japanese WordNet.

Lexical Ontology Extraction using Terminology Analysis: Automating Video Annotation
Neil Newbold | Bogdan Vrusias | Lee Gillam

The majority of work described in this paper was conducted as part of the Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) project, sponsored by the UK’s Engineering and Physical Sciences Research Council (EPSRC). REVEAL is concerned with reducing the time-consuming, yet essential, tasks undertaken by UK Police Officers when dealing with terascale collections of video related to crime-scenes. The project is working towards technologies which will archive video that has been annotated automatically based on prior annotations of similar content, enabling rapid access to CCTV archives and providing capabilities for automatic video summarisation. This involves considerations of semantic annotation relating, amongst other things, to content and to temporal reasoning. In this paper, we describe the ontology extraction components of the system in development, and its use in REVEAL for automatically populating a CCTV ontology from analysis of expert transcripts of the video footage.

Workbench with Authoring Tools for Collaborative Multi-lingual Ontological Knowledge Construction and Maintenance
Mukda Suktarachan | Dussadee Thamvijit | Daoyos Noikongka | Panita Yongyuth | Puwarat Pavaputanont Na Mahasarakham | Asanee Kawtrakul | Asanee Kawtrakul | Margherita Sini

An ontological knowledge management system requires dynamic and encapsulating operation in order to share knowledge among communities. The key to success of knowledge sharing in the field of agriculture is using and sharing agreed terminologies such as ontological knowledge especially in multiple languages. This paper proposes a workbench with three authoring tools for collaborative multilingual ontological knowledge construction and maintenance, in order to add value and support communities in the field of food and agriculture. The framework consists of the multilingual ontological knowledge construction and maintenance workbench platform, which composes of ontological knowledge management and user management, and three ontological knowledge authoring tools. The authoring tools used are two ontology extraction tools, ATOM and KULEX, and one ontology integration tool.

Towards Semi Automatic Construction of a Lexical Ontology for Persian
Mehrnoush Shamsfard

Lexical ontologies and semantic lexicons are important resources in natural language processing. They are used in various tasks and applications, especially where semantic processing is evolved such as question answering, machine translation, text understanding, information retrieval and extraction, content management, text summarization, knowledge acquisition and semantic search engines. Although there are a number of semantic lexicons for English and some other languages, Persian lacks such a complete resource to be used in NLP works. In this paper we introduce an ongoing project on developing a lexical ontology for Persian called FarsNet. We exploited a hybrid semi-automatic approach to acquire lexical and conceptual knowledge from resources such as WordNet, bilingual dictionaries, mono-lingual corpora and morpho-syntactic and semantic templates. FarsNet is an ontology whose elements are lexicalized in Persian. It provides links between various types of words (cross POS relations) and also between words and their corresponding concepts in other ontologies (cross ontologies relations). FarsNet aggregates the power of WordNet on nouns, the power of FrameNet on verbs and the wide range of conceptual relations from ontology community.

Mapping Roget’s Thesaurus and WordNet to French
Gerard de Melo | Gerhard Weikum

Roget’s Thesaurus and WordNet are very widely used lexical reference works. We describe an automatic mapping procedure that effectively produces French translations of the terms in these two resources. Our approach to the challenging task of disambiguation is based on structural statistics as well as measures of semantic relatedness that are utilized to learn a classification model for associations between entries in the thesaurus and French terms taken from bilingual dictionaries. By building and applying such models, we have produced French versions of Roget’s Thesaurus and WordNet with a considerable level of accuracy, which can be used for a variety of different purposes, by humans as well as in computational applications.

Representation of Atypical Entities in Ontologies
Christophe Jouis | Julien Bourdaillet

This paper is a contribution to formal ontology study. Some entities belong more or less to a class. In particular, some individual entities are attached to classes whereas they do not check all the properties of the class. To specify whether an individual entity belonging to a class is typical or not, we borrow the topological concepts of interior, border, closure, and exterior. We define a system of relations by adapting these topological operators. A scale of typicality, based on topology, is introduced. It enables to define levels of typicality where individual entities are more or less typical elements of a concept.

Extracting Concrete Senses of Lexicon through Measurement of Conceptual Similarity in Ontologies
Siaw-Fong Chung | Laurent Prévot | Mingwei Xu | Kathleen Ahrens | Shu-Kai Hsieh | Chu-Ren Huang

The measurement of conceptual similarity in a hierarchical structure has been proposed by studies such as Wu and Palmer (1994) which have been summarized and evaluated in Budanisky and Hirst (2006). The present study applies the measurement of conceptual similarity to conceptual metaphor research by comparing concreteness of ontological resource nodes to several prototypical concrete nodes selected by human subjects. Here, the purpose of comparing conceptual similarity between nodes is to select a concrete sense for a word which is used metaphorically. Through using WordNet-SUMO interface such as SinicaBow (Huang, Chang and Lee, 2004), concrete senses of a lexicon will be selected once its SUMO nodes have been compared in terms of conceptual similarity with the prototypical concrete nodes. This study has strong implications for the interaction of psycholinguistic and computational linguistic fields in conceptual metaphor research.

A Contextual Dynamic Network Model for WSD Using Associative Concept Dictionary
Jun Okamoto | Kiyoko Uchiyama | Shun Ishizaki

Many of the Japanese ideographs (Chinese characters) have a few meanings. Such ambiguities should be identified by using their contextual information. For example, we have an ideograph which has two pronunciations, /hitai/ and /gaku/, the former means a forehead of the human body and the latter has two meanings, an amount of money and a picture frame. Conventional methods for such a disambiguation problem have been using statistical methods with co-occurrence of words in their context. In this research, Contextual Dynamic Network Model is developed using the Associative Concept Dictionary which includes semantic relations among concepts/words and the relations can be represented with quantitative distances. In this model, an interactive activation method is used to identify a word’s meaning on the Contextual Semantic Network where the activation on the network is calculated using the distances. The proposed method constructs dynamically the Contextual Semantic Network according to the input words sequentially that appear in the sentence including an ambiguous word.

A Semantic Memory for Incremental Ontology Population
Berenike Loos | Lasse Schwarten

Generally, ontology learning and population is applied as a semi-automatic approach to knowledge acquisition in natural language understanding systems. That means, after the ontology is created or populated, an expert of the domain can still change or refine the newly acquired knowledge. In an incremental ontology learning framework (as e.g. applied for open-domain dialog systems) this approach is not sufficient as knowledge about the real world is dynamic and, therefore, has to be acquired and updated constantly. In this paper we propose the storing of newly acquired instances of an ontological concept in a separate database instead of integrating them directly into the system’s knowledge base. The advantage is that possibly incorrect knowledge is not part of the system’s ontology but stored aside. Furthermore, information about the confidence about the learned instances can be displayed and used for a final revision as well as a further automatic acquisition.

Turning a Term Extractor into a new Domain: first Experiences
Jorge Vivaldi | Anna Joan | Mercè Lorente

Computational terminology has notably evolved since the advent of computers. Regarding the extraction of terms in particular, a large number of resources have been developed: from very general tools to other much more specific acquisition methodologies. Such acquisition methodologies range from using simple linguistic patterns or frequency counting methods to using much more evolved strategies combining morphological, syntactical, semantical and contextual information. Researchers usually develop a term extractor to be applied to a given domain and, in some cases, some testing about the tool performance is also done. Afterwards, such tools may also be applied to other domains, though frequently no additional test is made in such cases. Usually, the application of a given tool to other domain does not require any tuning. Recently, some tools using semantic resources have been developed. In such cases, either a domain-specific or a generic resource may be used. In the latter case, some tuning may be necessary in order to adapt the tool to a new domain. In this paper, we present the task started in order to adapt YATE, a term extractor that uses a generic resource as EWN and that is already developed for the medical domain, into the economic one.

Similar Term Discovery using Web Search
Peter Anick | Vijay Murthi | Shaji Sebastian

We present an approach to the discovery of semantically similar terms that utilizes a web search engine as both a source for generating related terms and a tool for estimating the semantic similarity of terms. The system works by associating with each document in the search engine’s index a weighted term vector comprising those phrases that best describe the document’s subject matter. Related terms for a given seed phrase are generated by running the seed as a search query and mining the result vector produced by averaging the weights of terms associated with the top documents of the query result set. The degree of similarity between the seed term and each related term is then computed as the cosine of the angle between their respective result vectors. We test the effectiveness of this approach for building a term recommender system designed to help online advertisers discover additional phrases to describe their product offering. A comparison of its output with that of several alternative methods finds it to be competitive with the best known alternative.

Temporal Aspects of Terminology for Automatic Term Recognition: Case Study on Women’s Studies Terms
Junko Kubo | Keita Tsuji | Shigeo Sugimoto

The purpose of this paper is to clarify the temporal aspect of terminology focusing on the dictionary’s impact on terms. We used women’s studies terms as data and examined the changes of their values of five automatic term recognition (ATR) measures before and after dictionary publication. The changes of precision and recall of extraction based on these measures were also examined. The measures are TFIDF, C-value, MC-value, Nakagawa’s FLR, and simple document frequencies. We found that being listed in dictionaries gives longevity to terms and prevent them from losing termhood that is represented by these ATR measures. The peripheral or relatively less important terms are more likely to be influenced by dictionaries and their termhood increase after being listed in dictionaries. Among the termhood, the potential of word formation that can be measured by Nakagawa’s FLR seemed to be influenced most and the terms gradually gained it after being listed in dictionaries.

A Comparative Evaluation of Term Recognition Algorithms
Ziqi Zhang | Jose Iria | Christopher Brewster | Fabio Ciravegna

Automatic Term recognition (ATR) is a fundamental processing step preceding more complex tasks such as semantic search and ontology learning. From a large number of methodologies available in the literature only a few are able to handle both single and multi-word terms. In this paper we present a comparison of five such algorithms and propose a combined approach us¬ing a voting mechanism. We evaluated the six approaches using two different corpora and show how the voting algo¬rithm performs best on one corpus (a collection of texts from Wikipedia) and less well using the Genia corpus (a standard life science corpus). This indicates that choice and design of corpus has a major impact on the evaluation of term recog¬nition algorithms. Our experiments also showed that single-word terms can be equally important and occupy a fairly large proportion in certain domains. As a result, algorithms that ignore single-word terms may cause problems to tasks built on top of ATR. Effective ATR systems also need to take into account both the unstructured text and the structured aspects and this means information extraction techniques need to be integrated into the term recognition process.

Learning-based Detection of Scientific Terms in Patient Information
Veronique Hoste | Els Lefever | Klaar Vanopstal | Isabelle Delaere

In this paper, we investigate the use of a machine-learning based approach to the specific problem of scientific term detection in patient information. Lacking lexical databases which differentiate between the scientific and popular nature of medical terms, we used local context, morphosyntactic, morphological and statistical information to design a learner which accurately detects scientific medical terms. This study is the first step towards the automatic replacement of a scientific term by its popular counterpart, which should have a beneficial effect on readability. We show a F-score of 84% for the prediction of scientific terms in an English and Dutch EPAR corpus. Since recasting the term extraction problem as a classification problem leads to a large skewedness of the resulting data set, we rebalanced the data set through the application of some simple TF-IDF-based and Log-likelihood-based filters. We show that filtering indeed has a beneficial effect on the learner’s performance. However, the results of the filtering approach combined with the learning-based approach remain below those of the learning-based approach.

WNTERM: Enriching the MCR with a Terminological Dictionary
Eli Pociello | Antton Gurrutxaga | Eneko Agirre | Izaskun Aldezabal | German Rigau

In this paper we describe the methodology and the first steps for the creation of WNTERM (from WordNet and Terminology), a specialized lexicon produced from the merger of the EuroWordNet-based Multilingual Central Repository (MCR) and the Basic Encyclopaedic Dictionary of Science and Technology (BDST). As an example, the ecology domain has been used. The final result is a multilingual (Basque and English) light-weight domain ontology, including taxonomic and other semantic relations among its concepts, which is tightly connected to other wordnets.

Encoding Terms from a Scientific Domain in a Terminological Database: Methodology and Criteria
Rita Marinelli | Melissa Tiberi | Remo Bindi

This paper reports on the main phases of a research which aims at enhancing a maritime terminological database by means of a set of terms belonging to meteorology. The structure of the terminological database, according to EuroWordNet/ItalWordNet model is described; the criteria used to build corpora of specialized texts are explained as well as the use of the corpora as source for term selection and extraction. The contribution of the semantic databases is taken into account: on the one hand, the most recent version of the Princeton WordNet has been exploited as reference for comparing and evaluating synsets; on the other hand, the Italian WordNet has been employed as source for exporting synsets to be coded in the terminological resource. The set of semantic relations useful to codify new terms belonging to the discipline of meteorology is examined, revising the semantic relations provided by the IWN model, introducing new relations which are more suitably tailored to specific requirements either scientific or pragmatic. The need for a particular relation is highlighted to represent the mental association which is made when a term intuitively recalls another term, but they are neither synonyms nor connected by means of a hyperonymy/hyponymy relation.

An Evaluation Resource for Geographic Information Retrieval
Thomas Mandl | Fredric Gey | Giorgio Di Nunzio | Nicola Ferro | Mark Sanderson | Diana Santos | Christa Womser-Hacker

In this paper we present an evaluation resource for geographic information retrieval developed within the Cross Language Evaluation Forum (CLEF). The GeoCLEF track is dedicated to the evaluation of geographic information retrieval systems. The resource encompasses more than 600,000 documents, 75 topics so far, and more than 100,000 relevance judgments for these topics. Geographic information retrieval requires an evaluation resource which represents realistic information needs and which is geographically challenging. Some experimental results and analysis are reported

Bilingual Text Classification using the IBM 1 Translation Model
Jorge Civera | Alfons Juan-Císcar

Manual categorisation of documents is a time-consuming task that has been significantly alleviated with the deployment of automatic and machine-aided text categorisation systems. However, the proliferation of multilingual documentation has become a common phenomenon in many international organisations, while most of the current systems have focused on the categorisation of monolingual text. It has been recently shown that the inherent redundancy in bilingual documents can be effectively exploited by relatively simple, bilingual naive Bayes (multinomial) models. In this work, we present a refined version of these models in which this redundancy is explicitly captured by a combination of a unigram (multinomial) model and the well-known IBM 1 translation model. The proposed model is evaluated on two bilingual classification tasks and compared to previous work.

Ping-pong Document Clustering using NMF and Linkage-Based Refinement
Hiroyuki Shinnou | Minoru Sasaki

This paper proposes a ping-pong document clustering method using NMF and the linkage based refinement alternately, in order to improve the clustering result of NMF. The use of NMF in the ping-pong strategy can be expected effective for document clustering. However, NMF in the ping-pong strategy often worsens performance because NMF often fails to improve the clustering result given as the initial values. Our method handles this problem with the stop condition of the ping-pong process. In the experiment, we compared our method with the k-means and NMF by using 16 document data sets. Our method improved the clustering result of NMF significantly.

Spectral Clustering for a Large Data Set by Reducing the Similarity Matrix Size
Hiroyuki Shinnou | Minoru Sasaki

Spectral clustering is a powerful clustering method for document data set. However, spectral clustering needs to solve an eigenvalue problem of the matrix converted from the similarity matrix corresponding to the data set. Therefore, it is not practical to use spectral clustering for a large data set. To overcome this problem, we propose the method to reduce the similarity matrix size. First, using k-means, we obtain a clustering result for the given data set. From each cluster, we pick up some data, which are near to the central of the cluster. We take these data as one data. We call this data set as “committee”. Data except for committees remain one data. For these data, we construct the similarity matrix. Definitely, the size of this similarity matrix is reduced so much that we can perform spectral clustering using the reduced similarity matrix.

A Text-based Query Interface to OWL Ontologies
Danica Damljanovic | Valentin Tablan | Kalina Bontcheva

Accessing structured data in the form of ontologies requires training and learning formal query languages (e.g., SeRQL or SPARQL) which poses significant difficulties for non-expert users. One of the ways to lower the learning overhead and make ontology queries more straightforward is through a Natural Language Interface (NLI). While there are existing NLIs to structured data with reasonable performance, they tend to require expensive customisation to each new domain or ontology. Additionally, they often require specific adherence to a pre-defined syntax which, in turn, means that users still have to undergo training. In this paper we present Question-based Interface to Ontologies (QuestIO) - a tool for querying ontologies using unconstrained language-based queries. QuestIO has a very simple interface, requires no user training and can be easily embedded in any system or used with any ontology or knowledge base without prior customisation.

A Research on Automatic Chinese Catchword Extraction
Han Ren | Donghong Ji | Lei Han

Catchwords refer to popular words or phrases within certain area in certain period of time. In this paper, we propose a novel approach for automatic Chinese catchwords extraction. At the beginning, we discuss the linguistic definition of catchwords and analyze the features of catchwords by manual evaluation. According to those features of catchwords, we define three aspects to describe Popular Degree of catchwords. To extract terms with maximum meaning, we adopt an effective ATE algorithm for multi-character words and long phrases. Then we use conic fitting in Time Series Analysis to build Popular Degree Curves of extracted terms. To calculate Popular Degree Values of catchwords, a formula is proposed which includes values of Popular Trend, Peak Value and Popular Keeping. Finally, a ranking list of catchword candidates is built according to Popular Degree Values. Experiments show that automatic Chinese catchword extraction is effective and objective in comparison with manual evaluation.

ParsCit: an Open-source CRF Reference String Parsing Package
Isaac Councill | C. Lee Giles | Min-Yen Kan

We describe ParsCit, a freely available, open-source implementation of a reference string parsing package. At the core of ParsCit is a trained conditional random field (CRF) model used to label the token sequences in the reference string. A heuristic model wraps this core with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts. The package comes with utilities to run it as a web service or as a standalone utility. We compare ParsCit on three distinct reference string datasets and show that it compares well with other previously published work.

Automatic Acquisition of Usage Information for Language Resources
Shunsuke Kozawa | Hitomi Tohyama | Kiyotaka Uchimoto | Shigeki Matsubara

Recently, language resources (LRs) are becoming indispensable for linguistic research. Unfortunately, it is not easy to find their usages by searching the web even though they must be described in the Internet or academic articles. This indicates that the intrinsic value of LRs is not recognized very well. In this research, therefore, we extract a list of usage information for each LR to promote the efficient utilization of LRs. In this paper, we proposed a method for extracting a list of usage information from academic articles by using rules based on syntactic information. The rules are generated by focusing on the syntactic features that are observed in the sentences describing usage information. As a result of experiments, we achieved 72.9% in recall and 78.4% in precision for the closed test and 60.9% in recall and 72.7% in precision for the open test.

Cost-Sensitive Learning in Answer Extraction
Michael Wiegand | Jochen L. Leidner | Dietrich Klakow

One problem of data-driven answer extraction in open-domain factoid question answering is that the class distribution of labeled training data is fairly imbalanced. In an ordinary training set, there are far more incorrect answers than correct answers. The class-imbalance is, thus, inherent to the classification task. It has a deteriorating effect on the performance of classifiers trained by standard machine learning algorithms. They usually have a heavy bias towards the majority class, i.e. the class which occurs most often in the training set. In this paper, we propose a method to tackle class imbalance by applying some form of cost-sensitive learning which is preferable to sampling. We present a simple but effective way of estimating the misclassification costs on the basis of class distribution. This approach offers three benefits. Firstly, it maintains the distribution of the classes of the labeled training data. Secondly, this form of meta-learning can be applied to a wide range of common learning algorithms. Thirdly, this approach can be easily implemented with the help of state-of-the-art machine learning software.

Definition Extraction Using a Sequential Combination of Baseline Grammars and Machine Learning Classifiers
Łukasz Degórski | Michał Marcińczuk | Adam Przepiórkowski

The paper deals with the task of definition extraction from a small and noisy corpus of instructive texts. Three approaches are presented: Partial Parsing, Machine Learning and a sequential combination of both. We show that applying ML methods with the support of a trivial grammar gives results better than a relatively complicated partial grammar, and much better than pure ML approach.

Yet another Platform for Extracting Knowledge from Corpora
Francesca Fallucchi | Fabio Massimo Zanzotto

The research field of “extracting knowledge bases from text collections” seems to be mature: its target and its working hypotheses are clear. In this paper we propose a platform, YAPEK, i.e., Yet Another Platform for Extracting Knowledge from corpora, that wants to be the base to collect the majority of algorithms for extracting knowledge bases from corpora. The idea is that, when many knowledge extraction algorithms are collected under the same platform, relative comparisons are clearer and many algorithms can be leveraged to extract more valuable knowledge for final tasks such as Textual Entailment Recognition. As we want to collect many knowledge extraction algorithms, YAPEK is based on the three working hypotheses of the area: the basic hypothesis, the distributional hypothesis, and the point-wise assertion patterns. In YAPEK, these three hypotheses define two spaces: the space of the target textual forms and the space of the contexts. This platform guarantees the possibility of rapidly implementing many models for extracting knowledge from corpora as the platform gives clear entry points to model what is really different in the different algorithms: the feature spaces, the distances in these spaces, and the actual algorithm.

A Framework for Identity Resolution and Merging for Multi-source Information Extraction
Milena Yankova | Horacio Saggion | Hamish Cunningham

In the context of ontology-based information extraction, identity resolution is the process of deciding whether an instance extracted from text refers to a known entity in the target domain (e.g. the ontology). We present an ontology-based framework for identity resolution which can be customized to different application domains and extraction tasks. Rules for identify resolution, which compute similarities between target and source entities based on class information and instance properties and values, can be defined for each class in the ontology. We present a case study of the application of the framework to the problem of multi-source job vacancy extraction

Experiments to Investigate the Connection between Case Distribution and Topical Relevance of Search Terms in an Information Retrieval Setting
Jussi Karlgren | Hercules Dalianis | Bart Jongejan

We have performed a set of experiments made to investigate the utility of morphological analysis to improve retrieval of documents written in languages with relatively large morphological variation in a practical commercial setting, using the SiteSeeker search system developed and marketed by Euroling Ab. The objective of the experiments was to evaluate different lemmatisers and stemmers to determine which would be the most practical for the task at hand: highly interactive, relatively high precision web searches in commercial customer-oriented document collections. This paper gives an overview of some of the results for Finnish and German, and describes specifically one experiment designed to investigate the case distribution of nouns in a highly inflectional language (Finnish) and the topicality of the nouns in target texts. We find that topical nouns taken from queries are distributed differently over relevant and non-relevant documents depending on their grammatical case.

Identifying Strategic Information from Scientific Articles through Sentence Classification
Fidelia Ibekwe-SanJuan | Chaomei Chen | Roberto Pinho

We address here the need to assist users in rapidly accessing the most important or strategic information in the text corpus by identifying sentences carrying specific information. More precisely, we want to identify contribution of authors of scientific papers through a categorization of sentences using rhetorical and lexical cues. We built local grammars to annotate sentences in the corpus according to their rhetorical status: objective, new things, results, findings, hypotheses, conclusion, related_word, future work. The annotation is automatically projected automatically onto two other corpora to test their portability across several domains. The local grammars are implemented in the Unitex system. After sentence categorization, the annotated sentences are clustered and users can navigate the result by accessing specific information types. The results can be used for advanced information retrieval purposes.

Keywords, k-NN and Neural Networks: a Support for Hierarchical Categorization of Texts in Brazilian Portuguese
Susana Azeredo | Silvia Moraes | Vera Lima

A frequent problem in automatic categorization applications involving Portuguese language is the absence of large corpora of previously classified documents, which permit the validation of experiments carried out. Generally, the available corpora are not classified or, when they are, they contain a very reduced number of documents. The general goal of this study is to contribute to the development of applications which aim at text categorization for Brazilian Portuguese. Specifically, we point out that keywords selection associated with neural networks can improve results in the categorization of Brazilian Portuguese texts. The corpus is composed of 30 thousand texts from the Folha de São Paulo newspaper, organized in 29 sections. In the process of categorization, the k-Nearest Neighbor (k-NN) algorithm and the Multilayer Perceptron neural networks trained with the backpropagation algorithm are used. It is also part of our study to test the identification of keywords parting from the log-likelihood statistical measure and to use them as features in the categorization process. The results clearly show that the precision is better when using neural networks than when using the k-NN.

Automatic Extraction of Textual Elements from News Web Pages
Hossam Ibrahim | Kareem Darwish | Abdel-Rahim Madany

In this paper we present an algorithm for automatic extraction of textual elements, namely titles and full text, associated with news stories in news web pages. We propose a supervised machine learning classification technique based on the use of a Support Vector Machine (SVM) classifier to extract the desired textual elements. The technique uses internal structural features of a webpage without relying on the Document Object Model to which many content authors fail to adhere. The classifier uses a set of features which rely on the length of text, the percentage of hypertext, etc. The resulting classifier is nearly perfect on previously unseen news pages from different sites. The proposed technique is successfully employed in, which is the largest Arabic news aggregator on the web.

Extraction of Informative Expressions from Domain-specific Documents
Eiko Yamamoto | Hitoshi Isahara | Akira Terada | Yasunori Abe

What kinds of lexical resources are helpful for extracting useful information from domain-specific documents? Although domain-specific documents contain much useful knowledge, it is not obvious how to extract such knowledge efficiently from the documents. We need to develop techniques for extracting hidden information from such domain-specific documents. These techniques do not necessarily use state-of-the-art technologies and achieve deep and accurate language understanding, but are based on huge amounts of linguistic resources, such as domain-specific lexical databases. In this paper, we introduce two techniques for extracting informative expressions from documents: the extraction of related words that are not only taxonomically related but also thematically related, and the acquisition of salient terms and phrases. With these techniques we then attempt to automatically and statistically extract domain-specific informative expressions in aviation documents as an example and evaluate the results.

Connecting Text Mining and Pathways using the PathText Resource
Rune Sætre | Brian Kemper | Kanae Oda | Naoaki Okazaki | Yukiko Matsuoka | Norihiro Kikuchi | Hiroaki Kitano | Yoshimasa Tsuruoka | Sophia Ananiadou | Jun’ichi Tsujii

Many systems have been developed in the past few years to assist researchers in the discovery of knowledge published as English text, for example in the PubMed database. At the same time, higher level collective knowledge is often published using a graphical notation representing all the entities in a pathway and their interactions. We believe that these pathway visualizations could serve as an effective user interface for knowledge discovery if they can be linked to the text in publications. Since the graphical elements in a Pathway are of a very different nature than their corresponding descriptions in English text, we developed a prototype system called PathText. The goal of PathText is to serve as a bridge between these two different representations. In this paper, we first describe the overall architecture and the interfaces of the PathText system, and then provide some details about the core Text Mining components.

Detecting Co-Derivative Documents in Large Text Collections
Jan Pomikálek | Pavel Rychlý

We have analyzed the SPEX algorithm by Bernstein and Zobel (2004) for detecting co-derivative documents using duplicate n-grams. Although we totally agree with the claim that not using unique n-grams can greatly increase the efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. While the memory requirements for computing co-derivative documents can be reduced to up to 1% by only using duplicate n-grams, SPEX needs about 40 times more memory for computing the list of duplicate n-grams itself. Therefore the memory requirements of the whole process are not reduced enough to make the algorithm practical for very large collections. We propose a solution for this problem using an external sort with the suffix array in-memory sorting and temporary file compression. The proposed algorithm for computing duplicate n-grams uses a fixed amount of memory for any input size.

Extraction and Evaluation of Keywords from Learning Objects: a Multilingual Approach
Lothar Lemnitzer | Paola Monachesi

We report about a project which brings together Natural Language Processing and eLearning. One of the functionalities developed within this project is the possibility to annotate learning objects semi-automatically with keywords. To this end, a keyword extractor has been created which is able to handle documents in 8 languages. The approach employed is based on a linguistic processing step which is followed by a filtering step of candidate keywords and their subsequent ranking based on frequency criteria. Three tests have been carried out to provide a rough evaluation of the performance of the tool, to measure inter annotator agreement in order to determine the complexity of the task and to evaluate the acceptance of the proposed keywords by users.

Exploiting the Role of Position Feature in Chinese Relation Extraction
Peng Zhang | Wenjie Li | Furu Wei | Qin Lu | Yuexian Hou

Relation extraction is the task of finding pre-defined semantic relations between two entities or entity mentions from text. Many methods, such as feature-based and kernel-based methods, have been proposed in the literature. Among them, feature-based methods draw much attention from researchers. However, to the best of our knowledge, existing feature-based methods did not explicitly incorporate the position feature and no in-depth analysis was conducted in this regard. In this paper, we define and exploit nine types of position information between two named entity mentions and then use it along with other features in a multi-class classification framework for Chinese relation extraction. Experiments on the ACE 2005 data set show that the position feature is more effective than the other recognized features like entity type/subtype and character-based N-gram context. Most important, it can be easily captured and does not require as much effort as applying deep natural language processing.

Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation
Ben Allison | Louise Guthrie

The release of the Enron corpus provided a unique resource for studying aspects of email use, because it is largely unfiltered, and therefore presents a relatively complete collection of emails for a reasonably large number of correspondents. This paper describes a newly created subcorpus of the Enron emails which we suggest can be used to test techniqes for authorship attribution, and further shows the application of three different classification methods to this task to present baseline results. Two of the classifiers used are are standard, and have been shown to perform well in the literature, and one of the classifiers is novel and based on concurrent work that proposes a Bayesian hierarchical distribution for word counts in documents. For each of the classifiers, we present results using six text representations, including use of linguistic structures derived from a parser as well as lexical information.

Creating a Research Collection of Question Answer Sentence Pairs with Amazon’s Mechanical Turk
Michael Kaisser | John Lowe

Each year NIST releases a set of question, document id, answer-triples for the factoid questions used in the TREC Question Answering track. While this resource is widely used and proved itself useful for many purposes, it also is too coarse a grain-size for a lot of other purposes. In this paper we describe how we have used Amazon’s Mechanical Turk to have multiple subjects read the documents and identify the sentences themselves which contain the answer. For most of the 1911 questions in the test sets from 2002 to 2006 and each of the documents said to contain an answer, the Question-Answer Sentence Pairs (QASP) corpus introduced in this paper contains the identified answer sentences. We believe that this corpus, which we will make available to the public, can further stimulate research in QA, especially linguistically motivated research, where matching the question to the answer sentence by either syntactic or semantic means is a central concern.

Adaptation of Relation Extraction Rules to New Domains
Feiyu Xu | Hans Uszkoreit | Hong Li | Niko Felger

This paper presents various strategies for improving the extraction performance of less prominent relations with the help of the rules learned for similar relations, for which large volumes of data are available that exhibit suitable data properties. The rules are learned via a minimally supervised machine learning system for relation extraction called DARE. Starting from semantic seeds, DARE extracts linguistic grammar rules associated with semantic roles from parsed news texts. The performance analysis with respect to different experiment domains shows that the data property plays an important role for DARE. Especially the redundancy of the data and the connectivity of instances and pattern rules have a strong influence on recall. However, most real-world data sets do not possess the desirable small-world property. Therefore, we propose three scenarios to overcome the data property problem of some domains by exploiting a similar domain with better data properties. The first two strategies stay with the same corpus but try to extract new similar relations with learned rules. The third strategy adapts the learned rules to a new corpus. All three strategies show that frequently mentioned relations can help in the detection of less frequent relations.

Boosting Precision and Recall of Hyponymy Relation Acquisition from Hierarchical Layouts in Wikipedia
Asuka Sumida | Naoki Yoshinaga | Kentaro Torisawa

This paper proposes an extension of Sumida and Torisawa’s method of acquiring hyponymy relations from hierachical layouts in Wikipedia (Sumida and Torisawa, 2008). We extract hyponymy relation candidates (HRCs) from the hierachical layouts in Wikipedia by regarding all subordinate items of an item x in the hierachical layouts as x’s hyponym candidates, while Sumida and Torisawa (2008) extracted only direct subordinate items of an item x as x’s hyponym candidates. We then select plausible hyponymy relations from the acquired HRCs by running a filter based on machine learning with novel features, which even improve the precision of the resulting hyponymy relations. Experimental results show that we acquired more than 1.34 million hyponymy relations with a precision of 90.1%.

Parameters for Topic Boundary Detection in Multi-Party Dialogues
Margot Mieskes | Michael Strube

We present a topic boundary detection method that searches for connections between sequences of utterances in multi party dialogues. The connections are established based on word identity. We compare our method to a state-of-the art automatic Topic boundary detection method that was also used on multi party dialogues. We checked various methods of preprocessing of the data, including stemming, lemmatization and stopword filtering with a text-based as well as speech-based stopword lists. Using standard evaluation methods we found that our method outperformed the state-of-the art method.

Semantic Press
Eugenio Picchi | Eva Sassolini | Sebastiana Cucurullo | Francesca Bertagna | Paola Baroni

In this paper Semantic Press, a tool for the automatic press review, is introduced. It is based on Text Mining technologies and is tailored to meet the needs of the eGovernment and eParticipation communities. First, a general description of the application demands emerging from the eParticipation and eGovernment sectors is offered. Then, an introduction to the framework of the automatic analysis and classification of newspaper content is provided, together with a description of the technologies underlying it.

An Approach to Modeling Heterogeneous Resources for Information Extraction
Lei Xia | José Iria

In this paper, we describe an approach that aims to model heterogeneous resources for information extraction. Document is modeled in graph representation that enables better understanding of multi-media document and its structure which ultimately could result better cross-media information extraction. We also describe our proposed algorithm that segment document-based on the document modeling approach we described in this paper.

On Classifying Coherent/Incoherent Romanian Short Texts
Anca Dinu

In this paper we present and discuss the results of a text coherence experiment performed on a small corpus of Romanian text from a number of alternative high school manuals. During the last 10 years, an abundance of alternative manuals for high school was produced and distributed in Romania. Due to the large amount of material and to the relative short time in which it was produced, the question of assessing the quality of this material emerged; this process relied mostly of subjective human personal opinion, given the lack of automatic tools for Romanian. Debates and claims of poor quality of the alternative manuals resulted in a number of examples of incomprehensible / incoherent paragraphs extracted from such manuals. Our goal was to create an automatic tool which may be used as an indication of poor quality of such texts. We created a small corpus of representative texts from Romanian alternative manuals. We manually classified the chosen paragraphs from such manuals into two categories: comprehensible/coherent text and incomprehensible/incoherent text. We then used different machine learning techniques to automatically classify them in a supervised manner. Our approach is rather simple, but the results are encouraging.

Characterization of Scientific and Popular Science Discourse in French, Japanese and Russian
Lorraine Goeuriot | Natalia Grabar | Béatrice Daille

We aim to characterize the comparability of corpora, we address this issue in the trilingual context through the distinction of expert and non expert documents. We work separately with corpora composed of documents from the medical domain in three languages (French, Japanese and Russian) which present an important linguistic distance between them. In our approach, documents are characterized in each language by their topic and by a discursive typology positioned at three levels of document analysis: structural, modal and lexical. The document typology is implemented with two learning algorithms (SVMlight and C4.5). Evaluation of results shows that the proposed discursive typology can be transposed from one language to another, as it indeed allows to distinguish the two aimed discourses (science and popular science). However, we observe that performances vary a lot according to languages, algorithms and types of discursive characteristics.

Converting Romanized Persian to the Arabic Writing Systems
Jalal Maleki | Lars Ahrenberg

This paper describes a syllabification based conversion method for converting romanized Persian text to the traditional Arabic-based writing system. The system is implemented in Xerox XFST and relies on rule based conversion of words rather than using morphological analysis. The paper presents a brief evaluation of the accuracy of the transcriptions generated by the method.

Unsupervised Learning-based Anomalous Arabic Text Detection
Nasser Abouzakhar | Ben Allison | Louise Guthrie

The growing dependence of modern society on the Web as a vital source of information and communication has become inevitable. However, the Web has become an ideal channel for various terrorist organisations to publish their misleading information and send unintelligible messages to communicate with their clients as well. The increase in the number of published anomalous misleading information on the Web has led to an increase in security threats. The existing Web security mechanisms and protocols are not appropriately designed to deal with such recently developed problems. Developing technology to detect anomalous textual information has become one of the major challenges within the NLP community. This paper introduces the problem of anomalous text detection by automatically extracting linguistic features from documents and evaluating those features for patterns of suspicious and/or inconsistent information in Arabic documents. In order to achieve that, we defined specific linguistic features that characterise various Arabic writing styles. Also, the paper introduces the main challenges in Arabic processing and describes the proposed unsupervised learning model for detecting anomalous Arabic textual information.

Condensing Sentences for Subtitle Generation
Prokopis Prokopidis | Vassia Karra | Aggeliki Papagianopoulou | Stelios Piperidis

Text condensation aims at shortening the length of an utterance without losing essential textual information. In this paper, we report on the implementation and preliminary evaluation of a sentence condensation tool for Greek using a manually constructed table of 450 lexical paraphrases, and a set of rules that delete syntactic subtrees that carry minor semantic information. Evaluation on two-sentence sets show promising results regarding grammaticality and semantic acceptability of compressed versions.

Making Text Resources Accessible to the Reader: the Case of Patent Claims
Simon Mille | Leo Wanner

Hardly any other kind of text structures is as notoriously difficult to read as patents. This is first of all due to their abstract vocabulary and their very complex syntactic constructions. Especially the claims in a patent are a challenge: in accordance with international patent writing regulations, each claim must be rendered in a single sentence. As a result, sentences with more than 200 words are not uncommon. Therefore, paraphrasing of the claims in terms the user can understand is of high demand. We present a rule-based paraphrasing module that realizes paraphrasing of patent claims in English as a rewriting task. Prior to the rewriting proper, the module implies the stages of simplification and discourse and syntactic analyses. The rewriting makes use of a full-fledged text generator and consists in a number of genuine generation tasks such as aggregation, selection of referring expressions, choice of discourse markers and syntactic generation. As generator, we use the MATE-work bench, which is based on the Meaning-Text Theory of linguistics.

Exploiting Lexical Resources for Disambiguating CJK and Arabic Orthographic Variants
Jack Halpern

The orthographical complexities of Chinese, Japanese, Korean (CJK) and Arabic pose a special challenge to developers of NLP applications. These difficulties are exacerbated by the lack of a standardized orthography in these languages, especially the highly irregular Japanese orthography and the ambiguities of the Arabic script. This paper focuses on CJK and Arabic orthographic variation and provides a brief analysis of the linguistic issues. The basic premise is that statistical methods by themselves are inadequate, and that linguistic knowledge supported by large-scale lexical databases should play a central role in achieving high accuracy in disambiguating and normalizing orthographic variants.

Automatic Document Quality Control
Neil Newbold | Lee Gillam

This paper focuses on automatically improving the readability of documents. We explore mechanisms relating to content control that could be used (i) by authors to improve the quality and consistency of the language used in authoring; and (ii) to find a means to demonstrate this to readers. To achieve this, we implemented and evaluated a number of software components, including those of the University of Surrey Department of Computing’s content analysis applications (System Quirk). The software integrates these components within the commonly available GATE software and incorporates language resources considered useful within the standards development process: a Plain English thesaurus; lookup of ISO terminology provided from a terminology management system (TMS) via ISO 16642; automatic terminology discovery using statistical and linguistic techniques; and readability metrics. Results lead us to the development of an assistive tool, initially for authors of standards but not considered to be limited only to such authors, and also to a system that provides automatic annotation of texts to help readers to understand them. We describe the system developed and made freely available under the auspices of the EU eContent project LIRICS.

OpenCCG Workbench and Visualization Tool
Thepchai Supnithi | Suchinder Singh | Taneth Ruangrajitpakorn | Prachya Boonkwan | Monthika Boriboon

Combinatorial Category Grammar is (CCG) a lexicalized grammar formalism which is expressed by syntactic category, a logical form representation. There are difficulties in representing CCG without any visualization tools. This paper presents a design framework of OpenCCG workbench and visualization tool which enables linguists to develop CCG based lexicons more easily. Our research is aimed to resolve these gaps by developing a user-friendly tool. OpenCCG Workbench, an open source web-based environment, was developed to enable multiple users to visually create and update grammars for using with the OpenCCG library. It was designed to streamline and speed-up the lexicon building process, and to free the linguists from writing XML files which is both cumbersome and error-prone. The system consists of three sub-systems: grammar management system, grammar validator system, and concordance retrieval system. In this paper we will mainly discuss the most important parts, grammar management and validation systems, which are directly related to a CCG lexicon construction. We support users in three levels; Expert linguists who play a role as lexical entry designer, normal linguists who adds or edits lexicons, and guests who requires an acquisition to the lexicon into their applications.

Using the Web as a Linguistic Resource to Automatically Correct Lexico-Syntactic Errors
Matthieu Hermet | Alain Désilets | Stan Szpakowicz

This paper presents an algorithm for correcting language errors typical of second-language learners. We focus on preposition errors, which are very common among second-language learners but are not addressed well by current commercial grammar correctors and editing aids. The algorithm takes as input a sentence containing a preposition error (and possibly other errors as well), and outputs the correct preposition for that particular sentence context. We use a two-phase hybrid rule-based and statistical approach. In the first phase, rule-based processing is used to generate a short expression that captures the context of use of the preposition in the input sentence. In the second phase, Web searches are used to evaluate the frequency of this expression, when alternative prepositions are used instead of the original one. We tested this algorithm on a corpus of 133 French sentences written by intermediate second-language learners, and found that it could address 69.9% of those cases. In contrast, we found that the best French grammar and spell checker currently on the market, Antidote, addressed only 3% of those cases. We also showed that performance degrades gracefully when using a corpus of frequent n-grams to evaluate frequencies.

I saw TREE trees in the park: How to Correct Real-Word Spelling Mistakes
Davide Fossati | Barbara Di Eugenio

This paper presents a context sensitive spell checking system that uses mixed trigram models, and introduces a new empirically grounded method for building confusion sets. The proposed method has been implemented, tested, and evaluated in terms of coverage, precision, and recall. The results show that the method is effective.

User-Centred Design of Error Correction Tools
Martí Quixal | Toni Badia | Francesc Benavent | Jose R. Boullosa | Judith Domingo | Bernat Grau | Guillem Massó | Oriol Valentín

This paper presents a methodology for the design and implementation of user-centred language checking applications. The methodology is based on the separation of three critical aspects in this kind of application: functional purpose (educational or corrective goal), types of warning messages, and linguistic resources and computational techniques used. We argue that to assure a user-centred design there must be a clear-cut division between the “error” typology underlying the system and the software architecture. The methodology described has been used to implement two different user-driven spell, grammar and style checkers for Catalan. We discuss that this is an issue often neglected in commercial applications, and remark the benefits of such a methodology in the scalability of language checking applications. We evaluate our application in terms of recall, precision and noise, and compare it to the only other existing grammar checker for Catalan, to our knowledge.

Professor or Screaming Beast? Detecting Anomalous Words in Chinese
Wei Liu | Ben Allison | Louise Guthrie

The Internet has become the most popular platform for communication. However because most of the modern computer keyboard is Latin-based, Asian languages such as Chinese cannot input its characters (Hanzi) directly with these keyboards. As a result, methods for representing Chinese characters using Latin alphabets were introduced. The most popular method among these is the Pinyin input system. Pinyin is also called “Romanised” Chinese in that it phonetically resembles a Chinese character. Due to the highly ambiguous mapping from Pinyin to Chinese characters, word misuses can occur using standard computer keyboard, and more commonly so in internet chat-rooms or instant messengers where the language used is less formal. In this paper we aim to develop a system that can automatically identify such anomalies, whether they are simple typos or whether they are intentional. After identifying them, the system should suggest the correct word to be used.

Spelling Correction: from Two-Level Morphology to Open Source
Iñaki Alegria | Klara Ceberio | Nerea Ezeiza | Aitor Soroa | Gregorio Hernandez

Basque is a highly inflected and agglutinative language (Alegria et al., 1996). Two-level morphology has been applied successfully to this kind of languages and there are two-level based descriptions for very different languages. After doing the morphological description for a language, it is easy to develop a spelling checker/corrector for this language. However, what happens if we want to use the speller in the “free world” (OpenOffice, Mozilla, emacs, LaTeX, etc.)? Ispell and similar tools (aspell, hunspell, myspell) are the usual mechanisms for these purposes, but they do not fit the two-level model. In the absence of two-level morphology based mechanisms, an automatic conversion from two-level description to hunspell is described in this paper.

Automatic Rewriting of Patient Record Narratives
Catalina Hallett | David Hardcastle

Patients require access to Electronic Patient Records, however medical language is often too difficult for patients to understand. Explaining records to patients is a time-consuming task, which we attempt to simplify by automating the translation procedure. This paper introduces a research project dealing with the automatic rewriting of medical narratives for the benefit of patients. We are looking at various ways in which technical language can be transposed into patient-friendly language by means of a comparison with patient information materials. The text rewriting procedure we describe could potentially have an impact on the quality of information delivered to patients. We report on some preliminary experiments concerning rewriting at lexical and paragaph level. This is an ongoing project which currently addresses a restricted number of issues, including target text modelling and text rewriting at lexical level.

BART: A modular toolkit for coreference resolution
Yannick Versley | Simone Ponzetto | Massimo Poesio | Vladimir Eidelman | Alan Jern | Jason Smith | Xiaofeng Yang | Alessandro Moschitti

Developing a full coreference system able to run all the way from raw text to semantic interpretation is a considerable engineering effort. Accordingly, there is very limited availability of off-the shelf tools for researchers whose interests are not primarily in coreference or others who want to concentrate on a specific aspect of the problem. We present BART, a highly modular toolkit for developing coreference applications. In the Johns Hopkins workshop on using lexical and encyclopedic knowledge for entity disambiguation, the toolkit was used to extend a reimplementation of Soon et al.’s proposal with a variety of additional syntactic and knowledge-based features, and experiment with alternative resolution processes, preprocessing tools, and classifiers. BART has been released as open source software and is available from

ANAWIKI: Creating Anaphorically Annotated Resources through Web Cooperation
Massimo Poesio | Udo Kruschwitz | Jon Chamberlain

The ability to make progress in Computational Linguistics depends on the availability of large annotated corpora, but creating such corpora by hand annotation is very expensive and time consuming; in practice, it is unfeasible to think of annotating more than one million words. However, the success of Wikipedia and other projects shows that another approach might be possible: take advantage of the willingness of Web users to contribute to collaborative resource creation. AnaWiki is a recently started project that will develop tools to allow and encourage large numbers of volunteers over the Web to collaborate in the creation of semantically annotated corpora (in the first instance, of a corpus annotated with information about anaphora).

Influence of Text Type and Text Length on Anaphoric Annotation
Daniela Goecke | Maik Stührenberg | Andreas Witt

We report the results of a study that investigates the agreement of anaphoric annotations. The study focuses on the influence of the factors text length and text type on a corpus of scientific articles and newspaper texts. In order to measure inter-annotator agreement we compare existing approaches and we propose to measure each step of the annotation process separately instead of measuring the resulting anaphoric relations only. A total amount of 3,642 anaphoric relations has been annotated for a corpus of 53,038 tokens (12,327 markables). The results of the study show that text type has more influence on inter-annotator agreement than text length. Furthermore, the definition of well-defined annotation instructions and coder training is a crucial point in order to receive good annotation results.

Annotating Abstract Pronominal Anaphora in the DAD Project
Costanza Navarretta | Sussi Olsen

In this paper we present an extension of the MATE/GNOME annotation scheme for anaphora (Poesio, 2004) which accounts for abstract anaphora in Danish and Italian. By abstract anaphora it is here meant pronouns whose linguistic antecedents are verbal phrases, clauses and discourse segments. The extended scheme, which we call the DAD annotation scheme, allows to annotate information about abstract anaphora which is important to investigate their use, see i.a. (Webber, 1988; Gundel et al., 2003; Navarretta, 2004; Navarretta, 2007) and which can influence their automatic treatment. Intercoder agreement scores obtained by applying the DAD annotation scheme on texts and dialogues in the two languages are given and show that the information proposed in the scheme can be recognised in a reliable way.

Deriving Rhetorical Complexity Data from the RST-DT Corpus
Sandra Williams | Richard Power

This paper describes a study of the levels at which different rhetorical relations occur in rhetorical structure trees. In a previous empirical study (Williams and Reiter, 2003) of the RST-DT (Rhetorical Structure Theory Discourse Treebank) Corpus (Carlson et al., 2003), we noticed that certain rhetorical relations tended to occur more frequently at higher levels in a rhetorical structure tree, whereas others seemed to occur more often at lower levels. The present study takes a closer look at the data, partly to test this observation, and partly to investigate related issues such as the relative complexity of satellite and nucleus for each type of relation. One practical application of this investigation would be to guide discourse planning in Natural Language Generation (NLG), so that it reflects more accurately the structures found in documents written by human authors. We present our preliminary findings and discuss their relevance for discourse planning.

Knowledge-based Coreference Resolution for Hungarian
Márton Miháltz

We present a knowledge-based coreference resolution system for noun phrases in Hungarian texts. The system is used as a module in an automated psychological text processing project. Our system uses rules that rely on knowledge from the morphological, syntactic and semantic output of a deep parser and semantic relations form the Hungarian WordNet ontology. We also use rules that rely on Binding Theory, research results in Hungarian psycholinguistics, current research on proper name coreference identification and our own heuristics. We describe the constraints-and-preferences algorithm in detail that attempts to find coreference information for proper names, common nouns, pronouns and zero pronouns in texts. We present evaluation results for our system on a corpus manually annotated with coreference relations. Precision of the resolution of various coreference types reaches up to 80%, while overall recall is 63%. We also present an investigation of the various error types our system produced along with an analysis of the results.

The Italian Particle “ne”: Corpus Construction and Analysis
Malvina Nissim | Sara Perboni

The Italian particle “ne” exhibits interesting anaphoric properties that have not been yet explored in depth from a corpus and computational linguistic perspective. We provide: (i) an overview of the phenomenon; (ii) a set of annotation schemes for marking up occurrences of “ne”; (iii) the description of a corpus annotated for this phenomenon ; (iv) a first assessment of the resolution task. We show that the schemes we developed are reliable, and that the actual distribution of partitive and non-partitive uses of “ne” is inversely proportional to the amount of attention that the two different uses have received in the linguistic literature. As an assessment of the complexity of the resolution task, we find that a recency-based baseline yields an accuracy of less than 30% on both development and test data.

Introducing DRS (The Digital Replay System): a Tool for the Future of Corpus Linguistic Research and Analysis
Dawn Knight | Paul Tennent

This paper outlines the new resource technologies, products and applications that have been constructed during the development of a multi-modal (MM hereafter) corpus tool on the DReSS project (Understanding New Forms of the Digital Record for e-Social Science), based at the University of Nottingham, England. The paper provides a brief outline of the DRS (Digital Replay System, the software tool at the heart of the corpus), highlighting its facility to display synchronised video, audio and textual data and, most relevantly, a concordance tool capable of interrogating data constructed from textual transcriptions anchored to video or audio, and from coded annotations of specific features of gesture-in-talk. This is complemented by a real-time demonstration of the DRS interface in-use as part of the LREC 2008 conference. This will serve to show the manner in which a system such as the DRS can be used to facilitate the assembly, storage and analysis of multi modal corpora, supporting both qualitative and quantitative approaches to the analysis of collected data.

An Inverted Index for Storing and Retrieving Grammatical Dependencies
Michaela Atterer | Hinrich Schütze

Web count statistics gathered from search engines have been widely used as a resource in a variety of NLP tasks. For some tasks, however, the information they exploit is not fine-grained enough. We propose an inverted index over grammatical relations as a fast and reliable resource to access more general and also more detailed frequency information. To build the index, we use a dependency parser to parse a large corpus. We extract binary dependency relations, such as he-subj-say (“he” is the subject of “say”) as index terms and construct the index using publicly available open-source indexing software. The unit we index over is the sentence. The index can be used to extract grammatical relations and frequency counts for these relations. The framework also provides the possibility to search for partial dependencies (say, the frequency of “he” occurring in subject position), words, strings and a combination of these. One possible application is the disambiguation of syntactic structures.

MaltEval: an Evaluation and Visualization Tool for Dependency Parsing
Jens Nilsson | Joakim Nivre

This paper presents a freely available evaluation tool for dependency parsing: MaltEval ( It is flexible and extensible, and provides functionality for both quantitative evaluation and visualization of dependency structure. The quantitative evaluation is compatible with other standard evaluation software for dependency structure which does not produce visualization of dependency structure, and can output more details as well as new types of evaluation metrics. In addition, MaltEval has generic support for confusion matrices. It can also produce statistical significance tests when more than one parsed file is specified. The visualization module also has the ability to highlight discrepancies between the gold-standard files and the parsed files, and it comes with an easy to use GUI functionality to search in the dependency structure of the input files.

New Functions of FrameSQL for Multilingual FrameNets
Hiroaki Sato

The Berkeley FrameNet Project (BFN) is making an English lexical database called FrameNet, which describes syntactic and semantic properties of an English lexicon extracted from large electronic text corpora (Baker et al., 1998). Other projects dealing with Spanish, German and Japanese follow a similar approach and annotate large corpora. FrameSQL is a web-based application developed by the author, and it allows the user to search the BFN database in a variety of ways (Sato, 2003). FrameSQL shows a clear view of the headword’s grammar and combinatorial properties offered by the FrameNet database. FrameSQL has been developing and new functions were implemented for processing the Spanish FrameNet data (Subirats and Sato, 2004). FrameSQL is also in the process of incorporating the data of the Japanese FrameNet Project (Ohara et al., 2003) and that of the Saarbrücken Lexical Semantics Acquisition Project (Erk et al., 2003) into the database and will offer the same user-interface for searching these lexical data. This paper describes new functions of FrameSQL, showing how FrameSQL deals with the lexical data of English, Spanish, Japanese and German seamlessly.

Division of Example Sentences Based on the Meaning of a Target Word Using Semi-Supervised Clustering
Hiroyuki Shinnou | Minoru Sasaki

In this paper, we describe a system that divides example sentences (data set) into clusters, based on the meaning of the target word, using a semi-supervised clustering technique. In this task, the estimation of the cluster number (the number of the meaning) is critical. Our system primarily concentrates on this aspect. First, a user assigns the system an initial cluster number for the target word. The system then performs general clustering on the data set to obtain small clusters. Next, using constraints given by the user, the system integrates these clusters to obtain the final clustering result. Our system performs this entire procedure with high precision and requiring only a few constraints. In the experiment, we tested the system for 12 Japanese nouns used in the SENSEVAL2 Japanese dictionary task. The experiment proved the effectiveness of our system. In the future, we will improve sentence similarity measurements.

The Japanese FrameNet Software Tools
Hiroaki Saito | Shunta Kuboya | Takaaki Sone | Hayato Tagami | Kyoko Ohara

This paper describes an ongoing project “Japanese FrameNet (JFN)”, a corpus-based lexicon of Japanese in the FrameNet style. This paper focuses on the set of software tools tailored for the JFN annotation process. As the first step in the annotation, annotators select target sentences from the JFN corpus using the JFN kwic search tool, where they can specify cooccurring words and/or the part of speech of collocates. Our search tool is capable of displaying the parsed tree of a target sentence and its neigbouring sentences. The JFN corpus mainly consists of balanced and copyright-free “Japanese Corpus” which is being built as a national project. After the sentence to be annotated is chosen, the annotator labels syntactic and semantic tags to the appropriate phrases in the sentence. This work is performed on an annotation platform called JFNDesktop, in which the functions of labeling assist and consistency checking of annotations are available. Preliminary evaluation of our platform shows such functions accelerate the annotation process.

JMWNL: an Extensible Multilingual Library for Accessing Wordnets in Different Languages
Maria Teresa Pazienza | Armando Stellato | Alexandra Tudorache

In this paper we present JMWNL, a multilingual extension of the JWNL java library, which was originally developed for accessing Princeton WordNet dictionaries. JMWNL broadens the range of JWNL’s accessible resources by covering also dictionaries produced inside the EuroWordNet project. Specific resources, such as language-dependent algorithmic stemmers, have been adopted to cover the diversities in the morphological nature of words in the addressed idioms. New semantic and lexical relations have been included to maximize compatibility with new versions of the original Princeton WordNet and to include the whole range of relations from EuroWordNet. Relations from Princeton WordNet on one side and EuroWordNet on the other one have in some cases been mapped to provide a uniform reference for coherent cross-linguistic use of the library.

Benchmarking Textual Annotation Tools for the Semantic Web
Diana Maynard

This paper investigates the state of the art in automatic textual annotation tools, and examines the extent to which they are ready for use in the real world. We define some benchmarking criteria for measuring the usability of annotation tools, and examine those factors which are particularly important for a real user to be able to determine which is the most suitable tool for their use. We discuss factors such as usability, accessibility, interoperability and scalability, and evaluate a set of annotation tools according to these factors. Finally, we draw some conclusions about the current state of research in annotation and make some suggestions for the future.

Authorship Identification of Romanian Texts with Controversial Paternity
Liviu Dinu | Marius Popescu | Anca Dinu

In this work we propose a new strategy for the authorship identification problem and we test it on an example from Romanian literature: did Radu Albala found the continuation of Mateiu Caragiale’s novel Sub pecetea tainei, or did he write himself the respective continuation? The proposed strategy is based on the similarity of rankings of function words; we compare the obtained results with the results obtained by a learning method (namely Support Vector Machines -SVM- with a string kernel).

Ensuring Semantic Interoperability on Lexical Resources
Marc Kemps-Snijders | Claus Zinn | Jacquelijn Ringersma | Menzo Windhouwer

In this paper, we describe a unifying approach to tackle data heterogeneity issues for lexica and related resources. We present LEXUS, our software that implements the Lexical Markup Framework (LMF) to uniformly describe and manage lexica of different structures. LEXUS also makes use of a central Data Category Registry (DCR) to address terminological issues with regard to linguistic concepts as well as the handling of working and object languages. Finally, we report on ViCoS, a LEXUS extension, providing support for the definition of arbitrary semantic relations between lexical entries or parts thereof.

Exploring and Navigating: Tools for GermaNet
Marc Finthammer | Irene Cramer

GermaNet is regarded to be a valuable resource for many German NLP applications, corpus research, and teaching. This demo presents three GUI-based tools meant to facilitate the exploration of and navigation through GermaNet. The GermaNet Explorer exhibits various retrieval, sort, filter and visualization functions for words/synsets and also provides an insight into the modeling of GermaNet’s semantic relations as well as its representation as a graph. The GermaNet-Measure-API and GermaNet Pathfinder offer methods for the calculation of semantic relatedness based on GermaNet as a resource and the visualization of (semantic) paths between words/synsets. The GermaNet-Measure-API furthermore features a flexible interface, which facilitates the integration of all relatedness measures provided into user-defined applications. We have already used the three tools in our research on thematic chaining and thematic indexing, as a tool for the manual annotation of lexical chains, and as a resource in our courses on corpus linguistics and semantics.

A Knowledge-Modeling Approach for Multilingual Regulus Lexica
Marianne Santaholma | Nikos Chatzichrisafis

Development of lexical resources is, along with grammar development, one of the main efforts when building multilingual NLP applications. In this paper, we present a tool-based approach for more efficient manual lexicon development for a spoken language translation system. The approach in particular addresses the common problems of multilingual lexica including the redundancy of encoded information and inconsistency of lexica of different languages. The general benefits of this practical tool-based approach are clear and user-friendly lexicon structure, inheritance of information inside of a language and between different system languages, and transparency and consistency of coverage between system languages. The visual tool-based approach is user-friendly to linguistic informants that don’t have previous experience of lexicon development, while at the same time, it still is a powerful tool for expert system developers.

ODL: an Object Description Language for Lexical Information
Michael Rosner

This paper describes ODL, a description language for lexical information that is being developed within the context of a national project called MLRS (Maltese Language Resource Server) whose goal is to create a national corpus and computational lexicon for the Maltese language. The main aim of ODL is to make the task of the lexicographer easier by allowing lexical specifications to be set out formally so that actual entries will conform to them. The paper describes some of the background motivation, the ODL language itself, and concludes with a short example of how lexical values expressed in ODL can be mapped to an existing tagset together with some speculations about future work.

How to Evaluate and Raise the Quality in a Collaborative Lexicographic Approach
Dan Cristea | Corina Forăscu | Marius Răschip | Michael Zock

This paper focuses on different aspects of collaborative work used to create the electronic version of a dictionary in paper format, edited and printed by the Romanian Academy during the last century. In order to ensure accuracy in a reasonable amount of time, collaborative proofreading of the scanned material, through an on-line interface has been initiated. The paper details the activities and the heuristics used to maximize accuracy, and to evaluate the work of anonymous contributors with diverse backgrounds. Observing the behaviour of the enterprise for a period of 6 months allows estimating the feasibility of the approach till the end of the project.

Merging a Syntactic Resource with a WordNet: a Feasibility Study of a Merge between STO and DanNet
Bolette Sandford Pedersen | Anna Braasch | Lina Henriksen | Sussi Olsen | Claus Povlsen

This paper presents a feasibility study of a merge between SprogTeknologisk Ordbase (STO), which contains morphological and syntactic information, and DanNet, which is a Danish WordNet containing semantic information in terms of synonym sets and semantic relations. The aim of the merge is to develop a richer, composite resource which we believe will have a broader usage perspective than the two seen in isolation. In STO, the organizing principle is based on the observable syntactic features of a lemma’s near context (labeled syntactic units or SynUs). In contrast, the basic unit in DanNet is constituted by semantic senses or - in wordnet terminology - synonym sets (synsets). The merge of the two resources is thus basically to be understood as a linking between SynUs and synsets. In the paper we discuss which parts of the merge can be performed semi-automatically and which parts require manual linguistic matching procedures. We estimate that this manual work will amount to approx. 39% of the lexicon material.

Hydra: a Modal Logic Tool for Wordnet Development, Validation and Exploration
Borislav Rizov

This paper presents a multipurpose system for wordnet (WN) development, named Hydra. Hydra is an application for data editing and validation, as well as for data retrieval and synchronization between wordnets for different languages. The use of modal language for wordnet, the representation of wordnet as a relational database and the concurrent access are among its main advantages.

Evaluation of several Maximum Likelihood Linear Regression Variants for Language Adaptation
Míriam Luján | Carlos D. Martínez | Vicent Alabau

Multilingual Automatic Speech Recognition (ASR) systems are of great interest in multilingual environments. We studied the case of the Comunitat Valenciana where the two official languages are Spanish and Valencian. These two languages share most of their phonemes, and their syntax and vocabulary are also quite similar since they have influenced each other for many years. We constructed a system, and trained its acoustic models with a small corpus of Spanish and Valencian, which has produced poor results due to the lack of data. Adaptation techniques can be used to adapt acoustic models that are trained with a large corpus of a language inr order to obtain acoustic models for a phonetically similar language. This process is known as language adaptation. The Maximum Likelihood Linear Regression (MLLR) technique has commonly been used in speaker adaptation; however we have used MLLR in language adaptation. We compared several MLLR variants (mean square, diagonal matrix and full matrix) for language adaptation in order to choose the best alternative for our system.

Evaluation of Lexical Resources and Semantic Networks on a Corpus of Mental Associations
Laurianne Sitbon | Patrice Bellot | Philippe Blache

When a user cannot find a word, he may think of semantically related words that could be used into an automatic process to help him. This paper presents an evaluation of lexical resources and semantic networks for modelling mental associations. A corpus of associations has been constructed for its evaluation. It is composed of 20 low frequency target words each associated 5 times by 20 users. In the experiments we look for the target word in propositions made from the associated words thanks to 5 different resources. The results show that even if each resource has a useful specificity, the global recall is low. An experiment to extract common semantic features of several associations showed that we cannot expect to see the target word below a rank of 20 propositions.

Measures for Term and Sentence Relevances: an Evaluation for German
Heike Bieler | Stefanie Dipper

Terms, term relevances, and sentence relevances are concepts that figure in many NLP applications, such as Text Summarization. These concepts are implemented in various ways, though. In this paper, we want to shed light on the impact that different implementations can have on the overall performance of the systems. In particular, we examine the interplay between term definitions and sentence-scoring functions. For this, we define a gold standard that ranks sentences according to their significance and evaluate a range of relevant parameters with respect to the gold standard.

Annotation of Information Structure: an Evaluation across different Types of Texts
Julia Ritz | Stefanie Dipper | Michael Götze

We report on the evaluation of information structural annotation according to the Linguistic Information Structure Annotation Guidelines (LISA, (Dipper et al., 2007)). The annotation scheme differentiates between the categories of information status, topic, and focus. It aims at being language-independent and has been applied to highly heterogeneous data: written and spoken evidence from typologically diverse languages. For the evaluation presented here, we focused on German texts of different types, both written texts and transcriptions of spoken language, and analyzed the annotation quantitatively and qualitatively.

Word Segmentation of Vietnamese Texts: a Comparison of Approaches
Quang Thắng Đinh | Hồng Phương Lê | Thị Minh Huyền Nguyễn | Cẩm Tú Nguyễn | Mathias Rossignol | Xuân Lương Vũ

We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, which also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.

Comparing Italian parsers on a common Treebank: the EVALITA experience
Cristina Bosco | Alessandro Mazzei | Vincenzo Lombardo | Giuseppe Attardi | Anna Corazza | Alberto Lavelli | Leonardo Lesmo | Giorgio Satta | Maria Simi

The EVALITA 2007 Parsing Task has been the first contest among parsing systems for Italian. It is the first attempt to compare the approaches and the results of the existing parsing systems specific for this language using a common treebank annotated using both a dependency and a constituency-based format. The development data set for this parsing competition was taken from the Turin University Treebank, which is annotated both in dependency and constituency format. The evaluation metrics were those standardly applied in CoNLL and PARSEVAL. The results of the parsing results are very promising and higher than the state-of-the-art for dependency parsing of Italian. An analysis of such results is provided, which takes into account other experiences in treebank-driven parsing for Italian and for other Romance languages (in particular, the CoNLL X & 2007 shared tasks for dependency parsing). It focuses on the characteristics of data sets, i.e. type of annotation and size, parsing paradigms and approaches applied also to languages other than Italian.

Evaluation of Natural Language Tools for Italian: EVALITA 2007
Bernardo Magnini | Amedeo Cappelli | Fabio Tamburini | Cristina Bosco | Alessandro Mazzei | Vincenzo Lombardo | Francesca Bertagna | Nicoletta Calzolari | Antonio Toral | Valentina Bartalesi Lenzi | Rachele Sprugnoli | Manuela Speranza

EVALITA 2007, the first edition of the initiative devoted to the evaluation of Natural Language Processing tools for Italian, provided a shared framework where participants’ systems had the possibility to be evaluated on five different tasks, namely Part of Speech Tagging (organised by the University of Bologna), Parsing (organised by the University of Torino), Word Sense Disambiguation (organised by CNR-ILC, Pisa), Temporal Expression Recognition and Normalization (organised by CELCT, Trento), and Named Entity Recognition (organised by FBK, Trento). We believe that the diffusion of shared tasks and shared evaluation practices is a crucial step towards the development of resources and tools for Natural Language Processing. Experiences of this kind, in fact, are a valuable contribution to the validation of existing models and data, allowing for consistent comparisons among approaches and among representation schemes. The good response obtained by EVALITA, both in the number of participants and in the quality of results, showed that pursuing such goals is feasible not only for English, but also for other languages.

A Bottom-up Comparative Study of EuroWordNet and WordNet 3.0 Lexical and Semantic Relations
Maria Teresa Pazienza | Armando Stellato | Alexandra Tudorache

The paper presents a comparative study of semantic and lexical relations defined and adopted in WordNet and EuroWordNet. This document describes the experimental observations achieved through the analysis of data from different WordNet versions and EuroWordNet distributions for different languages, during the development of JMWNL (Java Multilingual WordNet Library), an extensible multilingual library for accessing WordNet-like resources in different languages and formats. The goal of this work was to realize an operative mapping between the relations defined in the two lexical resources and to unify library access and content navigation methods for both WordNet and EuroWordNet. The analysis focused on similarities, differences, semantic overlaps or inclusions, factual misinterpretations and inconsistencies between the intended and practical use of each single relation defined in these two linguistic resources. The paper details with examples the produced mapping, discussing required operations which implied merging, extending or simply keeping separate the examined relations

Evaluating the Ontology underlying sMail - the Conceptual Framework for Semantic Email Communication
Simon Scerri | Myriam Mencke | Brian Davis | Siegfried Handschuh

The lack of structure in the content of email messages makes it very hard for data channelled between the sender and the recipient to be correctly interpreted and acted upon. As a result, the purposes of messages frequently end up not being fulfilled, prompting prolonged communication and stalling the disconnected workflow that is characteristic of email. This problem could be partially solved by extending the current email model to support light-weight semantics pertaining to the intents of the sender and the expectations from the recipient(s), thus leaving no room for ambiguity. Semantically-aware email clients will then be able to support the user with the workflow of email-generated tasks. In line with this thinking, we present the sMail Conceptual Framework. At its core, this framework has an Email Speech Act Model. Given this model, email content can be categorized into a set of speech acts, each carrying specific expectations. In this paper we present and discuss the methodology and results of this model?s statistical evaluation. By performing the same evaluation on another existing model, we demonstrate our model?s higher sophistication. After careful observations, we perform changes to the model and subsequently accommodate the changes in the revised sMail Conceptual Framework.

Inter-sentential Coreferences in Semantic Networks: An Evaluation of Manual Annotation
Václav Novák | Keith Hall

We present an evaluation of inter-sentential coreference annotation in the context of manually created semantic networks. The semantic networks are constructed independently be each annotator and require an entity mapping priori to evaluating the coreference. We introduce a model used for mapping the semantic entities as well as an algorithm used for our evaluation task. Finally, we report the raw statistics for inter-annotator agreement and describe the inherent difficulty in evaluating coreference in semantic networks.

Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation
Mohamed Maamouri | Seth Kulick | Ann Bies

The Arabic Treebank (ATB), released by the Linguistic Data Consortium, contains multiple annotation files for each source file, due in part to the role of diacritic inclusion in the annotation process. The data is made available in both “vocalized” and “unvocalized” forms, with and without the diacritic marks, respectively. Much parsing work with the ATB has used the unvocalized form, on the basis that it more closely represents the “real-world” situation. We point out some problems with this usage of the unvocalized data and explain why the unvocalized form does not in fact represent “real-world” data. This is due to some aspects of the treebank annotation that to our knowledge have never before been published.

Evaluation of Virtual Keyboards for West-African Languages
Chantal Enguehard | Harouna Naroua

West African languages are written with alphabets that comprize non classical Latin characters. It is possible to design virtual keyboards which allow the writing of such special characters with a combination of keys. During the last decade, many different virtual keyboards had been created, without any standardization to fix the correspondence between each character and the keys to press to obtain it. We define a grid to evaluate such keyboards and apply it to five virtual keyboards in relation with the five main languages of Niger (Fulfulde, Hausa, Kanuri, Songhai-Zarma, Tamashek), Bambara and Soninke from Mali and Dyoula from Burkina Faso. We conclude hat the African LLACAN keyboard should be recommended in Niger because it covers all the characters used in the alphabets of the main languages of this country, it produces valid Unicode codes and it minimizes the number of keys to be pressed.

Anaphora Resolution Exercise: an Overview
Constantin Orăsan | Dan Cristea | Ruslan Mitkov | António Branco

Evaluation campaigns have become an established way to evaluate automatic systems which tackle the same task. This paper presents the first edition of the Anaphora Resolution Exercise (ARE) and the lessons learnt from it. This first edition focused only on English pronominal anaphora and NP coreference, and was organised as an exploratory exercise where various issues were investigated. ARE proposed four different tasks: pronominal anaphora resolution and NP coreference resolution on a predefined set of entities, pronominal anaphora resolution and NP coreference resolution on raw texts. For each of these tasks different inputs and evaluation metrics were prepared. This paper presents the four tasks, their input data and evaluation metrics used. Even though a large number of researchers in the field expressed their interest to participate, only three institutions took part in the formal evaluation. The paper briefly presents their results, but does not try to interpret them because in this edition of ARE our aim was not about finding why certain methods are better, but to prepare the ground for a fully-fledged edition.

Portuguese-English Word Alignment: some Experiments
Diana Santos | Alberto Simões

In this paper we describe some studies of Portuguese-English word alignment, focusing on (i) measuring the importance of the coupling between dictionaries and corpus; (ii) assessing the relevance of using syntactic information (POS and lemma) or just word forms, and (iii) taking into account the direction of translation. We first provide some motivation for the studies, as well as insist in separating type from token anlignment. We then briefly describe the resources employed: the EuroParl and COMPARA corpora, and the alignment tools, NATools, introducing some measures to evaluate the two kinds of dictionaries obtained. We then present the results of several experiments, comparing sizes, overlap, translation fertility and alignment density of the several bilingual resources built. We also describe preliminary data as far as quality of the resulting dictionaries or alignment results is concerned.

System Evaluation on a Named Entity Corpus from Clinical Notes
Karin Schuler | Vinod Kaggal | James Masanz | Philip Ogren | Guergana Savova

This paper presents the evaluation of the dictionary look-up component of Mayo Clinic’s Information Extraction system. The component was tested on a corpus of 160 free-text clinical notes which were manually annotated with the named entity disease. This kind of clinical text presents many language challenges such as fragmented sentences and heavy use of abbreviations and acronyms. The dictionary used for this evaluation was a subset of SNOMED-CT with semantic types corresponding to diseases/disorders without any augmentation. The algorithm achieves an F-score of 0.56 for exact matches and F-scores of 0.76 and 0.62 for right and left-partial matches respectively. Machine learning techniques are currently under investigation to improve this task.

Constructing Evaluation Corpora for Automated Clinical Named Entity Recognition
Philip Ogren | Guergana Savova | Christopher Chute

We report on the construction of a gold-standard dataset consisting of annotated clinical notes suitable for evaluating our biomedical named entity recognition system. The dataset is the result of consensus between four human annotators and contains 1,556 annotations on 160 clinical notes using 658 unique concept codes from SNOMED-CT corresponding to human disorders. Inter-annotator agreement was calculated on annotations from 100 of the documents for span (90.9%), concept code (81.7%), context (84.8%), and status (86.0%) agreement. Complete agreement for span, concept code, context, and status was 74.6%. We found that creating a consensus set based on annotations from two independently-created annotation sets can reduce inter-annotator disagreement by 32.3%. We found little benefit to pre-annotating the corpus with a third-party named entity recognizer.

Assessing the Costs of Machine-Assisted Corpus Annotation through a User Study
Eric Ringger | Marc Carmen | Robbie Haertel | Kevin Seppi | Deryle Lonsdale | Peter McClanahan | James Carroll | Noel Ellison

Fixed, limited budgets often constrain the amount of expert annotation that can go into the construction of annotated corpora. Estimating the cost of annotation is the first step toward using annotation resources wisely. We present here a study of the cost of annotation. This study includes the participation of annotators at various skill levels and with varying backgrounds. Conducted over the web, the study consists of tests that simulate machine-assisted pre-annotation, requiring correction by the annotator rather than annotation from scratch. The study also includes tests representative of an annotation scenario involving Active Learning as it progresses from a naïve model to a knowledgeable model; in particular, annotators encounter pre-annotation of varying degrees of accuracy. The annotation interface lists tags considered likely by the annotation model in preference to other tags. We present the experimental parameters of the study and report both descriptive and inferential statistics on the results of the study. We conclude with a model for estimating the hourly cost of annotation for annotators of various skill levels. We also present models for two granularities of annotation: sentence at a time and word at a time.

Training and Evaluation of POS Taggers on the French MULTITAG Corpus
Alexandre Allauzen | Hélène Bonneau-Maynard

The explicit introduction of morphosyntactic information into statistical machine translation approaches is receiving an important focus of attention. The current freely available Part of Speech (POS) taggers for the French language are based on a limited tagset which does not account for some flectional particularities. Moreover, there is a lack of a unified framework of training and evaluation for these kinds of linguistic resources. Therefore in this paper, three standard POS taggers (Treetagger, Brill’s tagger and the standard HMM POS tagger) are trained and evaluated in the same conditions on the French MULTITAG corpus. This POS-tagged corpus provides a tagset richer than the usual ones, including gender and number distinctions, for example. Experimental results show significant differences of performance between the taggers. According to the tagging accuracy estimated with a tagset of 300 items, taggers may be ranked as follows: Treetagger (95.7%), Brill’s tagger (94.6%), HMM tagger (93.4%). Examples of translation outputs illustrate how considering gender and number distinctions in the POS tagset can be relevant.

Cleaneval: a Competition for Cleaning Web Pages
Marco Baroni | Francis Chantree | Adam Kilgarriff | Serge Sharoff

Cleaneval is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus for linguistic and language technology research and development. The first exercise took place in 2007. We describe how it was set up, results, and lessons learnt

A Ground Truth Dataset for Matching Culturally Diverse Romanized Person Names
Mark Arehart | Keith J. Miller

This paper describes the development of a ground truth dataset of culturally diverse Romanized names in which approximately 70,000 names are matched against a subset of 700. We ran the subset as queries against the complete list using several matchers, created adjudication pools, adjudicated the results, and compiled two versions of ground truth based on different sets of adjudication guidelines and methods for resolving adjudicator conflicts. The name list, drawn from publicly available sources, was manually seeded with over 1500 name variants. These names include transliteration variation, database fielding errors, segmentation differences, incomplete names, titles, initials, abbreviations, nicknames, typos, OCR errors, and truncated data. These diverse types of matches, along with the coincidental name similarities already in the list, make possible a comprehensive evaluation of name matching systems. We have used the dataset to evaluate several open source and commercial algorithms and provide some of those results.

Producing a Test Collection for Patent Machine Translation in the Seventh NTCIR Workshop
Atsushi Fujii | Masao Utiyama | Mikio Yamamoto | Takehito Utsuro

In aiming at research and development on machine translation, we produced a test collection for Japanese-English machine translation in the seventh NTCIR Workshop. This paper describes details of our test collection. From patent documents published in Japan and the United States, we extracted patent families as a parallel corpus. A patent family is a set of patent documents for the same or related invention and these documents are usually filed to more than one country in different languages. In the parallel corpus, we aligned Japanese sentences with their counterpart English sentences. Our test collection, which includes approximately 2,000,000 sentence pairs, can be used to train and test machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval and the contribution of machine translation to a patent retrieval task can also be evaluated. Our test collection will be available to the public for research purposes after the NTCIR final meeting.

A Test Suite for Inference Involving Adjectives
Marilisa Amoia | Claire Gardent

Recently, most of the research in NLP has concentrated on the creation of applications coping with textual entailment. However, there still exist very few resources for the evaluation of such applications. We argue that the reason for this resides not only in the novelty of the research field but also and mainly in the difficulty of defining the linguistic phenomena which are responsible for inference. As the TSNLP project has shown test suites provide optimal diagnostic and evaluation tools for NLP applications, as contrary to text corpora they provide a deep insight in the linguistic phenomena allowing control over the data. Thus in this paper, we present a test suite specifically developed for studying inference problems shown by English adjectives. The construction of the test suite is based on the deep linguistic analysis and following classification of entailment patterns of adjectives and follows the TSNLP guidelines on linguistic databases providing a clear coverage, systematic annotation of inference tasks, large reusability and simple maintenance. With the design of this test suite we aim at creating a resource supporting the evaluation of computational systems handling natural language inference and in particular at providing a benchmark against which to evaluate and compare existing semantic analysers.

Evaluation Framework for Distant-talking Speech Recognition under Reverberant Environments: newest Part of the CENSREC Series -
Takanobu Nishiura | Masato Nakayama | Yuki Denda | Norihide Kitaoka | Kazumasa Yamamoto | Takeshi Yamada | Satoru Tsuge | Chiyomi Miyajima | Masakiyo Fujimoto | Tetsuya Takiguchi | Satoshi Tamura | Shingo Kuroiwa | Kazuya Takeda | Satoshi Nakamura

Recently, speech recognition performance has been drastically improved by statistical methods and huge speech databases. Now performance improvement under such realistic environments as noisy conditions is being focused on. Since October 2001, we from the working group of the Information Processing Society in Japan have been working on evaluation methodologies and frameworks for Japanese noisy speech recognition. We have released frameworks including databases and evaluation tools called CENSREC-1 (Corpus and Environment for Noisy Speech RECognition 1; formerly AURORA-2J), CENSREC-2 (in-car connected digits recognition), CENSREC-3 (in-car isolated word recognition), and CENSREC-1-C (voice activity detection under noisy conditions). In this paper, we newly introduce a collection of databases and evaluation tools named CENSREC-4, which is an evaluation framework for distant-talking speech under hands-free conditions. Distant-talking speech recognition is crucial for a hands-free speech interface. Therefore, we measured room impulse responses to investigate reverberant speech recognition. The results of evaluation experiments proved that CENSREC-4 is an effective database suitable for evaluating the new dereverberation method because the traditional dereverberation process had difficulty sufficiently improving the recognition performance. The framework was released in March 2008, and many studies are being conducted with it in Japan.

An Experimental Methodology for an End-to-End Evaluation in Speech-to-Speech Translation
Olivier Hamon | Djamel Mostefa

This paper describes the evaluation methodology used to evaluate the TC-STAR speech-to-speech translation (SST) system and the results from the third year of the project. It follows the results presented in Hamon (2007), dealing with the first end-to-end evaluation of the project. In this paper, we try to experiment with the methodology and the protocol during a second end-to-end evaluation, by comparing outputs from the TC-STAR system with interpreters from the European parliament. For this purpose, we test different criteria of evaluation and type of questions within a comprehension test. The results show that interpreters do not translate all the information (as opposed to the automatic system), but the quality of SST is still far from that of human translation. The experimental comprehension test used provides new information to study the quality of automatic systems, but without settling the issue of which protocol is the best. This depends on what the evaluator wants to know about the SST: either to have a subjective end-user evaluation or a more objective one.

Evaluation of Different Segmentation Techniques for Dialogue Turns
Carlos D. Martínez-Hinarejos | Vicent Tamarit

In dialogue systems, it is necessary to decode the user input into semantically meaningful units. These semantical units, usually Dialogue Acts (DA), are used by the system to produce the most appropriate response. The user turns can be segmented into utterances, which are meaningful segments from the dialogue viewpoint. In this case, a single DA is associated to each utterance. Many previous works have used DA assignation models on segmented dialogue corpora, but only a few have tried to perform the segmentation and assignation at the same time. The knowledge of the segmentation of turns into utterances is not common in dialogue corpora, and knowing the quality of the segmentations provided by the models that simultaneously perform segmentation and assignation would be interesting. In this work, we evaluate the accuracy of the segmentation offered by this type of model. The evaluation is done on a Spanish dialogue system on a railway information task. The results reveal that one of these techniques provides a high quality segmentation for this corpus.

Acquisition and Evaluation of a Dialog Corpus through WOz and Dialog Simulation Techniques
David Griol | Lluís F. Hurtado | Encarna Segarra | Emilio Sanchis

In this paper, we present a comparison between two corpora acquired by means of two different techniques. The first corpus was acquired by means of the Wizard of Oz technique. A dialog simulation technique has been developed for the acquisition of the second corpus. A random selection of the user and system turns has been used, defining stop conditions for automatically deciding if the simulated dialog is successful or not. We use several evaluation measures proposed in previous research to compare between our two acquired corpora, and then discuss the similarities and differences between the two corpora with regard to these measures.

What would you Ask a conversational Agent? Observations of Human-Agent Dialogues in a Museum Setting
Susan Robinson | David Traum | Midhun Ittycheriah | Joe Henderer

Embodied Conversational Agents have typically been constructed for use in limited domain applications, and tested in very specialized environments. Only in recent years have there been more cases of moving agents into wider public applications (e.g.Bell et al., 2003; Kopp et al., 2005). Yet little analysis has been done to determine the differing needs, expectations, and behavior of human users in these environments. With an increasing trend for virtual characters to “go public”, we need to expand our understanding of what this entails for the design and capabilities of our characters. This paper explores these issues through an analysis of a corpus that has been collected since December 2006, from interactions with the virtual character Sgt Blackwell at the Cooper Hewitt Museum in New York. The analysis includes 82 hierarchical categories of user utterances, as well as specific observations on user preferences and behaviors drawn from interactions with Blackwell.

An Evaluation of Spoken and Textual Interaction in the RITEL Interactive Question Answering System
Dave Toney | Sophie Rosset | Aurélien Max | Olivier Galibert | Eric Bilinski

The RITEL project aims to integrate a spoken language dialogue system and an open-domain information retrieval system in order to enable human users to ask a general question and to refine their search for information interactively. This type of system is often referred to as an Interactive Question Answering (IQA) system. In this paper, we present an evaluation of how the performance of the RITEL system differs when users interact with it using spoken versus textual input and output. Our results indicate that while users do not perceive the two versions to perform significantly differently, many more questions are asked in a typical text-based dialogue.

Classification Procedures for Software Evaluation
Muriel Amar | Sophie David | Rachel Panckhurst | Lisa Whistlecroft

We outline a methodological classification for evaluation approaches of software in general. This classification was initiated partly owing to involvement in a biennial European competition (the European Academic Software Award, EASA) which was held for over a decade. The evaluation grid used in EASA gradually became obsolete and inappropriate in recent years, and therefore needed to be revised. In order to do this, it was important to situate the competition in relation to other software evaluation procedures. A methodological perspective for the classification is adopted rather than a conceptual one, since a number of difficulties arise with the latter. We focus on three main questions: What to evaluate? How to evaluate? and Who does evaluate? The classification is therefore hybrid: it allows one to account for the most common evaluation approaches and is also an observatory. Two main approaches are differentiated: system and usage. We conclude that any evaluation always constructs its own object, and the objects to be evaluated only partially determine the evaluation which can be applied to them. Generally speaking, this allows one to begin apprehending what type of knowledge is objectified when one or another approach is chosen.

Cross-Corpus Evaluation of Word Alignment
Sylwia Ozdowska

We present the procedures we implemented to carry out system oriented evaluation of a syntax-based word aligner, ALIBI. While cross-corpus evaluation is still relatively rare in NLP, we take the approach of regarding cross-corpus evaluation as part of system oriented evaluation. Our hypothesis is that the granularity of alignments and the level of syntactic correspondence depend on corpus type; our objective is to assess how this impacts on alignment quality. We test our system on three English-French parallel corpora. The evaluation procedures are defined in accordance with state-of-the-art word alignment evaluation principles. They include, for each corpus, the creation of a reference set containing multiple annotations of the same data, the assessment of inter-annotator agreement rates and an analysis of the reference set obtained. We show that alignment performance varies across corpora according to the multiple reference annotations produced and further motivate our choice of preserving all reference annotations without solving disagreements between annotators.

Evaluating Evaluation Metrics for Ontology-Based Applications: Infinite Reflection
Diana Maynard | Wim Peters | Yaoyong Li

In this paper, we discuss methods of measuring the performance of ontology-based information extraction systems. We focus particularly on the Balanced Distance Metric (BDM), a new metric we have proposed which aims to take into account the more flexible nature of ontologically-based applications. We first examine why traditional Precision and Recall metrics, as used for flat information extraction tasks, are inadequate when dealing with ontologies. We then describe the Balanced Distance Metric (BDM) which takes ontological similarity into account. Finally, we discuss a range of experiments designed to test the accuracy and usefulness of the BDM when compared with traditional metrics and with a standard distance-based metric.

Lexical Substitution as a Framework for Multiword Evaluation
Diana McCarthy

In this paper we analyse data from the SemEval lexical substitution task in those cases where the annotators indicated that the target word was part of a phrase before substituting the target with a synonym. We classify the types of phrases that were provided in this way by the annotators in order to evaluate the utility of the method as a means of producing a gold-standard for multiword evaluation. Multiword evaluation is a difficult area because lexical resources are not complete and people’s judgments on multiwords vary. Whilst we do not believe lexical substitution is necessarily a panacea for multiword evaluation, we do believe it is a useful methodology because the annotator is focused on the task of substitution. Following the analysis, we make some recommendations which would make the data easier to classify.

Tree Distance and Some Other Variants of Evalb
Martin Emms

Some alternatives to the standard evalb measures for parser evaluation are considered, principally the use of a tree-distance measure, which assigns a score to a linearity and ancestry respecting mapping between trees, in contrast to the evalb measures, which assign a score to a span preserving mapping. Additionally, analysis of the evalb measures suggests some further variants, concerning different normalisations, the portions of a tree compared and whether scores should be micro or macro averaged. The outputs of 6 parsing systems on Section 23 of the Penn Treebank were taken. It is shown that the ranking of the parsing systems varies as the alternative evaluation measures are used. For a fixed parsing system, it is also shown that the ranking of the parses from best to worst will vary according to whether the evalb or tree-distance measure is used. It is argued that the tree-distance measure ameliorates a problem that has been noted concerning over-penalisation of attachment errors.

BLEU+: a Tool for Fine-Grained BLEU Computation
A. Cüneyd Tantuǧ | Kemal Oflazer | Ilknur Durgar El-Kahlout

We present a tool, BLEU+, which implements various extension to BLEU computation to allow for a better understanding of the translation performance, especially for morphologically complex languages. BLEU+ takes into account both “closeness” in morphological structure, “closeness” of the root words in the WordNet hierarchy while comparing tokens in the candidate and reference sentence. In addition to gauging performance at a finer level of granularity, BLEU+ also allows the computation of various upper bound oracle scores: comparing all tokens considering only the roots allows us to get an upper bound when all errors due to morphological structure are fixed, while comparing tokens in an error-tolerant way considering minor morpheme edit operations, allows us to get a (more realistic) upper bound when tokens that differ in morpheme insertions/deletions and substitutions are fixed. We use BLEU+ in the fine-grained evaluation of the output of our English-to-Turkish statistical MT system.

Elicited Imitation as an Oral Proficiency Measure with ASR Scoring
C. Ray Graham | Deryle Lonsdale | Casey Kennington | Aaron Johnson | Jeremiah McGhee

This paper discusses development and evaluation of a practical, valid and reliable instrument for evaluating the spoken language abilities of second-language (L2) learners of English. First we sketch the theory and history behind elicited imitation (EI) tests and the renewed interest in them. Then we present how we developed a new test based on various language resources, and administered it to a few hundred students of varying levels. The students were also scored using standard evaluation techniques, and the EI results were compared to more traditionally derived scores. We also sketch how we developed a new integrated tool that allows the session recordings of the EI data to be analyzed with a widely-used automatic speech recognition (ASR) engine. We discuss the promising results of the ASR engine’s processing of these files and how they correlated with human scoring of the same items. We indicate how the integrated tool will be used in the future. Further development plans and prospects for follow-on work round out the discussion.

Methodology for Evaluating the Usability of User Interfaces in Mobile Services
Pedro Concejero | Daniel Tapias | Juan José Rodríguez | Juan Carlos Luengo | Sebastián Sánchez

In this paper we present a usability measure adapted to mobile services, which is based on the well-known theoretical framework defined in the ISO 9241-11 standard. This measure is then applied to a representative set of services of the Telefónica’s portfolio for residential customers. The user tests that we present were carried out by a total of 327 people. Additionally, we describe the detailed application of the methodology to a particular service and present the results of all the experiments that were carried out with the different services. These results show highly significant differences in the three usability measures considered (effectiveness, efficiency and satisfaction), though all of them have the same trend. The worst performers in all cases were the WAP and i-mode user interfaces (UI), while the best performers were the SMS and web based UIs closely followed by the voice UI. Finally, we also analyse the results and present our conclusions.

An Economic View on Human Language Technology Evaluation
Edouard Geoffrois

This paper analyses some general issues about human language technology evaluation, focusing on economic aspects. It first provides a scientific rationale for the need to organize evaluation in the form of campaigns, by relating this need to some basic characteristics of human language technologies, namely that they involve learning to process information in a way which reproduces human capabilities. It then reviews the benefits and constraints of these evaluation campaigns. Borrowing concepts from the field of economics, it also provides an analysis of the economic incentives to organize evaluation campaigns. It entails from this analysis that fitting evaluation campaigns to the needs of scientific research requires a strong implication in term of research policy and public funding.

Comparing Corpus-based to Web-based Lookup Techniques for Automatic English Inclusion Detection
Beatrice Alex

The influence of English as a global language continues to grow to an extent that its words and expressions permeate the original forms of other languages. This paper evaluates a modular Web-based sub-component of an existing English inclusion classifier and compares it to a corpus-based lookup technique. Both approaches are evaluated on a German gold standard data set. It is demonstrated to what extent the Web-based approach benefits from the amount of data available online and the fact that this data is constantly updated.

Centering Theory for Evaluation of Coherence in Computer-Aided Summaries
Laura Hasler

This paper investigates a new evaluation method for assessing the coherence of computer-aided summaries, justified by the inappropriacy of existing evaluation methods for this task. It develops a metric for Centering Theory (CT), a theory of local coherence and salience, to measure coherence in pairs of extracts and abstracts produced in a computer-aided summarisation environment. 100 news text summaries (50 pairs of extracts and their corresponding abstracts) are analysed using CT and the metric is applied to obtain a score for each summary; the summary with the higher score out of a pair is considered more coherent. Human judgement is also obtained to allow a comparison with the CT evaluation to assess the validity of the development of CT as a useful evaluation metric in computer-aided summarisation.

Linguistic Resources and Evaluation Techniques for Evaluation of Cross-Document Automatic Content Extraction
Stephanie Strassel | Mark Przybocki | Kay Peterson | Zhiyi Song | Kazuaki Maeda

The NIST Automatic Content Extraction (ACE) Evaluation expands its focus in 2008 to encompass the challenge of cross-document and cross-language global integration and reconciliation of information. While past ACE evaluations have been limited to local (within-document) detection and disambiguation of entities, relations and events, the current evaluation adds global (cross-document and cross-language) entity disambiguation tasks for Arabic and English. This paper presents the 2008 ACE XDoc evaluation task and associated infrastructure. We describe the linguistic resources created by LDC to support the evaluation, focusing on new approaches required for data selection, data processing, annotation task definitions and annotation software, and we conclude with a discussion of the metrics developed by NIST to support the evaluation.

Let’s not Argue about Semantics
Johan Bos

What’s the best way to assess the performance of a semantic component in an NLP system? Tradition in NLP evaluation tells us that comparing output against a gold standard is a good idea. To define a gold standard, one first needs to decide on the representation language, and in many cases a first-order language seems a good compromise between expressive power and efficiency. Secondly, one needs to decide how to represent the various semantic phenomena, in particular the depth of analysis of quantification, plurals, eventualities, thematic roles, scope, anaphora, presupposition, ellipsis, comparatives, superlatives, tense, aspect, and time-expressions. Hence it will be hard to come up with an annotation scheme unless one permits different level of semantic granularity. The alternative is a theory-neutral black-box type evaluation where we just look at how systems react on various inputs. For this approach, we can consider the well-known task of recognising textual entailment, or the lesser-known task of textual model checking. The disadvantage of black-box methods is that it is difficult to come up with natural data that cover specific semantic phenomena.

Can we Evaluate the Quality of Generated Text?
David Hardcastle | Donia Scott

Evaluating the output of NLG systems is notoriously difficult, and performing assessments of text quality even more so. A range of automated and subject-based approaches to the evaluation of text quality have been taken, including comparison with a putative gold standard text, analysis of specific linguistic features of the output, expert review and task-based evaluation. In this paper we present the results of a variety of such approaches in the context of a case study application. We discuss the problems encountered in the implementation of each approach in the context of the literature, and propose that a test based on the Turing test for machine intelligence offers a way forward in the evaluation of the subjective notion of text quality.

An Infrastructure, Tools and Methodology for Evaluation of Multicultural Name Matching Systems
Keith J. Miller | Mark Arehart | Catherine Ball | John Polk | Alan Rubenstein | Kenneth Samuel | Elizabeth Schroeder | Eva Vecchi | Chris Wolf

This paper describes a Name Matching Evaluation Laboratory that is a joint effort across multiple projects. The lab houses our evaluation infrastructure as well as multiple name matching engines and customized analytical tools. Included is an explanation of the methodology used by the lab to carry out evaluations. This methodology is based on standard information retrieval evaluation, which requires a carefully-constructed test data set. The paper describes how we created that test data set, including the “ground truth” used to score the systems’ performance. Descriptions and snapshots of the lab’s various tools are provided, as well as information on how the different tools are used throughout the evaluation process. By using this evaluation process, the lab has been able to identify strengths and weaknesses of different name matching engines. These findings have led the lab to an ongoing investigation into various techniques for combining results from multiple name matching engines to achieve optimal results, as well as into research on the more general problem of identity management and resolution.

Evaluating Robustness Of A QA System Through A Corpus Of Real-Life Questions
Laurianne Sitbon | Patrice Bellot | Philippe Blache

This paper presents the sequential evaluation of the question answering system SQuaLIA. This system is based on the same sequential process as most statistical question answering systems, involving 4 main steps from question analysis to answer extraction.The evaluation is based on a corpus made from 20 questions taken in the set of an evaluation campaign and which were well answered by SQuaLIA. Each of the 20 questions has been typed by 17 native participants, non natives and dyslexics. They were vocally instructed the target of each question. Each of the 4 analysis steps of the system involves a loss of accuracy, until an average of 60 of right answers at the end of the process. The main cause of this loss seems to be the orthographic mistakes users make on nouns.

Sentiment Analysis and the Use of Extrinsic Datasets in Evaluation
Ann Devitt | Khurshid Ahmad

The field of automated sentiment analysis has emerged in recent years as an exciting challenge to the computational linguistics community. Research in the field investigates how emotion, bias, mood or affect is expressed in language and how this can be recognised and represented automatically. To date, the most successful applications have been in the classification of product reviews and editorials. This paper aims to open a discussion about alternative evaluation methodologies for sentiment analysis systems that broadens the scope of this new field to encompass existing work in other domains such as psychology and to exploit existing resources in diverse domains such as finance or medicine. We outline some interesting avenues for research which investigate the impact of affective text content on the human psyche and on external factors such as stock markets.

Certification and Cleaning up of a Text Corpus: Towards an Evaluation of the “Grammatical” Quality of a Corpus
Cyril Grouin

We present in this article the methods we used for obtaining measures to ensure the quality and well-formedness of a text corpus. These measures allow us to determine the compatibility of a corpus with the treatments we want to apply on it. We called this method “certification of corpus”. These measures are based upon the characteristics required by the linguistic treatments we have to apply on the corpus we want to certify. Since the certification of corpus allows us to highlight the errors present in a text, we developed modules to carry out an automatic correction. By applying these modules, we reduced the number of errors. In consequence, it increases the quality of the corpus making it possible to use a corpus that a first certification would not have admitted.

WEB-Based Listening Test System for Speech Synthesis and Speech Conversion Evaluation
Laurent Blin | Olivier Boeffard | Vincent Barreaud

In this article, we propose a web based listening test system that can be used with a large range of listeners. Our main goals were to make the configuration of the tests as simple and flexible as possible, to simplify the recruiting of the testees and, of course, to keep track of the results using a relational database. This first version of our system can perform the most widely used listening tests in the speech processing community (AB-BA, ABX and MOS tests). It can also easily evolve and propose other tests implemented by the tester by means of a module interface. This scenario is explored in this article which proposes an implementation of a module for Comparison Mean Opinion Score (CMOS) tests and conduct of such an experiment. This test allowed us to extract from the BREF120 corpus a couple of voices of distinct supra-segmental characteristics. This system is offered to the speech synthesis and speech conversion community under free license.

Semiotic-based Ontology Evaluation Tool (S-OntoEval)
Renata Dividino | Massimo Romanelli | Daniel Sonntag

The objective of the Semiotic-based Ontology Evaluation Tool (S-OntoEval) is to evaluate and propose improvements to a given ontological model. The evaluation aims at assessing the quality of the ontology by drawing upon semiotic theory, taking several metrics into consideration for assessing the syntactic, semantic, and pragmatic aspects of ontology quality. We consider an ontology to be a semiotic object and we identify three main types of semiotic ontology evaluation levels: the structural level, assessing the ontology syntax and formal semantics; the functional level, assessing the ontology cognitive semantics and; the usability-related level, assessing the ontology pragmatics. The Ontology Evaluation Tool implements metrics for each semiotic ontology level: on the structural level by making use of reasoner such as the RACER System and Pellet to check the logical consistency of our ontological model (TBoxes and ABoxes) and graph-theory measures such as Depth; on the functional level by making use of a task-based evaluation approach which measures the quality of the ontology based on the adequacy of the ontological model for a specific task; and on the usability-profiling level by applying a quantitative analysis of the amount of annotation. Other metrics can be easily integrated and added to the respective evaluation level. In this work, the Ontology Evaluation Tool is used to test and evaluate the SWIntO Ontology of the SmartWeb project.

ANNALIST - ANNotation ALIgnment and Scoring Tool
George Demetriou | Robert Gaizauskas | Haotian Sun | Angus Roberts

In this paper we describe ANNALIST (Annotation, Alignment and Scoring Tool), a scoring system for the evaluation of the output of semantic annotation systems. ANNALIST has been designed as a system that is easily extensible and configurable for different domains, data formats, and evaluation tasks. The system architecture enables data input via the use of plugins and the users can access the system’s internal alignment and scoring mechanisms without the need to convert their data to a specified format. Although developed for evaluation tasks that involve the scoring of entity mentions and relations primarily, ANNALIST’s generic object representation and the availability of a range of criteria for the comparison of annotations enable the system to be tailored to a variety of scoring jobs. The paper reports on results from using ANNALIST in real-world situations in comparison to other scorers which are more established in the literature. ANNALIST has been used extensively for evaluation tasks within the VIKEF (EU FP6) and CLEF (UK MRC) projects.

Task-Based Evaluation of Meeting Browsers: from Task Elicitation to User Behavior Analysis
Andrei Popescu-Belis | Mike Flynn | Pierre Wellner | Philippe Baudrion

This paper presents recent results of the application of the task-based Browser Evaluation Test (BET) to meeting browsers, that is, interfaces to multimodal databases of meeting recordings. The tasks were defined by browser-neutral BET observers. Two groups of human subjects used the Transcript-based Query and Browsing interface (TQB), and attempted to solve as many BET tasks - pairs of true/false statements to disambiguate - as possible in a fixed amount of time. Their performance was measured in terms of precision and speed. Results indicate that the browser’s annotation-based search functionality is frequently used, in particular the keyword search. A more detailed analysis of each test question for each participant confirms that despite considerable variation across strategies, the use of queries is correlated to successful performance.

Improving Contextual Quality Models for MT Evaluation Based on Evaluators’ Feedback
Paula Estrella | Andrei Popescu-Belis | Maghi King

The Framework for the Evaluation for Machine Translation (FEMTI) contains guidelines for building a quality model that is used to evaluate MT systems in relation to the purpose and intended context of use of the systems. Contextual quality models can thus be constructed, but entering into FEMTI the knowledge required for this operation is a complex task. An experiment has been set up in order to transfer knowledge from MT evaluation experts into the FEMTI guidelines, by polling experts about the evaluation methods they would use in a particular context, then inferring from the results generic relations between characteristics of the context of use and quality characteristics. The results of this hands-on exercise, carried out as part of a conference tutorial, have served to refine FEMTI’s “generic contextual quality model” and to obtain feedback on the FEMTI guidelines in general.

Performance Evaluation of Speech Translation Systems
Brian Weiss | Craig Schlenoff | Greg Sanders | Michelle Steves | Sherri Condon | Jon Phillips | Dan Parvaz

One of the most challenging tasks for uniformed service personnel serving in foreign countries is effective verbal communication with the local population. To remedy this problem, several companies and academic institutions have been funded to develop machine translation systems as part of the DARPA TRANSTAC (Spoken Language Communication and Translation System for Tactical Use) program. The goal of this program is to demonstrate capabilities to rapidly develop and field free-form, two-way translation systems that would enable speakers of different languages to communicate with one another in real-world tactical situations. DARPA has mandated that each TRANSTAC technology be evaluated numerous times throughout the life of the program and has tasked the National Institute of Standards and Technology (NIST) to lead this effort. This paper describes the experimental design methodology and test procedures from the most recent evaluation, conducted in July 2007, which focused on English to/from Iraqi Arabic.

Automatic Evaluation Measures for Statistical Machine Translation System Optimization
Arne Mauser | Saša Hasan | Hermann Ney

Evaluation of machine translation (MT) output is a challenging task. In most cases, there is no single correct translation. In the extreme case, two translations of the same input can have completely different words and sentence structure while still both being perfectly valid. Large projects and competitions for MT research raised the need for reliable and efficient evaluation of MT systems. For the funding side, the obvious motivation is to measure performance and progress of research. This often results in a specific measure or metric taken as primarily evaluation criterion. Do improvements in one measure really lead to improved MT performance? How does a gain in one evaluation metric affect other measures? This paper is going to answer these questions by a number of experiments.

RACAI’s Linguistic Web Services
Dan Tufiş | Radu Ion | Alexandru Ceauşu | Dan Ştefănescu

Nowadays, there are hundreds of Natural Language Processing applications and resources for different languages that are developed and/or used, almost exclusively with a few but notable exceptions, by their creators. Assuming that the right to use a particular application or resource is licensed by the rightful owner, the user is faced with the often not so easy task of interfacing it with his/her own systems. Even if standards are defined that provide a unified way of encoding resources, few are the cases when the resources are actually coded in conformance to the standard (and, at present time, there is no such thing as general NLP application interoperability). Semantic Web came with the promise that the web will be a universal medium for information exchange whatever its content. In this context, the present article outlines a collection of linguistic web services for Romanian and English, developed at the Research Institute for AI for the Romanian Academy (RACAI) which are ready to provide a standardized way of calling particular NLP operations and extract the results without caring about what exactly is going on in the background.

Words in Contexts: Digital Editions of Literary Journals in the “AAC - Austrian Academy Corpus”
Hanno Biber | Evelyn Breiteneder | Karlheinz Mörth

In this paper two highly innovative digital editions will be presented. For the creation and the implementation of these editions the latest developments within corpus research have been taken into account. The digital editions of the historical literary journals Die Fackel (published by Karl Kraus in Vienna from 1899 to 1936) and Der Brenner (published by Ludwig Ficker in Innsbruck from 1910 to 1954) have been developed within the corpus research framework of the “AAC - Austrian Academy Corpus” at the Austrian Academy of Sciences in collaboration with other researchers and programmers in the AAC from Vienna together with the graphic designer Anne Burdick from Los Angeles. For the creation of these scholarly digital editions the AAC edition philosophy and edition principles have been applied whereby new corpus research methods have been made use of for questions of computational philology and textual studies in a digital environment. The examples of the digital online editions of the literary journals Die Fackel and Der Brenner will give insights into the potentials and the benefits of making corpus research methods and techniques available for scholarly research into language and literature.

ASV Toolbox: a Modular Collection of Language Exploration Tools
Chris Biemann | Uwe Quasthoff | Gerhard Heyer | Florian Holz

ASV Toolbox is a modular collection of tools for the exploration of written language data both for scientific and educational purposes. It includes modules that operate on word lists or texts and allow to perform various linguistic annotation, classification and clustering tasks, including language detection, POS-tagging, base form reduction, named entity recognition, and terminology extraction. On a more abstract level, the algorithms deal with various kinds of word similarity, using pattern-based and statistical approaches. The collection can be used to work on large real-world data sets as well as for studying the underlying algorithms. Each module of the ASV Toolbox is designed to work either on a plain text files or with a connection to a MySQL database. While it is especially designed to work with corpora of the Leipzig Corpora Collection, it can easily be adapted to other sources.

LX-Service: Web Services of Language Technology for Portuguese
António Branco | Francisco Costa | Pedro Martins | Filipe Nunes | João Silva | Sara Silveira

In the present paper we report on the development of a cluster of web services of language technology for Portuguese that we named as LXService. These web services permit the direct interaction of client applications with language processing tools via the Internet. This way of making available language technology was motivated by the need of its integration in an eLearning environment. In particular, it was motivated by the development of new multilingual functionalities that were aimed at extending a Learning Management System and that needed to resort to the outcome of some of those tools in a distributed and remote fashion. This specific usage situation happens however to be representative of a typical and recurrent set up in the utilization of language processing tools in different settings and projects. Therefore, the approach reported here offers not only a solution for this specific problem, which immediately motivated it, but contributes also some first steps for what we see as an important paradigm shift in terms of the way language technology can be distributed and find a better way to unleash its full potential and impact.

The TextPro Tool Suite
Emanuele Pianta | Christian Girardi | Roberto Zanoli

We present TextPro, a suite of modular Natural Language Processing (NLP) tools for analysis of Italian and English texts. The suite has been designed so as to integrate and reuse state of the art NLP components developed by researchers at FBK. The current version of the tool suite provides functions ranging from tokenization to chunking and Named Entity Recognition (NER). The system’s architecture is organized as a pipeline of processors wherein each stage accepts data from an initial input or from an output of a previous stage, executes a specific task, and sends the resulting data to the next stage, or to the output of the pipeline. TextPro performed the best on the task of Italian NER and Italian PoS Tagging at EVALITA 2007. When tested on a number of other standard English benchmarks, TextPro confirms that it performs as state of the art system. Distributions for Linux, Solaris and Windows are available, for both research and commercial purposes. A web-service version of the system is under development.

An AI-inspired intelligent agent/student architecture to combine Language Resources research and teaching
Bayan Abu Shawar | Eric Atwell

This paper describes experimental use of the multi-agent architecture to integrate Natural Language and Information Systems research and teaching, by casting a group of students as intelligent agents to collect and analyse English language resources from around the world. Section 2 and section 3 describe the hybrid intelligent information systems experiments at the University of Leeds and the results generated, including several research papers accepted at international conferences, and a finalist entry in the British Computer Society Machine Intelligence contest. Our proposals for applying the multi-agent idea in other universities such as the Arab Open University are presented in section 4. The conclusion is presented in section 5: the success of hybrid intelligent information systems experiments in generating research papers within a limited time.

Language Resources and Tools for Swedish: A Survey
Kjell Elenius | Eva Forsbom | Beáta Megyesi

Language resources and tools to create and process these resources are necessary components in human language technology and natural language applications. In this paper, we describe a survey of existing language resources for Swedish, and the need for Swedish language resources to be used in research and real-world applications in language technology as well as in linguistic research. The survey is based on a questionnaire sent to industry and academia, institutions and organizations, and to experts involved in the development of Swedish language resources in Sweden, the Nordic countries and world-wide.

Glossa: a Multilingual, Multimodal, Configurable User Interface
Lars Nygaard | Joel Priestley | Anders Nøklestad | Janne Bondi Johannessen

We describe a web-based corpus query system, Glossa, which combines the expressiveness of regular query languages with the user-friendliness of a graphical interface. Since corpus users are usually linguists with little interest in technical matters, we have developed a system where the user need not have any prior knowledge of the search system. Furthermore, no previous knowledge of abbreviations for metavariables such as part of speech and source text is needed. All searches are done using checkboxes, pull-down menus, or writing simple letters to make words or other strings. Querying for more than one word is simply done by adding an additional query box, and for parts of words by choosing a feature such as “start of word”. The Glossa system also allows a wide range of viewing and post-processing options. Collocations can be viewed and counted in a number of ways, and be viewed as different kinds of graphical charts. Further annotation and deletion of single results for further processing is also easy. The Glossa system is already in use for a number of corpora. Corpus administrators can easily adapt the system to a wide range of corpora, including multilingual corpora and corpora with audio and video content.

Ontology-Based Interface Specifications for a NLP Pipeline Architecture
Ekaterina Buyko | Christian Chiarcos | Antonio Pareja Lora

The high level of heterogeneity between linguistic annotations usually complicates the interoperability of processing modules within an NLP pipeline. In this paper, a framework for the interoperation of NLP components, based on a data-driven architecture, is presented. Here, ontologies of linguistic annotation are employed to provide a conceptual basis for the tagset-neutral processing of linguistic annotations. The framework proposed here is based on a set of structured OWL ontologies: a reference ontology, a set of annotation models which formalize different annotation schemes, and a declarative linking between these, specified separately. This modular architecture is particularly scalable and flexible as it allows for the integration of different reference ontologies of linguistic annotations in order to overcome the absence of a consensus for an ontology of linguistic terminology. Our proposal originates from three lines of research from different fields: research on annotation type systems in UIMA; the ontological architecture OLiA, originally developed for sustainable documentation and annotation-independent corpus browsing, and the ontologies of the OntoTag model, targeted towards the processing of linguistic annotations in Semantic Web applications. We describe how UIMA annotations can be backed up by ontological specifications of annotation schemes as in the OLiA model, and how these are linked to the OntoTag ontologies, which allow for further ontological processing.

Foundation of a Component-based Flexible Registry for Language Resources and Technology
Daan Broeder | Thierry Declerck | Erhard Hinrichs | Stelios Piperidis | Laurent Romary | Nicoletta Calzolari | Peter Wittenburg

Within the CLARIN e-science infrastructure project it is foreseen to develop a component-based registry for metadata for Language Resources and Language Technology. With this registry it is hoped to overcome the problems of the current available systems with respect to inflexible fixed schema, unsuitable terminology and interoperability problems. The registry will address interoperability needs by refering to a shared vocabulary registered in data category registries as they are suggested by ISO.

Building a Federation of Language Resource Repositories: the DAM-LR Project and its Continuation within CLARIN.
Daan Broeder | David Nathan | Sven Strömqvist | Remco van Veenendaal

The DAM-LR project aims at virtually integrating various European language resource archives that allow users to navigate and operate in a single unified domain of language resources. This type of integration introduces Grid technology to the humanities disciplines and forms a federation of archives. The complete architecture is designed based on a few well-known components .This is considered the basis for building a research infrastructure for Language Resources as is planned within the CLARIN project. The DAM-LR project was purposefully started with only a small number of participants for flexibility and to avoid complex contract negotiations with respect to legal issues. Now that we have gained insights into the basic technology issues and organizational issues, it is foreseen that the federation will be expanded considerably within the CLARIN project that will also address the associated legal issues.

A Grid of Regional Language Archives
Paul Trilsbeek | Daan Broeder | Tobias Valkenhoef | Peter Wittenburg

About two years ago, the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands, started an initiative to install regional language archives in various places around the world, particularly in places where a large number of endangered languages exist and are being documented. These digital archives make use of the LAT archiving framework that the MPI has developed over the past nine years. This framework consists of a number of web-based tools for depositing, organizing and utilizing linguistic resources in a digital archive. The regional archives are in principle autonomous archives, but they can decide to share metadata descriptions and language resources with the MPI archive in Nijmegen and become part of a grid of linked LAT archives. By doing so, they will also take advantage of the long-term preservation strategy of the MPI archive. This paper describes the reasoning behind this initiative and how in practice such an archive is set up.

Adapting International Standard for Asian Language Technologies
Takenobu Tokunaga | Dain Kaplan | Chu-Ren Huang | Shu-Kai Hsieh | Nicoletta Calzolari | Monica Monachini | Claudia Soria | Kiyoaki Shirai | Virach Sornlertlamvanich | Thatsanee Charoenporn | YingJu Xia

Corpus-based approaches and statistical approaches have been the main stream of natural language processing research for the past two decades. Language resources play a key role in such approaches, but there is an insufficient amount of language resources in many Asian languages. In this situation, standardisation of language resources would be of great help in developing resources in new languages. This paper presents the latest development efforts of our project which aims at creating a common standard for Asian language resources that is compatible with an international standard. In particular, the paper focuses on i) lexical specification and data categories relevant for building multilingual lexical resources for Asian languages; ii) a core upper-layer ontology needed for ensuring multilingual interoperability and iii) the evaluation platform used to test the entire architectural framework.

A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure
Keiji Shinzato | Daisuke Kawahara | Chikara Hashimoto | Sadao Kurohashi

In recent years, language resources acquired from theWeb are released, and these data improve the performance of applications in several NLP tasks. Although the language resources based on the web page unit are useful in NLP tasks and applications such as knowledge acquisition, document retrieval and document summarization, such language resources are not released so far. In this paper, we propose a data format for results of web page processing, and a search engine infrastructure which makes it possible to share approximately 100 million Japanese web data. By obtaining the web data, NLP researchers are enabled to begin their own processing immediately without analyzing web pages by themselves.

UFRA: a UIMA-based Approach to Federated Language Resource Architecture
Riccardo Del Gratta | Roberto Bartolini | Tommaso Caselli | Monica Monachini | Claudia Soria | Nicoletta Calzolari

In this paper we address the issue of developing an interoperable infrastructure for language resources and technologies. In our approach, called UFRA, we extend the Federate Database Architecture System adding typical functionalities caming from UIMA. In this way, we capitalize the advantages of a federated architecture, such as autonomy, heterogeneity and distribution of components, monitored by a central authority responsible for checking both the integration of components and user rights on performing different tasks. We use the UIMA approach to manage and define one common front-end, enabling users and clients to query, retrieve and use language resources and technologies. The purpose of this paper is to show how UIMA leads from a Federated Database Architecture to a Federated Resource Architecture, adding to a registry of available components both static resources such as lexicons and corpora and dynamic ones such as tools and general purpose language technologies. At the end of the paper, we present a case-study that adopts this framework to integrate the SIMPLE lexicon and TIMEML annotation guidelines to tag natural language texts.

The Metadata-Database of a Next Generation Sustainability Web-Platform for Language Resources
Georg Rehm | Oliver Schonefeld | Andreas Witt | Timm Lehmberg | Christian Chiarcos | Hanan Bechara | Florian Eishold | Kilian Evang | Magdalena Leshtanska | Aleksandar Savkov | Matthias Stark

Our goal is to provide a web-based platform for the long-term preservation and distribution of a heterogeneous collection of linguistic resources. We discuss the corpus preprocessing and normalisation phase that results in sets of multi-rooted trees. At the same time we transform the original metadata records, just like the corpora annotated using different annotation approaches and exhibiting different levels of granularity, into the all-encompassing and highly flexible format eTEI for which we present editing and parsing tools. We also discuss the architecture of the sustainability platform. Its primary components are an XML database that contains corpus and metadata files and an SQL database that contains user accounts and access control lists. A staging area, whose structure, contents, and consistency can be checked using tools, is used to make sure that new resources about to be imported into the platform have the correct structure.

From Field Notes towards a Knowledge Base
Piroska Lendvai | Steve Hunt

We describe the process of converting plain text cultural heritage data to elements of a domain-specific knowledge base, using general machine learning techniques. First, digitised expedition field notes are segmented and labelled automatically. In order to obtain perfect records, we create an annotation tool that features selective sampling, allowing domain experts to validate automatically labelled text, which is then stored in a database. Next, the records are enriched with semi-automatically derived secondary metadata. Metadata enable fine-grained querying, the results of which are additionally visualised using maps and photos.

Construction of a Metadata Database for Efficient Development and Use of Language Resources
Hitomi Tohyama | Shunsuke Kozawa | Kiyotaka Uchimoto | Shigeki Matsubara | Hitoshi Isahara

The National Institute of Information and Communications Technology (NICT) and Nagoya University have been jointly constructing a large scale database named SHACHI by collecting detailed meta-information on language resources (LRs) in Asia and Western countries, for the purpose of effectively combining LRs. The purpose of this project is to investigate languages, tag sets, and formats compiled in LRs throughout the world, to systematically store LR metadata, to create a search function for this information, and to ultimately utilize all this for a more efficient development of LRs. This metadata database contains more than 2,000 compiled LRs such as corpora, dictionaries, thesauruses and lexicons, forming a large scale metadata of LRs archive. Its metadata, an extended version of OLAC metadata set conforming to Dublin Core, which contain detailed meta-information, have been collected semi-automatically. This paper explains the design and the structure of the metadata database, as well as the realization of the catalogue search tool. Additionally, the website of this database is now open to the public and accessible to all Internet users.

A Taxonomy of Lexical Metadata Categories
Bodil Nistrup Madsen | Hanne Erdman Thomsen

Metadata registries comprising sets of categories to be used in data collections exist in many fields. The purpose of a metadata registry is to facilitate data exchange and interoperability within a domain, and registries often contain definitions and examples. In this paper we will argue that in order to ensure completeness, consistency, user-friendliness and extensibility, metadata registries should be structured as taxonomies. Furthermore we will illustrate the usefulness of using terminological ontologies as the basis for developing metadata taxonomies. In this connection we will discuss the principles of developing ontologies and the differences between taxonomies and ontologies. The paper includes examples of initiatives for developing metadata standards within the field of language resources, more specifically lexical data categories, elaborated at international and national level. However, the principles that we introduce for the development of data category registries are relevant not only for metadata registries for lexical resources, but for all kinds of metadata registries.

The 2008 Oriental COCOSDA Book Project: in Commemoration of the First Decade of Sustained Activities in Asia
Shuichi Itahashi | Chiu-yu Tseng

The purpose of Oriental COCOSDA is to provide the Asian community a platform to exchange ideas, to share information and to discuss regional matters on creation, utilization, dissemination of spoken language corpora of oriental languages and also on the assessment methods of speech recognition/synthesis systems as well as to promote speech research on oriental languages. Since its preparatory meeting in Hong Kong in 1997, annual workshops have been organized and held in Japan, Taiwan, China, Korea, Thailand, Singapore, India, Indonesia, Malaysia, and Vietnam from 1998 onwards. The organization is managed by a convener, three advisory members, and 26 committee members from 13 regions in Oriental area. In order to commemorate 10 years of continued activities, the members have decided to publish a book which covers a wide range of speech research. Special focus will be on speech resources or speech corpora in Oriental countries and standardization of speech input/output systems performance evaluation methods on which key technologies for speech systems development are based. The book will also include linguistic outlines of oriental languages, annotation, labeling, and software tools for speech processing.

Towards the National Corpus of Polish
Adam Przepiórkowski | Rafał L. Górski | Barbara Lewandowska-Tomaszyk | Marek Łaziński

This paper presents a new corpus project, aiming at building a national corpus of Polish. What makes it different from a typical YACP (Yet Another Corpus Project) is 1) the fact that all four partners in the project have in the past constructed corpora of Polish, sometimes in the spirit of collaboration, at other times - in the spirit of competition, 2) the partners bring into the project varying areas of expertise and experience, so the synergy effect is anticipated, 3) the corpus will be built with an eye on specific applications in various fields, including lexicography (the corpus will be the empirical basis of a new large general dictionary of Polish) and natural language processing (a number of NLP tools will be constructed within the project).

Strengthening the Estonian Language Technology
Einar Meister | Jaak Vilo

The paper will give an overview of developments in Estonia in the field of Human Language Technologies. Despite of the fact that Estonian is one of the smallest official languages in EU and therefore in less favourable position in the HLT-market, the national initiatives are undertaken in order to promote HLT development in Estonia. The paper will introduce recent activities in Estonia, including National Programme for Estonian Language Technology (2006-2010).

MEDAR: Collaboration between European and Mediterranean Arabic Partners to Support the Development of Language Technology for Arabic
Bente Maegaard | Mohammed Atiyya | Khalid Choukri | Steven Krauwer | Chafic Mokbel | Mustafa Yaseen

After the successful completion of the NEMLAR project 2003-2005, a new opportunity for a project was opened by the European Commission, and a group of largely the same partners is now executing the MEDAR project. MEDAR will be updating the surveys and BLARK for Arabic already made, and will then focus on machine translation (and other tools for translation) and information retrieval with a focus on language resources, tools and evaluation for these applications. A very important part of the MEDAR project is to reinforce and extend the NEMLAR network and to create a cooperation roadmap for Human Language Technologies for Arabic. It is expected that the cooperation roadmap will attract wide attention from other parties and that it can help create a larger platform for collaborative projects. Finally, the project will focus on dissemination of knowledge about existing resources and tools, as well as actors and activities; this will happen through newsletter, website and an international conference which will follow up on the Cairo conference of 2004. Dissemination to user communities will also be important, e.g. through participation in translators? conferences. The goal of these activities is to create a stronger and lasting collaboration between EU countries and Arabic speaking countries.

Slovene Terminology Web Portal and the TBX-Compatible Simplified DTD/schema
Simon Krek | Vojko Gorjanc | Špela Arhar

The paper describes the project whose main purpose is the creation of the Slovene terminology web portal, funded by the Slovene Research Agency and the Amebis software company. It focuses on the DTD/schema used for the unification of different terminology resources in different input formats into one database available on the web. Two projects involving unification DTD/schemas were taken as the model for the resulting DTD/schema: the CONCEDE project and the TMF project. The final DTD/schema was tested on twenty different specialized dictionaries, both monolingual and bilingual, in various formats either without any existing markup or with complex XML structure. The result of the project will be an on-line terminology resource for Slovenian which will also include didactic material on terminology and free tools for uploading domain-specific text collections to be processed with NLP software, including a term extractor.

LIRICS Semantic Role Annotation: Design and Evaluation of a Set of Data Categories
Volha Petukhova | Harry Bunt

Semantic roles have often proved to be useful labels for stating linguistic generalisations of various sorts. There is, however, a lack of agreement on their defining criteria, which causes serious problems for semantic roles to be a useful classificatory device for predicate-argument relations. These criteria should (a) support the design of a semantic role set which is complete but does not contain redundant relations; (b) be based on semantic rather than morphological, lexical or syntactic properties; and (c) enable formal interpretation. In this paper we report on the analyses of alternative approaches to annotation and representation of semantic role information (such as FrameNet, PropBank and VerbNet) with respect to their models of description, granularity of semantic role sets, definitions of semantic roles concepts, consistency and reliability of annotations. We present methodological principles for characterising well-defined concepts which were developed within the LIRICS (Linguistic InfRastructure for Interoperable ResourCes and Systems; see project, as well as the designed set of semantic roles and their definitions in ISO 12620 format. We discuss evaluation results of the defined concepts for semantic role annotation concerning the redundancy and completeness of the tagset and the reliability of annotations in terms of inter-annotator agreement.

Reusable Tagset Conversion Using Tagset Drivers
Daniel Zeman

Part-of-speech or morphological tags are important means of annotation in a vast number of corpora. However, different sets of tags are used in different corpora, even for the same language. Tagset conversion is difficult, and solutions tend to be tailored to a particular pair of tagsets. We propose a universal approach that makes the conversion tools reusable. We also provide an indirect evaluation in the context of a parsing task.

Presentation of the New ISO-Standard for the Representation of Entries in Dictionaries: ISO 1951
Marie-Jeanne Derouin | André Le Meur

Times have changed over the last ten years in terms of dictionary production. With the introduction of digital support and networking, the lifespan of dictionaries has been considerably extended. The dictionary manuscript has become a unique data-source that can be reused and manipulated many times by numerous in-house and external experts. The traditional relationship between author, publisher and user has now been extended to include other partners: data-providers - either other publishers or institutions or industry-partners - , software developers, language-tool providers, etc. All these dictionary experts need a basic common language to optimize their work flow and to be able to co-operate in developing new products while avoiding time-consuming and expensive data manipulations. In this paper we will first of all present the ISO standardization for Lexicography which takes these new market needs into account, and then go on to describe the new standard ISO 1951: -“Presentation/Representation” of entries in dictionaries- which was published in March 2007. In conclusion, we will outline the benefits of standardization for the dictionary publishing industry.

ISOcat: Corralling Data Categories in the Wild
Marc Kemps-Snijders | Menzo Windhouwer | Peter Wittenburg | Sue Ellen Wright

To achieve true interoperability for valuable linguistic resources different levels of variation need to be addressed. ISO Technical Committee 37, Terminology and other language and content resources, is developing a Data Category Registry. This registry will provide a reusable set of data categories. A new implementation, dubbed ISOcat, of the registry is currently under construction. This paper shortly describes the new data model for data categories that will be introduced in this implementation. It goes on with a sketch of the standardization process. Completed data categories can be reused by the community. This is done by either making a selection of data categories using the ISOcat web interface, or by other tools which interact with the ISOcat system using one of its various Application Programming Interfaces. Linguistic resources that use data categories from the registry should include persistent references, e.g. in the metadata or schemata of the resource, which point back to their origin. These data category references can then be used to determine if two or more resources share common semantics, thus providing a level of interoperability close to the source data and a promising layer for semantic alignment on higher levels.

Standardising Bilingual Lexical Resources According to the Lexicon Markup Framework
Isa Maks | Carole Tiberius | Remco van Veenendaal

The Dutch HLT agency for language and speech technology (known as TST-centrale) at the Institute for Dutch Lexicology is responsible for the maintenance, distribution and accessibility of (Dutch) digital language resources. In this paper we present a project which aims to standardise the format of a set of bilingual lexicons in order to make them available to potential users, to facilitate the exchange of data (among the resources and with other (monolingual) resources) and to enable reuse of these lexicons for NLP applications like machine translation and multilingual information retrieval. We pay special attention to the methods and tools we used and to some of the problematic issues we encountered during the conversion process. As these problems are mainly caused by the fact that the standard LMF model fails in representing the detailed semantic and pragmatic distinctions made in our bilingual data, we propose some modifications to the standard. In general, we think that a standard for lexicons should provide a model for bilingual lexicons that is able to represent all detailed and fine-grained translation information which is generally found in these types of lexicons.

A Framework for Standardized Syntactic Annotation
Thierry Declerck

This poster presents an ISO framework for the standardization of syntactic annotation (SynAF). The normative part SynAF is concerned with a metamodel for syntactic annotation that covers both dimensions of constituency and dependency, and propose thus a multi-layered annotation framework that allows the combined and interrelated annotation of language data along both lines of consideration. This standard is designed to be used in close conjuncion with the metamodel presented in the Linguistic Annotation Framework (LAF) and with ISO 12620, Terminology and other language resources - Data categories.

A Guide for the Production of Reusable Language Resources
Victoria Arranz | Franck Gandcher | Valérie Mapelli | Khalid Choukri

The project described in this paper is funded by the French Ministry of Research. It aims at providing producers of Language Resources, and HLT players in general, with a guide which offers technical, legal and strategic recommendations/guidelines for the reuse of their Language Resources. The guide is dedicated in particular to academic laboratories which produce Language Resources and may benefit from further advice to start development, but also to any HLT player who wishes to follow the best practices in this field. The guidelines focus on different steps of a Language Resource’s life, i.e. specifications, production, validation, distribution, and maintenance. This paper gives a brief overview of the guide, and describes a) technical formats, standards and best practices which correspond to the current state of the art, for different types of resources, whether written or spoken, at different steps of the production line, b) legal issues and models/templates which can be used for the dissemination of Language Resources as widely as possible, c) strategic issues, by offering a dissemination plan which takes into account all types of constraints faced by HLT community players.

Prolexbase: a Multilingual Relational Lexical Database of Proper Names
Denis Maurel

This paper deals with a multilingual relational lexical database of proper name, Prolexbase, a free resource available on the CNRTL website. The Prolex model is based on two main concepts: firstly, a language independent pivot and, secondly, the prolexeme (the projection of the pivot onto particular language), that is a set of lemmas (names and derivatives). These two concepts model the variations of proper name: firstly, independent of language and, secondly, language dependent by morphology or knowledge. Variation processing is very important for NLP: the same proper name can be written in different instances, maybe in different parts of speech, and it can also be replaced by another one, a lexical anaphora (that reveals semantic link). The pivot represents different referent’s points of view, i.e. language independent variations of name. Pivots are linked by three semantic relations (quasi-synonymy, partitive relation and associative relation). The prolexeme is a set of variants (aliases), quasi-synonyms and morphosemantic derivatives. Prolexemes are linked to classifying contexts and reliability code.

Ontologizing Lexicon Access Functions based on an LMF-based Lexicon Taxonomy
Yoshihiko Hayashi | Chiharu Narawa | Monica Monachini | Claudia Soria | Nicoletta Calzolari

This paper discusses ontologization of lexicon access functions in the context of a service-oriented language infrastructure, such as the Language Grid. In such a language infrastructure, an access function to a lexical resource, embodied as an atomic Web service, plays a crucially important role in composing a composite Web service tailored to a user’s specific requirement. To facilitate the composition process involving service discovery, planning and invocation, the language infrastructure should be ontology-based; hence the ontologization of a range of lexicon functions is highly required. In a service-oriented environment, lexical resources however can be classified from a service-oriented perspective rather than from a lexicographically motivated standard. Hence to address the issue of interoperability, the taxonomy for lexical resources should be ground to principled and shared lexicon ontology. To do this, we have ontologized the standardized lexicon modeling framework LMF, and utilized it as a foundation to stipulate the service-oriented lexicon taxonomy and the corresponding ontology for lexicon access functions. This paper also examines a possible solution to fill the gap between the ontological descriptions and the actual Web service API by adopting a W3C recommendation SAWSDL, with which Web service descriptions can be linked with the domain ontology.

Romanian Lexical Data Bases: Inflected and Syllabic Forms Dictionaries
Ana-Maria Barbu

This paper presents two lexical data bases for Romanian: RoMorphoDict, a dictionary of inflected forms and RoSyllabiDict, a dictionary of syllabified inflected forms. Each data basis is available in two Unicode formats: text and XML. An entry of RoMorphoDict, in text format, contains information on inflected form, its lemma, its morpho-syntactic description and the marking of the stressed vowel in pronunciation, while in XML format, an entry, representing the whole paradigm of a word, contains further informations about roots and paradigm class. An entry of RoSyllabiDict, in both formats, contains information about unsyllabified word, its syllabified correspondent, grammatical information and/or type of syllabification, if it is the case. The stressed vowel is also marked on the syllabified form. Each lexical data base includes the corresponding inflected forms of about 65,000 lemmas, that is, over 700,000 entries in RoMorphoDict, and over 500,000 entries in RoSyllabiDict. Both resources are available for free. The paper describes in detail the content of these data bases and the procedure of building them.

Producing an Encyclopedic Dictionary using Patent Documents
Atsushi Fujii

Although the World Wide Web has late become an important source to consult for the meaning of words, a number of technical terms related to high technology are not found on the Web. This paper describes a method to produce an encyclopedic dictionary for high-tech terms from patent information. We used a collection of unexamined patent applications published by the Japanese Patent Office as a source corpus. Given this collection, we extracted terms as headword candidates and retrieved applications including those headwords. Then, we extracted paragraph-style descriptions and categorized them into technical domains. We also extracted related terms for each headword. We have produced a dictionary including approximately 400,000 Japanese terms as headwords. We have also implemented an interface with which users can explore our dictionary by reading text descriptions and viewing a related-term graph.

Evaluating the Relationship between Linguistic and Geographic Distances using a 3D Visualization
Folkert de Vriend | Jan Pieter Kunst | Louis ten Bosch | Charlotte Giesbers | Roeland van Hout

In this paper we discuss how linguistic and geographic distances can be related using a 3D visualization. We will convert linguistic data for locations along the German-Dutch border to linguistic distances that can be compared directly to geographic distances. This enables us to visualize linguistic distances as “real” distances with the use of the third dimension available in 3D modelling software. With such a visualization we will test if descriptive dialect data support the hypothesis that the German-Dutch state border became a linguistic border between the German and Dutch dialects. Our visualization is implemented in the 3D modelling software SketchUp.

Enhancing an English-Polish Electronic Dictionary for Multiword Expression Research
Piotr Bański | Radosław Moszczyński

This paper describes a project aimed at converting a legacy representation of English idioms into an XML-based format. The project is set in the context of a large electronic English-Polish dictionary which contains several hundred formalized idiom descriptions and which has been released under the terms of a free license. In short, the project consists of three phases: cleaning up the dictionary markup, extracting the legacy idiom representations, and converting them into TEI P5 XML constrained by a RelaxNG grammar created for this purpose and constituting a module that can be included as part of the TEI P5 schema. The paper contains general descriptions of the individual phases and several examples of XML-encoded idioms. It also suggests some directions for further research, which include abstracting the XML-ized idiom representations into general syntactic patterns and using the representations to automatically identify idioms in tagged corpora.

ProPOSEL: A Prosody and POS English Lexicon for Language Engineering
Claire Brierley | Eric Atwell

ProPOSEL is a prototype prosody and PoS (part-of-speech) English lexicon for Language Engineering, derived from the following language resources: the computer-usable dictionary CUVPlus, the CELEX-2 database, the Carnegie-Mellon Pronouncing Dictionary, and the BNC, LOB and Penn Treebank PoS-tagged corpora. The lexicon is designed for the target application of prosodic phrase break prediction but is also relevant to other machine learning and language engineering tasks. It supplements the existing record structure for wordform entries in CUVPlus with syntactic annotations from rival PoS-tagging schemes, mapped to fields for default closed and open-class word categories and for lexical stress patterns representing the rhythmic structure of wordforms and interpreted as potential new text-based features for automatic phrase break classifiers. The current version of the lexicon comes as a textfile of 104052 separate entries and is intended for distribution with the Natural Language ToolKit; it is therefore accompanied by supporting Python software for manipulating the data so that it can be used for Natural Language Processing (NLP) and corpus-based research in speech synthesis and speech recognition.

Creating Glossaries Using Pattern-Based and Machine Learning Techniques
Eline Westerhout | Paola Monachesi

One of the aims of the Language Technology for eLearning project is to show that Natural Language Processing techniques can be employed to enhance the learning process. To this end, one of the functionalities that has been developed is a pattern-based glossary candidate detector which is capable of extracting definitions in eight languages. In order to improve the results obtained with the pattern-based approach, machine learning techniques are applied on the Dutch results to filter out incorrectly extracted definitions. In this paper, we discuss the machine learning techniques used and we present the results of the quantitative evaluation. We also discuss the integration of the tool into the Learning Management System ILIAS.

Using Similarity Measures to Extend the LinGO Lexicon
Lynne Cahill

Deep processing of natural language requires large scale lexical resources that have sufficient coverage at a sufficient level of detail and accuracy (i.e. both recall and precision). Hand-crafted lexicons are extremely labour-intensive to create and maintain, and require continuous updating and extension to retain their level of usability. In this paper we present a technique for extending lexicons using similarity measures that can be extracted from corpora. The technique involves creating lexical entries for unknown words based on entries for words that are known and that are deemed to be distributionally similar. We demonstrate the applicability of the approach by providing an extended lexicon for the LinGO system using similarity measures extracted from the BNC. We also discuss the advantages and disadvantages of using such lexical extensions in different ways: principally either as part of the main lexicon or as a separate resource used only for “last resort” use.

Acquiring a Poor Man’s Inflectional Lexicon for German
Peter Adolphs

Many NLP modules and applications require the availability of a module for wide-coverage inflectional analysis. One way to obtain such analyses is to use a morphological analyser in combination with an inflectional lexicon. Since large text corpora nowadays are easily available and inflectional systems are in general well understood, it seems feasible to acquire lexical data from raw texts, guided by our knowledge of inflection. I present an acquisition method along these lines for German. The general idea can be roughly summarised as follows: first, generate a set of lexical entry hypotheses for each word-form in the corpus; then, select hypotheses that explain the word-forms found in the corpus “best”. To this end, I have turned an existing morphological grammar, cast in finite-state technology (Schmid et al. 2004), into a hypothesiser for lexical entries. Irregular forms are simply listed so that they do not interfere with the regular rules used in the hypothesiser. Running the hypothesiser on a text corpus yields a large number of lexical entry hypotheses. These are then ranked according to their validity with the help of a statistical model that is based on the number of attested and predicted word forms for each hypothesis.

COLDIC, a Lexicographic Platform for LMF compliant lexica
Núria Bel | Sergio Espeja | Montserrat Marimon | Marta Villegas

Despite of the importance of lexical resources for a number of NLP applications (Machine Translation, Information Extraction, Question Answering, among others), there has been a traditional lack of generic tools for the creation, maintenance and management of computational lexica. The most direct obstacle for the development of generic tools, independent of any particular application format, was the lack of standards for the description and encoding of lexical resources. The availability of the Lexical Markup Framework (LMF) has changed this scenario and has made it possible the development of generic lexical platforms. COLDIC is a generic platform for working with computational lexica. The system has been designed to let the user concentrate on lexicographical tasks, but still being autonomous in the management of the tools. The creation and maintenance of the database, which is the core of the tool, demand no specific training in databases. A LMF compliant schema implemented in a Document Type Definition (DTD) describing the lexical resources is taken by the system to automatically configure the platform. Besides, the most standard web services for interoperability are also generated automatically. Other components of the platform include build-in functions supporting the most common tasks of the lexicographic work.

The Annotation Guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank: the Treatment of some specific Syntactic Constructions in Latin
David Bamman | Marco Passarotti | Roberto Busa | Gregory Crane

The paper describes the treatment of some specific syntactic constructions in two treebanks of Latin according to a common set of annotation guidelines. Both projects work within the theoretical framework of Dependency Grammar, which has been demonstrated to be an especially appropriate framework for the representation of languages with a moderately free word order, where the linear order of constituents is broken up with elements of other constituents. The two projects are the first of their kind for Latin, so no prior established guidelines for syntactic annotation are available to rely on. The general model for the adopted style of representation is that used by the Prague Dependency Treebank, with departures arising from the Latin grammar of Pinkster, specifically in the traditional grammatical categories of the ablative absolute, the accusative + infinitive, and gerunds/gerundives. Sharing common annotation guidelines allows us to compare the datasets of the two treebanks for tasks such as mutually checking annotation consistency, diachronically studying specific syntactic constructions, and training statistical dependency parsers.

Unsupervised Lexical Acquisition for Part of Speech Tagging
Dan Tufiş | Elena Irimia | Radu Ion | Alexandru Ceauşu

It is known that POS tagging is not very accurate for unknown words (words which the POS tagger has not seen in the training corpora). Thus, a first step to improve the tagging accuracy would be to extend the coverage of the tagger’s learned lexicon. It turns out that, through the use of a simple procedure, one can extend this lexicon without using additional, hard to obtain, hand-validated training corpora. The basic idea consists of merely adding new words along with their (correct) POS tags to the lexicon and trying to estimate the lexical distribution of these words according to similar ambiguity classes already present in the lexicon. We present a method of automatically acquire high quality POS tagging lexicons based on morphologic analysis and generation. Currently, this procedure works on Romanian for which we have a required paradigmatic generation procedure but the architecture remains general in the sense that given the appropriate substitutes for the morphological generator and POS tagger, one should obtain similar results.

A Hybrid Approach to Extracting and Classifying Verb+Noun Constructions
Amalia Todiraşcu | Dan Tufiş | Ulrich Heid | Christopher Gledhill | Dan Ştefanescu | Marion Weller | François Rousselot

We present the main findings and preliminary results of an ongoing project aimed at developing a system for collocation extraction based on contextual morpho-syntactic properties. We explored two hybrid extraction methods: the first method applies language-indepedent statistical techniques followed by a linguistic filtering, while the second approach, available only for German, is based on a set of lexico-syntactic patterns to extract collocation candidates. To define extraction and filtering patterns, we studied a specific collocation category, the Verb-Noun constructions, using a model inspired by the systemic functional grammar, proposing three level analysis: lexical, functional and semantic criteria. From tagged and lemmatized corpus, we identify some contextual morpho-syntactic properties helping to filter the output of the statistical methods and to extract some potential interesting VN constructions (complex predicates vs complex predicators). The extracted candidates are validated and classified manually.

Head or Non-head? Semi-automatic Procedures for Extracting and Classifying Subcategorisation Properties of Compounds.
Ekaterina Lapshinova-Koltunski | Ulrich Heid

In this paper we discuss an approach to the semi-automatic extraction and classification of the compounds extracted from German corpora. Compound nominals are semi-automatically extracted from text corpora along with their sentential complements. In this study we concentrate on that­, wh­ or if subclauses although our methods can be applied to other complements as well. We elaborate an architecture using linguistic knowledge about the phenomena we extract, and aim at answering the following questions: how can data about subcategorisation properties of nominal compounds be extracted from text corpora, and how can compounds be classified according to their subcategorisation properties? Our classification is based on the relationships between the subcategorisation of nominal compounds, e.g. Grundfrage, Wettstreit and Beweismittel, and that of their constituent parts, such as Frage, Streit, Beweis, etc. We show that there are cases which do not match the commonly accepted assumption that the head of a compound is its valency bearer. Such cases should receive a specific treatment in NLP dictionary building. This calls for tools to identify and classify such cases by means of data extraction from corpora. We propose precision-oriented semi­automatic extraction which can operate on tokenized, tagged and lemmatized texts. In the future, we are going to extend the kinds of extracted complements beyond subclauses and analyze the nature of the non-head valency-bearer of compounds, as well as an extension of the kinds of extracted complements beyond subclauses.

A LAF/GrAF based Encoding Scheme for underspecified Representations of syntactic Annotations.
Manuel Kountz | Ulrich Heid | Kerstin Eckart

Data models and encoding formats for syntactically annotated text corpora need to deal with syntactic ambiguity; underspecified representations are particularly well suited for the representation of ambiguous data because they allow for high informational efficiency. We discuss the issue of being informationally efficient, and the trade-off between efficient encoding of linguistic annotations and complete documentation of linguistic analyses. The main topic of this article is a data model and an encoding scheme based on LAF/GrAF (Ide and Romary, 2006; Ide and Suderman, 2007) which provides a flexible framework for encoding underspecified representations. We show how a set of dependency structures and a set of TiGer graphs (Brants et al., 2002) representing the readings of an ambiguous sentence can be encoded, and we discuss basic issues in querying corpora which are encoded using the framework presented here.

The JOS Morphosyntactically Tagged Corpus of Slovene
Tomaž Erjavec | Simon Krek

The JOSmorphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpora: jos100k, a 100,000 word balanced monolingual sampled corpus annotated with hand validated morphosyntactic descriptions (MSDs) and lemmas, and jos1M, the 1 million-word partially hand validated corpus. The two corpora have been sampled from the 600M-word Slovene reference corpus FidaPLUS. The JOS resources have a standardised encoding, with the MULTEXT-East-type morphosyntactic specifications and the corpora encoded according to the Text Encoding Initiative Guidelines P5. JOS resources are available as a dataset for research under the Creative Commons licence and are meant to facilitate developments of HLT for Slovene.

♠ Demo: An Open Source Tool for Partial Parsing and Morphosyntactic Disambiguation
Aleksander Buczyński | Adam Przepiórkowski

The paper presents Spejd, an Open Source Shallow Parsing and Disambiguation Engine. Spejd (abbreviated to ♠) is based on a fully uniform formalism both for constituency partial parsing and for morphosyntactic disambiguation - the same grammar rule may contain structure-building operations, as well as morphosyntactic correction and disambiguation operations. The formalism and the engine are more flexible than either the usual shallow parsing formalisms, which assume disambiguated input, or the usual unification-based formalisms, which couple disambiguation (via unification) with structure building. Current applications of Spejd include rule-based disambiguation, detection of multiword expressions, valence acquisition, and sentiment analysis. The functionality can be further extended by adding external lexical resources. While the examples are based on the set of rules prepared for the parsing of the IPI PAN Corpus of Polish, ♠ is fully language-independent and we hope it will also be useful in the processing of other languages.

Annotating Superlatives
Silke Scheible

This paper describes a three-part annotation scheme for superlatives. The first identifies syntactic classes, since superlatives can serve different semantic purposes. The second and third only apply to superlatives that express straight-forward comparisons between targets and their comparison sets. The second form of annotation identifies the spans of each target and comparison set, which is of interest for relation extraction. The third form labels superlatives as facts or opinions, which has not yet been undertaken in the area of sentiment detection. The annotation scheme has been tested and evaluated on 500 tokens of superlatives, the results of which are presented in Section 5. In addition to providing a platform for investigating superlatives on a larger scale, this research also introduces a new text-based Wikipedia corpus which is especially suitable for linguistic research.

POS Tagging for German: how important is the Right Context?
Steliana Ivanova | Sandra Kuebler

Part-of-Speech tagging is generally performed by Markov models, based on bigram or trigram models. While Markov models have a strong concentration on the left context of a word, many languages require the inclusion of right context for correct disambiguation. We show for German that the best results are reached by a combination of left and right context. If only left context is available, then changing the direction of analysis and going from right to left improves the results. In a version of MBT with default parameter settings, the inclusion of the right context improved POS tagging accuracy from 94.00% to 96.08%, thus corroborating our hypothesis. The version with optimized parameters reaches 96.73%.

UnsuParse: unsupervised Parsing with unsupervised Part of Speech Tagging
Christian Hänig | Stefan Bordag | Uwe Quasthoff

Based on simple methods such as observing word and part of speech tag co-occurrence and clustering, we generate syntactic parses of sentences in an entirely unsupervised and self-inducing manner. The parser learns the structure of the language in question based on measuring “breaking points” within sentences. The learning process is divided into two phases, learning and application of learned knowledge. The basic learning works in an iterative manner which results in a hierarchical constituent representation of the sentence. Part-of-Speech tags are used to circumvent the data sparseness problem for rare words. The algorithm is applied on untagged data, on manually assigned tags and on tags produced by an unsupervised part of speech tagger. The results are unsurpassed by any self-induced parser and challenge the quality of trained parsers with respect to finding certain structures such as noun phrases.

Enriching the Venice Italian Treebank with Dependency and Grammatical Relations
Sara Tonelli | Rodolfo Delmonte | Antonella Bristot

In this paper we propose a rule-based approach to extract dependency and grammatical functions from the Venice Italian Treebank, a Treebank of written text with PoS and constituent labels consisting of 10,200 utterances and about 274,000 tokens. As manual corpus annotation is expensive and time-consuming, we decided to exploit this existing constituency-based Treebank to derive dependency structures with lower effort. After describing the procedure to extract heads and dependents, based on a head percolation table for Italian, we introduce the rules adopted to add grammatical relation labels. To this purpose, we manually relabeled all non-canonical arguments, which are very frequent in Italian, then we automatically labeled the remaining complements or arguments following some syntactic restrictions based on the position of the constituents w.r.t to parent and sibling nodes. The final section of the paper describes evaluation results. Evaluation was carried out in two steps, one for dependency relations and one for grammatical roles. Results are in line with similar conversion algorithms carried out for other languages, with 0.97 precision on dependency arcs and F-measure for the main grammatical functions scoring 0.96 or above, except for obliques with 0.75.

Rule-Based Chunker for Croatian
Kristina Vučković | Marko Tadić | Zdravko Dovedan

In this paper we discuss a rule-based approach to chunking sentences in Croatian, implemented using local regular grammars within the NooJ development environment. We describe the rules and their implementation by regular grammars and at the same time show that in NooJ environment it is extremely easy to fine tune their different sub-rules. Since Croatian has strong morphosyntactic features that are shared between most or all elements of a chunk, the rules are built by taking these features into account and strongly relying on them. For the evaluation of our chunker we used a extracted set of manually annotated sentences from 100 kw MSD/tagged and disambiguated Croatian corpus. Our chunker performed the best on VP-chunks (F: 97.01), while NP-chunks (F: 92.31) and PP-chunks (F: 83.08) were of lower quality. The results are comparable to chunker performance of CoNLL-2000 shared task of chunking.

Learning properties of Noun Phrases: from data to functions
Valeria Quochi | Basilio Calderone

The paper presents two experiments of unsupervised classification of Italian noun phrases. The goal of the experiments is to identify the most prominent contextual properties that allow for a functional classification of noun phrases. For this purpose, we used a Self Organizing Map is trained with syntactically-annotated contexts containing noun phrases. The contexts are defined by means of a set of features representing morpho-syntactic properties of both nouns and their wider contexts. Two types of experiments have been run: one based on noun types and the other based on noun tokens. The results of the type simulation show that when frequency is the most prominent classification factor, the network isolates idiomatic or fixed phrases. The results of the token simulation experiment, instead, show that, of the 36 attributes represented in the original input matrix, only a few of them are prominent in the re-organization of the map. In particular, key features in the emergent macro-classification are the type of determiner and the grammatical number of the noun. An additional but not less interesting result is an organization into semantic/pragmatic micro-classes. In conclusions, our result confirm the relative prominence of determiner type and grammatical number in the task of noun (phrase)categorization.

A Study of Parentheticals in Discourse Corpora - Implications for NLG Systems
Eva Banik | Alan Lee

This paper presents a corpus study of parenthetical constructions in two different corpora: the Penn Discourse Treebank (PDTB, (PDTBGroup, 2008)) and the RST Discourse Treebank (Carlson et al., 2001). The motivation for the study is to gain a better understanding of the rhetorical properties of parentheticals in order to enable a natural language generation system to produce parentheticals as part of a rhetorically well-formed output. We argue that there is a correlation between syntactic and rhetorical types of parentheticals and establish two main categories: ELABORATION/EXPANSION-type NP-modifier parentheticals and NON-ELABORATION/EXPANSION-type VP- or S-modifier parentheticals. We show several strategies for extracting these from the two corpora and discuss how the seemingly contradictory results obtained can be reconciled in light of the rhetorical and syntactic properties of parentheticals as well as the decisions taken in the annotation guidelines.

Enhancing the Arabic Treebank: a Collaborative Effort toward New Annotation Guidelines
Mohamed Maamouri | Ann Bies | Seth Kulick

The Arabic Treebank team at the Linguistic Data Consortium has significantly revised and enhanced its annotation guidelines and procedure over the past year. Improvements were made to both the morphological and syntactic annotation guidelines, and annotators were trained in the new guidelines, focusing on areas of low inter-annotator agreement. The revised guidelines are now being applied in annotation production, and the combination of the revised guidelines and a period of intensive annotator training has raised inter-annotator agreement f-measure scores already and has also improved parsing results.

A Pilot Arabic Propbank
Martha Palmer | Olga Babko-Malaya | Ann Bies | Mona Diab | Mohamed Maamouri | Aous Mansouri | Wajdi Zaghouani

In this paper, we present the details of creating a pilot Arabic proposition bank (Propbank). Propbanks exist for both English and Chinese. However the morphological and syntactic expression of linguistic phenomena in Arabic yields a very different type of process in creating an Arabic propbank. Hence, we highlight those characteristics of Arabic that make creating a propbank for the language a different challenge compared to the creation of an English Propbank.We believe that many of the lessons learned in dealing with Arabic could generalise to other languages that exhibit equally rich morphology and relatively free word order.

Saxon: an Extensible Multimedia Annotator
Mark Greenwood | José Iria | Fabio Ciravegna

This paper introduces Saxon, a rule-based document annotator that is capable of processing and annotating several document formats and media, both within and across documents. Furthermore, Saxon is readily extensible to support other input formats due to both it’s flexible rule formalism and the modular plugin architecture of the Runes framework upon which it is built. In this paper we introduce the Saxon rule formalism through examples aimed at highlighting its power and flexibility.

Spatiotemporal Coding in ANVIL
Michael Kipp

We present a new coding mechanism, spatiotemporal coding, that allows coders to annotate points and regions in the video frame by drawing directly on the screen. Coders can not only attach labels to time intervals in the video but can specify a possibly moving region on the video screen. This opens up the spatial dimension for multi-track video coding and is an essential asset in almost every area of video coding, e.g. gesture coding, facial expression coding, encoding semantics for information retrieval etc. We discuss conceptual variants, design decisions and the relation to the MPEG-7 standard and tools.

Integrating Audio and Visual Information for Modelling Communicative Behaviours Perceived as Different
Michelina Savino | Laura Scivetti | Mario Refice

In human face-to-face interaction, participants can rely on a number of audio-visual information for interpreting interlocutors’ communicative intentions, such information strongly contributing to the successfulness of communication. Modelling these typical human abilities represents a main objective in human communication research, including technological applications like human-machine interaction. In this pilot study we explore the possibility of using audio-visual parameters for describing/measuring the differences perceived in interlocutor’s communicative behaviours. Preliminary results derived from the multimodal analysis of a single subject seem to indicate that measuring the distribution of some prosodic and hand gesture events which are temporally co-occurring contribute to the accounting of such perceived differences. Moreover, as far as gesture events are concerned, it has been observed that relevant information are not simply to be found in the occurences of single gestures, but mainly in some gesture modalities (for example, ’single stroke’ vs ’multiple stroke’ gestures, one-hand vs both-hands gestures, etc?). In this paper we also introduce and describe a software package, ViSuite, we developed for multimodal processing and used for the work described in his paper.

Annotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium
Kazuaki Maeda | Haejoong Lee | Shawn Medero | Julie Medero | Robert Parker | Stephanie Strassel

The Linguistic Data Consortium (LDC) creates a variety of linguistic resources - data, annotations, tools, standards and best practices - for many sponsored projects. The programming staff at LDC has created the tools and technical infrastructures to support the data creation efforts for these projects, creating tools and technical infrastructures for all aspects of data creation projects: data scouting, data collection, data selection, annotation, search, data tracking and worklow management. This paper introduces a number of samples of LDC programming staff’s work, with particular focus on the recent additions and updates to the suite of software tools developed by LDC. Tools introduced include the GScout Web Data Scouting Tool, LDC Data Selection Toolkit, ACK - Annotation Collection Kit, XTrans Transcription and Speech Annotation Tool, GALE Distillation Toolkit, and the GALE MT Post Editing Workflow Management System.

Design and Recording of Czech Audio-Visual Database with Impaired Conditions for Continuous Speech Recognition
Jana Trojanová | Marek Hrúz | Pavel Campr | Miloš Železný

In this paper we discuss the design, acquisition and preprocessing of a Czech audio-visual speech corpus. The corpus is intended for training and testing of existing audio-visual speech recognition system. The name of the database is UWB-07-ICAVR, where ICAVR stands for Impaired Condition Audio Visual speech Recognition. The corpus consists of 10,000 utterances of continuous speech obtained from 50 speakers. The total length of the database is 25 hours. Each utterance is stored as a separate sentence. The corpus extends existing databases by covering condition of variable illumination. We acquired 50 speakers, where half of them were men and half of them were women. Recording was done by two cameras and two microphones. Database introduced in this paper can be used for testing of visual parameterization in audio-visual speech recognition (AVSR). Corpus can be easily split into training and testing part. Each speaker pronounced 200 sentences: first 50 were the same for all, the rest of them were different. Six types of illumination were covered. Session for one speaker can fit on one DVD disk. All files are accompanied by visual labels. Labels specify region of interest (mouth and area around them specified by bounding box). Actual pronunciation of each sentence is transcribed into the text file.

The UJIpenchars Database: a Pen-Based Database of Isolated Handwritten Characters
D. Llorens | F. Prat | A. Marzal | J. M. Vilar | M. J. Castro | J. C. Amengual | S. Barrachina | A. Castellanos | S. España | J. A. Gómez | J. Gorbe | A. Gordo | V. Palazón | G. Peris | R. Ramos-Garijo | F. Zamora

The availability of large amounts of data is a fundamental prerequisite for building handwriting recognition systems. Any system needs a test set of labelled samples for measuring its performance along its development and guiding it. Moreover, there are systems that need additional samples for learning the recognition task they have to cope with later, i.e. a training set. Thus, the acquisition and distribution of standard databases has become an important issue in the handwriting recognition research community. Examples of widely used databases in the online domain are UNIPEN, IRONOFF, and Pendigits. This paper describes the current state of our own database, UJIpenchars, whose first version contains online representations of 1,364 isolated handwritten characters produced by 11 writers and is freely available at the UCI Machine Learning Repository. Moreover, we have recently concluded a second acquisition phase, totalling more than 11,000 samples from 60 writers to be made available in short as UJIpenchars2.

Sign Language Corpus Annotation: toward a new Methodology
Emilie Chételat-Pelé | Annelies Braffort

This paper deals with non manual gestures annotation involved in Sign Language within the context of automatic generation of Sign Language. We will tackle linguistic researches in sign language, present descriptions of non manual gestures and problems lead to movement description. Then, we will propose a new annotation methodology, which allows non manual gestures description. This methodology can describe all Non Manual Gestures with precision, economy and simplicity. It is based on four points: Movement description (instead of position description); Movement decomposition (the diagonal movement is described with horizontal movement and vertical movement separately); Element decomposition (we separate higher eyelid and lower eyelid); Use of a set of symbols rather than words. One symbol can describe many phenomena (with use of colours, height...). First analysis results allow us to define precisely the structure of eye blinking and give the very first ideas for the rules to be designed. All the results must be refined and confirmed by extending the study on the whole corpus. In a second step, our annotation will be used to produce analyses in order to define rules and structure definition of Non Manual Gestures that will be evaluate in LIMSI’s automatic French Sign Language generation system.

Benchmark Databases for Video-Based Automatic Sign Language Recognition
Philippe Dreuw | Carol Neidle | Vassilis Athitsos | Stan Sclaroff | Hermann Ney

A new, linguistically annotated, video database for automatic sign language recognition is presented. The new RWTH-BOSTON-400 corpus, which consists of 843 sentences, several speakers and separate subsets for training, development, and testing is described in detail. For evaluation and benchmarking of automatic sign language recognition, large corpora are needed. Recent research has focused mainly on isolated sign language recognition methods using video sequences that have been recorded under lab conditions using special hardware like data gloves. Such databases have often consisted generally of only one speaker and thus have been speaker-dependent, and have had only small vocabularies. A new database access interface, which was designed and created to provide fast access to the database statistics and content, makes it possible to easily browse and retrieve particular subsets of the video database. Preliminary baseline results on the new corpora are presented. In contradistinction to other research in this area, all databases presented in this paper will be publicly available.

The ATIS Sign Language Corpus
Jan Bungeroth | Daniel Stein | Philippe Dreuw | Hermann Ney | Sara Morrissey | Andy Way | Lynette van Zijl

Systems that automatically process sign language rely on appropriate data. We therefore present the ATIS sign language corpus that is based on the domain of air travel information. It is available for five languages, English, German, Irish sign language, German sign language and South African sign language. The corpus can be used for different tasks like automatic statistical translation and automatic sign language recognition and it allows the specific modeling of spatial references in signing space.

Collection and Preprocessing of Czech Sign Language Corpus for Sign Language Recognition
Pavel Campr | Marek Hrúz | Jana Trojanová

This paper discusses the design, recording and preprocessing of a Czech sign language corpus. The corpus is intended for training and testing of sign language recognition (SLR) systems. The UWB-07-SLR-P corpus contains video data of 4 signers recorded from 3 different perspectives. Two of the perspectives contain whole body and provide 3D motion data, the third one is focused on signer’s face and provide data for face expression and lip feature extraction. Each signer performed 378 signs with 5 repetitions. The corpus consists of several types of signs: numbers (35 signs), one and two-handed finger alphabet (64), town names (35) and other signs (244). Each sign is stored in a separate AVI file. In total the corpus consists of 21853 video files in total length of 11.1 hours. Additionally each sign is preprocessed and basic features such as 3D hand and head trajectories are available. The corpus is mainly focused on feature extraction and isolated SLR rather than continuous SLR experiments.

A Multimodal Infant Behavior Annotation for Developmental Analysis of Demonstrative Expressions
Shigeyoshi Kitazawa | Shinya Kiriyama | Tomohiko Kasami | Shogo Ishikawa | Naofumi Otani | Hiroaki Horiuchi | Yoichi Takebayashi

We have obtained the valuable findings about the developmental processes of demonstrative expression skills, which is concerned with the fundamental commonsense of human knowledge, such as to get an object and to catch someone’s attention. We have already developed a framework to record genuine spontaneous speech of infants. We are constructing a multimodal infant behavior corpus, which enables us to elucidate human commonsense knowledge and its acquisition mechanism. Based on the observation of the corpus, we proposed a multimodal behavior description for observation of demonstrative expressions. We proved that the proposed model has the nearly 90% coverage in an open test of the behavior description task. The analysis using the model produced many valuable findings from multimodal viewpoints; for example, the change of “line of sight” from “object to person” to “person to object” means that the infant has obtained a better way to catch someone’s attention. Our intention-based analysis provided us with an infant behavior model that may apply to a likely behavior simulation system.

Automatic Emotional Degree Labeling for Speakers’ Anger Utterance during Natural Japanese Dialog
Yoshiko Arimoto | Sumio Ohno | Hitoshi Iida

This paper describes a method of automatic emotional degree labeling for speaker’s anger utterances during natural Japanese dialog. First, we explain how to record anger utterance naturally appeared in natural Japanese dialog. Manual emotional degree labeling was conducted in advance to grade the utterances by a 6 Likert scale to obtain a correct anger degree. Then experiments of automatic anger degree estimation were conducted to label an anger degree with each utterance by its acoustic features. Also estimation experiments were conducted with speaker-dependent datasets to find out any influence of individual emotional expression on automatic emotional degree labeling. As a result, almost all the speaker’s models show higher adjusted R square so that those models are superior to the speaker-independent model in those estimation capabilities. However, a residual between automatic emotional degree and manual emotional degree (0.73) is equivalent to those of speaker’s models. There still has a possibility to label utterances with the speaker-independent model.

A Real-World Emotional Speech Corpus for Modern Greek
Theodoros Kostoulas | Todor Ganchev | Iosif Mporas | Nikos Fakotakis

The present paper deals with the design and the annotation of a Greek real-world emotional speech corpus. The speech data consist of recordings collected during the interaction of naïve users with a smart-home dialogue system. Annotation of the speech data with respect to the uttered command and emotional state was performed. Initial experimentations towards recognizing negative emotional states were performed and the experimental results indicate the range of difficulties when dealing with real-world data.

Annotating Subjective Content in Meetings
Theresa Wilson

This paper presents an annotation scheme for marking subjective content in meetings, specifically the opinions and sentiments that participants express as part of their discussion. The scheme adapts concepts from the Multi-perspective Question Answering (MPQA) Annotation Scheme, an annotation scheme for marking opinions and attributions in the news. The adaptations reflect the differences in multiparty conversation as compared to text, as well as the overall goals of our project.

The AUTONOMATA Spoken Names Corpus
Henk van den Heuvel | Jean-Pierre Martens | Bart D’hoore | Kristof D’hanens | Nanneke Konings

In the Autonomata project we have collected a corpus of spoken name utterances with manually corrected phonemic transcriptions of these utterances. The corpus was designed with the intention to become a major resource for the development of automatic speech recognition engines that can achieve a high accuracy on the recognition of person and geographical names spoken in Dutch. The recorded names were selected so as to reveal the major pronunciation variations that a speech recognizer of e.g. a navigation system with speech input is going to be confronted with. This includes native speakers speaking foreign names and vice versa.

Acquiring Pronunciation Data for a Placenames Lexicon in a Less-Resourced Language
Briony Williams | Rhys James Jones

A new procedure is described for generating pronunciations for a dictionary of place-names in a less-resourced language (Welsh, spoken in Wales, UK). The method is suitable for use in a situation where there is a lack of skilled phoneticians with expertise in the language, but where there are native speakers available, as well as a text-to-speech synthesiser for the language. The lack of skilled phoneticians will make it impossible to carry out direct editing of pronunciations, and so a method has been devised that makes it possible for non-phonetician native speakers to edit pronunciations without knowledge of the phonology of the language. The key advance in this method is the use of “re-spelling” to indicate pronunciation in a linguistically-naïve fashion on the part of the non-specialist native speaker. The “re-spelled” forms of placenames are used to drive a set of specially-adapted letter-to-sound rules, which generate the pronunciations desired. The speech synthesiser is used to provide audio feedback to the native speaker editor for purposes of verification. A graphical user interface acts as the link between the database, the speech synthesiser and the native speaker editor. This method has been used successfully to generate pronunciations for placenames in Wales.

Constructing a Database of Non-Japanese Pronunciations of Different Japanese Romanizations
Reiko Kaji | Hajime Mochizuki

In this paper, we investigated how foreign language speakers pronounce Japanese words transliterated using two major Romanization systems, Hepburn and Kunrei. First, we recorded foreign language speakers pronouncing Romanized Japanese words. Next, Japanese speakers listened to the recordings and wrote down the words in Japanese Kana. Sets of each Romanized Japanese word, its correct Kana expression, its recorded reading, and the Kana dictated from the recording were stored in our database. We also investigated which of the two Romanization systems was pronounced more correctly by foreign language speakers by comparing the correctness of their respective readings. We also investigated which system’s pronunciation by foreign language speakers was judged as more acceptable by Japanese speakers.

Combined Systems for Automatic Phonetic Transcription of Proper Nouns
Antoine Laurent | Téva Merlin | Sylvain Meignier | Yannick Estève | Paul Deléglise

Large vocabulary automatic speech recognition (ASR) technologies perform well in known, controlled contexts. However recognition of proper nouns is commonly considered as a difficult task. Accurate phonetic transcription of a proper noun is difficult to obtain, although it can be one of the most important resources for a recognition system. In this article, we propose methods of automatic phonetic transcription applied to proper nouns. The methods are based on combinations of the rule-based phonetic transcription generator LIA_PHON and an acoustic-phonetic decoding system. On the ESTER corpus, we observed that the combined systems obtain better results than our reference system (LIA_PHON). The WER (Word Error Rate) decreased on segments of speech containing proper nouns, without affecting negatively the results on the rest of the corpus. On the same corpus, the Proper Noun Error Rate (PNER, which is a WER computed on proper nouns only), decreased with our new system.

Evaluation of Modules and Tools for Speech Synthesis: the ECESS Framework
Harald Höge | Zdravko Kacic | Bojan Kotnik | Matej Rojc | Nicolas Moreau | Horst-Udo Hain

The consortium ECESS (European Center of Excellence for Speech Synthesis) has set up a framework for evaluation of software modules and tools relevant for speech synthesis. Till now two lines of evaluation campaigns have been established: (1) Evaluation of the ECESS TTS modules (text processing, prosody, acoustic synthesis). (2) Evaluation of ECESS tools (pitch extraction, voice activity detection, phonetic segmentation). The functionality and interfaces of the ECESS TTS have been developed by a joint effort between ECESS and the EC-funded project TC-STAR . First evaluation campaigns were conducted within TC-STAR using the ECESS framework. As TC-STAR finished in March 2007, ECESS continued and extended the evaluation of ECESS TTS modules and tools by its own. Within the paper we describe a novel framework which allows performing remote evaluation for modules via the web. First experimental results are reported. Further the result of several evaluation campaigns for tools handling pitch extraction and voice activity detection are presented.

An Automatic Close Copy Speech Synthesis Tool for Large-Scale Speech Corpus Evaluation
Dafydd Gibbon | Jolanta Bachan

The production of rich multilingual speech corpus resources on a large scale is a requirement for many linguistic, phonetic and technological tasks, in both research and application domains. It is also time-consuming and therefore expensive. The human component in the resource creation process is also prone to inconsistencies, a situation frequently documented in cross-transcriber consistency studies. In the present case, corpora of three languages were to be evaluated and corrected: (1) Polish, a large automatically annotated and manually corrected single-speaker TTS unit-selection corpus in the BOSS Label File (BLF) format, (2) German and (3) English, the second and third being manually annotated multi-speaker story-telling learner corpora in Praat TextGrid format. A method is provided for supporting the evaluation and correction of time-aligned annotations for the three corpora by permitting a rapid audio screening of the annotations by an expert listener for the detection of perceptually conspicuous systematic or isolated errors in the annotations. The criterion for perceptual conspicuousness was provided by converting the annotation formats into the interface format required by the MBROLA speech synthesiser. The audio screening procedure is complementary to other methods of corpus evaluation and does not replace them.

A Flexible Wizard of Oz Environment for Rapid Prototyping
Stefan Scherer | Petra-Maria Strauß

This paper presents a freely-available, and flexible Wizard of Oz environment for rapid prototyping. The system is designed to investigate the required features of a dialog system using the commonly used Wizard of Oz approach. The idea is that the time consuming design of such a tool can be avoided by using the provided architecture. The developers can easily adapt the database and extend the tool to the individual needs of the targeted dialog system. The tool is designed as a client-server architecture and provides efficient input features and versatile output types including voice, or an avatar as visual output. Furthermore, a scenario, namely restaurant selection, is introduced in order to give an example application for a dialog system.

Building of a Speech Corpus Optimised for Unit Selection TTS Synthesis
Jindřich Matoušek | Daniel Tihelka | Jan Romportl

The paper deals with the process of designing a phonetically and prosodically rich speech corpus for unit selection speech synthesis. The attention is given mainly to the recording and verification stage of the process. In order to ensure as high quality and consistency of the recordings as possible, a special recording environment consisting of a recording session management and “pluggable” chain of checking modules was designed and utilised. Other stages, namely text collection (including) both phonetically and prosodically balanced sentence selection and a careful annotation on both orthographic and phonetic level are also mentioned.

Methodologies for Designing and Recording Speech Databases for Corpus Based Synthesis
Luís Oliveira | Sérgio Paulo | Luís Figueira | Carlos Mendes | Ana Nunes | Joaquim Godinho

In this paper we share our experience and describe the methodologies that we have used in designing and recording large speech databases for applications requiring speech synthesis. Given the growing demand for customized and domain specific voices for use in corpus based synthesis systems, we believe that good practices should be established for the creation of these databases which are a key factor in the quality of the resulting speech synthesizer. We will focus on the designing of the recording prompts, on the speaker selection procedure, on the recording setup and on the quality control of the resulting database. One of the major challenges was to assure the uniformity of the recordings during the 20 two-hour recording sessions that each speaker had to perform, to produce a total of 13 hours of recorded speech for each of the four speakers. This work was conducted in the scope of the Tecnovoz project that brought together 4 speech research centers and 9 companies with the goal of integrating speech technologies in a wide range of applications.

MISTRAL: a Statistical Machine Translation Decoder for Speech Recognition Lattices
Alexandre Patry | Philippe Langlais

This paper presents MISTRAL, an open source statistical machine translation decoder dedicated to spoken language translation. While typical machine translation systems take a written text as input, MISTRAL translates word lattices produced by automatic speech recognition systems. The lattices are translated in two passes using a phrase-based model. Our experiments reveal an improvement in BLEU when translating lattices instead of sentences returned by a speech recognition system.

LC-STAR II: Starring more Lexica
Ute Ziegenhain | Hanne Fersoe | Henk van den Heuvel | Asuncion Moreno

LC-STAR II is a follow-up project of the EU funded project LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Components, IST-2001-32216). LC-STAR II develops large lexica containing information for speech processing in ten languages targeting especially automatic speech recognition and text to speech synthesis but also other applications like speech-to-speech translation and tagging. The project follows by large the specifications developed within the scope of LC-STAR covering thirteen languages: Catalan, Finnish, German, Greek, Hebrew, Italian, Mandarin Chinese, Russian, Turkish, Slovenian, Spanish, Standard Arabic and US-English. The ten new LC-STAR II languages are: Brazilian-Portuguese, Cantonese, Czech, English-UK, French, Hindi, Polish, Portuguese, Slovak, and Urdu. The project started in 2006 with a lifetime of two years. The project is funded by a consortium, which includes Microsoft (USA), Nokia (Finland), NSC (Israel), Siemens (Germany) and Harmann/Becker (Germany). The project is coordinated by UPC (Spain) and validation is performed by SPEX (The Netherlands), and CST (Denmark). The developed language resources will be shared among partners.This paper presents a summary of the creation of word lists and lexica and an overview of adaptations of the specifications and conceptual representation model from LC-STAR to the new languages. The validation procedure will be presented too.

Communicating Unknown Words in Machine Translation
Matthias Eck | Stephan Vogel | Alex Waibel

A new approach to handle unknown words in machine translation is presented. The basic idea is to find definitions for the unknown words on the source language side and translate those definitions instead. Only monolingual resources are required, which generally offer a broader coverage than bilingual resources and are available for a large number of languages. In order to use this in a machine translation system definitions are extracted automatically from online dictionaries and encyclopedias. The translated definition is then inserted and clearly marked in the original hypothesis. This is shown to lead to significant improvements in (subjective) translation quality.

Developing Non-European Translation Pairs in a Medium-Vocabulary Medical Speech Translation System
Pierrette Bouillon | Sonia Halimi | Yukie Nakao | Kyoko Kanzaki | Hitoshi Isahara | Nikos Tsourakis | Marianne Starlander | Beth Ann Hockey | Manny Rayner

We describe recent work on MedSLT, a medium-vocabulary interlingua-based medical speech translation system, focussing on issues that arise when handling languages of which the grammar engineer has little or no knowledge. We show how we can systematically create and maintain multiple forms of grammars, lexica and interlingual representations, with some versions being used by language informants, and some by grammar engineers. In particular, we describe the advantages of structuring the interlingua definition as a simple semantic grammar, which includes a human-readable surface form. We show how this allows us to rationalise the process of evaluating translations between languages lacking common speakers, and also makes it possible to create a simple generic tool for debugging to-interlingua translation rules. Examples presented focus on the concrete case of translation between Japanese and Arabic in both directions.

CLIoS: Cross-lingual Induction of Speech Recognition Grammars
Nadine Perera | Michael Pitz | Manfred Pinkal

We present an approach for the cross-lingual induction of speech recognition grammars that separates the task of translation from the task of grammar generation. The source speech recognition grammar is used to generate phrases, which are translated by a common translation service. The target recognition grammar is induced by using the production rules of the source language, manually translated sentences and a statistical word alignment tool. We induce grammars for the target languages Spanish and Japanese. The coverage of the resulting grammars is evaluated on two corpora and compared quantitatively and qualitatively to a grammar induced with unsupervised monolingual grammar induction.

Construction and Analysis of Word-level Time-aligned Simultaneous Interpretation Corpus
Takahiro Ono | Hitomi Tohyama | Shigeki Matsubara

In this paper, quantitative analyses of the delay in Japanese-to-English (J-E) and English-to-Japanese (E-J) interpretations are described. The Simultaneous Interpretation Database of Nagoya University (SIDB) was used for the analyses. Beginning time and end time of each word were provided to the corpus using HMM-based phoneme segmentation, and the time lag between the corresponding words was calculated as the word-level delay. Word-level delay was calculated for 3,722 pairs and 4,932 pairs of words for J-E and E-J interpretations, respectively. The analyses revealed that J-E interpretation has much larger delay than E-J interpretation and that the difference of word order between Japanese and English affect the degree of delay.

Semantic Frame Annotation on the French MEDIA corpus
Marie-Jean Meurs | Frédéric Duvert | Frédéric Béchet | Fabrice Lefèvre | Renato de Mori

This paper introduces a knowledge representation formalism used for annotation of the French MEDIA dialogue corpus in terms of high level semantic structures. The semantic annotation, worked out according to the Berkeley FrameNet paradigm, is incremental and partially automated. We describe an automatic interpretation process for composing semantic structures from basic semantic constituents using patterns involving words and constituents. This process contains procedures which provide semantic compositions and generating frame hypotheses by inference. The MEDIA corpus is a French dialogue corpus recorded using a Wizard of Oz system simulating a telephone server for tourist information and hotel booking. It had been manually transcribed and annotated at the word and semantic constituent levels. These levels support the automatic interpretation process which provides a high level semantic frame annotation. The Frame based Knowledge Source we composed contains Frame definitions and composition rules. We finally provide some results obtained on the automatically-derived annotation.

Cross-Domain Dialogue Act Tagging
Nick Webb | Ting Liu | Mark Hepple | Yorick Wilks

We present recent work in the area of Cross-Domain Dialogue Act (DA) tagging. We have previously reported on the use of a simple dialogue act classifier based on purely intra-utterance features - principally involving word n-gram cue phrases automatically generated from a training corpus. Such a classifier performs surprisingly well, rivalling scores obtained using far more sophisticated language modelling techniques. In this paper, we apply these automatically extracted cues to a new annotated corpus, to determine the portability and generality of the cues we learn.

Building Mobile Spoken Dialogue Applications Using Regulus
Nikos Tsourakis | Maria Georgescul | Pierrette Bouillon | Manny Rayner

Regulus is an Open Source platform that supports construction of rule-based medium-vocabulary spoken dialogue applications. It has already been used to build several substantial speech-enabled applications, including NASA’s Clarissa procedure navigator and Geneva University’s MedSLT medical speech translator. System like these would be far more useful if they were available on a hand-held device, rather than, as with the present version, on a laptop. In this paper we describe the Open Source framework we have developed, which makes it possible to run Regulus applications on generally available mobile devices, using a distributed client-server architecture that offers transparent and reliable integration with different types of ASR systems. We describe the architecture, an implemented calendar application prototype hosted on a mobile device, and an evaluation. The evaluation shows that performance on the mobile device is as good as performance on a normal desktop PC.

Active Annotation in the LUNA Italian Corpus of Spontaneous Dialogues
Christian Raymond | Kepa Joseba Rodriguez | Giuseppe Riccardi

In this paper we present an active approach to annotate with lexical and semantic labels an Italian corpus of conversational human-human and Wizard-of-Oz dialogues. This procedure consists in the use of a machine learner to assist human annotators in the labeling task. The computer assisted process engages human annotators to check and correct the automatic annotation rather than starting the annotation from un-annotated data. The active learning procedure is combined with an annotation error detection to control the reliablity of the annotation. With the goal of converging as fast as possible to reliable automatic annotations minimizing the human effort, we follow the active learning paradigm, which selects for annotation the most informative training examples required to achieve a better level of performance. We show that this procedure allows to quickly converge on correct annotations and thus minimize the cost of human supervision.

A Comparison of Various Methods for Concept Tagging for Spoken Language Understanding
Stefan Hahn | Patrick Lehnen | Christian Raymond | Hermann Ney

The extraction of flat concepts out of a given word sequence is usually one of the first steps in building a spoken language understanding (SLU) or dialogue system. This paper explores five different modelling approaches for this task and presents results on a French state-of-the-art corpus, MEDIA. Additionally, two log-linear modelling approaches could be further improved by adding morphologic knowledge. This paper goes beyond what has been reported in the literature. We applied the models on the same training and testing data and used the NIST scoring toolkit to evaluate the experimental results to ensure identical conditions for each of the experiments and the comparability of the results. Using a model based on conditional random fields, we achieve a concept error rate of 11.8% on the MEDIA evaluation corpus.

Morphosyntactic Resources for Automatic Speech Recognition
Stéphane Huet | Guillaume Gravier | Pascale Sébillot

Texts generated by automatic speech recognition (ASR) systems have some specificities, related to the idiosyncrasies of oral productions or the principles of ASR systems, that make them more difficult to exploit than more conventional natural language written texts. This paper aims at studying the interest of morphosyntactic information as a useful resource for ASR. We show the ability of automatic methods to tag outputs of ASR systems, by obtaining a tag accuracy similar for automatic transcriptions to the 95-98 % usually reported for written texts, such as newspapers. We also demonstrate experimentally that tagging is useful to improve the quality of transcriptions by using morphosyntactic information in a post-processing stage of speech decoding. Indeed, we obtain a significant decrease of the word error rate with experiments done on French broadcast news from the ESTER corpus; we also notice an improvement of the sentence error rate and observe that a significant number of agreement errors are corrected.

STC-TIMIT: Generation of a Single-channel Telephone Corpus
Nicolás Morales | Javier Tejedor | Javier Garrido | José Colás | Doroteo T. Toledano

This paper describes a new speech corpus, STC-TIMIT, and discusses the process of design, development and its distribution through LDC. The STC-TIMIT corpus is derived from the widely used TIMIT corpus by sending it through a real and single telephone channel. TIMIT is phonetically balanced, covers the dialectal diversity in continental USA and has been extensively used as a benchmark for speech recognition algorithms, especially in early stages of development. The experimental usability of TIMIT has been increased eventually with the creation of derived corpora, passing the original data through different channels. One such example is the well-known NTIMIT corpus, where the original files in TIMIT are re-recorded after being sent through different telephone calls, resulting in a corpus that characterizes telephone channels in a wide sense. In STC-TIMIT, we followed a similar procedure, but the whole corpus was transmitted in a single telephone call with the goal of obtaining data from a real and yet highly stable telephone channel across the whole corpus. Files in STC-TIMIT are aligned to those of TIMIT with a theoretical precision of 0.125 ms, making TIMIT labels valid for the new corpus. The experimental section presents several results on speech recognition accuracy.

LILA: Cellular Telephone Speech Databases from Asia
Eric Sanders | Asuncion Moreno | Herbert Tropf | Lynette Melnar | Nurit Dekel | Breanna Gillies | Niklas Paulsson

The goal of the LILA project was the collection of speech databases over cellular telephone networks of five languages in three Asian countries. Three languages were recorded in India: Hindi by first language speakers, Hindi by second language speakers and Indian English. Furthermore, Mandarin was recorded in China and Korean in South-Korea. The databases are part of the SpeechDat-family and follow the SpeechDat rules in many respects. All databases have been finished and have passed the validation tests. Both Hindi databases and the Korean database will be available to the public for sale.

JURISDIC: Polish Speech Database for Taking Dictation of Legal Texts
Grazyna Demenko | Stefan Grocholewski | Katarzyna Klessa | Jerzy Ogórkiewicz | Agnieszka Wagner | Marek Lange | Daniel Śledziński | Natalia Cylwik

The paper provides an overview of the Polish Speech Database for taking dictation of legal texts, created for the purpose of LVCSR system for Polish. It presents background information about the design of the database and the requirements coming from its future uses. The applied method of the text corpora construction is presented as well as the database structure and recording scenarios. The most important details on the recording conditions and equipment are specified, followed by the description of the assessment methodology of recording quality, and the annotation specification and evaluation. Additionally, the paper contains current statistics from the database and the information about both the ongoing and planned stages of the database development process.

A Multi-sensor Speech Database with Applications towards Robust Speech Processing in hostile Environments
Tomas Dekens | Yorgos Patsis | Werner Verhelst | Frédéric Beaugendre | François Capman

In this paper, we present a database with speech in different types of background noises. The speech and noise were recorded with a set of different microphones and including some sensors that pick up the speech vibrations by making contact with the skull, the throat and the ear canal, respectively. As these sensors should be less sensitive to noise sources, our database can be especially useful for investigating the properties of these special microphones and comparing them to those of conventional microphones for applications requiring noise robust speech capturing and processing. In this paper we describe some experiments that were carried out using this database in the field of Voice Activity Detection (VAD). It is shown that the signals of a special microphone such as the throat microphone exhibit a high signal to noise ratio and that this property can be exploited to significantly improve the accuracy of a VAD algorithm.

The LECTRA Corpus - Classroom Lecture Transcriptions in European Portuguese
Isabel Trancoso | Rui Martins | Helena Moniz | Ana Isabel Mata | M. Céu Viana

This paper describes the corpus of university lectures that has been recorded in European Portuguese, and some of the recognition experiments we have done with it. The highly specific topic domain and the spontaneous speech nature of the lectures are two of the most challenging problems. Lexical and language model adaptation proved difficult given the scarcity of domain material in Portuguese, but improvements can be achieved with unsupervised acoustic model adaptation. From the point of view of the study of spontaneous speech characteristics, namely disflluencies, the LECTRA corpus has also proved a very valuable resource.

ALC: Alcohol Language Corpus
Florian Schiel | Christian Heinrich | Sabine Barfüßer | Thomas Gilg

A number of forensic studies published during the last 50 years report that intoxication with alcohol influences speech in a way that is made manifest in certain features of the speech signal. However, most of these studies are based on data that are not publicly available nor of statistically sufficient size. Furthermore, in spite of the positive reports nobody ever successfully implemented a method to detect alcoholic intoxication from the speech signal. The Alcohol Language Corpus (ALC) aims to answer these open questions by providing a publicly available large and statistically sound corpus of intoxicated and sober speech. This paper gives a detailed description of the corpus features and methodology. Also, we will present some preliminary results on a series of verifications about reported potential features that are claimed to reliably indicate alcoholic intoxication.

Design of a Multimodal Database for Research on Automatic Detection of Severe Apnoea Cases
Rubén Fernández | Luis A. Hernández | Eduardo López | José Alcázar | Guillermo Portillo | Doroteo T. Toledano

The aim of this paper is to present the design of a multimodal database suitable for research on new possibilities for automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases can be very useful to give priority to their early treatment optimizing the expensive and time-consuming tests of current diagnosis methods based on full overnight sleep in a hospital. This work is part of an on-going collaborative project between medical and signal processing groups towards the design of a multimodal database as an innovative resource to promote new research efforts on automatic OSA diagnosis through speech and image processing technologies. In this contribution we present the multimodal design criteria derived from the analysis of specific voice properties related to OSA physiological effects as well as from the morphological facial characteristics in apnoea patients. Details on the database structure and data collection methodology are also given as it is intended to be an open resource to promote further research in this field. Finally, preliminary experimental results on automatic OSA voice assessment are presented for the collected speech data in our OSA multimodal database. Standard GMM speaker recognition techniques obtain an overall correct classification rate of 82%. This represents an initial promising result underlining the interest of this research framework and opening further perspectives for improvement using more specific speech and image recognition technologies.

Test Collections for Spoken Document Retrieval from Lecture Audio Data
Tomoyosi Akiba | Kiyoaki Aikawa | Yoshiaki Itoh | Tatsuya Kawahara | Hiroaki Nanjo | Hiromitsu Nishizaki | Norihito Yasuda | Yoichi Yamashita | Katunobu Itou

The Spoken Document Processing Working Group, which is part of the special interest group of spoken language processing of the Information Processing Society of Japan, is developing a test collection for evaluation of spoken document retrieval systems. A prototype of the test collection consists of a set of textual queries, relevant segment lists, and transcriptions by an automatic speech recognition system, allowing retrieval from the Corpus of Spontaneous Japanese (CSJ). From about 100 initial queries, application of the criteria that a query should have more than five relevant segments that consist of about one minute speech segments yielded 39 queries. Targeting the test collection, an ad hoc retrieval experiment was also conducted to assess the baseline retrieval performance by applying a standard method for spoken document retrieval.

In-car Speech Data Collection along with Various Multimodal Signals
Akira Ozaki | Sunao Hara | Takashi Kusakawa | Chiyomi Miyajima | Takanori Nishino | Norihide Kitaoka | Katunobu Itou | Kazuya Takeda

In this paper, a large-scale real-world speech database is introduced along with other multimedia driving data. We designed a data collection vehicle equipped with various sensors to synchronously record twelve-channel speech, three-channel video, driving behavior including gas and brake pedal pressures, steering angles, and vehicle velocities, physiological signals including driver heart rate, skin conductance, and emotion-based sweating on the palms and soles, etc. These multimodal data are collected while driving on city streets and expressways under four different driving task conditions including two kinds of monologues, human-human dialog, and human-machine dialog. We investigated the response timing of drivers against navigator utterances and found that most overlapped with the preceding utterance due to the task characteristics and the features of Japanese. When comparing utterance length, speaking rate, and the filler rate of driver utterances in human-human and human-machine dialogs, we found that drivers tended to use longer and faster utterances with more fillers to talk with humans than machines.

Developing Corpus of Japanese Classroom Lecture Speech Contents
Masatoshi Tsuchiya | Satoru Kogure | Hiromitsu Nishizaki | Kengo Ohta | Seiichi Nakagawa

This paper explains our developing Corpus of Japanese classroom Lecture speech Contents (henceforth, denoted as CJLC). Increasing e-Learning contents demand a sophisticated interactive browsing system for themselves, however, existing tools do not satisfy such a requirement. Many researches including large vocabulary continuous speech recognition and extraction of important sentences against lecture contents are necessary in order to realize the above system. CJLC is designed as their fundamental basis, and consists of speech, transcriptions, and slides that were collected in real university classroom lectures. This paper also explains the difference about disfluency acts between classroom lectures and academic presentations.

The ATCOSIM Corpus of Non-Prompted Clean Air Traffic Control Speech
Konrad Hofbauer | Stefan Petrik | Horst Hering

Air traffic control (ATC) is based on voice communication between pilots and controllers and uses a highly task and domain specific language. Due to this very reason, spoken language technologies for ATC require domain-specific corpora, of which only few exist to this day. The ATCOSIM Air Traffic Control Simulation Speech corpus is a speech database of non-prompted and clean ATC operator speech. It consists of ten hours of speech data, which were recorded in typical ATC control room conditions during ATC real-time simulations. The database includes orthographic transcriptions and additional information on speakers and recording sessions. The ATCOSIM corpus is publicly available and provided online free of charge. In this paper, we first give an overview of ATC related corpora and their shortcomings. We then show the difficulties in obtaining operational ATC speech recordings and propose the use of existing ATC real-time simulations. We describe the recording, transcription, production and validation process of the ATCOSIM corpus, and outline an application example for automatic speech recognition in the ATC domain.

The MoveOn Motorcycle Speech Corpus
Thomas Winkler | Theodoros Kostoulas | Richard Adderley | Christian Bonkowski | Todor Ganchev | Joachim Köhler | Nikos Fakotakis

A speech and noise corpus dealing with the extreme conditions of the motorcycle environment is developed within the MoveOn project. Speech utterances in British English are recorded and processed approaching the issue of command and control and template driven dialog systems on the motorcycle. The major part of the corpus comprises noisy speech and environmental noise recorded on a motorcycle, but several clean speech recordings in a silent environment are also available. The corpus development focuses on distortion free recordings and accurate descriptions of both recorded speech and noise. Not only speech segments are annotated but also annotation of environmental noise is performed. The corpus is a small-sized speech corpus with about 12 hours of clean and noisy speech utterances and about 30 hours of segments with environmental noise without speech. This paper addresses the motivation and development of the speech corpus and finally presents some statistics and results of the database creation.

Audio Database in Support of Potentiel Threat and Crisis Situation Management
Stavros Ntalampiras | Ilyas Potamitis | Todor Ganchev | Nikos Fakotakis

This paper describes a corpus consisting of audio data for automatic space monitoring based solely on the perceived acoustic information. The particular database is created as part of a project aiming at the detection of abnormal events, which lead to life-threatening situations or property damage. The audio corpus is composed of vocal reactions and environmental sounds that are usually encountered in atypical situations. The audio data is composed of three parts: Phase I - professional sound effects collections, Phase II recordings obtained from action and drama movies and Phase III - vocal reactions related to real-world emergency events as retrieved from television, radio broadcast news, documentaries etc. The annotation methodology is given in details along with preliminary classification results and statistical analysis of the dataset regarding Phase I. The main objective of such a dataset is to provide training data for automatic recognition machines that detect hazardous situations and to provide security enhancement in public environments, which otherwise require human supervision.

CallSurf: Automatic Transcription, Indexing and Structuration of Call Center Conversational Speech for Knowledge Extraction and Query by Content
Martine Garnier-Rizet | Gilles Adda | Frederik Cailliau | Sylvie Guillemin-Lanne | Claire Waast-Richard | Lori Lamel | Stephan Vanni | Claire Waast-Richard

Being the client’s first interface, call centres worldwide contain a huge amount of information of all kind under the form of conversational speech. If accessible, this information can be used to detect eg. major events and organizational flaws, improve customer relations and marketing strategies. An efficient way to exploit the unstructured data of telephone calls is data-mining, but current techniques apply on text only. The CallSurf project gathers a number of academic and industrial partners covering the complete platform, from automatic transcription to information retrieval and data mining. This paper concentrates on the speech recognition module as it discusses the collection, the manual transcription of the training corpus and the techniques used to build the language model. The NLP techniques used to pre-process the transcribed corpus for data mining are POS tagging, lemmatization, noun group and named entity recognition. Some of them have been especially adapted to the conversational speech characteristics. POS tagging and preliminary data mining results obtained on the manually transcribed corpus are briefly discussed.