Luís Sarmento

In this paper we introduce a public resource named BACO (Base de Co-Ocorrências), a very large textual database built from the WPT03 collection, a publicly available crawl of the whole Portuguese web in 2003. BACO uses a generic relational database engine to store 1.5 million web documents in raw text (more than 6GB of plain text), corresponding to 35 million sentences, consisting of more than 1000 million words. BACO comprises four lexicon tables, including a standard single token lexicon, and three n-gram tables (2-grams, 3-grams and 4-grams) with several hundred million entries, and a table containing 780 million co-occurrence pairs. We describe the design choices and explain the preparation tasks involved in loading the data in the relational database. We present several statistics regarding storage requirements and we demonstrate how this resource is currently used.

pdf abs
Component Evaluation in a Question Answering System
Luís Fernando Costa | Luís Sarmento
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Automatic question answering (QA) is a complex task, which lies in the cross-road of Natural Language Processing, Information Retrieval and Human Computer Interaction. A typical QA system has four modules question processing, document retrieval, answer extraction and answer presentation. In each of these modules, a multitude of tools can be used. Therefore, the performance evaluation of each of these components is of great importance in order to check their impact in the global performance, and to conclude whether these components are necessary, need to be improved or substituted. This paper describes some experiments performed in order to evaluate several components of the question answering system Esfinge.We describe the experimental set up and present the results of error analysis based on runtime logs of Esfinge. We present the results of component analysis, which provides good insights about the importance of the individual components and pre-processing modules at various levels, namely stemming, named-entity recognition, PoS Filtering and filtering of undesired answers. We also present the results of substituting the document source in which Esfinge tries to find possible answers and compare the results obtained using web sources such as Google, Yahoo and BACO, a large database of web documents in Portuguese.

pdf abs
Corpógrafo V3 - From Terminological Aid to Semi-automatic Knowledge Engineering
Luís Sarmento | Belinda Maia | Diana Santos | Ana Pinto | Luís Cabral
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we will present Corpógrafo, a mature web-based environment for working with corpora, for terminology extraction, and for ontology development. We will explain Corpógrafos workflow and describe the most important information extraction methods used, namely its term extraction, and definition / semantic relations identification procedures. We will describe current Corpógrafo users and present a brief overview of the XML format currently used to export terminology databases. Finally, we present future improvements for this tool.