2006
pdf
abs
COMBINA-PT: A Large Corpus-extracted and Hand-checked Lexical Database of Portuguese Multiword Expressions
Amália Mendes
|
Sandra Antunes
|
Maria Fernanda Bacelar do Nascimento
|
João Miguel Casteleiro
|
Luísa Pereira
|
Tiago Sá
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper presents the COMBINA-PT project, a study of corpus-extracted Portuguese Multiword (MW) expressions. The objective of this on-going project is to compile a large lexical database of multiword (MW) units of the Portuguese language, automatically extracted from a balanced 50 million word corpus, and manually validated with the help of lexical association measures. MW expressions considered in the database include named entities and lexical associations with different degrees of cohesion, ranging from frozen groups, which undergo little or no variation, to lexical collocations composed of words that tend to occur together and that constitute syntactic dependencies, although with a low degree of fixedness. This new resource has a two-fold objective: (i) to be an important research tool which supports the development of MW expressions typologies and their lexicographic treatment; (ii) to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.
pdf
abs
The African Varieties of Portuguese: Compiling Comparable Corpora and Analyzing Data-Derived Lexicon
Maria Fernanda Bacelar do Nascimento
|
José Bettencourt Gonçalves
|
Luísa Pereira
|
Antónia Estrela
|
Afonso Pereira
|
Rui Santos
|
Sancho M. Oliveira
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Linguistic Resources for the Study of the Portuguese African Varieties is an ongoing project that aims at the constitution, treatment, analysis and availability of a corpus of the African varieties of Portuguese, with 3 million words of written and spoken texts, constituted by five comparable subcorpora, corresponding to the varieties of Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe. This material will allow intra and intercorpora comparative studies, which will make visible variations that result from discursive and pragmatic differences of each corpus and aspects of linguistic unity or diversity that characterise the spoken Portuguese of this referred five African countries. The five corpora are comparable in size (600,000 words each), in chronology (the last 30 years) and in types and genres (24,000 spoken words and c. 580,000 written words, the last belonging to newspapers, literature and varia). The corpus is automatically annotated and after the extraction of alphabetical lists of lexical forms, these data will be automatically lemmatised. Five separated lists of vocabulary for each variety will be established. A tool for word extraction and preferential calculus according to predefined indexes in order to achieve lexicon comparison of the African Portuguese Varieties is being developed. Concordances extraction will be also performed.
pdf
bib
abs
Corpus-based extraction and identification of Portuguese Multiword Expressions
Sandra Antunes
|
Maria Fernanda Bacelar do Nascimento
|
João Miguel Casteleiro
|
Amália Mendes
|
Luísa Pereira
|
Tiago Sá
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters
This presentation reports on an on-going project aimed at building a large lexical database of corpus-extracted multiword (MW) expressions for the Portuguese language. MW expressions were automatically extracted from a balanced 50 million word corpus compiled for this project, furthermore these were statistically interpreted using lexical association measures, followed by a manual validation process. The lexical database covers different types of MW expressions, from named entities to lexical associations with different degrees of cohesion, ranging from totally frozen idioms to favoured co-occurring forms, such as collocations. We aim to achieve two main objectives with this resource. Firstly to build on the large set of data of different types of MW expressions, thus revising existing typologies of collocations and integrating them in a larger theory of MW units. Secondly, to use the extensive hand-checked data as training data to evaluate existing statistical lexical association measures.
2004
pdf
abs
Providing On-line Access to Portuguese Language Resources: Corpora and Lexicons
Maria Fernanda Bacelar do Nascimento
|
Amália Mendes
|
Luísa Pereira
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL's webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese (CRPC), a monitor corpus containing, at the present, more than 300 million words, taken by sampling from several types of written text (literary, newspaper, technical, didactic, juridical, parlamentary, etc.) and spoken text (informal and formal), pertaining to national and regional varieties of Portuguese (including European, Brazilian, African and Asian Portuguese). The LRs available for on-line queries include: a) several subcorpora (written and spoken, tagged and untagged) compiled and extracted from CRPC for specific CLUL's projects and now available for on-line queries; b) a published sample of "Português Fundamental", a spoken CRPC subcorpus, available for texts download; c) a frequency lexicon extracted from a CRPC subcorpus available for both on-line queries and download. Other RLs available for Portuguese are also referred: C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages, a CD-ROM edition of a spoken corpus with text-to-sound alignment; the LE-PAROLE corpus; the LE-PAROLE Lexicon and the SIMPLE Lexicon.
2000
pdf
Portuguese Corpora at CLUL
Maria Fernanda Bacelar do Nascimento
|
Luisa Pereira
|
João Saramago
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)