A Syntactically Annotated Corpus of Tibetan

Andreas Wagner, Bettina Zeisler


Abstract
This paper describes the creation of a syntactically annotated Tibetan corpus. This corpus forms a part of the TUSNELDA collection of corpora and databases for linguistic research. It will ultimately comprise spoken and written Tibetan texts originating from different regions and historical epochs. These texts are annotated with several kinds of linguistic information, in particular POS tags, phrases, argument structures of verbs, clauses and sentences, as well as several kinds of discourse units and textual segments. The annotation is done in XML. The primary research interest which guides the development of the corpus is the investigation of cross-clausal references, especially the relation between empty arguments (i.e. arguments not overtly realised in a clause) and their antecedents in previous clauses. For this purpose, such references are explicitly encoded so that they can be qualitatively and quantitatively evaluated with the help of standard XML techniques such as XPath search and XSLT transformations. Apart from this primary research interest, we expect that our corpus will be useful for other projects concerning Tibetan and related languages. Like other data in TUSNELDA, it will be made accessible via a WWW query interface.
Anthology ID:
L04-1156
Volume:
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Month:
May
Year:
2004
Address:
Lisbon, Portugal
Editors:
Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa, Raquel Silva
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2004/pdf/293.pdf
DOI:
Bibkey:
Cite (ACL):
Andreas Wagner and Bettina Zeisler. 2004. A Syntactically Annotated Corpus of Tibetan. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
Cite (Informal):
A Syntactically Annotated Corpus of Tibetan (Wagner & Zeisler, LREC 2004)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2004/pdf/293.pdf