Challenges in the development of annotated corpora of computer-mediated communication in Indian Languages: A Case of Hindi

Ritesh Kumar


Abstract
The present paper describes an ongoing effort to compile and annotate a large corpus of computer-mediated communication (CMC) in Hindi. It describes the process of the compilation of the corpus, the basic structure of the corpus and the annotation of the corpus and the challenges faced in the creation of such a corpus. It also gives a description of the technologies developed for the processing of the data, addition of the metadata and annotation of the corpus. Since it is a corpus of written communication, it provides quite a distinctive challenge for the annotation process. Besides POS annotation, it will also be annotated at higher levels of representation. Once completely developed it will be a very useful resource of Hindi for research in the areas of linguistics, NLP and other social sciences research related to communication, particularly computer-mediated communication..Besides this the challenges discussed here and the way they are tackled could be taken as the model for developing the corpus of computer-mediated communication in other Indian languages. Furthermore the technologies developed for the construction of this corpus will also be made available publicly.
Anthology ID:
L12-1355
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
299–302
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/619_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Ritesh Kumar. 2012. Challenges in the development of annotated corpora of computer-mediated communication in Indian Languages: A Case of Hindi. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 299–302, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Challenges in the development of annotated corpora of computer-mediated communication in Indian Languages: A Case of Hindi (Kumar, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/619_Paper.pdf