Bengali dataset format
======================
<word><TAB><part of speech><TAB><lemma>

End of a sentence is indicated by a blank line.

One linguist took 2 months to complete the annotation which was checked by another person and differences were sorted out. Out of the 91 short stories of Tagore, we calculated the value of (# total tokens/# distinct tokens) for each story. Based on this value (lower is better), top 11 stories were selected. They are as follows: ANADHIKAR PRABESH, DURBUDDHI, EKTI KHUDRO PURATAN GALPO, GINNI, PUTROJOGGO, RITIMOTO NOVEL, SADAR O ANDAR, SANASKAR, SESH PROSHKAR, UDDHAR, ULUKHORER BIPOD.

Apart from the above, 17 news articles (http://www.anandabazar.com/) were crafted from the following domains: Travelogue, Science, Business, Country, Food, Psychology, Health, Animal, Archaeology, Education and Politics.

List of POS tags in Bengali dataset:
====================================
NNP	-> proper noun
POSঅব্য	-> a composite tag referring preposition, conjunction and interjection
POSক্রি	-> verb
POSক্রিবিণ	-> adverb
POSবি	-> noun
POSবিণ	-> adjective
POSবিণবিণ	-> a Bengali specific POS tag indicating adjective modifier
POSসর্ব	-> pronoun
UNK	-> unknown token


Hindi dataset format
======================
<word><TAB><part of speech><TAB><lemma>

End of a sentence is indicated by a blank line.

List of POS tags in Hindi dataset:
==================================
adj	-> adjective
adv	-> adverb
avy	-> a composite tag referring preposition, conjunction and interjection
n	-> noun
nnp	-> proper noun
num	-> number
pn	-> pronoun
sym	-> symbol
unk	-> unknown token
v	-> verb
