In recent years, low resource languages (LRLs) have seen a surge in interest after certain tasks have been solved for larger ones and as they present various challenges (data sparsity, sparsity of experts and expertise, unusual structural properties etc.). For a larger number of them in the wake of this interest resources and technologies have been created. However, there are very small languages for which this has not yet led to a significant change. We focus here one such language (Nogai) and one larger small language (Maori). Since especially smaller languages often face the situation of having very similar siblings or a larger small sister language which is more accessible, the rate of noise in data gathered on them so far is often high. Therefore, we present small corpora for our 2 case study languages which we obtained through web information retrieval and likewise for their noise inducing distractor languages and conduct a small language identification experiment where we identify documents in a boolean way as either belonging or not to the target language. We release our test corpora for two such scenarios in the format of the An Crubadan project (Scannell, 2007) and a tool for unsupervised language identification using alphabet and toponym information.
In this paper, we investigate a covert labeling cue, namely the probability that a title (by example of the Wikipedia titles) is a noun. If this probability is very large, any list such as or comparable to the Wikipedia titles can be used as a reliable word-class (or part-of-speech tag) predictor or noun lexicon. This may be especially useful in the case of Low Resource Languages (LRL) where labeled data is lacking and putatively for Natural Language Processing (NLP) tasks such as Word Sense Disambiguation, Sentiment Analysis and Machine Translation. Profitting from the ease of digital publication on the web as opposed to print, LRL speaker communities produce resources such as Wikipedia and Wiktionary, which can be used for an assessment. We provide statistical evidence for a strong noun bias for the Wikipedia titles from 2 corpora (English, Persian) and a dictionary (Japanese) and for a typologically balanced set of 17 languages including LRLs. Additionally, we conduct a small experiment on predicting noun tags for out-of-vocabulary items in part-of-speech tagging for English.
TGermaCorp is a German text corpus whose primary sources are collected from German literature texts which date from the sixteenth century to the present. The corpus is intended to represent its target language (German) in syntactic, lexical, stylistic and chronological diversity. For this purpose, it is hand-annotated on several linguistic layers, including POS, lemma, named entities, multiword expressions, clauses, sentences and paragraphs. In order to introduce TGermaCorp in comparison to more homogeneous corpora of contemporary everyday language, quantitative assessments of syntactic and lexical diversity are provided. In this respect, TGermaCorp contributes to establishing characterising features for resource descriptions, which is needed for keeping track of a meaningful comparison of the ever-growing number of natural language resources. The assessments confirm the special role of proper names, whose propagation in text may influence lexical and syntactic diversity measures in rather trivial ways. TGermaCorp will be made available via hucompute.org.
We study the role of the second language in bilingual word embeddings in monolingual semantic evaluation tasks. We find strongly and weakly positive correlations between down-stream task performance and second language similarity to the target language. Additionally, we show how bilingual word embeddings can be employed for the task of semantic language classification and that joint semantic spaces vary in meaningful ways across second languages. Our results support the hypothesis that semantic language similarity is influenced by both structural similarity as well as geography/contact.