Karl Neergaard


2016

pdf
Database of Mandarin Neighborhood Statistics
Karl Neergaard | Hongzhi Xu | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In the design of controlled experiments with language stimuli, researchers from psycholinguistic, neurolinguistic, and related fields, require language resources that isolate variables known to affect language processing. This article describes a freely available database that provides word level statistics for words and nonwords of Mandarin, Chinese. The featured lexical statistics include subtitle corpus frequency, phonological neighborhood density, neighborhood frequency, and homophone density. The accompanying word descriptors include pinyin, ascii phonetic transcription (sampa), lexical tone, syllable structure, dominant PoS, and syllable, segment and pinyin lengths for each phonological word. It is designed for researchers particularly concerned with language processing of isolated words and made to accommodate multiple existing hypotheses concerning the structure of the Mandarin syllable. The database is divided into multiple files according to the desired search criteria: 1) the syllable segmentation schema used to calculate density measures, and 2) whether the search is for words or nonwords. The database is open to the research community at https://github.com/karlneergaard/Mandarin-Neighborhood-Statistics.

pdf
EVALution-MAN: A Chinese Dataset for the Training and Evaluation of DSMs
Liu Hongchao | Karl Neergaard | Enrico Santus | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Distributional semantic models (DSMs) are currently being used in the measurement of word relatedness and word similarity. One shortcoming of DSMs is that they do not provide a principled way to discriminate different semantic relations. Several approaches have been adopted that rely on annotated data either in the training of the model or later in its evaluation. In this paper, we introduce a dataset for training and evaluating DSMs on semantic relations discrimination between words, in Mandarin, Chinese. The construction of the dataset followed EVALution 1.0, which is an English dataset for the training and evaluating of DSMs. The dataset contains 360 relation pairs, distributed in five different semantic relations, including antonymy, synonymy, hypernymy, meronymy and nearsynonymy. All relation pairs were checked manually to estimate their quality. In the 360 word relation pairs, there are 373 relata. They were all extracted and subsequently manually tagged according to their semantic type. The relatas’ frequency was calculated in a combined corpus of Sinica and Chinese Gigaword. To the best of our knowledge, EVALution-MAN is the first of its kind for Mandarin, Chinese.

2015

pdf
Graph Theoretic Features of the Adult Mental lexicon Predict Language Production in Mandarin: Clustering Coefficient
Karl Neergaard | Chu-Ren Huang
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters