Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect

Yo Sato; Kevin Heffernan

Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect

Abstract

We present in this work a universal, character-based method for representing sentences so that one can thereby calculate the distance between any two sentence pair. With a small alphabet, it can function as a proxy of phonemes, and as one of its main uses, we carry out dialect clustering: cluster a dialect/sub-language mixed corpus into sub-groups and see if they coincide with the conventional boundaries of dialects and sub-languages. By using data with multiple Japanese dialects and multiple Slavic languages, we report how well each group clusters, in a manner to partially respond to the question of what separates languages from dialects.

Anthology ID:: 2020.lrec-1.124
Volume:: Proceedings of the 12th Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 985–990
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.124
DOI:
Bibkey:
Cite (ACL):: Yo Sato and Kevin Heffernan. 2020. Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 985–990, Marseille, France. European Language Resources Association.
Cite (Informal):: Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect (Sato & Heffernan, LREC 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/update-css-js/2020.lrec-1.124.pdf

PDF Cite Search