@inproceedings{kim-etal-2018-word,
    title = "Word-like character n-gram embedding",
    author = "Kim, Geewook  and
      Fukui, Kazuki  and
      Shimodaira, Hidetoshi",
    editor = "Xu, Wei  and
      Ritter, Alan  and
      Baldwin, Tim  and
      Rahimi, Afshin",
    booktitle = "Proceedings of the 2018 {EMNLP} Workshop W-{NUT}: The 4th Workshop on Noisy User-generated Text",
    month = nov,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/W18-6120/",
    doi = "10.18653/v1/W18-6120",
    pages = "148--152",
    abstract = "We propose a new word embedding method called \textit{word-like character} n\textit{-gram embedding}, which learns distributed representations of words by embedding word-like character n-grams. Our method is an extension of recently proposed \textit{segmentation-free word embedding}, which directly embeds frequent character n-grams from a raw corpus. However, its n-gram vocabulary tends to contain too many non-word n-grams. We solved this problem by introducing an idea of \textit{expected word frequency}. Compared to the previously proposed methods, our method can embed more words, along with the words that are not included in a given basic word dictionary. Since our method does not rely on word segmentation with rich word dictionaries, it is especially effective when the text in the corpus is in unsegmented language and contains many neologisms and informal words (e.g., Chinese SNS dataset). Our experimental results on Sina Weibo (a Chinese microblog service) and Twitter show that the proposed method can embed more words and improve the performance of downstream tasks."
}Markdown (Informal)
[Word-like character n-gram embedding](https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/W18-6120/) (Kim et al., WNUT 2018)
ACL
- Geewook Kim, Kazuki Fukui, and Hidetoshi Shimodaira. 2018. Word-like character n-gram embedding. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 148–152, Brussels, Belgium. Association for Computational Linguistics.