@inproceedings{a-augenstein-2020-2kenize,
    title = "2kenize: Tying Subword Sequences for {C}hinese Script Conversion",
    author = "A, Pranav  and
      Augenstein, Isabelle",
    editor = "Jurafsky, Dan  and
      Chai, Joyce  and
      Schluter, Natalie  and
      Tetreault, Joel",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2020.acl-main.648/",
    doi = "10.18653/v1/2020.acl-main.648",
    pages = "7257--7272",
    abstract = "Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have insufficient performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method{'}s particular strengths are in dealing with code mixing and named entities."
}Markdown (Informal)
[2kenize: Tying Subword Sequences for Chinese Script Conversion](https://preview.aclanthology.org/ingest-emnlp/2020.acl-main.648/) (A & Augenstein, ACL 2020)
ACL