@inproceedings{lai-winterstein-2020-cifu,
    title = "{C}ifu: a Frequency Lexicon of {H}ong {K}ong {C}antonese",
    author = "Lai, Regine  and
      Winterstein, Gr{\'e}goire",
    editor = "Calzolari, Nicoletta  and
      B{\'e}chet, Fr{\'e}d{\'e}ric  and
      Blache, Philippe  and
      Choukri, Khalid  and
      Cieri, Christopher  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Isahara, Hitoshi  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, H{\'e}l{\`e}ne  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://preview.aclanthology.org/ingest-emnlp/2020.lrec-1.375/",
    pages = "3069--3077",
    language = "eng",
    ISBN = "979-10-95546-34-4",
    abstract = "This paper introduces Cifu, a lexical database for Hong Kong Cantonese (HKC) that offers phonological and orthographic information, frequency measures, and lexical neighborhood information for lexical items in HKC. Cifu is of use for NLP applications and the design and analysis of psycholinguistics experiments on HKC. We elaborate on the characteristics and challenges specific to HKC that were relevant in the design of Cifu. This includes lexical, orthographic and phonological aspects of HKC, word segmentation issues, the place of HKC in written media, and the availability of data. We discuss the measure of Neighborhood Density (ND), highlighting how the analytic nature of Cantonese and its writing system affect that measure. We justify using six different variations of ND, based on the possibility of inserting or deleting phonemes when searching for neighbors and on the choice of data for retrieving frequencies. Statistics about the four genres (written, adult spoken, children spoken and child-directed) within the dataset are discussed. We find that the lexical diversity of the child-directed speech genre is particularly low, compared to a size-matched written corpus. The correlations of word frequencies of different genres are all high, but in generally decrease as word length increases."
}Markdown (Informal)
[Cifu: a Frequency Lexicon of Hong Kong Cantonese](https://preview.aclanthology.org/ingest-emnlp/2020.lrec-1.375/) (Lai & Winterstein, LREC 2020)
ACL
- Regine Lai and Grégoire Winterstein. 2020. Cifu: a Frequency Lexicon of Hong Kong Cantonese. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3069–3077, Marseille, France. European Language Resources Association.