@inproceedings{nguyen-thanh-bui-doan-2006-word,
    title = "Word Segmentation for {V}ietnamese Text Categorization An {I}nternet-based Statistic and Genetic Algorithm Approach",
    author = "Nguyen Thanh, Hung  and
      Bui Doan, Khanh",
    editor = "Mertens, Piet  and
      Fairon, C{\'e}drick  and
      Dister, Anne  and
      Watrin, Patrick",
    booktitle = "Actes de la 13{\`e}me conf{\'e}rence sur le Traitement Automatique des Langues Naturelles. Posters",
    month = apr,
    year = "2006",
    address = "Leuven, Belgique",
    publisher = "ATALA",
    url = "https://preview.aclanthology.org/ingest-emnlp/2006.jeptalnrecital-poster.20/",
    pages = "561--570",
    abstract = "This paper suggests a novel Vietnamese segmentation approach for text categorization. Instead of using an annotated training corpus or a lexicon which are still lacking in Vietnamese, we use both statistical information extracted directly from a commercial search engine and a genetic algorithm to find the optimal routes to segmentation. The extracted information includes document frequency and n-gram mutual information. Our experiment results obtained on the segmentation and categorization of online news abstracts are very promising. It matches near 80 {\%} human judgment on segmentation and over 90 {\%} micro-averaging F1 in categorization. The processing time is less than one second per document when statistical information is cached."
}Markdown (Informal)
[Word Segmentation for Vietnamese Text Categorization An Internet-based Statistic and Genetic Algorithm Approach](https://preview.aclanthology.org/ingest-emnlp/2006.jeptalnrecital-poster.20/) (Nguyen Thanh & Bui Doan, JEP/TALN/RECITAL 2006)
ACL