Khanh Bui Doan


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2006

pdf bib
Word Segmentation for Vietnamese Text Categorization An Internet-based Statistic and Genetic Algorithm Approach
Hung Nguyen Thanh | Khanh Bui Doan
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

This paper suggests a novel Vietnamese segmentation approach for text categorization. Instead of using an annotated training corpus or a lexicon which are still lacking in Vietnamese, we use both statistical information extracted directly from a commercial search engine and a genetic algorithm to find the optimal routes to segmentation. The extracted information includes document frequency and n-gram mutual information. Our experiment results obtained on the segmentation and categorization of online news abstracts are very promising. It matches near 80 % human judgment on segmentation and over 90 % micro-averaging F1 in categorization. The processing time is less than one second per document when statistical information is cached.