Abstract
In this paper, we present an inferential model for text type and genre identification of Web pages, where text types are inferred using a modified form of Bayes’ theorem, and genres are derived using a few simple if-then rules. As the genre system on the Web is a complex phenomenon, and Web pages are usually more unpredictable and individualized than paper documents, we propose this approach as an alternative to unsupervised and supervised techniques. The inferential model allows a classification that can accommodate genres that are not entirely standardized, and is more capable of reading a Web page, which is mixed, rarely corresponding to an ideal type and often showing a mixture of genres or no genre at all. A proper evaluation of such a model remains an open issue.- Anthology ID:
- 2006.jeptalnrecital-long.28
- Volume:
- Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
- Month:
- April
- Year:
- 2006
- Address:
- Leuven, Belgique
- Venue:
- JEP/TALN/RECITAL
- SIG:
- Publisher:
- ATALA
- Note:
- Pages:
- 308–317
- Language:
- URL:
- https://aclanthology.org/2006.jeptalnrecital-long.28
- DOI:
- Cite (ACL):
- Marina Santini. 2006. Identifying Genres of Web Pages. In Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs, pages 308–317, Leuven, Belgique. ATALA.
- Cite (Informal):
- Identifying Genres of Web Pages (Santini, JEP/TALN/RECITAL 2006)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2006.jeptalnrecital-long.28.pdf