Clustering Web Pages to Identify Emerging Textual Patterns

Marina Santini


Abstract
The Web has triggered many adjustments in many fields. It also has had a strong impact on the genre repertoire. Novel genres have already emerged, e.g. blog and FAQs. Presumably, other new genres are still in formation, because the Web is still fluid and in constant change. In this paper we present an experiment that explores the possibility of automatically detecting the emerging textual patterns that are slowly taking shape on the Web. Emerging textual patterns can develop into novel Web genres or novel text types in the near future. The experimental set up includes a collection of unclassified web pages, two sets of features and the use of cluster analysis. Results are encouraging and deserve further investigation.
Anthology ID:
2005.jeptalnrecital-recitalcourt.12
Volume:
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues (articles courts)
Month:
June
Year:
2005
Address:
Dourdan, France
Venue:
JEP/TALN/RECITAL
SIG:
Publisher:
ATALA
Note:
Pages:
703–708
Language:
URL:
https://aclanthology.org/2005.jeptalnrecital-recitalcourt.12
DOI:
Bibkey:
Cite (ACL):
Marina Santini. 2005. Clustering Web Pages to Identify Emerging Textual Patterns. In Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues (articles courts), pages 703–708, Dourdan, France. ATALA.
Cite (Informal):
Clustering Web Pages to Identify Emerging Textual Patterns (Santini, JEP/TALN/RECITAL 2005)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2005.jeptalnrecital-recitalcourt.12.pdf