Abstract
Web pages do not offer reliable metadata concerning their creation date and time. However, getting the document creation time is a necessary step for allowing to apply temporal normalization systems to web pages. In this paper, we present DCTFinder, a system that parses a web page and extracts from its content the title and the creation date of this web page. DCTFinder combines heuristic title detection, supervised learning with Conditional Random Fields (CRFs) for document date extraction, and rule-based creation time recognition. Using such a system allows further deep and efficient temporal analysis of web pages. Evaluation on three corpora of English and French web pages indicates that the tool can extract document creation times with reasonably high accuracy (between 87 and 92%). DCTFinder is made freely available on http://sourceforge.net/projects/dctfinder/, as well as all resources (vocabulary and annotated documents) built for training and evaluating the system in English and French, and the English trained model itself.- Anthology ID:
- L14-1270
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 2037–2042
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/3_Paper.pdf
- DOI:
- Cite (ACL):
- Xavier Tannier. 2014. Extracting News Web Page Creation Time with DCTFinder. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2037–2042, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- Extracting News Web Page Creation Time with DCTFinder (Tannier, LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/3_Paper.pdf