Abstract
An essential operation in web corpus construction consists in retaining the desired content while discarding the rest. Another challenge finding one’s way through websites. This article introduces a text discovery and extraction tool published under open-source license. Its installation and use is straightforward, notably from Python and on the command-line. The software allows for main text, comments and metadata extraction, while also providing building blocks for web crawling tasks. A comparative evaluation on real-world data also shows its interest as well as the performance of other available solutions. The contributions of this paper are threefold: it references the software, features a benchmark, and provides a meaningful baseline for similar tasks. The tool performs significantly better than other open-source solutions in this evaluation and in external benchmarks.- Anthology ID:
- 2021.acl-demo.15
- Volume:
- Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations
- Month:
- August
- Year:
- 2021
- Address:
- Online
- Editors:
- Heng Ji, Jong C. Park, Rui Xia
- Venues:
- ACL | IJCNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 122–131
- Language:
- URL:
- https://aclanthology.org/2021.acl-demo.15
- DOI:
- 10.18653/v1/2021.acl-demo.15
- Cite (ACL):
- Adrien Barbaresi. 2021. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 122–131, Online. Association for Computational Linguistics.
- Cite (Informal):
- Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction (Barbaresi, ACL-IJCNLP 2021)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2021.acl-demo.15.pdf
- Code
- adbar/trafilatura