A Multilingual Information Extraction Pipeline for Investigative Journalism
Abstract
We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks.- Anthology ID:
- D18-2014
- Volume:
- Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
- Month:
- November
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Eduardo Blanco, Wei Lu
- Venue:
- EMNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 78–83
- Language:
- URL:
- https://aclanthology.org/D18-2014
- DOI:
- 10.18653/v1/D18-2014
- Cite (ACL):
- Gregor Wiedemann, Seid Muhie Yimam, and Chris Biemann. 2018. A Multilingual Information Extraction Pipeline for Investigative Journalism. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 78–83, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- A Multilingual Information Extraction Pipeline for Investigative Journalism (Wiedemann et al., EMNLP 2018)
- PDF:
- https://preview.aclanthology.org/teach-a-man-to-fish/D18-2014.pdf
- Data
- Polyglot-NER