Abstract
We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks.- Anthology ID:
- D18-2014
- Volume:
- Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
- Month:
- November
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Eduardo Blanco, Wei Lu
- Venue:
- EMNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 78–83
- Language:
- URL:
- https://aclanthology.org/D18-2014
- DOI:
- 10.18653/v1/D18-2014
- Cite (ACL):
- Gregor Wiedemann, Seid Muhie Yimam, and Chris Biemann. 2018. A Multilingual Information Extraction Pipeline for Investigative Journalism. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 78–83, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- A Multilingual Information Extraction Pipeline for Investigative Journalism (Wiedemann et al., EMNLP 2018)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/D18-2014.pdf
- Data
- Polyglot-NER