Odinson: A Fast Rule-based Information Extraction Framework

Marco A. Valenzuela-Escárcega, Gus Hahn-Powell, Dane Bell


Abstract
We present Odinson, a rule-based information extraction framework, which couples a simple yet powerful pattern language that can operate over multiple representations of text, with a runtime system that operates in near real time. In the Odinson query language, a single pattern may combine regular expressions over surface tokens with regular expressions over graphs such as syntactic dependencies. To guarantee the rapid matching of these patterns, our framework indexes most of the necessary information for matching patterns, including directed graphs such as syntactic dependencies, into a custom Lucene index. Indexing minimizes the amount of expensive pattern matching that must take place at runtime. As a result, the runtime system matches a syntax-based graph traversal in 2.8 seconds in a corpus of over 134 million sentences, nearly 150,000 times faster than its predecessor.
Anthology ID:
2020.lrec-1.267
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2183–2191
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.267
DOI:
Bibkey:
Cite (ACL):
Marco A. Valenzuela-Escárcega, Gus Hahn-Powell, and Dane Bell. 2020. Odinson: A Fast Rule-based Information Extraction Framework. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2183–2191, Marseille, France. European Language Resources Association.
Cite (Informal):
Odinson: A Fast Rule-based Information Extraction Framework (Valenzuela-Escárcega et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2020.lrec-1.267.pdf