Query Processing and Optimization for a Custom Retrieval Language

Yakov Kuzin, Anna Smirnova, Evgeniy Slobodkin, George Chernishev


Abstract
Data annotation has been a pressing issue ever since the rise of machine learning and associated areas. It is well-known that obtaining high-quality annotated data incurs high costs, be they financial or time-related. In our previous work, we have proposed a custom, SQL-like retrieval language used to query collections of short documents, such as chat transcripts or tweets. Its main purpose is enabling a human annotator to select “situations” from such collections, i.e. subsets of documents that are related both thematically and temporally. This language, named Matcher, was prototyped in our custom annotation tool. Entering the next stage of development of the tool, we have tested the prototype implementation. Given the language’s rich semantics, many possible execution options with various costs arise. We have found out we could provide tangible improvement in terms of speed and memory consumption by carefully selecting the execution strategy in each particular case. In this work, we present the improved algorithms and proposed optimization methods, as well as a benchmark suite whose results show the significance of the presented techniques. While this is an initial work and not a full-fledged optimization framework, it nevertheless yields good results, providing up to tenfold improvement.
Anthology ID:
2022.pandl-1.8
Volume:
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
PANDL
SIG:
Publisher:
International Conference on Computational Linguistics
Note:
Pages:
61–70
Language:
URL:
https://aclanthology.org/2022.pandl-1.8
DOI:
Bibkey:
Cite (ACL):
Yakov Kuzin, Anna Smirnova, Evgeniy Slobodkin, and George Chernishev. 2022. Query Processing and Optimization for a Custom Retrieval Language. In Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning, pages 61–70, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
Cite (Informal):
Query Processing and Optimization for a Custom Retrieval Language (Kuzin et al., PANDL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.pandl-1.8.pdf
Code
 yakovypg/chat-corpora-annotator