Query Processing and Optimization for a Custom Retrieval Language
Yakov Kuzin, Anna Smirnova, Evgeniy Slobodkin, George Chernishev
Abstract
Data annotation has been a pressing issue ever since the rise of machine learning and associated areas. It is well-known that obtaining high-quality annotated data incurs high costs, be they financial or time-related. In our previous work, we have proposed a custom, SQL-like retrieval language used to query collections of short documents, such as chat transcripts or tweets. Its main purpose is enabling a human annotator to select “situations” from such collections, i.e. subsets of documents that are related both thematically and temporally. This language, named Matcher, was prototyped in our custom annotation tool. Entering the next stage of development of the tool, we have tested the prototype implementation. Given the language’s rich semantics, many possible execution options with various costs arise. We have found out we could provide tangible improvement in terms of speed and memory consumption by carefully selecting the execution strategy in each particular case. In this work, we present the improved algorithms and proposed optimization methods, as well as a benchmark suite whose results show the significance of the presented techniques. While this is an initial work and not a full-fledged optimization framework, it nevertheless yields good results, providing up to tenfold improvement.- Anthology ID:
- 2022.pandl-1.8
- Volume:
- Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Editors:
- Laura Chiticariu, Yoav Goldberg, Gus Hahn-Powell, Clayton T. Morrison, Aakanksha Naik, Rebecca Sharp, Mihai Surdeanu, Marco Valenzuela-Escárcega, Enrique Noriega-Atala
- Venue:
- PANDL
- SIG:
- Publisher:
- International Conference on Computational Linguistics
- Note:
- Pages:
- 61–70
- Language:
- URL:
- https://aclanthology.org/2022.pandl-1.8
- DOI:
- Cite (ACL):
- Yakov Kuzin, Anna Smirnova, Evgeniy Slobodkin, and George Chernishev. 2022. Query Processing and Optimization for a Custom Retrieval Language. In Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning, pages 61–70, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
- Cite (Informal):
- Query Processing and Optimization for a Custom Retrieval Language (Kuzin et al., PANDL 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/2022.pandl-1.8.pdf
- Code
- yakovypg/chat-corpora-annotator