Translating Knowledge Representations with Monolingual Word Embeddings: the Case of a Thesaurus on Corporate Non-Financial Reporting

Martín Quesada Zaragoza, Lianet Sepúlveda Torres, Jérôme Basdevant


Abstract
A common method of structuring information extracted from textual data is using a knowledge model (e.g. a thesaurus) to organise the information semantically. Creating and managing a knowledge model is already a costly task in terms of human effort, not to mention making it multilingual. Multilingual knowledge modelling is a common problem for both transnational organisations and organisations providing text analytics that want to analyse information in more than one language. Many organisations tend to develop their language resources first in one language (often English). When it comes to analysing data sources in other languages, either a lot of effort has to be invested in recreating the same knowledge base in a different language or the data itself has to be translated into the language of the knowledge model. In this paper, we propose an unsupervised method to automatically induce a given thesaurus into another language using only comparable monolingual corpora. The aim of this proposal is to employ cross-lingual word embeddings to map the set of topics in an already-existing English thesaurus into Spanish. With this in mind, we describe different approaches to generate the Spanish thesaurus terms and offer an extrinsic evaluation by using the obtained thesaurus, which covers non-financial topics in a multi-label document classification task, and we compare the results across these approaches.
Anthology ID:
2020.computerm-1.3
Volume:
Proceedings of the 6th International Workshop on Computational Terminology
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
CompuTerm
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
17–25
Language:
English
URL:
https://aclanthology.org/2020.computerm-1.3
DOI:
Bibkey:
Cite (ACL):
Martín Quesada Zaragoza, Lianet Sepúlveda Torres, and Jérôme Basdevant. 2020. Translating Knowledge Representations with Monolingual Word Embeddings: the Case of a Thesaurus on Corporate Non-Financial Reporting. In Proceedings of the 6th International Workshop on Computational Terminology, pages 17–25, Marseille, France. European Language Resources Association.
Cite (Informal):
Translating Knowledge Representations with Monolingual Word Embeddings: the Case of a Thesaurus on Corporate Non-Financial Reporting (Quesada Zaragoza et al., CompuTerm 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2020.computerm-1.3.pdf