Translating Headers of Tabular Data: A Pilot Study of Schema Translation

Kunrui Zhu, Yan Gao, Jiaqi Guo, Jian-Guang Lou


Abstract
Schema translation is the task of automatically translating headers of tabular data from one language to another. High-quality schema translation plays an important role in cross-lingual table searching, understanding and analysis. Despite its importance, schema translation is not well studied in the community, and state-of-the-art neural machine translation models cannot work well on this task because of two intrinsic differences between plain text and tabular data: morphological difference and context difference. To facilitate the research study, we construct the first parallel dataset for schema translation, which consists of 3,158 tables with 11,979 headers written in 6 different languages, including English, Chinese, French, German, Spanish, and Japanese. Also, we propose the first schema translation model called CAST, which is a header-to-header neural machine translation model augmented with schema context. Specifically, we model a target header and its context as a directed graph to represent their entity types and relations. Then CAST encodes the graph with a relational-aware transformer and uses another transformer to decode the header in the target language. Experiments on our dataset demonstrate that CAST significantly outperforms state-of-the-art neural machine translation models. Our dataset will be released at https://github.com/microsoft/ContextualSP.
Anthology ID:
2021.emnlp-main.5
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
56–66
Language:
URL:
https://aclanthology.org/2021.emnlp-main.5
DOI:
10.18653/v1/2021.emnlp-main.5
Bibkey:
Cite (ACL):
Kunrui Zhu, Yan Gao, Jiaqi Guo, and Jian-Guang Lou. 2021. Translating Headers of Tabular Data: A Pilot Study of Schema Translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 56–66, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Translating Headers of Tabular Data: A Pilot Study of Schema Translation (Zhu et al., EMNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2021.emnlp-main.5.pdf
Code
 microsoft/ContextualSP