Minghuan Tan


2021

pdf bib
Learning and Evaluating Chinese Idiom Embeddings
Minghuan Tan | Jing Jiang
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

We study the task of learning and evaluating Chinese idiom embeddings. We first construct a new evaluation dataset that contains idiom synonyms and antonyms. Observing that existing Chinese word embedding methods may not be suitable for learning idiom embeddings, we further present a BERT-based method that directly learns embedding vectors for individual idioms. We empirically compare representative existing methods and our method. We find that our method substantially outperforms existing methods on the evaluation dataset we have constructed.

pdf bib
Does BERT Understand Idioms? A Probing-Based Empirical Study of BERT Encodings of Idioms
Minghuan Tan | Jing Jiang
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Understanding idioms is important in NLP. In this paper, we study to what extent pre-trained BERT model can encode the meaning of a potentially idiomatic expression (PIE) in a certain context. We make use of a few existing datasets and perform two probing tasks: PIE usage classification and idiom paraphrase identification. Our experiment results suggest that BERT indeed can separate the literal and idiomatic usages of a PIE with high accuracy. It is also able to encode the idiomatic meaning of a PIE to some extent.

2020

pdf bib
A BERT-based Dual Embedding Model for Chinese Idiom Prediction
Minghuan Tan | Jing Jiang
Proceedings of the 28th International Conference on Computational Linguistics

Chinese idioms are special fixed phrases usually derived from ancient stories, whose meanings are oftentimes highly idiomatic and non-compositional. The Chinese idiom prediction task is to select the correct idiom from a set of candidate idioms given a context with a blank. We propose a BERT-based dual embedding model to encode the contextual words as well as to learn dual embeddings of the idioms. Specifically, we first match the embedding of each candidate idiom with the hidden representation corresponding to the blank in the context. We then match the embedding of each candidate idiom with the hidden representations of all the tokens in the context thorough context pooling. We further propose to use two separate idiom embeddings for the two kinds of matching. Experiments on a recently released Chinese idiom cloze test dataset show that our proposed method performs better than the existing state of the art. Ablation experiments also show that both context pooling and dual embedding contribute to the improvement of performance.
Search
Co-authors
Venues