Jimin Hong
2022
Reweighting Strategy Based on Synthetic Data Identification for Sentence Similarity
TaeHee Kim
|
ChaeHun Park
|
Jimin Hong
|
Radhika Dua
|
Edward Choi
|
Jaegul Choo
Proceedings of the 29th International Conference on Computational Linguistics
Semantically meaningful sentence embeddings are important for numerous tasks in natural language processing. To obtain such embeddings, recent studies explored the idea of utilizing synthetically generated data from pretrained language models(PLMs) as a training corpus. However, PLMs often generate sentences different from the ones written by human. We hypothesize that treating all these synthetic examples equally for training can have an adverse effect on learning semantically meaningful embeddings. To analyze this, we first train a classifier that identifies machine-written sentences and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences. Based on this, we propose a novel approach that first trains the classifier to measure the importance of each sentence. The distilled information from the classifier is then used to train a reliable sentence embedding model. Through extensive evaluation on four real-world datasets, we demonstrate that our model trained on synthetic data generalizes well and outperforms the baselines.
2021
AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain
Jimin Hong
|
TaeHee Kim
|
Hyesu Lim
|
Jaegul Choo
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
During the fine-tuning phase of transfer learning, the pretrained vocabulary remains unchanged, while model parameters are updated. The vocabulary generated based on the pretrained data is suboptimal for downstream data when domain discrepancy exists. We propose to consider the vocabulary as an optimizable parameter, allowing us to update the vocabulary by expanding it with domain specific vocabulary based on a tokenization statistic. Furthermore, we preserve the embeddings of the added words from overfitting to downstream data by utilizing knowledge learned from a pretrained language model with a regularization term. Our method achieved consistent performance improvements on diverse domains (i.e., biomedical, computer science, news, and reviews).
2020
Fˆ2-Softmax: Diversifying Neural Text Generation via Frequency Factorized Softmax
Byung-Ju Choi
|
Jimin Hong
|
David Park
|
Sang Wan Lee
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Despite recent advances in neural text generation, encoding the rich diversity in human language remains elusive. We argue that the sub-optimal text generation is mainly attributable to the imbalanced token distribution, which particularly misdirects the learning model when trained with the maximum-likelihood objective. As a simple yet effective remedy, we propose two novel methods, Fˆ2-Softmax and MefMax, for a balanced training even with the skewed frequency distribution. MefMax assigns tokens uniquely to frequency classes, trying to group tokens with similar frequencies and equalize frequency mass between the classes. Fˆ2-Softmax then decomposes a probability distribution of the target token into a product of two conditional probabilities of (1) frequency class, and (2) token from the target frequency class. Models learn more uniform probability distributions because they are confined to subsets of vocabularies. Significant performance gains on seven relevant metrics suggest the supremacy of our approach in improving not only the diversity but also the quality of generated texts.
Search
Co-authors
- Taehee Kim 2
- Jaegul Choo 2
- Chaehun Park 1
- Radhika Dua 1
- Edward Choi 1
- show all...