Abstract
Text embedding requires a highly efficient method for training domain-specific models on limited data, as general models trained on large corpora lack universal applicability in highly specific fields. Therefore, we have introduced VAEGPT-Sim, an innovative model for generating synonyms that combines a denoising variational autoencoder with a target-specific discriminator to generate synonymous sentences that closely resemble human language. Even when trained with completely unsupervised settings, it maintains a harmonious balance between semantic similarity and lexical diversity, as shown by a comprehensive evaluation metric system with the highest average scores compared to other generative models. When VAEGPT-Sim is utilized as a module for contrastive learning in text representation, it delivers state-of-the-art results in small-dataset training on STS benchmarks, surpassing ConSERT by 2.8 points. This approach optimizes the effectiveness of text representation despite a limited corpus, signifying an advancement in domain-specific embedding technology.- Anthology ID:
- 2024.findings-acl.513
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2024
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8666–8681
- Language:
- URL:
- https://aclanthology.org/2024.findings-acl.513
- DOI:
- 10.18653/v1/2024.findings-acl.513
- Cite (ACL):
- Zhenyi Wang, Haiyan Ning, Qing Ling, and Dan Wang. 2024. VAEGPT-Sim: Improving Sentence Representation with Limited Corpus Using Gradually-Denoising VAE. In Findings of the Association for Computational Linguistics: ACL 2024, pages 8666–8681, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- VAEGPT-Sim: Improving Sentence Representation with Limited Corpus Using Gradually-Denoising VAE (Wang et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/autopr/2024.findings-acl.513.pdf