Assessing the Similarity of Cross-Lingual Seq2Seq Sentence Embeddings Using Low-Resource Spectral Clustering

Nelson Moll; Tahseen Rabbani

Assessing the Similarity of Cross-Lingual Seq2Seq Sentence Embeddings Using Low-Resource Spectral Clustering

Abstract

In this work, we study the cross-lingual distance of machine translations through alignment of seq2seq representations over small corpora. First, we use the M2M100 model to collect sentence-level representations of The Book of Revelation in several languages. We then perform unsupervised manifold alignment (spectral clustering) between these collections of embeddings. As verses between translations are not necessarily aligned, our procedure falls under the challenging, but more realistic non-correspondence regime. The cost function associated with each alignment is used to rank the relative (machine) similarity of one language to another. We then perform correspondent alignment over another cluster of languages, this time using FLORES+ parallel NLLB model embeddings. Our experiments demonstrate that the representations of closely-related languages group closely, and are cheap to align (requiring <1000 sentences) via our strategy.

Anthology ID:: 2025.resourceful-1.28
Volume:: Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)
Month:: March
Year:: 2025
Address:: Tallinn, Estonia
Editors:: Špela Arhar Holdt, Nikolai Ilinykh, Barbara Scalvini, Micaella Bruton, Iben Nyholm Debess, Crina Madalina Tudor
Venues:: RESOURCEFUL | WS
SIG:
Publisher:: University of Tartu Library, Estonia
Note:
Pages:: 137–142
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.resourceful-1.28/
DOI:
Bibkey:
Cite (ACL):: Nelson Moll and Tahseen Rabbani. 2025. Assessing the Similarity of Cross-Lingual Seq2Seq Sentence Embeddings Using Low-Resource Spectral Clustering. In Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025), pages 137–142, Tallinn, Estonia. University of Tartu Library, Estonia.
Cite (Informal):: Assessing the Similarity of Cross-Lingual Seq2Seq Sentence Embeddings Using Low-Resource Spectral Clustering (Moll & Rabbani, RESOURCEFUL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.resourceful-1.28.pdf

PDF Cite Search Fix data