Peiqin Lin


2024

pdf
Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark
Stephen Mayhew | Terra Blevins | Shuheng Liu | Marek Suppa | Hila Gonen | Joseph Marvin Imperial | Börje Karlsson | Peiqin Lin | Nikola Ljubešić | Lester James Miranda | Barbara Plank | Arij Riabi | Yuval Pinter
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 19 datasets annotated with named entities in a cross-lingual consistent schema across 13 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We will release the data, code, and fitted models to the public.

pdf
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
Peiqin Lin | Chengzhi Hu | Zheyu Zhang | Andre Martins | Hinrich Schuetze
Findings of the Association for Computational Linguistics: EACL 2024

Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLM-Sim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLM-Sim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.

pdf
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining
Yihong Liu | Peiqin Lin | Mingyang Wang | Hinrich Schuetze
Findings of the Association for Computational Linguistics: NAACL 2024

Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: One For All (OFA), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available.

2023

pdf
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Ayyoob Imani | Peiqin Lin | Amir Hossein Kargaran | Silvia Severini | Masoud Jalili Sabet | Nora Kassner | Chunlan Ma | Helmut Schmid | André Martins | François Yvon | Hinrich Schütze
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, “help” from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.

2020

pdf
A Shared-Private Representation Model with Coarse-to-Fine Extraction for Target Sentiment Analysis
Peiqin Lin | Meng Yang
Findings of the Association for Computational Linguistics: EMNLP 2020

Target sentiment analysis aims to detect opinion targets along with recognizing their sentiment polarities from a sentence. Some models with span-based labeling have achieved promising results in this task. However, the relation between the target extraction task and the target classification task has not been well exploited. Besides, the span-based target extraction algorithm has a poor performance on target phrases due to the maximum target length setting or length penalty factor. To address these problems, we propose a novel framework of Shared-Private Representation Model (SPRM) with a coarse-to-fine extraction algorithm. For jointly learning target extraction and classification, we design a Shared-Private Network, which encodes not only shared information for both tasks but also private information for each task. To avoid missing correct target phrases, we also propose a heuristic coarse-to-fine extraction algorithm that first gets the approximate interval of the targets by matching the nearest predicted start and end indexes and then extracts the targets by adopting an extending strategy. Experimental results show that our model achieves state-of-the-art performance.