He Zhang
Other people with similar names: He Zhang
Unverified author pages with similar names: He Zhang
2025
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion
Jianqing Zhu | Huang Huang | Zhihang Lin | Juhao Liang | Zhengyang Tang | Khalid Almubarak | Mosen Alharthi | Bang An | Juncai He | Xiangbo Wu | Fei Yu | Junying Chen | Ma Zhuoheng | Yuhao Du | He Zhang | Saied Alshahrani | Emad A. Alghamdi | Lian Zhang | Ruoyu Sun | Haizhou Li | Benyou Wang | Jinchao Xu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jianqing Zhu | Huang Huang | Zhihang Lin | Juhao Liang | Zhengyang Tang | Khalid Almubarak | Mosen Alharthi | Bang An | Juncai He | Xiangbo Wu | Fei Yu | Junying Chen | Ma Zhuoheng | Yuhao Du | He Zhang | Saied Alshahrani | Emad A. Alghamdi | Lian Zhang | Ruoyu Sun | Haizhou Li | Benyou Wang | Jinchao Xu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper addresses the critical need for democratizing large language models (LLM) in the Arab world, a region that has seen slower progress in developing models comparable to state-of-the-art offerings like GPT-4 or GPT-3.5, due to a predominant focus on mainstream languages (e.g., English and Chinese). One practical objective for Arabic LLMs is to utilize Arabic-specific vocabulary in the tokenizer to accelerate decoding. However, using a different vocabulary often leads to degradation of the model’s learned knowledge, since many words become out-of-vocabulary (OOV) at the beginning of training. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion, which is implemented by a modified BPE algorithm that progressively extends the Arabic subwords in its dynamic vocabulary during training, thereby balancing the OOV ratio at every stage. The ablation study demonstrated the effectiveness of Progressive Vocabulary Expansion.Moreover, AraLLaMA achieves decent performance comparable to the best Arabic LLMs across a variety of Arabic benchmarks. Our model weights are available at: https://github.com/FreedomIntelligence/AraLLaMa.
2024
SumSurvey: An Abstractive Dataset of Scientific Survey Papers for Long Document Summarization
Ran Liu | Ming Liu | Min Yu | He Zhang | Jianguo Jiang | Gang Li | Weiqing Huang
Findings of the Association for Computational Linguistics: ACL 2024
Ran Liu | Ming Liu | Min Yu | He Zhang | Jianguo Jiang | Gang Li | Weiqing Huang
Findings of the Association for Computational Linguistics: ACL 2024
With the popularity of large language models (LLMs) and their ability to handle longer input documents, there is a growing need for high-quality long document summarization datasets. Although many models already support 16k input, current lengths of summarization datasets are inadequate, and salient information is not evenly distributed. To bridge these gaps, we collect a new summarization dataset called SumSurvey, consisting of more than 18k scientific survey papers. With an average document length exceeding 12k and a quarter exceeding 16k, as well as the uniformity metric outperforming current mainstream long document summarization datasets, SumSurvey brings new challenges and expectations to both fine-tuned models and LLMs. The informativeness of summaries and the models supporting the evaluation of long document summarization warrant further attention. Automatic and human evaluation results on this abstractive dataset confirm this view. Our dataset and code are available at https://github.com/Oswald1997/SumSurvey.