2025
pdf
bib
abs
FarSense: A Comprehensive Commonsense Benchmark and Evaluation Framework for the Farsi Language
Kamyar Zeinalipour
|
Neda Jamshidi
|
Seyedehbahareh Hejazi
|
Marco Maggini
|
Monica Bianchini
|
Simone Paoletti
|
Marco Gori
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Although Farsi is widely spoken, no comprehensive benchmark exists for assessing commonsense reasoning in language models. We therefore present FarSense, a 6‐task benchmark for Farsi covering True/False judgment, multiple-choice questions, Explanation, Cause‐Effect inference, Counterfactual reasoning, and Knowledge Completion. Starting from Farsi‐Wikipedia, we filtered noise and retained ~4,210 passages, rewrote them into realistic daily scenarios, and derived the above tasks from each scenario. Scenario and task generation quality was first judged via native‐speaker annotations on outputs from five major LLMs—GPT‐4o, Gemini-2.5-Flash, Mistral-Large, Qwen‐Plus, and DeepSeek‐Chat. Gemini-2.5-Flash demonstrated the highest performance, leading to its use in generating a large-scale dataset, subsequently finalized through meticulous two-step human validation. Using FarSense, we measured the commonsense ability of the same five flagship LLMs and also fine‐tuned six compact models (1B–24B parameters) before re‐evaluating them. To ensure broad applicability, task wording was designed to minimize dialectal, cultural, or religious bias. Experiments show that targeted fine‐tuning yields substantial gains, confirming FarSense as a reliable, openly licensed resource for advancing reproducible commonsense understanding research in Farsi NLP. We publicly release all code and data at https://github.com/KamyarZeinalipour/FarSense.
pdf
bib
abs
PersianMCQ-Instruct: A Comprehensive Resource for Generating Multiple-Choice Questions in Persian
Kamyar Zeinalipour
|
Neda Jamshidi
|
Fahimeh Akbari
|
Marco Maggini
|
Monica Bianchini
|
Marco Gori
Proceedings of the First Workshop on Language Models for Low-Resource Languages
We present PersianMCQ-Instruct, a comprehensive resource that includes a dataset and advanced models for generating multiple-choice questions (MCQs) in standard Iranian Persian, a low-resource language spoken by over 80 million people. This resource features three state-of-the-art models for Persian MCQ generation: PMCQ-Gemma2-9b, PMCQ-Llama3.1-8b, and PMCQ-Mistral-7B. Inspired by the Agent Instruct framework and GPT-4o, we created the dataset by curating over 4,000 unique Persian Wikipedia pages, resulting in three MCQs per page and a total of over 12,000 questions. To ensure the quality of this dataset, we conducted human evaluations and model fine-tuning, both of which demonstrated significant performance improvements in Persian MCQ generation. The dataset and models are publicly available, offering valuable tools for researchers and educators, with particular benefits for advancing Persian-language educational technology.
2024
pdf
bib
abs
Design Proteins Using Large Language Models: Enhancements and Comparative Analyses
Kamyar Zeinalipour
|
Neda Jamshidi
|
Monica Bianchini
|
Marco Maggini
|
Marco Gori
Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)
Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B, Llama-2-7B, Llama-3-8B, and gemma-7B, to produce valid protein sequences. All of these models are publicly available (https://github.com/KamyarZeinalipour/protein-design-LLMs).Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen varieties, ProtGPT2, and ProLLaMA, which were trained on millions of protein sequences. To validate and quantify the performance of our models, we conduct comparative analyses employing standard metrics such as pLDDT, RMSD, TM-score, and REU. Furthermore, we commit to making the trained versions of all four models publicly available, fostering greater transparency and collaboration in the field of computational biology.