Maaz Amjad

2026

Toward Cross-Domain Automated Feedback: A Comparative Evaluation of Open-Source Models across Diverse Student Assessment Types
Muhammad Haseeb | Min Paing Hmue | Ahmad Imam Amjad | Maaz Amjad | Victor Sheng
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

Constructive, personalized, and timely feedback is essential to student learning. However, providing such feedback in large classes remains a major challenge. Large language models (LLMs) offer alternatives to support instructors by generating relevant feedback that highlights both student strengths and areas for improvement. Nevertheless, most existing LLM-based feedback systems rely on proprietary APIs and are often tailored to specific tasks, which limits their accessibility, flexibility, and applicability in resource-constrained educational settings. In this study, we investigate the potential of two open-source LLMs (DeepSeek R1 and Qwen 3.5) to support automated feedback generation across three disciplines (e.g., programming assignments, essays, and mathematics problems). We evaluate two prompting strategies (unified and multi-agent) across these domains and use human judgment criteria to assess feedback quality. Through this analysis, we examine the potential of open-source models as practical, scalable alternatives to proprietary API-based systems for developing freely accessible feedback-generation tools. Our results show that a single open-source model can generate useful feedback across diverse domains, though with varying effectiveness. DeepSeek R1 performs better on reasoning-intensive tasks such as mathematics, while Qwen 3.5 works best for holistic tasks such as writing, but both models struggle with programming tasks. We find that model architecture matters more than prompting strategy, and reasoning-optimized models excel in structured domains, while general-purpose models perform better on holistic tasks. Finally, we conclude that a multi-agent approach does not consistently guarantee improved results over a single unified LLM approach.

2025

pdf bib abs

Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation
Muhammad Ali Shafique | Kanwal Mehreen | Muhammad Arham | Maaz Amjad | Sabur Butt | Hamza Farooq
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)

Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach.

pdf bib abs

Advances in Auto-Grading with Large Language Models: A Cross-Disciplinary Survey
Tania Amanda Nkoyo Frederick Eneye | Chukwuebuka Fortunate Ijezue | Ahmad Imam Amjad | Maaz Amjad | Sabur Butt | Gerardo Castañeda-Garza
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

With the rise and widespread adoption of Large Language Models (LLMs) in recent years, extensive research has been conducted on their applications across various domains. One such domain is education, where a key area of interest for researchers is investigating the implementation and reliability of LLMs in grading student responses. This review paper examines studies on the use of LLMs in grading across six academic sub-fields: educational assessment, essay grading, natural sciences and technology, social sciences and humanities, computer science and engineering, and mathematics. It explores how different LLMs are applied in automated grading, the prompting techniques employed, the effectiveness of LLM-based grading for both structured and open-ended responses, and the patterns observed in grading performance. Additionally, this paper discusses the challenges associated with LLM-based grading systems, such as inconsistencies and the need for human oversight. By synthesizing existing research, this paper provides insights into the current capabilities of LLMs in academic assessment and serves as a foundation for future exploration in this area.

2020

pdf bib abs

Data Augmentation using Machine Translation for Fake News Detection in the Urdu Language
Maaz Amjad | Grigori Sidorov | Alisa Zhila
Proceedings of the Twelfth Language Resources and Evaluation Conference

The task of fake news detection is to distinguish legitimate news articles that describe real facts from those which convey deceiving and fictitious information. As the fake news phenomenon is omnipresent across all languages, it is crucial to be able to efficiently solve this problem for languages other than English. A common approach to this task is supervised classification using features of various complexity. Yet supervised machine learning requires substantial amount of annotated data. For English and a small number of other languages, annotated data availability is much higher, whereas for the vast majority of languages, it is almost scarce. We investigate whether machine translation at its present state could be successfully used as an automated technique for annotated corpora creation and augmentation for fake news detection focusing on the English-Urdu language pair. We train a fake news classifier for Urdu on (1) the manually annotated dataset originally in Urdu and (2) the machine-translated version of an existing annotated fake news dataset originally in English. We show that at the present state of machine translation quality for the English-Urdu language pair, the fully automated data augmentation through machine translation did not provide improvement for fake news detection in Urdu.