Gee-Lyle Wong


2026

Automatic question generation with large language models has advanced rapidly, yet producing assessment-ready items, complete with mark schemes and expected answers, remains challenging, especially when generation must reliably target higher-order cognitive levels in Bloom’s Taxonomy. We propose a multi-agent, multi-stage framework that generates structured assessment tuples for both short-answer questions (SAQs) and scenario-based questions (SBQs), combining Bloom-specialized generation agents with staged decomposition and automated verification. We further introduce a rubric-guided LLM-as-a-judge evaluation framework with Bloom-specific alignment metrics. Experiments on university-level AI course material across five generation pipelines show that prompt-level Bloom conditioning alone is insufficient to reliably achieve cognitive control. In contrast, our structured approach yields consistent and notable improvements in alignment, mark scheme quality, and output yield, particularly for higher-order Bloom levels over baseline pipelines.