Andreea Dutulescu

2026

ORSO QGen: Odds-Ratio Steerable Optimization for Controlling Question Generation
Andreea Dutulescu | Stefan Ruseti | Mihai Dascalu | Danielle S. McNamara
Findings of the Association for Computational Linguistics: EACL 2026

Question generation plays an important role in educational applications, enabling automated assessment and reading comprehension support. Attribute-controlled question generation aims to produce questions that fit predefined characteristics such as difficulty, focus, or coverage. Existing methods predominantly rely on supervised fine-tuning, which often fails to impose a strong adherence to attribute values, resulting in weak coupling between prompt specifications and model outputs. We introduce Odds-Ratio Steerable Optimization (ORSO), a framework designed to enhance attribute sensitivity in question generation models. Building upon preference-based learning techniques without requiring human-curated preference sets, ORSO employs input-level perturbations to create contrastive training signals. Empirical evaluations on both exhaustive and expert-validated attribute configurations indicate that ORSO performs better in enforcing attribute conformity while maintaining output quality. These results argue for the benefits of explicit attribute-aware optimization in controllable question generation tasks.

pdf bib abs

EduMUSE: A Multimodal Educational Dataset with Automatically Extracted Instructional Context
Andreea Dutulescu | Stefan Ruseti | Mihai Dascalu | Danielle McNamara
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

Research in AI applied to education increasingly relies on large-scale, high-quality datasets to support the development and evaluation of learning analytics and intelligent educational systems. Open educational resources provide a promising foundation, yet few datasets integrate structured instructional content with assessment materials in a multimodal form. In this study, we introduce a large-scale multimodal educational dataset (EduMUSE - Educational Multimodal Understanding & Solution Dataset) constructed from OpenStax undergraduate textbooks across multiple domains. The dataset integrates hierarchically structured instructional text, figures, exercises, and, when available, official solutions. For exercises with solutions, we introduce an automatic method that associates each exercise with a focused instructional subsection rather than entire textbook chapters, estimating subsection relevance via solution likelihood under candidate contexts using a vision–language model. We analyze the impact of contextualization on the behavior of vision–language models across different contexts. Results indicate that subsection-level instructional context has a measurable impact on model performance, with variation across model scales and task formulations. The dataset and code are released as open source at https://github.com/upb-nlp/BEA-EduMUSE/ to support reproducible research in multimodal educational modeling and to facilitate generating similar datasets using our approach.

Co-authors

Venues

Fix author