Maria Monica Manlises

2026

Zero Shot Phonics: Evaluating Constraint-Adherent Phonics Story Generation in Large Language Models
Maria Monica Manlises | Ethel Ong
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

Phonics stories are essential for early literacy, requiring controlled repetition of grapheme-phoneme (GP) patterns while maintaining simplicity, suitability, and quality. Generating such texts poses a challenge for large language models (LLMs), which must balance multiple phonological and pedagogical constraints. We evaluate six LLMs in a zero-shot setting across 16 prompt configurations, producing 8,688 outputs and 39,096 stories. Outputs are assessed using a multi-dimensional framework covering phonological alignment, developmental lexical appropriateness, readability, and narrative quality. Results show that while LLMs generate highly readable and age-appropriate text, they exhibit variability in phoneme control and narrative coherence. Prompt design significantly affects performance, revealing trade-offs between focusing on multiple phonological, linguistic, and pedagogical constraints, while model choice also leads to significant differences. These findings highlight the challenges of controllable educational text generation and the importance of prompt design in balancing instructional objectives. We release our prompts, generated stories, and evaluation framework to support future work in phonics-based story generation for early readers.

2025

pdf bib abs

DLSU at BEA 2025 Shared Task: Towards Establishing Baseline Models for Pedagogical Response Evaluation Tasks
Maria Monica Manlises | Mark Edward Gonzales | Lanz Lim
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

We present our submission for Tracks 3 (Providing Guidance), 4 (Actionability), and 5 (Tutor Identification) of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-Powered Tutors. Our approach sought to investigate the performance of directly using sentence embeddings of tutor responses as input to downstream classifiers (that is, without employing any fine-tuning). To this end, we benchmarked two general-purpose sentence embedding models: gte-modernbert-base (GTE) and all-MiniLM-L12-v2, in combination with two downstream classifiers: XGBoost and multilayer perceptron. Feeding GTE embeddings to a multilayer perceptron achieved macro-F1 scores of 0.4776, 0.5294, and 0.6420 on the official test sets for Tracks 3, 4, and 5, respectively. While overall performance was modest, these results offer insights into the challenges of pedagogical response evaluation and establish a baseline for future improvements.

Co-authors

Venues

BEA2
WS2

Fix author