Maria Monica Manlises


2026

Phonics stories are essential for early literacy, requiring controlled repetition of grapheme-phoneme (GP) patterns while maintaining simplicity, suitability, and quality. Generating such texts poses a challenge for large language models (LLMs), which must balance multiple phonological and pedagogical constraints. We evaluate six LLMs in a zero-shot setting across 16 prompt configurations, producing 8,688 outputs and 39,096 stories. Outputs are assessed using a multi-dimensional framework covering phonological alignment, developmental lexical appropriateness, readability, and narrative quality. Results show that while LLMs generate highly readable and age-appropriate text, they exhibit variability in phoneme control and narrative coherence. Prompt design significantly affects performance, revealing trade-offs between focusing on multiple phonological, linguistic, and pedagogical constraints, while model choice also leads to significant differences. These findings highlight the challenges of controllable educational text generation and the importance of prompt design in balancing instructional objectives. We release our prompts, generated stories, and evaluation framework to support future work in phonics-based story generation for early readers.

2025

We present our submission for Tracks 3 (Providing Guidance), 4 (Actionability), and 5 (Tutor Identification) of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-Powered Tutors. Our approach sought to investigate the performance of directly using sentence embeddings of tutor responses as input to downstream classifiers (that is, without employing any fine-tuning). To this end, we benchmarked two general-purpose sentence embedding models: gte-modernbert-base (GTE) and all-MiniLM-L12-v2, in combination with two downstream classifiers: XGBoost and multilayer perceptron. Feeding GTE embeddings to a multilayer perceptron achieved macro-F1 scores of 0.4776, 0.5294, and 0.6420 on the official test sets for Tracks 3, 4, and 5, respectively. While overall performance was modest, these results offer insights into the challenges of pedagogical response evaluation and establish a baseline for future improvements.