Shlok Gilda

2026

Position: Evaluation Scores Are Perishable Knowledge Claims
Sankalp Gilda | Shlok Gilda
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)

Evaluation methodologies for language models increasingly combine multiple signals—automated metrics, LLM-as-judge ratings, human assessments, and benchmark suite results. When these signals are aggregated via averaging, the resulting evaluation confidence can substantially exceed the reliability of the weakest signal: a phenomenon we call trust inflation in evaluation. We argue that evaluation scores should be treated as epistemic claims with three properties: formality (human evaluation provides stronger evidence than an automated metric), scope (a benchmark result applies to the tested distribution, not universally), and validity windows (benchmark results expire as contamination accumulates and distributions shift). Drawing on several converging research traditions—chain-of-thought analysis, possibilistic logic, and algebraic theory—that establish weakest-link aggregation as the conservative endpoint of a parameterized operator family controlled by a single pessimism parameter, and on concrete lessons from building an evaluation harness for agentic AI, we propose that evaluation results carry explicit metadata—formality tier, scope declaration, and expiration date—to make their epistemic status transparent. We illustrate the cost of mean aggregation on the public HELM leaderboard: across 54 frontier models on ten scenarios, the top-five models ranked by mean score and by weakest-link are completely disjoint.

pdf bib abs

Children’s English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety
Qian Shen | Fanghua Cao | Min Yao | Shlok Gilda | Bonnie Dorr | Walter Leite
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

Large Language Models (LLMs) are widely applied in educational practices, such as for generating children’s stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children’s reading curriculum and its corresponding generated stories from GPT-4o and Llama 3.3 70B to design different experiments for fine-tuning three 8B-parameter LLMs, which then generated new English reading stories that were subjected to quantitative and qualitative evaluation. Our method prioritizes controllability over scale, enabling educators to target reading levels and error patterns with a compact, affordable model. Our evaluation results show that with appropriate fine-tuning designs, children’s English reading stories generated by 8B LLMs perform better on difficulty-related metrics than those from zero-shot GPT-4o and Llama 3.3 70B, with almost no discernible safety issues. Such fine-tuned LLMs could be more broadly used by teachers, parents, and children in classrooms and at home to generate engaging English reading stories with children’s interests, controllable difficulty and safety.

Co-authors

Min Yao 1

Venues

Fix author