Robert Stevens


2026

In this work, we propose an entity type-specific and knowledge-augmented token classification framework designed to improve encoder models’ performance on recipe texts. Our empirical analysis shows that this approach achieves state-of-the-art (SOTA) results on 5 out of 7 benchmark recipe datasets, significantly outperforming traditional token classification methods. We introduce a novel methodology leveraging curated domain-specific knowledge contexts to guide encoder models such as BERT and RoBERTa, which we refer to as RecipeBERT-KA and RecipeRoBERTa-KA. Additionally, we release a newly reprocessed entity type-specific and knowledge-enriched dataset that merges seven widely used food datasets, making it the largest annotated food-related dataset to date. Comparative analysis with SOTA large language models (GPT-4o, Mistral-7B, LLaMA 3-13B and LLaMA 3-70B) highlights the practical advantages of our smaller and specialised models. Finally, we analyse the impact of the different knowledge contexts, our models’ potential for transfer learning, the effect of combining the datasets and scenarios where traditional token classification may still perform competitively, offering nuanced insight into method selection.
We introduce ReciFine, the largest human-evaluated, finely annotated recipe dataset to date, designed to advance controllable and trustworthy recipe generation. Existing resources, such as RecipeNLG, extract food items only from ingredient lists, overlooking entities expressed in instructions, including tools, chef actions, food and tool states, and durations, which are crucial for realistic and context-aware generation. To address this limitation, we extend RecipeNLG with finely annotated extraction of over 97 million entities across ten entity types from 2.2 million recipes. We are the first to explore recipe generation with explicit control over multiple entity types, enabling models to generate recipes conditioned not only on ingredients but also on tools, chef actions, cooking durations, and other contextual factors. Large language models fine-tuned or few-shot prompted with ReciFine extractions consistently outperform those trained on ingredient-list data alone across both automatic and human evaluations. ReciFine establishes a foundation for factual, coherent, structured, controllable recipe generation, and we release a human-annotated benchmark to support future evaluation and model development.

2015

2013

2011