Evaluating Spatiotemporal Consistency in Automatically Generated Sewing Instructions

Luisa Geiger, Mareike Hartmann, Michael Sullivan, Alexander Koller


Abstract
In this paper, we propose a novel, automatic tree-based evaluation metric for LLM-generated step-by-step assembly instructions, that more accurately reflects spatiotemporal aspects of construction than traditional metrics such as BLEU and BERT similarity scores. We apply our proposed metric to the domain of sewing instructions, and show that our metric better correlates with manually-annotated error counts, demonstrating our metric’s superiority for evaluating the spatiotemporal soundness of sewing instructions. Further experiments show that our metric is more robust than traditional approaches against artificially-constructed counterfactual examples that are specifically constructed to confound metrics that rely on textual similarity.
Anthology ID:
2025.emnlp-main.934
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
18519–18536
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.934/
DOI:
Bibkey:
Cite (ACL):
Luisa Geiger, Mareike Hartmann, Michael Sullivan, and Alexander Koller. 2025. Evaluating Spatiotemporal Consistency in Automatically Generated Sewing Instructions. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18519–18536, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Evaluating Spatiotemporal Consistency in Automatically Generated Sewing Instructions (Geiger et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.934.pdf
Checklist:
 2025.emnlp-main.934.checklist.pdf