Anna Klezovich
2026
MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue
Anna Deichler | Jim O'Regan | Fethiye Irmak Dogan | Anna Klezovich | Lubos Marcinek | Iolanda Leite | Jonas Beskow
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Anna Deichler | Jim O'Regan | Fethiye Irmak Dogan | Anna Klezovich | Lubos Marcinek | Iolanda Leite | Jonas Beskow
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing MM-Conv—speak, point, look—a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types, enabling systematic evaluation of multimodal reference resolution.
How Much Data Is Enough Data? A New Motion Capture Corpus for Probabilistic Sign Language Generation
Anna Klezovich | Johanna Mesch | Gustav Eje Henter | Jonas Beskow
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Anna Klezovich | Johanna Mesch | Gustav Eje Henter | Jonas Beskow
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present a new 4.1 hours long high-quality motion capture sign language dataset for Swedish Sign Language — STS Mocap v1. The dataset consists of high quality multimodal data: body tracked with markers, fingers tracked with Manus Quantum Metagloves, face tracked with iPhone LiveLink app in MetaHuman Animator mode, and corresponding textual sentence translation to spoken Swedish. With the help of this dataset, we show that four hours of motion capture data is enough for generative modeling of sign language conditioned on 2D pose. In comparison, training the same flow-matching model on only 30 minutes of this data, which is a common size for sign language motion capture datasets, shows a significant degradation in the quality of the synthesized data.
2024
Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data
Fredrik Malmberg | Anna Klezovich | Johanna Mesch | Jonas Beskow
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources
Fredrik Malmberg | Anna Klezovich | Johanna Mesch | Jonas Beskow
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources