Shubhojit Mallick

2026

Recent work, such as TripCraft and TravelPlanner, has shown the promise of Large Language Models (LLMs) for personalized, constraint-aware travel itinerary generation. However, real-world travel often involves disruptions such as transit cancellations, weather-related closures, or overbooked attractions. To address this gap, we introduce **TripTide**, the first benchmark designed to evaluate the ability of LLMs to revise travel itineraries under realistic disruptions.TripTide models both disruption severity and traveler tolerance, enabling systematic evaluation of how LLMs respond to unexpected travel events. The benchmark simulates scenarios where existing itineraries must be revised while preserving the traveler’s original intent and respecting practical constraints. We conduct a three-fold evaluation of itinerary revision quality: (i) Automatic metrics measuring *Preservation of Intent*, *Responsiveness*, and *Adaptability* (semantic, spatial, and sequential), (ii) LLM-as-a-Judge evaluation assessing the quality and plausibility of revised itineraries and (iii) Human evaluation examining overall revision quality and user satisfaction.Our findings show that LLMs generally preserve semantic intent and sequential structure, while spatial deviations are more pronounced in shorter itineraries and diminish for longer ones. However, the ability to handle disruptions degrades as itinerary length increases, highlighting limitations in long-horizon itinerary revision. The TripTide benchmark provides a foundation for systematically evaluating robustness and adaptability in LLM-based travel planning systems.

2025

pdf bib abs

TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning
Soumyabrata Chaudhuri | Pranav Purkar | Ritwik Raghav | Shubhojit Mallick | Manish Gupta | Abhik Jana | Shreya Ghosh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in probing Large Language Models (LLMs) have explored their latent potential as personalized travel planning agents, though this remains a rather nascent field. Existing benchmarks, such as TravelPlanner and TravelPlanner+, rely on semi-synthetic data as well ignoring several key components of travel planning, limiting their real-world applicability. Therefore, we introduce TripCraft, a spatio-temporally coherent travel planning dataset incorporating real-world constraints, including public transit schedules, public events, varied attraction categories, and user personas for enhanced personalization. Our dataset enables more detailed trip itinerary generation (including duration spent at each point of interest based on users’ persona, transit between two points of interest, etc.) while ensuring spatio-temporal consistency. Further, we propose novel evaluation metrics (temporal meal score, attraction score, spatial score, ordering score, and persona score) to assess LLM-generated plans across temporal, spatial, sequential, and personal dimensions, overcoming the limitations of commonsense and hard constraint metrics. Interestingly, our parameter-informed setting significantly enhances meal scheduling, improving performance from 61% to 80% in the 7-day scenario- as quantified by a 19% gain in our temporal meal score. Moreover, TripCraft serves as a high-quality benchmark for advancing personalized LLM-driven travel planning.

Co-authors

Pranav Purkar 1

Ritwik Raghav 1

Venues

ACL1
Findings1

Fix author