Tao Kong

2025

We introduce ProcWorld, a large-scale benchmark for partially observable embodied spatial reasoning and long-term planning with large language models (LLM) and vision language models (VLM). ProcWorld features a wide range of challenging embodied navigation and object manipulation tasks, covering 16 task types, 5,000 rooms, and over 10 million evaluation trajectories with diverse data distribution. ProcWorld supports configurable observation modes, ranging from text-only descriptions to vision-only observations. It enables text-based actions to control the agent following language instructions. ProcWorld has presented significant challenges for LLMs and VLMs: (1) active information gathering given partial observations for disambiguation; (2) simultaneous localization and decision-making by tracking the spatio-temporal state-action distribution; (3) constrained reasoning with dynamic states subject to physical reachability. Our extensive evaluation of 15 foundation models and 5 reasoning algorithms (with over 1 million rollouts) indicates larger models perform better. However, ProcWorld remains highly challenging for existing state-of-the-art models and in-context learning methods due to constrained reachability and the need for combinatorial spatial reasoning.

2024

Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in processing image inputs and following open-ended instructions. Despite these advancements, there is considerable scope for enhancing open-source multi-modal LLMs, especially in terms of multi-modal understanding accuracy and instruction-following proficiency. In this paper, we conduct a comprehensive study on training GPT4-style models. We introduce Lynx a multi-modal LLM developed through a series of controlled experiments comparing various model variants. This process allowed us to identify and implement an optimal training strategy tailored for multi-modal LLMs. In addition to our model development, we propose a plug-and-play technique designed to augment the instruction-following capabilities of multi-modal LLMs. We have validated the performance of Lynx on multiple benchmarks. Results demonstrate that Lynx not only achieves strong image understanding accuracy but also excels in instruction-following tasks, paving the path for ongoing enhancements in multi-modal LLMs.

2022

pdf bib abs
Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng | Tao Kong | Ya Jing | Jiaan Wang | Xiaojie Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Reference Expression Generation (REG) and Comprehension (REC) are two highly correlated tasks. Modeling REG and REC simultaneously for utilizing the relation between them is a promising way to improve both. However, the problem of distinct inputs, as well as building connections between them in a single model, brings challenges to the design and training of the joint model. To address the problems, we propose a unified model for REG and REC, named UniRef. It unifies these two tasks with the carefully-designed Image-Region-Text Fusion layer (IRTF), which fuses the image, region and text via the image cross-attention and region cross-attention. Additionally, IRTF could generate pseudo input regions for the REC task to enable a uniform way for sharing the identical representation space across the REC and REG. We further propose Vision-conditioned Masked Language Modeling (VMLM) and Text-Conditioned Region Prediction (TRP) to pre-train UniRef model on multi-granular corpora. The VMLM and TRP are directly related to REG and REC, respectively, but could help each other. We conduct extensive experiments on three benchmark datasets, RefCOCO, RefCOCO+ and RefCOCOg. Experimental results show that our model outperforms previous state-of-the-art methods on both REG and REC.