Tianyi Niu
2026
RotBench: Evaluating Multi-modal Large Language Models on Identifying Image Rotation
Tianyi Niu | Jaemin Cho | Elias Stengel-Eskin | Mohit Bansal
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianyi Niu | Jaemin Cho | Elias Stengel-Eskin | Mohit Bansal
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench, a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information—including captions, depth maps, and more—or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270° rotated images. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models’ ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs’ spatial reasoning capabilities and human perception in identifying rotation.
2025
Chameleon LLMs: User Personas Influence Chatbot Personality Shifts
Jane Xing | Tianyi Niu | Shashank Srivastava
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Jane Xing | Tianyi Niu | Shashank Srivastava
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
As large language models (LLMs) integrate into society, their ability to adapt to users is as critical as their accuracy. While prior work has used personality tests to examine the perceived personalities of LLMs, little research has explored whether LLMs adapt their perceived personalities in response to user interactions. We investigate whether and how LLMs exhibit conversational adaptations over prolonged interactions. Using a controlled simulations where a user and chatbot engage in dialogue, we measure the chatbot’s personality shift before and after the conversation. Across multiple models, we find that traits such as Agreeableness, Extraversion, and Conscientiousness are highly susceptible to user influence, whereas Emotional Stability and Intellect remain relatively more stable. Our results suggest that LLMs dynamically adjust their conversational style in response to user personas, raising important implications for AI alignment, trust, and safety.
Probing Neural Network Generalization using Default Patterns
Brandon Prickett | Tianyi Niu | Katya Pertsova
Proceedings of the 22nd SIGMORPHON workshop on Computational Morphology, Phonology, and Phonetics
Brandon Prickett | Tianyi Niu | Katya Pertsova
Proceedings of the 22nd SIGMORPHON workshop on Computational Morphology, Phonology, and Phonetics
Whether neural-net models can learn minoritydefault patterns has been a matter of some controversy. Results based on modeling real human language data are hard to interpret due to complexity. Therefore, we examine the learning of a simple artificial language pattern involving defaults using three computational models”:" an Encoder-Decoder RNN, a Transformer Encoder, and a Logistic Regression. Overall, we find that the models have the hardest time with minority defaults, but can eventually learn them and apply them to novel words (although not always extend them to completely novel segments or novel CV-sequences). Typefrequency has the largest effect on learning in all models, trumping the effect of distribution. We examine the weights of two models to provide further insights into how defaults are represented inside the models.