Ruoxi Ning


2025

pdf bib
From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens
Hala Sheta | Eric Haoran Huang | Shuyu Wu | Ilia Alenabi | Jiajun Hong | Ryker Lin | Ruoxi Ning | Daniel Wei | Jialin Yang | Jiawei Zhou | Ziqiao Ma | Freda Shi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic.The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.

2024

pdf bib
TAIL: A Toolkit for Automatic and Realistic Long-Context Large Language Model Evaluation
Gefei Gu | Yilun Zhao | Ruoxi Ning | Yanan Zheng | Arman Cohan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

As long-context large language models (LLMs) are attracting increasing attention for their ability to handle context windows exceeding 128k tokens, the need for effective evaluation methods for these models becomes critical.Existing evaluation methods, however, fall short: needle-in-a-haystack (NIAH) and its variants are overly simplistic, while creating realistic benchmarks is prohibitively expensive due to extensive human annotation requirements. To bridge this gap, we propose TAIL, an automatic toolkit for creating realistic evaluation benchmarks and assessing the performance of long-context LLMs.With TAIL, users can customize the building of a long-context, document-grounded QA benchmark and obtain visualized performance metrics of evaluated models.TAIL has the advantage of requiring minimal human annotation and generating natural questions based on user-provided long-context documents. We apply TAIL to construct a benchmark encompassing multiple expert domains, such as finance, law, patent, and scientific literature. We then evaluate four state-of-the-art long-context LLMs using this benchmark. Results show that all LLMs experience varyingdegrees of performance degradation as contextlengths increase.

2022

pdf bib
Challenges to Open-Domain Constituency Parsing
Sen Yang | Leyang Cui | Ruoxi Ning | Di Wu | Yue Zhang
Findings of the Association for Computational Linguistics: ACL 2022

Neural constituency parsers have reached practical performance on news-domain benchmarks. However, their generalization ability to other domains remains weak. Existing findings on cross-domain constituency parsing are only made on a limited number of domains. Tracking this, we manually annotate a high-quality constituency treebank containing five domains. We analyze challenges to open-domain constituency parsing using a set of linguistic features on various strong constituency parsers. Primarily, we find that 1) BERT significantly increases parsers’ cross-domain performance by reducing their sensitivity on the domain-variant features.2) Compared with single metrics such as unigram distribution and OOV rate, challenges to open-domain constituency parsing arise from complex features, including cross-domain lexical and constituent structure variations.