2023
pdf
abs
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering
Fangyu Liu
|
Francesco Piccinno
|
Syrine Krichene
|
Chenxi Pang
|
Kenton Lee
|
Mandar Joshi
|
Yasemin Altun
|
Nigel Collier
|
Julian Eisenschlos
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose MatCha (Math reasoning and Chart derendering pretraining) to enhance visual language models’ capabilities in jointly modeling charts/plots and language data. Specifically, we propose several pretraining tasks that cover plot deconstruction and numerical reasoning which are the key capabilities in visual language modeling. We perform the MatCha pretraining starting from Pix2Struct, a recently proposed image-to-text visual language model. On standard benchmarks such as PlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by as much as nearly 20%. We also examine how well MatCha pretraining transfers to domains such as screenshots, textbook diagrams, and document figures and observe overall improvement, verifying the usefulness of MatCha pretraining on broader visual language tasks.
pdf
abs
DePlot: One-shot visual language reasoning by plot-to-table translation
Fangyu Liu
|
Julian Eisenschlos
|
Francesco Piccinno
|
Syrine Krichene
|
Chenxi Pang
|
Kenton Lee
|
Mandar Joshi
|
Wenhu Chen
|
Nigel Collier
|
Yasemin Altun
Findings of the Association for Computational Linguistics: ACL 2023
Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than thousands of data points, DePlot+LLM with just one-shot prompting achieves a 29.4% improvement over finetuned SOTA on human-written queries from the task of chart QA.
2016
pdf
A Constituent Syntactic Parse Tree Based Discourse Parser
Zhongyi Li
|
Hai Zhao
|
Chenxi Pang
|
Lili Wang
|
Huan Wang
Proceedings of the CoNLL-16 shared task