VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

Xianwei Zhuang*, Yuxin Xie*, Yufan Deng*, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou,
Peking University
xwzhuang@stu.pku.edu.cn

Abilities

Comparative analysis

A comparative analysis of various MLLMs across multiple visual comprehension and generation benchmarks is presented. The CLIP-scores is employed as a text-to-image visual generation metric, while the remaining metrics are derived from standard visual question-answering benchmarks and multi-modal comprehension benchmarks. Notably, our VARGPT model demonstrates significant superiority over the compared baselines across all comprehension benchmarks. Furthermore, it exhibits exceptional instruction-to-image generation capabilities, thus enhancing its versatility and applicability in diverse visual-linguistic tasks.

Interpolate start reference image.

Generated samples

Some generated 256*256 samples by VARGPT trained on ImageNet. VARGPT supports text-and-image instructions from user and outputs both text-and-image mixed modal data simultaneously.

Interpolate start reference image.

Abstract

We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework. VARGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation. VARGPT innovatively extends the LLaVA architecture, achieving efficient scale-wise autoregressive visual generation within MLLMs while seamlessly accommodating mixed-modal input and output within a single model framework. Our VARGPT undergoes a three-stage unified training process on specially curated datasets, comprising a pre-training phase and two mixed visual instruction-tuning phases. The unified training strategy are designed to achieve alignment between visual and textual features, enhance instruction following for both understanding and generation, and improve visual generation quality, respectively. Despite its LLAVA-based architecture for multimodel understanding, VARGPT significantly outperforms LLaVA-1.5 across various vision-centric benchmarks, such as visual question-answering and reasoning tasks. Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks.

Video

Methodology

Model Architecture

The illustration of the proposed VARGPT framework, which consists of (1) a large language model, visual encoder and a understanding projector for visual understanding; (2) a visual decoder and dual generation projectors for visual generation. VARGPT employs the causal attention in the LLM backbone, while utilizing block causal attention in the visual decoder.

Interpolate start reference image.

Training

The three training pipeline of the VARGPT, including stage-1 pretraining, stage-2 and stage-3 instruction fine-tuning.

Interpolate start reference image.

Data

We present the data distribution we constructed and collected, encompassing: (a) the proportional breakdown of data across the three training stages; and (b) the distribution of mixed instruction data employed during the instruction fine-tuning phase in the second stage. Our composite dataset for stage-2 training is derived from LLaVA-1.5, LLaVA-OneVision, and ImageNet-Instruct-130K.

Interpolate start reference image.

Experiments

Zero-shot multi-modal evaluation

Zero-shot multi-modal evaluation on multi-modal benchmarks including MMMU, MME, MMBench, SEEDBench, and POPE (including different settings random, popular and adversarial). The overall scores are reported for evaluation and we report test results for MMBench. Gen represents whether the method supports image generation capability. VARGPT achieves the best overall performance.

Interpolate start reference image.

Performance comparison on visual question answering tasks.

Performance comparison on visual question answering tasks. We gray out the model has trained on the dataset. Gen represents whether the method supports image generation capability.

Interpolate start reference image.

More samples

Visual understanding samples

The cases of visual understanding in VARGPT show that our VARGPT has achieved superior performance in understanding.

Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.

Generated samples

Some generated 256×256 samples by VARGPT trained on ImageNet. VARGPT supports user text command input and outputs both text and image modal data simultaneously.

Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.