(1) Image
An image can be considered two-dimensional, allowing decoding to proceed along two orthogonal dimensions: rows and columns.
Visual autoregressive models typically adhere to a raster-order "next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant.
In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far "next-neighbor prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension.
During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet 256 x 256 and UCF101 demonstrate that NAR achieves 2.4x and 8.6x higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4% of the training data.
An image can be considered two-dimensional, allowing decoding to proceed along two orthogonal dimensions: rows and columns.
Videos can be regarded as three-dimensional, adding a temporal dimension to images, decoding can be performed along three orthogonal dimensions: times, rows, and columns.
Class-conditional image generation samples produced by NAR-XXL on ImageNet 256 × 256.
@article{he2025nar,
title={Neighboring Autoregressive Modeling for Efficient Visual Generation},
author={He, Yefei and He, Yuanyu and He, Shaoxuan and Chen, Feng and Zhou, Hong and Zhang, Kaipeng and Zhuang, Bohan},
journal={arXiv preprint arXiv:2503.10696},
year={2025}
}