Neighboring Autoregressive Modeling for Efficient Visual Generation

Abstract

Visual autoregressive models typically adhere to a raster-order "next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant.

In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far "next-neighbor prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension.

During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet 256 x 256 and UCF101 demonstrate that NAR achieves 2.4x and 8.6x higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4% of the training data.

Video

Method

Figure 2. Comparisons of different autoregressive visual generation paradigm. The proposed NAR paradigm formulates the generation process as an outpainting procedure, progressively expanding the boundary of the decoded token region. This approach effectively preserves locality, as all tokens near the starting point are consistently decoded before the current token.

(1) Image

An image can be considered two-dimensional, allowing decoding to proceed along two orthogonal dimensions: rows and columns.

(2) Video

Videos can be regarded as three-dimensional, adding a temporal dimension to images, decoding can be performed along three orthogonal dimensions: times, rows, and columns.

Visualizations

Class-conditional image generation samples produced by NAR-XXL on ImageNet 256 × 256.

Results

Table 1. Quantitative evaluation on the ImageNet 256 × 256 benchmark. “Step” denotes the number of model forward passes required to generate an image. The throughput is measured with the maximum batch size supported on a single A100 GPU. Classifier-free guidance is set to 2 for our method. We also report the reconstruction FID (rFID) of visual tokenizers for each method, which serves as an upper bound for generation FID. †: model denoted as M shares the same hidden dimension as the L model but is reduced by 6 layers in depth.

Table 2. Comparison of class-conditional video generation meth- ods on UCF-101 benchmark. Classifier-free guidance is set to 1.25 for all variants of our method. †: model denoted as LP shares the same hidden dimension as the XL model but is reduced by 6 layers in depth.

Table 3. Quantitative evaluation on the GenEval benchmark.

BibTeX

@article{he2025nar,
      title={Neighboring Autoregressive Modeling for Efficient Visual Generation},
      author={He, Yefei and He, Yuanyu and He, Shaoxuan and Chen, Feng and Zhou, Hong and Zhang, Kaipeng and Zhuang, Bohan},
      journal={arXiv preprint arXiv:2503.10696},
      year={2025}
    }

Neighboring AutoRegressive Modeling for Efficient Visual Generation

Figure 1. Generated samples from NAR. Results are shown for 512 × 512 text-guided image generation (1st row), 256 × 256 class- conditional image generation (2nd row) and 128 × 128 class-conditional video generation (3rd row)