HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

Authors: Jizhihui Liu*, Feiyi Du*, Guangdao Zhu*, Niu Lian, Jun Li, Bin Chen†, Weili Guan, Yaowei Wang

Venue: AAAI-26 (Student Abstract) Oral / (Submitted to ACL-26)

Arxiv Link GitHub Code

Vision-Language Models (VLMs) like LLaVA and Qwen have shown incredible capabilities in multimodal tasks. However, they come with a significant bottleneck: efficiency. VLMs encode images into extremely long sequences of visual tokens, leading to massive computational overhead and slow inference speeds.

But do we really need all those tokens?

In our recent paper, we demonstrate that visual tokens are highly redundant. To address this, we introduce HiPrune, a training-free, model-agnostic token pruning framework that dramatically accelerates VLM inference without sacrificing performance.

Figure 1: Redundancy analyses on visual tokens. (a) The sum of cosine similarity between each token and neighbouring 10 tokens. (b) The performance of LLaVA-1.5-7B when randomly removing 50% visual tokens or 5% text tokens.

The Key Insight: Hierarchical Attention in Vision Encoders

Most previous pruning methods either require expensive retraining or rely on specific architecture features like the [CLS] token (which modern models like SigLIP lack).

Instead of looking at the LLM's cross-attention, we investigated the internal structure of the vision encoder itself. We discovered a universal hierarchical attention pattern across different ViT-based encoders (CLIP, SigLIP, DeiT, etc.):

Shallow layers: Often retain low-level noise.
Middle layers: Highly object-centric, naturally focusing on the semantic subjects of the image.
Deep layers: Distribute attention uniformly to capture global context.

Figure 2: Attention map for different layers of SigLIP and CLIP. Patches with higher scores are in yellow. We can see that the middle layer is more object-centric.

How HiPrune Works

Motivated by this hierarchical pattern, HiPrune selectively preserves a compact subset of tokens that collectively retain both fine-grained details and global information. It requires zero retraining and alters nothing about the original VLM structure.

We select three specific types of tokens:

Anchor Tokens: Tokens with the highest attention scores in the middle layers (capturing main objects).
Buffer Tokens: Tokens spatially adjacent to Anchors (preserving spatial continuity and mitigating attention noise).
Register Tokens: Tokens with high attention in the deepest/output layer (providing global contextual summaries).

Figure 3: An overview of HiPrune. We first allocate anchor tokens that receive the highest attention in layer l and record buffer tokens around them. Then we allocate register tokens directly based on the last layer attention of the rest of the tokens. HiPrune only accesses the attention in the vision encoder while keeping the LLM unchanged.

By combining these three components, HiPrune effectively covers the most critical parts of the image while discarding the redundant background.

Figure 4: Visualization on tokens retained by HiPrune. Anchor tokens are in yellow, buffer tokens are in red, and register tokens are in cyan. Anchor and buffer tokens focus on the sports player, the aircraft, and the fire extinguisher. We set $\alpha=0.5$ for a better visualization. The images are slightly resized, randomly chosen from the COCO val2017 set.

State-of-the-Art Results

We evaluated HiPrune across various architectures, including LLaVA-1.5, LLaVA-NeXT (fixed-length), and Qwen2.5-VL (dynamic-resolution). The results show a state-of-the-art trade-off between accuracy and efficiency:

LLaVA-1.5: Preserves 99.3% of original task performance while keeping only 33.3% of the visual tokens. Even at an extreme 11.1% token budget, it maintains 92.7% performance.
LLaVA-NeXT: Actually boosts overall performance by 2.6% when using 22.2% of tokens, while reducing FLOPs by 6×. Maintains 99.5% accuracy with just 11.1% tokens.
Efficiency: Reduces inference FLOPs and latency by up to 9$\times$.

Table 1: Results on LLaVA-1.5-7B. All the methods are in a training-free way. `$^\dagger$' denotes results reproduced by us.

Table 2 — Table 1: Results on LLaVA-1.5-7B. All the methods are in a training-free way. `$^\dagger$' denotes results reproduced by us.

Conclusion

HiPrune proves that you don't need complex merging algorithms or special tokens to compress visual information in VLMs. By simply leveraging the inherent layer-wise attention hierarchy already present inside vision encoders, we can achieve highly efficient, plug-and-play inference for universal Vision-Language Models.