SparseMM

Abstract

Abstract Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38× real-time acceleration and 52\% memory reduction during generation while maintaining performance parity on efficiency test.

Chase Visual Head

Chase Visual Head. To investigate how attention heads within the MLLM attend to visual elements and to identify the specific visual head, we introduce an OCR-based method and define the visual score. For each output token, we first determine its corresponding region within the image based on (text, bbox) pair. Based on this region, we then identify the associated image tokens in the input sequence Subsequently, we iterate over all attention heads. For any given head, if the token that receives the highest attention in this head's attention matrix belongs to the set of identified image tokens, a "hit" is recorded for that head and its score is incremented by the inverse of the number of image tokens

SparseMM for MLLM Acceleration

Illustrations of SparseMM. For multimodal models, visual tokens usually account for a large proportion, which causes the KV-Cache to consume much GPU memory. Given a fixed cache budget, we use the visual head to guide the KV-Cache allocation between attention heads. The cache budget of each head consists of three parts: a Uniform-Based Cache, a Local Window Cache, and a Score-Preferred Cache.

Benchmark Performance

Main Results. We select representative benchmarks, and select mainstream state-of-the-art KV-Cache compression methods.

Efficiency Evaluation

Efficiency Results. We compare our SparseMM with FullKV method. The results show that we achieve superior accuracy-efficiency trade-offs.

Visualization Examples of Visual Head

Citation (BibTeX)


@article{wang2025sparsemm,
  title={SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs},
  author={Wang, Jiahui and Liu, Zuyan and Rao, Yongming and Lu, Jiwen },
  journal={arXiv preprint arXiv:},
  year={2025}
  }

SparsMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs