Small Vision-Language Models are Smart Compressors for Long Video Understanding
Abstract
Tempo is an efficient framework that compresses long videos for multimodal understanding by using a small vision-language model for temporal compression and adaptive token allocation to maintain intent-aligned representations within strict budgets.
Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free O(1) dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.
Community
๐ฅ Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding
How do we make MLLMs understand hour-long videos without saturating context windows? Tempo uses an SVLM to actively filter and compress videos via query-aware cross-modal distillation in a single forward pass!
๐ SOTA Performance: Outperforms other long video MLLMs on the extreme-long LVBench (52.3 at 8K budget).
Everything is open-sourced! Try it out:
- ๐ค Interactive Space: https://huggingface.co/spaces/Vision-CAIR/Tempo
- ๐ป GitHub: https://github.com/FeiElysia/Tempo
- ๐ Project Page: https://feielysia.github.io/tempo-page/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models (2026)
- Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects (2026)
- CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling (2026)
- PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference (2026)
- KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models (2026)
- Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models (2026)
- Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.08120 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper