STRIDE-2B / README.md
arkimjh's picture
Update README.md
2268058 verified
metadata
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3-VL-2B-Instruct
tags:
  - video-understanding
  - streaming
  - proactive
  - activation-model
  - masked-diffusion
  - multimodal
  - plug-and-play
language:
  - en
pipeline_tag: video-classification
model-index:
  - name: STRIDE-2B
    results:
      - task:
          type: video-classification
          name: Proactive Streaming Activation
        dataset:
          type: custom
          name: OVO-Bench
        metrics:
          - type: accuracy
            value: 59.07
            name: Overall (w/ Qwen3-VL-8B)
      - task:
          type: video-classification
          name: Proactive Streaming Activation
        dataset:
          type: custom
          name: StreamingBench
        metrics:
          - type: accuracy
            value: 59.29
            name: Overall (w/ Qwen3-VL-8B)
      - task:
          type: video-classification
          name: Temporal Grounding
        dataset:
          type: custom
          name: ET-Bench
        metrics:
          - type: f1
            value: 62.8
            name: TVG F1
          - type: f1
            value: 10.7
            name: EPM F1
          - type: f1
            value: 24.6
            name: TAL F1
          - type: f1
            value: 36.5
            name: DVC F1
          - type: f1
            value: 28.5
            name: SLC F1

STRIDE-2B

STRIDE (Structured Temporal Refinement with Iterative DEnoising) is a lightweight proactive activation model for streaming video understanding. It decides when a downstream Video-LLM should respond during a live video stream β€” without waiting for explicit user queries.

arXiv Project Page GitHub HF

Paper: STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

Junho Kim*, Hosu Lee*, James M. Rehg, Minsu Kim, Yong Man Ro

UIUC, KAIST, Google DeepMind

What is STRIDE?

Existing streaming Video-LLMs are reactive β€” they only respond when a user explicitly asks a question. STRIDE makes them proactive by adding a lightweight front-end that continuously monitors incoming frames and predicts coherent activation spans indicating when to trigger a response.

The key insight is that activation in streaming video is not a point-wise binary decision ("should I respond now?"), but a span-structured sequence modeling problem β€” the model must capture consistent onset (0 β†’ 1), persistence (1 β†’ 1), and offset (1 β†’ 0) transitions. STRIDE achieves this through masked diffusion over a temporal activation window, jointly predicting and iteratively refining activation signals across the window.

Two-Stage Architecture

Video Stream
  β”‚
  β–Ό
[STRIDE Activation Model]  ← this model (2B)
  β”‚
  β”‚ trigger (only if active)
  β–Ό
[Downstream Video-LLM]     ← frozen, any off-the-shelf
  β”‚
  β–Ό
Response
  • Stage 1 β€” Activation (STRIDE): Monitors the stream at 1 FPS, maintains a sliding activation window, and iteratively denoises binary activation labels via masked diffusion.
  • Stage 2 β€” Response (Downstream LLM): When triggered, the frozen downstream Video-LLM receives the accumulated frame cache and generates a response. STRIDE is fully plug-and-play β€” compatible with any off-the-shelf Video-LLM.

Results

OVO-Bench (Online Video Understanding)

Method Real-Time Perception Backward Tracing Forward Active Responding Overall
Flash-VStream-7B 28.37 27.38 45.09 33.61
Dispider 54.55 36.06 34.72 41.78
TimeChat-Online-7B 58.60 42.00 36.40 45.60
QueryStream-7B 61.40 42.10 39.03 47.51
StreamAgent-7B 61.30 41.70 45.40 49.40
STRIDE + Gemma3-4B 60.93 34.87 55.73 50.51
STRIDE + InternVL3-8B 67.72 45.23 58.00 56.98
STRIDE + Qwen3-VL-8B 69.68 47.83 59.70 59.07

StreamingBench (Streaming Comprehension)

Method Real-Time Visual Omni-Source Contextual Overall
Flash-VStream-7B 23.23 26.00 24.12 24.04
VideoLLM-Online-8B 35.99 28.45 26.55 32.48
Dispider 67.63 35.66 33.61 53.12
StreamAgent-7B 74.31 36.26 34.62 57.02
STRIDE + Gemma3-4B 60.00 36.80 38.80 50.14
STRIDE + InternVL3-8B 72.45 39.20 38.80 57.58
STRIDE + Qwen3-VL-8B 74.24 41.30 39.90 59.29

ET-Bench (Temporal Grounding, Activation-Only)

Model Params TVG EPM TAL DVC SLC Avg
Temporal-Localization Specialized
VTimeLLM 7B 7.6 1.9 18.2 12.4 8.7 9.8
TimeChat 7B 26.2 3.9 10.1 16.6 5.6 12.5
VTG-LLM 7B 15.9 3.7 14.4 40.2 20.8 19.0
LITA 13B 22.2 4.6 18.0 39.7 21.0 21.1
ETChat 5B 38.6 10.2 30.8 38.4 24.4 28.5
Streaming Baselines
VideoLLM-Online 8B 13.2 3.8 9.1 24.0 9.9 12.0
Dispider 9B 36.1 15.5 27.3 33.8 18.8 26.3
StreamBridge 8B 34.3 – 24.3 38.3 22.6 –
Ours
STRIDE 2B 62.8 10.7 24.6 36.5 28.5 32.6

STRIDE achieves the best overall average with only 2B parameters, outperforming 7-13B temporal-localization specialized models and streaming baselines.

Usage

For the full streaming inference pipeline and evaluation scripts, please refer to the STRIDE GitHub repository.

Training

  • Architecture: Qwen3VLForProactiveMDM (Qwen3-VL backbone with a temporal activation head)
  • Base model: Qwen/Qwen3-VL-2B-Instruct
  • Training data: Temporal activation annotations curated from eight publicly available video understanding datasets (ActivityNet-Captions, LITA, YouCook2, ET-Instruct, Charades, CharadesEgo, DiDeMo, Grounded-VideoLLM)

Model Variants

Model Params Description
STRIDE-2B (this) 2B Default activation model
STRIDE-4B 4B Scaled variant with improved accuracy

Citation

@article{kim2026stride,
  title={STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding},
  author={Kim, Junho and Lee, Hosu and Rehg, James M. and Kim, Minsu and Ro, Yong Man},
  journal={arXiv preprint arXiv:2603.27593},
  year={2026}
}

License

This model is released under the Apache 2.0 License.