Update README.md

2268058 verified about 5 hours ago

7.07 kB

library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3-VL-2B-Instruct
tags:
  - video-understanding
  - streaming
  - proactive
  - activation-model
  - masked-diffusion
  - multimodal
  - plug-and-play
language:
  - en
pipeline_tag: video-classification
model-index:
  - name: STRIDE-2B
    results:
      - task:
          type: video-classification
          name: Proactive Streaming Activation
        dataset:
          type: custom
          name: OVO-Bench
        metrics:
          - type: accuracy
            value: 59.07
            name: Overall (w/ Qwen3-VL-8B)
      - task:
          type: video-classification
          name: Proactive Streaming Activation
        dataset:
          type: custom
          name: StreamingBench
        metrics:
          - type: accuracy
            value: 59.29
            name: Overall (w/ Qwen3-VL-8B)
      - task:
          type: video-classification
          name: Temporal Grounding
        dataset:
          type: custom
          name: ET-Bench
        metrics:
          - type: f1
            value: 62.8
            name: TVG F1
          - type: f1
            value: 10.7
            name: EPM F1
          - type: f1
            value: 24.6
            name: TAL F1
          - type: f1
            value: 36.5
            name: DVC F1
          - type: f1
            value: 28.5
            name: SLC F1

STRIDE-2B

STRIDE (Structured Temporal Refinement with Iterative DEnoising) is a lightweight proactive activation model for streaming video understanding. It decides when a downstream Video-LLM should respond during a live video stream — without waiting for explicit user queries.

Paper: STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

Junho Kim*, Hosu Lee*, James M. Rehg, Minsu Kim, Yong Man Ro

UIUC, KAIST, Google DeepMind

What is STRIDE?

Existing streaming Video-LLMs are reactive — they only respond when a user explicitly asks a question. STRIDE makes them proactive by adding a lightweight front-end that continuously monitors incoming frames and predicts coherent activation spans indicating when to trigger a response.

The key insight is that activation in streaming video is not a point-wise binary decision ("should I respond now?"), but a span-structured sequence modeling problem — the model must capture consistent onset (0 → 1), persistence (1 → 1), and offset (1 → 0) transitions. STRIDE achieves this through masked diffusion over a temporal activation window, jointly predicting and iteratively refining activation signals across the window.

Two-Stage Architecture

Video Stream
  │
  ▼
[STRIDE Activation Model]  ← this model (2B)
  │
  │ trigger (only if active)
  ▼
[Downstream Video-LLM]     ← frozen, any off-the-shelf
  │
  ▼
Response

Stage 1 — Activation (STRIDE): Monitors the stream at 1 FPS, maintains a sliding activation window, and iteratively denoises binary activation labels via masked diffusion.
Stage 2 — Response (Downstream LLM): When triggered, the frozen downstream Video-LLM receives the accumulated frame cache and generates a response. STRIDE is fully plug-and-play — compatible with any off-the-shelf Video-LLM.

Results

OVO-Bench (Online Video Understanding)

Method	Real-Time Perception	Backward Tracing	Forward Active Responding	Overall
Flash-VStream-7B	28.37	27.38	45.09	33.61
Dispider	54.55	36.06	34.72	41.78
TimeChat-Online-7B	58.60	42.00	36.40	45.60
QueryStream-7B	61.40	42.10	39.03	47.51
StreamAgent-7B	61.30	41.70	45.40	49.40
STRIDE + Gemma3-4B	60.93	34.87	55.73	50.51
STRIDE + InternVL3-8B	67.72	45.23	58.00	56.98
STRIDE + Qwen3-VL-8B	69.68	47.83	59.70	59.07

StreamingBench (Streaming Comprehension)

Method	Real-Time Visual	Omni-Source	Contextual	Overall
Flash-VStream-7B	23.23	26.00	24.12	24.04
VideoLLM-Online-8B	35.99	28.45	26.55	32.48
Dispider	67.63	35.66	33.61	53.12
StreamAgent-7B	74.31	36.26	34.62	57.02
STRIDE + Gemma3-4B	60.00	36.80	38.80	50.14
STRIDE + InternVL3-8B	72.45	39.20	38.80	57.58
STRIDE + Qwen3-VL-8B	74.24	41.30	39.90	59.29

ET-Bench (Temporal Grounding, Activation-Only)

Model	Params	TVG	EPM	TAL	DVC	SLC	Avg
Temporal-Localization Specialized
VTimeLLM	7B	7.6	1.9	18.2	12.4	8.7	9.8
TimeChat	7B	26.2	3.9	10.1	16.6	5.6	12.5
VTG-LLM	7B	15.9	3.7	14.4	40.2	20.8	19.0
LITA	13B	22.2	4.6	18.0	39.7	21.0	21.1
ETChat	5B	38.6	10.2	30.8	38.4	24.4	28.5
Streaming Baselines
VideoLLM-Online	8B	13.2	3.8	9.1	24.0	9.9	12.0
Dispider	9B	36.1	15.5	27.3	33.8	18.8	26.3
StreamBridge	8B	34.3	–	24.3	38.3	22.6	–
Ours
STRIDE	2B	62.8	10.7	24.6	36.5	28.5	32.6

STRIDE achieves the best overall average with only 2B parameters, outperforming 7-13B temporal-localization specialized models and streaming baselines.

Usage

For the full streaming inference pipeline and evaluation scripts, please refer to the STRIDE GitHub repository.

Training

Architecture: Qwen3VLForProactiveMDM (Qwen3-VL backbone with a temporal activation head)
Base model: Qwen/Qwen3-VL-2B-Instruct
Training data: Temporal activation annotations curated from eight publicly available video understanding datasets (ActivityNet-Captions, LITA, YouCook2, ET-Instruct, Charades, CharadesEgo, DiDeMo, Grounded-VideoLLM)

Model Variants

Model	Params	Description
STRIDE-2B (this)	2B	Default activation model
STRIDE-4B	4B	Scaled variant with improved accuracy

Citation

@article{kim2026stride,
  title={STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding},
  author={Kim, Junho and Lee, Hosu and Rehg, James M. and Kim, Minsu and Ro, Yong Man},
  journal={arXiv preprint arXiv:2603.27593},
  year={2026}
}

License

This model is released under the Apache 2.0 License.