Kimi-K2.5-DFlash
This model is still under training.
DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
This model is the drafter component. It must be used in conjunction with the target model moonshotai/Kimi-K2.5.
Quick Start
Installation
vLLM:
uv pip install vllm
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
Please refer to PR39930 to see how to use DFlash on vLLM.
SGLang:
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
Launch Server
vLLM:
vllm serve moonshotai/Kimi-K2.5 \
--speculative-config '{"method": "dflash", "model": "z-lab/Kimi-K2.5-DFlash", "num_speculative_tokens": 7}' \
--attention-backend flashinfer \
--max-num-batched-tokens 32768
SGLang:
# Optional: enable schedule overlapping (experimental, may not be stable)
# export SGLANG_ENABLE_SPEC_V2=1
# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
python -m sglang.launch_server \
--model-path moonshotai/Kimi-K2.5 \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Kimi-K2.5-DFlash \
--speculative-num-draft-tokens 8 \
--tp-size 8 \
--attention-backend trtllm_mla \
--speculative-draft-attention-backend fa4 \
--mem-fraction-static 0.9 \
--speculative-dflash-draft-window-size 4096 \
--trust-remote-code
Tip: For long-context or agentic workloads, add
--speculative-dflash-draft-window-size WINDOW_SIZEto enable sliding-window attention for the drafter.
Usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[{"role": "user", "content": "Write a quicksort in Python."}],
max_tokens=4096,
)
print(response.choices[0].message.content)
Early Results
- Thinking: enabled
- Max new tokens: 4096
- Block size: 8
- SGLang results. vLLM results might be different.
Dataset Accept Length GSM8K 5.4 Math500 5.4 HumanEval 5.0 MBPP 4.5 MT-Bench 3.7
- Downloads last month
- 301
Collection including z-lab/Kimi-K2.5-DFlash
Collection
Block Diffusion for Flash Speculative Decoding • 14 items • Updated • 61