Speculative decoding?

#2
by pa0los - opened

Can this model be used as draft model for Qwen3.5-122B?
Has someone tried the performance of Speculative Decoding?

If you're using llama.cpp then you can't until they add support for it and if you're using something else then you're better off using MTP.

llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 \
  --spec-draft-hf unsloth/Qwen3.5-0.8B-GGUF:Q4_0 \
  --spec-type ngram-mod \
  --spec-ngram-mod-n-match 24 \
  --spec-ngram-mod-n-min 48 \
  --spec-ngram-mod-n-max 64

https://github.com/ggml-org/llama.cpp/pull/22397#issuecomment-4327711013

Sign up or log in to comment