Speculative decoding?

by pa0los - opened Mar 2

Discussion

pa0los

Mar 2

Can this model be used as draft model for Qwen3.5-122B?
Has someone tried the performance of Speculative Decoding?

watchingyousleep

Mar 3

If you're using llama.cpp then you can't until they add support for it and if you're using something else then you're better off using MTP.

mtasic85

5 days ago

llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 \
  --spec-draft-hf unsloth/Qwen3.5-0.8B-GGUF:Q4_0 \
  --spec-type ngram-mod \
  --spec-ngram-mod-n-match 24 \
  --spec-ngram-mod-n-min 48 \
  --spec-ngram-mod-n-max 64

https://github.com/ggml-org/llama.cpp/pull/22397#issuecomment-4327711013

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment