Suggestion: Preserve ssm_alpha and ssm_beta in BF16 for Q8_0 Quants (Fix for Long-Context Degradation)

#13

by wize-1 - opened Feb 28

Discussion

wize-1

Feb 28

•

edited Feb 28

Hi Unsloth team,

First, thank you for the incredible analysis on layers quantization and precision!

I’ve been doing some deep-dive testing on the Q8_0 quants for the Qwen 3.5 architecture, specifically focusing on long-context agent workflows, and I wanted to propose a slight tweak to the standard Q8_0 quantization schema for future updates.

Currently, conventional Q8_0 quantization crushes both the ssm_alpha and ssm_beta weights down to 8-bit. I believe these specific layers should be strictly preserved in their native BF16 precision, even in an otherwise fully 8-bit model.

Because,

Accumulated Delta in Recurrent States: Because the SSM relies on recurrent mathematical states to maintain memory, the minor rounding errors introduced by Q8_0 compound with every single token. At 2,000 tokens, it’s invisible. But at 15,000+ tokens, this accumulated delta significantly degrades the model's context awareness and causes instruction drift.
The Official Qwen Precedent: The official Qwen FP8 quantized models actually validate this exact approach. In their official releases, they deliberately leave these two specific projection weights in native BF16 to prevent context rot:
• model.language_model.layers.0.linear_attn.in_proj_a.weight [32, 2048] BF16
• model.language_model.layers.0.linear_attn.in_proj_b.weight [32, 2048] BF16
Negligible Bandwidth Cost: Because these specific small matrices are mathematically tiny compared to the massive MoE routed experts, keeping them in BF16 consumes virtually zero extra memory bandwidth or VRAM.
The Hardware Reality (APUs): Pure Q8_0 is currently incredibly important for users on specific hardwares, like the AMD Strix Halo APU. Q8_0 provides massive throughput advantages (especially via Vulkan compute shaders) because it bypasses the complex bit-shifting required by _K quants.

Creating a Q8_0 that surgically protects ssm_alpha and ssm_beta with BF16 would give hardware users the absolute best of both worlds: fast, flat 8-bit matrix multiplication for the heavy experts, with uncompressed logic retention for massive context windows, which is particularly crucial for multi-turn agentic use.

wize-1

Feb 28

FYI, t/s performances of my setting (Strix Halo with 128GB RAM) are,
Conventional Q8_0 quant: better with vulkan, 39-40 t/s
UD-Q8_K_XL: negligible difference, 30-31 t/s
UD-Q6_K_XL: better with rocm, 33-34 t/s
AesSedai Q5_K: better with vulkan, 43-44 t/s (Q8_0 used for many weights)
UD-Q4_K_XL: better with rocm, 40-41 t/s
(Q6_K used for many weights instead)

Cypherfox

Mar 1

So I'm looking at the file info and it looks like they're already preserving BF16 for ssm_alpha and ssm_beta. Am I reading it wrong?

wize-1

Mar 1

I mean, Q8_0 quant they are currently working on. I know UD-Q8 quant already uses BF16 for ssm weights (actually, UD-Q6 quant also uses BF16), but t/s performance is much lower than conventional Q8_0 with my hardware setup, which (I suppose) is mainly due to much more BF16 weights used (especially for attention and embedding).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment