Suggestion: Preserve ssm_alpha and ssm_beta in BF16 for Q8_0 Quants (Fix for Long-Context Degradation)
Hi Unsloth team,
First, thank you for the incredible analysis on layers quantization and precision!
I’ve been doing some deep-dive testing on the Q8_0 quants for the Qwen 3.5 architecture, specifically focusing on long-context agent workflows, and I wanted to propose a slight tweak to the standard Q8_0 quantization schema for future updates.
Currently, conventional Q8_0 quantization crushes both the ssm_alpha and ssm_beta weights down to 8-bit. I believe these specific layers should be strictly preserved in their native BF16 precision, even in an otherwise fully 8-bit model.
Because,
- Accumulated Delta in Recurrent States: Because the SSM relies on recurrent mathematical states to maintain memory, the minor rounding errors introduced by Q8_0 compound with every single token. At 2,000 tokens, it’s invisible. But at 15,000+ tokens, this accumulated delta significantly degrades the model's context awareness and causes instruction drift.
- The Official Qwen Precedent: The official Qwen FP8 quantized models actually validate this exact approach. In their official releases, they deliberately leave these two specific projection weights in native BF16 to prevent context rot:
• model.language_model.layers.0.linear_attn.in_proj_a.weight [32, 2048] BF16
• model.language_model.layers.0.linear_attn.in_proj_b.weight [32, 2048] BF16 - Negligible Bandwidth Cost: Because these specific small matrices are mathematically tiny compared to the massive MoE routed experts, keeping them in BF16 consumes virtually zero extra memory bandwidth or VRAM.
- The Hardware Reality (APUs): Pure Q8_0 is currently incredibly important for users on specific hardwares, like the AMD Strix Halo APU. Q8_0 provides massive throughput advantages (especially via Vulkan compute shaders) because it bypasses the complex bit-shifting required by _K quants.
Creating a Q8_0 that surgically protects ssm_alpha and ssm_beta with BF16 would give hardware users the absolute best of both worlds: fast, flat 8-bit matrix multiplication for the heavy experts, with uncompressed logic retention for massive context windows, which is particularly crucial for multi-turn agentic use.
FYI, t/s performances of my setting (Strix Halo with 128GB RAM) are,
Conventional Q8_0 quant: better with vulkan, 39-40 t/s
UD-Q8_K_XL: negligible difference, 30-31 t/s
UD-Q6_K_XL: better with rocm, 33-34 t/s
AesSedai Q5_K: better with vulkan, 43-44 t/s (Q8_0 used for many weights)
UD-Q4_K_XL: better with rocm, 40-41 t/s
(Q6_K used for many weights instead)
I mean, Q8_0 quant they are currently working on. I know UD-Q8 quant already uses BF16 for ssm weights (actually, UD-Q6 quant also uses BF16), but t/s performance is much lower than conventional Q8_0 with my hardware setup, which (I suppose) is mainly due to much more BF16 weights used (especially for attention and embedding).