AWQ 4-bit produces repetitive gibberish on long outputs with vLLM v0.15.1 — root cause identified

by BigBlueWhale - opened Feb 6

Feb 6

Summary

This model produces coherent short responses (under ~50 tokens) but degenerates into repetitive gibberish on longer outputs when served via vllm/vllm-openai:v0.15.1. The root cause is not the quantization itself — it is a chain of three bugs in vLLM v0.15.1 that result in the wrong text backbone class (MistralForCausalLM) being loaded instead of the correct Ministral3ForCausalLM.

Full investigation with exact code paths, config comparisons, and a survey of all alternative quantizations: GIBBERISH_BUG_REPORT.md

Verbatim gibberish output

Prompt: "Write a Python function that implements binary search on a sorted list. Include type hints and a docstring."

Here's a Python function that implements binary search on a sorted list. The function includes type hints and a docstring.

```python
from typing import TypeVar, List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union

Short responses like "What is 2+2?" and tool calls work fine. The degeneration is 100% reproducible on any prompt requiring more than ~50-100 tokens of output.

Root cause: wrong text backbone class

The bug is in vllm/model_executor/models/mistral3.py (lines ~340-342). Mistral3ForConditionalGeneration.__init__ has a special case written for Pixtral-12B:

# NOTE: These are special cases for Pixtral-12B in the HF-format
if (
    config.text_config.architectures is None
    and config.text_config.model_type == "mistral"
):
    config.text_config.architectures = ["MistralForCausalLM"]

This fires because:

text_config.model_type must be patched from "ministral3" to "mistral" — transformers v4.57.6 (bundled in vLLM v0.15.1) does not recognize "ministral3" (KeyError: 'ministral3'). The ministral3 type was added in transformers v5.0.0 but vLLM v0.15.1 is pinned to transformers >= 4.56.0, < 5.
text_config.architectures is null in this model's config.json — this field is simply not set.

Both conditions are TRUE, so vLLM forces MistralForCausalLM (the old Mistral 7B architecture, inheriting from LlamaForCausalLM) as the text backbone. The correct class Ministral3ForCausalLM does not exist in vLLM v0.15.1's model registry:

# Verified inside the vllm/vllm-openai:v0.15.1 container:
from vllm.model_executor.models.registry import _VLLM_MODELS
# MistralForCausalLM               -> ('mistral', 'MistralForCausalLM')
# Mistral3ForConditionalGeneration  -> ('mistral3', 'Mistral3ForConditionalGeneration')
# MistralLarge3ForCausalLM          -> ('mistral_large_3', 'MistralLarge3ForCausalLM')
# NO Ministral3ForCausalLM entry

Why the Mistral-native loading path is also blocked for this model

The official FP8 model (mistralai/Devstral-Small-2-24B-Instruct-2512) works on vLLM v0.15.1 using config-format: "mistral" + load-format: "mistral", which reads params.json and loads consolidated-*.safetensors. But this AWQ model only ships HuggingFace-format sharded safetensors (model-*.safetensors) — load-format: "mistral" requires consolidated-*.safetensors files, which llm-compressor does not produce. This forces the HuggingFace config path, which triggers the bug chain above.

Potential fix in this model's config.json

Adding "architectures": ["MistralForCausalLM"] (or any non-null value) to text_config in this model's config.json would bypass the Pixtral-12B special case if-statement. However, this alone would not fix the gibberish because Ministral3ForCausalLM does not exist in vLLM v0.15.1's registry — there is no correct value to set it to. The real fix requires vLLM to either:

Merge the transformers v5 bump (vllm-project/vllm#30566, open since December 12, 2025, not merged)
Add Ministral3ForCausalLM to its model registry

Environment

GPU: NVIDIA GeForce RTX 5090 (32 GiB GDDR7 VRAM)
Docker image: vllm/vllm-openai:v0.15.1 (February 4, 2026)
transformers inside container: 4.57.6
Model revision: da6366ed3bb6d5207c6544ede10df31e4082e027

No alternative 4-bit quant works either

I surveyed every quantized variant of Devstral Small 2 24B on HuggingFace (AWQ, GPTQ, NVFP4, BitsAndBytes NF4, GGUF, SINQ, MLX, EXL3). All HuggingFace-format quantizations share the same architecture routing bug. The MLX 4-bit community model also reports gibberish. Full survey in the bug report.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment