AWQ 4-bit produces repetitive gibberish on long outputs with vLLM v0.15.1 β root cause identified
Summary
This model produces coherent short responses (under ~50 tokens) but degenerates into repetitive gibberish on longer outputs when served via vllm/vllm-openai:v0.15.1. The root cause is not the quantization itself β it is a chain of three bugs in vLLM v0.15.1 that result in the wrong text backbone class (MistralForCausalLM) being loaded instead of the correct Ministral3ForCausalLM.
Full investigation with exact code paths, config comparisons, and a survey of all alternative quantizations: GIBBERISH_BUG_REPORT.md
Verbatim gibberish output
Prompt: "Write a Python function that implements binary search on a sorted list. Include type hints and a docstring."
Here's a Python function that implements binary search on a sorted list. The function includes type hints and a docstring.
```python
from typing import TypeVar, List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
Short responses like "What is 2+2?" and tool calls work fine. The degeneration is 100% reproducible on any prompt requiring more than ~50-100 tokens of output.
Root cause: wrong text backbone class
The bug is in vllm/model_executor/models/mistral3.py (lines ~340-342). Mistral3ForConditionalGeneration.__init__ has a special case written for Pixtral-12B:
# NOTE: These are special cases for Pixtral-12B in the HF-format
if (
config.text_config.architectures is None
and config.text_config.model_type == "mistral"
):
config.text_config.architectures = ["MistralForCausalLM"]
This fires because:
text_config.model_typemust be patched from"ministral3"to"mistral"β transformers v4.57.6 (bundled in vLLM v0.15.1) does not recognize"ministral3"(KeyError: 'ministral3'). Theministral3type was added in transformers v5.0.0 but vLLM v0.15.1 is pinned totransformers >= 4.56.0, < 5.text_config.architecturesis null in this model'sconfig.jsonβ this field is simply not set.
Both conditions are TRUE, so vLLM forces MistralForCausalLM (the old Mistral 7B architecture, inheriting from LlamaForCausalLM) as the text backbone. The correct class Ministral3ForCausalLM does not exist in vLLM v0.15.1's model registry:
# Verified inside the vllm/vllm-openai:v0.15.1 container:
from vllm.model_executor.models.registry import _VLLM_MODELS
# MistralForCausalLM -> ('mistral', 'MistralForCausalLM')
# Mistral3ForConditionalGeneration -> ('mistral3', 'Mistral3ForConditionalGeneration')
# MistralLarge3ForCausalLM -> ('mistral_large_3', 'MistralLarge3ForCausalLM')
# NO Ministral3ForCausalLM entry
Why the Mistral-native loading path is also blocked for this model
The official FP8 model (mistralai/Devstral-Small-2-24B-Instruct-2512) works on vLLM v0.15.1 using config-format: "mistral" + load-format: "mistral", which reads params.json and loads consolidated-*.safetensors. But this AWQ model only ships HuggingFace-format sharded safetensors (model-*.safetensors) β load-format: "mistral" requires consolidated-*.safetensors files, which llm-compressor does not produce. This forces the HuggingFace config path, which triggers the bug chain above.
Potential fix in this model's config.json
Adding "architectures": ["MistralForCausalLM"] (or any non-null value) to text_config in this model's config.json would bypass the Pixtral-12B special case if-statement. However, this alone would not fix the gibberish because Ministral3ForCausalLM does not exist in vLLM v0.15.1's registry β there is no correct value to set it to. The real fix requires vLLM to either:
- Merge the transformers v5 bump (vllm-project/vllm#30566, open since December 12, 2025, not merged)
- Add
Ministral3ForCausalLMto its model registry
Environment
- GPU: NVIDIA GeForce RTX 5090 (32 GiB GDDR7 VRAM)
- Docker image:
vllm/vllm-openai:v0.15.1(February 4, 2026) - transformers inside container: 4.57.6
- Model revision:
da6366ed3bb6d5207c6544ede10df31e4082e027
No alternative 4-bit quant works either
I surveyed every quantized variant of Devstral Small 2 24B on HuggingFace (AWQ, GPTQ, NVFP4, BitsAndBytes NF4, GGUF, SINQ, MLX, EXL3). All HuggingFace-format quantizations share the same architecture routing bug. The MLX 4-bit community model also reports gibberish. Full survey in the bug report.