Humaneval: Base vs Polarquant vs FOEM

by Arien0 - opened 28 days ago

•

Just to compare, I've benchmarked Qwopus3.5-9B-v3 vs Qwopus3.5-9B-v3-PolarQuant-Q5 vs Qwopus3.5-9B-v3-INT4-FOEM and the results are:

Model Variant Pass@1 (Mean) Mean Time Stability
Base (BF16) 66.87% 4:55 min ±0.6%
FOEM (INT4) 62.80% 2:58 min 0.0% var
PolarQuant (Q5) 55.49% 3:57 min 0.0% var

Don't get me wrong, but I guess there's something wrong between vLMM and PolarQuant Q5.

Settings:
##################################################################
#!/bin/bash
source /opt/llm/human/venv/bin/activate
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0
export OMP_NUM_THREADS=12
export PYTORCH_ALLOC_CONF="expandable_segments:True,garbage_collection_threshold:0.5"
export VLLM_NO_USAGE_STATS=1
export VLLM_DO_NOT_TRACK=1
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export VLLM_USE_FLASHINFER_SAMPLER=1
export HF_ALLOW_CODE_EVAL="1"

taskset -c 0-11 lm_eval --model vllm
--model_args pretrained=/opt/llm/modelos/Qwopus3.5-9B-v3,gpu_memory_utilization=0.8,enforce_eager=True,language_model_only=True,max_model_len=16384
--tasks humaneval
--batch_size 1
##################################################################

caiovicentino1

Owner 28 days ago

@Arien0 — thanks for running this benchmark, and for being thorough with the methodology.

The 55.49% vs 66.87% gap on HumanEval is concerning and worth investigating.

Likely cause: Our INT4 quantization uses naive round-to-nearest symmetric (no calibration data). FOEM likely uses calibrated quantization (GPTQ-style with representative data), which preserves code generation accuracy better.

For PPL (WikiText-2), naive INT4 performs well (6.56 vs 6.37 BF16) because perplexity averages over many tokens. But HumanEval tests functional correctness — a single wrong token in a function body = fail. This is where calibrated quantization matters.

What we know works well with PQ5→INT4:

WikiText-2 PPL: +0.19 (6.56 vs 6.37)
HellaSwag: +2.5% (PQ5 actually beats BF16, confirmed at 27B)
Winogrande: +0.5%

Where it struggles:

HumanEval (your result) — code generation is more sensitive to weight precision

Next steps we're considering:

Calibrated INT4 (GPTQ-style with calibration data) instead of naive round-to-nearest — this should close the gap with FOEM
INT8 option for code-sensitive use cases — less compression but better precision
Test with the MoE 35B which has more parameters and may be more robust to quantization

Thanks for the honest comparison — this is exactly the kind of data that pushes us to improve. The 113 tok/s on RTX 3090 is still a win for inference speed, but quality needs to match.

caiovicentino1

Owner 27 days ago

Update: PolarQuant v7 — INT4 that BEATS BF16

@Arien0 your benchmark was the catalyst for this. After 5 iterative experiments today, we found the optimal INT4 config:

Method	HumanEval pass@1	Config
PolarQuant v7	67.07%	GPTQ INT4 gs64 + FOEM
BF16 Base	66.87%	—
FOEM INT4 (your test)	62.80%	GPTQ gs128 + FOEM
GPTQ desc_act	60.98%	GPTQ gs128, no FOEM
PolarQuant Q5 (naive)	55.49%	Round-to-nearest

The key discovery: group_size=64

Moving from gs128 to gs64 was the dominant factor, jumping from 61.59% to 67.07% (+5.48pp). FOEM added +0.61pp on top. Same Marlin kernel, same inference speed, only +0.1 GB larger.

What we tried and learned

Hadamard rotation + GPTQ = 60.98% (no improvement — GPTQ already handles weight distribution)
FOEM gs128 = 61.59% (+0.61pp over vanilla GPTQ)
FOEM gs64 = 67.07% (the breakthrough)

Universal config (works for any model)

QuantizeConfig(
    bits=4, group_size=64, sym=True, desc_act=True,
    foem=FOEMConfig(alpha=0.25, beta=0.2, device="auto")
)

Model published: Qwopus3.5-9B-v3-PolarQuant-v7-GPTQ

8.7 GB, Marlin native, vLLM ready. Would love if you could verify with your benchmark setup!

Quantization acts as a regularizer — consistent with our earlier finding that PQ5 beats BF16 on HellaSwag (+2.5%) and Winogrande (+0.5%). Now confirmed on HumanEval too.

Arien0

26 days ago

Try this settings in GPTQmodel (those layers are very sensible to quant in Qwen3.5 and are better in BF16):

dynamic={
"-:.linear_attn.": {},
"-:.visual.": {},
"-:.mtp.": {},
"-:lm_head": {},
"-:model.language_model.embed_tokens": {},
"-:.norm.": {},
}

caiovicentino1

Owner 26 days ago

@Arien0 — this is gold, thank you. The dynamic skip list confirms what we've been piecing together from isolated tests: Qwen3.5's linear_attn.*, mtp.*, lm_head, embed_tokens, and all norm.* are precision-sensitive and should stay in BF16.

Your pattern list is especially useful because:

linear_attn.* — carries most of the long-context reasoning signal
mtp.* — we already saw MTP degrades under INT4 (discussion #2)
lm_head + embed_tokens + norm.* — classic "don't quantize these" for token-level precision

We'll test this as PolarQuant v8 against v7 (gs64 + FOEM, 67.07% HumanEval on 9B standard). If the dynamic skip list lifts us further while preserving Marlin speed, that's the next breakthrough. Will tag you on the v8 model card when it's ready.

Question: did you arrive at this list through ablation, or from observing similar patterns on other Qwen quants? Would be useful to know if the pattern generalizes to Qwen3.5-3B and the MoE variants.

Arien0

25 days ago

Those lines are in the standard llmcompressor recipe:
ignore=[
"lm_head",
"re:.visual.",
"re:.linear_attn.",
],

They are found in many Qwen3.5 quants.

caiovicentino1

Owner 25 days ago

Got it — knowing the recipe is the llmcompressor standard (not a one-off) makes it portable across Qwen3.5 quants, which is useful.

We had independently hit the linear_attn skip from our own ablation: the GDN gate sub-layers in_proj_a / in_proj_b are 0.07% of params but catastrophic to quantize (collapses hybrid attention). Good to see the recipe codifies that plus the layers we hadn't isolated individually (mtp, embed_tokens, norm.*).

v8 with the full explicit dynamic skip is queued. Heads-down on a different track this week so it won't be today, but I'll tag you on the v8 card. If it lifts past v7's 67.07%, that's a clean progression worth documenting.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment