OOM on a RTX 4090 24GB

#3
by bird0867 - opened

Strangely this happens no matter what context length I try. I certainly should have enough VRAM to use the model even with a small context window. Would appreciate any advice.

On RTX 4090, decreasing context length to 48k will fit in VRAM, e.g.

vllm serve Zyphra/ZAYA1-8B --port 8010 \
   --mamba-cache-dtype float32 --dtype bfloat16 \
   --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser zaya_xml --max-model-len 48000

Lmk if you are still encountering OOM

this works for me on 4090 w/ full ctx:
--served-model-name zaya1-8b
--no-use-tqdm-on-load
--trust-remote-code
--mamba-cache-dtype float32
--dtype bfloat16
--max-model-len 131072
--gpu-memory-utilization 0.98
--kv-cache-dtype auto
--max-num-seqs 2
--stream-interval 20
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser zaya_xml

yeah you do need to drop --max-num-seqs from the default 256 or itll oom on the dummy inputs.
Otherwise should work.

the following worked for me on a 4090:
vllm serve Zyphra/ZAYA1-8B --port 8010
--mamba-cache-dtype float32 --dtype bfloat16
--reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser zaya_xml --max-num-seqs 8

Zyphra org

@nqgl Yep IIRC max num seqs ~16 and full max model len fit for me on 4090

Sign up or log in to comment