OOM on a RTX 4090 24GB

by bird0867 - opened 5 days ago

Strangely this happens no matter what context length I try. I certainly should have enough VRAM to use the model even with a small context window. Would appreciate any advice.

ganeshnanduru

Zyphra org 5 days ago

On RTX 4090, decreasing context length to 48k will fit in VRAM, e.g.

vllm serve Zyphra/ZAYA1-8B --port 8010 \
   --mamba-cache-dtype float32 --dtype bfloat16 \
   --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser zaya_xml --max-model-len 48000

Lmk if you are still encountering OOM

alexbi29

5 days ago

this works for me on 4090 w/ full ctx:
--served-model-name zaya1-8b
--no-use-tqdm-on-load
--trust-remote-code
--mamba-cache-dtype float32
--dtype bfloat16
--max-model-len 131072
--gpu-memory-utilization 0.98
--kv-cache-dtype auto
--max-num-seqs 2
--stream-interval 20
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser zaya_xml

nqgl

5 days ago

yeah you do need to drop --max-num-seqs from the default 256 or itll oom on the dummy inputs.
Otherwise should work.

the following worked for me on a 4090:
vllm serve Zyphra/ZAYA1-8B --port 8010
--mamba-cache-dtype float32 --dtype bfloat16
--reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser zaya_xml --max-num-seqs 8

rawsh

Zyphra org 4 days ago

@nqgl Yep IIRC max num seqs ~16 and full max model len fit for me on 4090

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment