I get 40 TPS on my RX 7900 XTX
Whenever the vllm gptq_marlin kernel gets resolved, I'm hoping to get closer to 100 TPS. Until then, this is the best model that has worked for me thus far with ROCm 7.2. Also, I have desktop set to my igpu, my dgpu dedicated fully.
Specs (Built February 2026)
CPU: AMD Ryzen 9 9950X (16-core / 32-thread, Zen 5)
Motherboard: ASUS TUF Gaming X870E-PLUS WiFi
RAM: G.Skill Trident Z5 Neo RGB 128GB (2×64GB) DDR5-6000 CL32
GPU: ASUS TUF Gaming RX 7900 XTX OC 24GB
Storage: Samsung PM1733 3.84TB Enterprise NVMe U.2
Case: Fractal Design Meshify 3 XL Solid Black
CPU Cooler: Noctua NH-D15 chromax.black
Power Supply: be quiet! Dark Power 14 1200W 80+ Titanium
Config:
[Unit]
Description=CHANGEME vLLM Inference Server
Requires=docker.service
After=docker.service network-online.target
Wants=network-online.target
[Service]
Restart=on-failure
RestartSec=10
ExecStart=docker run --rm
--name changeme-vllm
--network=host
--group-add=video
--group-add=render
--ipc=host
--cap-add=SYS_PTRACE
--security-opt seccomp=unconfined
--device=/dev/kfd
--device=/dev/dri/renderD128
--device=/dev/dri/card0
-e HIP_VISIBLE_DEVICES=0
-e HUGGING_FACE_HUB_TOKEN=CHANGEME
-v /home/CHANGEME/.cache/huggingface:/root/.cache/huggingface
-v /home/CHANGEME/.cache/vllm:/root/.cache/vllm
-v /tmp/torchinductor_root:/tmp/torchinductor_root
rocm/vllm-dev:nightly
python -m vllm.entrypoints.openai.api_server
--model OPEA/DeepSeek-R1-Distill-Qwen-32B-int4-gptq-sym-inc
--dtype float16
--host 0.0.0.0
--port 8000
--max-model-len 8192
--gpu-memory-utilization 0.95
--enforce-eager --reasoning-parser deepseek_r1
ExecStop=docker stop changeme-vllm
[Install]
WantedBy=multi-user.target
