Running Llama-3_3-Nemotron-Super-49B-v1_5 on DGX Spark with NGC vLLM Container
#13
by PhotosGrafus - opened
Running Llama-3_3-Nemotron-Super-49B-v1_5 on DGX Spark with NGC vLLM Container
Hardware
- System: NVIDIA DGX Spark
- Memory: 128GB unified memory (CPU+GPU shared)
- GPU: Single GPU (Grace Blackwell architecture)
- Current working model: gpt-oss-120b runs successfully with NGC vLLM container
Current Setup
I'm using the NGC vLLM container for inference:
nvcr.io/nvidia/vllm:25.11-py3 (vLLM 0.11.0)
My working gpt-oss-120b configuration:
sudo docker run -d \
--gpus all \
--ipc=host \
--shm-size 32g \
-v /home/data/models/gpt-oss-120b:/model \
-p 8000:8000 \
nvcr.io/nvidia/vllm:25.11-py3 \
vllm serve /model \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.7 \
--max-model-len 131072 \
--trust-remote-code \
--generation-config=vllm
Questions
The HuggingFace example uses pip install vllm==0.9.2 with tensor-parallel-size=8. I need to adapt this for DGX Spark (single GPU, unified memory).
1. NGC vLLM 0.11.0 Compatibility
- Is
nvcr.io/nvidia/vllm:25.11-py3(vLLM 0.11.0) compatible with this model? - Or must I use
vllm==0.9.2specifically?
2. Required Parameters
Please confirm or correct each parameter for DGX Spark:
| Parameter | My assumption | Correct? |
|---|---|---|
--trust-remote-code |
Required | ? |
--enforce-eager |
Required | ? |
--gpu-memory-utilization |
0.7 (unified memory constraint) | ? |
--max-model-len |
32768 or 65536? | ? |
--tensor-parallel-size |
1 (single GPU) | ? |
--generation-config |
vllm | ? |
3. DeciLMForCausalLM Architecture
- I saw GitHub issues about
DeciLMForCausalLMnot being supported in some vLLM versions - Does NGC vLLM 0.11.0 support this architecture natively, or does
--trust-remote-codehandle it?
4. Reasoning Mode
- Does vLLM deployment support
<think>tag parsing natively? - Or is additional configuration needed for reasoning on/off modes?
5. Complete Docker Command
If possible, please provide a tested docker run command for DGX Spark with:
- NGC vLLM container (or specify if pip version is required)
- Single GPU / unified memory configuration
- Recommended context length for 128GB unified memory
Model Location
/home/data/models/llama-nemotron-super-49b/
(Downloaded via huggingface-cli download nvidia/Llama-3_3-Nemotron-Super-49B-v1_5)
Thank you for any guidance. I want to avoid trial-and-error on production hardware.