Unable to run on SGLang?
#1
by nbaughman - opened
Thanks for creating these quants, its very much appreciated.
Unfortunately, I've had difficulty running them on SGLang. This applies to both the 8bit as well as the 4bit. I've had no issues with vLLM though. Qwen3-30B-A3B-Instruct-2507-AWQ-4bit works just fine. Any ideas?
services:
sglang:
image: lmsysorg/sglang:latest
ipc: "host"
shm_size: 16GB
ports:
- 8001:8001
volumes:
- ./models:/models
command: >
--model-path /models/Qwen3-Next-80B-A3B-Instruct-AWQ-8bit
--host 0.0.0.0
--port 8001
--log-level info
--enable-metrics
--dtype float16
entrypoint: ["python3", "-m", "sglang.launch_server"]
healthcheck:
test: ["CMD", "curl", "-f", "http://0.0.0.0:8001/v1/models"]
interval: 30s
timeout: 5s
retries: 20
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Loading safetensors checkpoint shards: 0% Completed | 0/18 [00:00<?, ?it/s]
sglang-1 | [2025-11-05 00:14:14] Scheduler hit an exception: Traceback (most recent call last):
sglang-1 | File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2786, in run_scheduler_process
sglang-1 | scheduler = Scheduler(
sglang-1 | ^^^^^^^^^^
sglang-1 | File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 319, in __init__
sglang-1 | self.tp_worker = TpModelWorker(
sglang-1 | ^^^^^^^^^^^^^^
sglang-1 | File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
sglang-1 | self._model_runner = ModelRunner(
sglang-1 | ^^^^^^^^^^^^
sglang-1 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 319, in __init__
sglang-1 | self.initialize(min_per_gpu_memory)
sglang-1 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 395, in initialize
sglang-1 | self.load_model()
sglang-1 | File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 749, in load_model
sglang-1 | self.model = get_model(
sglang-1 | ^^^^^^^^^^
sglang-1 | File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 28, in get_model
sglang-1 | return loader.load_model(
sglang-1 | ^^^^^^^^^^^^^^^^^^
sglang-1 | File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 595, in load_model
sglang-1 | self.load_weights_and_postprocess(
sglang-1 | File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 603, in load_weights_and_postprocess
sglang-1 | model.load_weights(weights)
sglang-1 | File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_next.py", line 1054, in load_weights
sglang-1 | param = params_dict[name]
sglang-1 | ~~~~~~~~~~~^^^^^^
sglang-1 | KeyError: 'model.layers.5.mlp.shared_expert.gate_gate_up_proj.weight'
sglang-1 |
sglang-1 | [2025-11-05 00:14:14] Received sigquit from a child process. It usually means the child failed.
Loading safetensors checkpoint shards: 0% Completed | 0/18 [00:01<?, ?it/s]
the same for me :(
Thanks for letting me know. If seems that sglang and vllm initiates the model differently and therefore, results in this error. I will look at this and give you an update of any fixes.
Personally, I always use vllm to run these AWQ models, since llmcompressor is part of vllm-project.
Please redownload the config.json file, it should load with sglang now :)
Hey Cpatonn, thanks mate, that works perfectly ! GAFAM => π