Unable to run on SGLang?

#1
by nbaughman - opened

Thanks for creating these quants, its very much appreciated.

Unfortunately, I've had difficulty running them on SGLang. This applies to both the 8bit as well as the 4bit. I've had no issues with vLLM though. Qwen3-30B-A3B-Instruct-2507-AWQ-4bit works just fine. Any ideas?

services:
    sglang:
        image: lmsysorg/sglang:latest
        ipc: "host"
        shm_size: 16GB
        ports:
            - 8001:8001

        volumes:
            - ./models:/models

        command: >
            --model-path /models/Qwen3-Next-80B-A3B-Instruct-AWQ-8bit
            --host 0.0.0.0
            --port 8001
            --log-level info
            --enable-metrics
            --dtype float16

        entrypoint: ["python3", "-m", "sglang.launch_server"]

        healthcheck:
            test: ["CMD", "curl", "-f", "http://0.0.0.0:8001/v1/models"]
            interval: 30s
            timeout: 5s
            retries: 20

        deploy:
            resources:
                reservations:
                    devices:
                        - driver: nvidia
                          count: all
                          capabilities: [gpu]
Loading safetensors checkpoint shards:   0% Completed | 0/18 [00:00<?, ?it/s]
sglang-1  | [2025-11-05 00:14:14] Scheduler hit an exception: Traceback (most recent call last):
sglang-1  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2786, in run_scheduler_process
sglang-1  |     scheduler = Scheduler(
sglang-1  |                 ^^^^^^^^^^
sglang-1  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 319, in __init__
sglang-1  |     self.tp_worker = TpModelWorker(
sglang-1  |                      ^^^^^^^^^^^^^^
sglang-1  |   File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
sglang-1  |     self._model_runner = ModelRunner(
sglang-1  |                          ^^^^^^^^^^^^
sglang-1  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 319, in __init__
sglang-1  |     self.initialize(min_per_gpu_memory)
sglang-1  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 395, in initialize
sglang-1  |     self.load_model()
sglang-1  |   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 749, in load_model
sglang-1  |     self.model = get_model(
sglang-1  |                  ^^^^^^^^^^
sglang-1  |   File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 28, in get_model
sglang-1  |     return loader.load_model(
sglang-1  |            ^^^^^^^^^^^^^^^^^^
sglang-1  |   File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 595, in load_model
sglang-1  |     self.load_weights_and_postprocess(
sglang-1  |   File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 603, in load_weights_and_postprocess
sglang-1  |     model.load_weights(weights)
sglang-1  |   File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_next.py", line 1054, in load_weights
sglang-1  |     param = params_dict[name]
sglang-1  |             ~~~~~~~~~~~^^^^^^
sglang-1  | KeyError: 'model.layers.5.mlp.shared_expert.gate_gate_up_proj.weight'
sglang-1  | 
sglang-1  | [2025-11-05 00:14:14] Received sigquit from a child process. It usually means the child failed.
Loading safetensors checkpoint shards:   0% Completed | 0/18 [00:01<?, ?it/s]

the same for me :(

cyankiwi org

Thanks for letting me know. If seems that sglang and vllm initiates the model differently and therefore, results in this error. I will look at this and give you an update of any fixes.

Personally, I always use vllm to run these AWQ models, since llmcompressor is part of vllm-project.

cyankiwi org

Please redownload the config.json file, it should load with sglang now :)

Hey Cpatonn, thanks mate, that works perfectly ! GAFAM => 😁

Sign up or log in to comment