Max Schwartzapfel's picture

1

Max Schwartzapfel PRO

dystrio

·

https://dystrio.com

AI & ML interests

Runtime-agnostic model compilation for efficient inference.

Recent Activity

reacted to theirpost with 🚀 about 20 hours ago

Sculpt: A compression ladder for Qwen 3.5 9B These were designed to improve deployment efficiency while preserving full dense-transformer compatibility. Across the ladder we observe: * ~2–10% checkpoint size reduction * up to 25–27% faster prefill at higher compression tiers * decode throughput roughly unchanged Because Sculpt operates before quantization, these structural reductions compound with GPTQ, AWQ, GGUF, and INT8/INT4 pipelines, shifting deployment memory thresholds without changing infrastructure assumptions. The release includes four checkpoints: -Default (kf = 0.95)** — fidelity-preserving baseline replacement -Production (kf = 0.90)** — balanced serving tier -Throughput (kf = 0.88)** — latency-optimized -Experimental (kf = 0.82)** — aggressive compression boundary All checkpoints load directly with Transformers, vLLM, TGI, and GGUF workflows. No custom runtime or kernels required. The goal of Sculpt is simple: explore how far structured FFN compression plus teacher-guided distillation can improve efficiency while remaining a drop-in dense replacement. Models: - [Qwen3.5-9B-Sculpt-Default](https://huggingface.co/dystrio/Qwen3.5-9B-Sculpt-Default) - [Qwen3.5-9B-Sculpt-Production](https://huggingface.co/dystrio/Qwen3.5-9B-Sculpt-Production) - [Qwen3.5-9B-Sculpt-Throughput](https://huggingface.co/dystrio/Qwen3.5-9B-Sculpt-Throughput) - [Qwen3.5-9B-Sculpt-Experimental](https://huggingface.co/dystrio/Qwen3.5-9B-Sculpt-Experimental) Curious which tradeoff tier people would choose in practice for serving workloads. We want to Sculpt models that work for you. If you give us you workload, we can make the sculpted models meet your SLO's.

posted an update 4 days ago

Sculpt: A compression ladder for Qwen 3.5 9B These were designed to improve deployment efficiency while preserving full dense-transformer compatibility. Across the ladder we observe: * ~2–10% checkpoint size reduction * up to 25–27% faster prefill at higher compression tiers * decode throughput roughly unchanged Because Sculpt operates before quantization, these structural reductions compound with GPTQ, AWQ, GGUF, and INT8/INT4 pipelines, shifting deployment memory thresholds without changing infrastructure assumptions. The release includes four checkpoints: -Default (kf = 0.95)** — fidelity-preserving baseline replacement -Production (kf = 0.90)** — balanced serving tier -Throughput (kf = 0.88)** — latency-optimized -Experimental (kf = 0.82)** — aggressive compression boundary All checkpoints load directly with Transformers, vLLM, TGI, and GGUF workflows. No custom runtime or kernels required. The goal of Sculpt is simple: explore how far structured FFN compression plus teacher-guided distillation can improve efficiency while remaining a drop-in dense replacement. Models: - [Qwen3.5-9B-Sculpt-Default](https://huggingface.co/dystrio/Qwen3.5-9B-Sculpt-Default) - [Qwen3.5-9B-Sculpt-Production](https://huggingface.co/dystrio/Qwen3.5-9B-Sculpt-Production) - [Qwen3.5-9B-Sculpt-Throughput](https://huggingface.co/dystrio/Qwen3.5-9B-Sculpt-Throughput) - [Qwen3.5-9B-Sculpt-Experimental](https://huggingface.co/dystrio/Qwen3.5-9B-Sculpt-Experimental) Curious which tradeoff tier people would choose in practice for serving workloads. We want to Sculpt models that work for you. If you give us you workload, we can make the sculpted models meet your SLO's.

published a model 4 days ago

dystrio/Qwen3.5-9B-Sculpt-Throughput

View all activity

Organizations

None yet

liked a model 10 days ago

dystrio/Qwen2.5-7B-Instruct-sculpt-throughput

Text Generation • 5B • Updated 11 days ago • 372 • 1