Qwen3.5-0.8B Chat β€” MLX (Apple Silicon)

Text-only extraction of Qwen3.5-0.8B quantized for on-device LLM chat on Apple Silicon via MLX.

Architecture

Qwen3.5 is a hybrid model with 24 layers:

  • 18Γ— DeltaNet β€” linear attention with gated delta rule recurrence, O(1) memory per step
  • 6Γ— GatedAttention β€” full scaled dot-product attention with KV cache, partial RoPE (25%)
  • Pattern: [linear, linear, linear, full] Γ— 6
  • Tied word embeddings (lm_head = embed_tokens)

Variants

Variant Size Path
INT4 404 MB int4/model.safetensors
INT8 786 MB int8/model.safetensors

Each variant includes config.json, tokenizer.json, and tokenizer_config.json.

Usage

import Qwen3Chat

let model = try await Qwen35MLXChat.fromPretrained(quantization: .int4)
let response = try model.generate(
    messages: [ChatMessage(role: .user, content: "Hello!")],
    sampling: ChatSamplingConfig(temperature: 0.3, maxTokens: 100)
)

Part of the soniqo speech toolkit for Apple Silicon.

Conversion

Quantized directly from Qwen/Qwen3.5-0.8B using mx.quantize() (group_size=64). Text model extracted, vision tower removed. Norm weights adjusted (+1). Conv1d transposed to MLX channels-last format.

python scripts/convert_qwen35_chat_mlx.py --output int4/ --bits 4
python scripts/convert_qwen35_chat_mlx.py --output int8/ --bits 8

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aufklarer/Qwen3.5-0.8B-Chat-MLX

Finetuned
(174)
this model

Collection including aufklarer/Qwen3.5-0.8B-Chat-MLX