Qwen3-g023-tiny-v1 โ€” GGUF

A surgically optimized 27-layer Qwen3 variant that outperforms the original 28-layer model.

Created by selectively deleting a harmful layer and swapping adjacent layers for improved information flow. Scores 92.9/100 with 100% factual accuracy โ€” a 5.1-point improvement over the original Qwen3-1.7B baseline (87.8/100).

Available Quantizations

Quantization Bits/Weight Description Download
Q8_0 8.00 Highest quality, virtually lossless (USE THIS ONE) Qwen3-g023-tiny-v1-Q8_0.gguf
Q6_K 6.57 Excellent quality, good compression Qwen3-g023-tiny-v1-Q6_K.gguf
Q4_K_M 4.85 Good balance of quality and size Qwen3-g023-tiny-v1-Q4_K_M.gguf
Q3_K_M 3.91 High compression, moderate quality loss Qwen3-g023-tiny-v1-Q3_K_M.gguf
Q2_K 3.35 Maximum compression, significant quality loss Qwen3-g023-tiny-v1-Q2_K.gguf

Model Details

Parameter Value
Architecture Qwen3ForCausalLM
Layers 27 (28 original โˆ’ 1 deleted)
Hidden Size 2,048
Intermediate Size 6,144
Attention Heads 16 query / 8 key-value (GQA)
Head Dimension 128
Vocabulary 151,936 tokens
Max Context 40,960 tokens
RoPE ฮธ 1,000,000
Tied Embeddings Yes
Total Parameters ~1.67B
Precision (source) bfloat16

Surgery Operations

This model was created by applying two surgical operations to Qwen/Qwen3-1.7B:

  1. Delete layer 10 โ€” Layer 10 was identified as harmful to model quality. Removing it improved the overall score from 85.9 to 91.4.
  2. Swap layers 11 โ†” 12 (post-deletion indices) โ€” Swapping these adjacent attention blocks optimized information flow between the model's middle layers, further improving the score to 92.9.

Key Findings

  • Smaller is better: The 27-layer model outperforms both the 28-layer original and various 29โ€“30 layer expanded models.
  • Layer 10 is actively harmful: Removing it alone yields a +3.6 point improvement.
  • Operations compound selectively: Deletion + swap works, but deletion + duplication degrades quality.

Benchmark Results

Metric Original (28L) v1 (27L) ฮ”
Overall Score 87.8 / 100 92.9 / 100 +5.1
Factual Accuracy 15 / 17 (88%) 17 / 17 (100%) +12%
Avg Perplexity โ€” 15.70 โ€”
Thinking Mode โœ… โœ… โ€”
Non-Thinking Mode โœ… โœ… โ€”

Evaluated using a comprehensive test suite with 17 factual questions, 2 completion coherence tests, perplexity measurements, repetition analysis, and thinking/non-thinking mode verification.

Features

  • Thinking mode: Full <think> / </think> reasoning support โ€” toggle via enable_thinking parameter
  • Non-thinking mode: Direct responses without chain-of-thought overhead
  • Tool calling: Full function/tool calling support
  • System prompts: Standard system message support
  • Chat template: Qwen3 ChatML template embedded in the GGUF

Usage

With Ollama

# Download the GGUF and create from Modelfile
cat > Modelfile << 'EOF'
FROM ./Qwen3-g023-tiny-v1-Q8_0.gguf

PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 45
PARAMETER min_p 0.1
PARAMETER num_ctx 40000
PARAMETER mirostat 2
PARAMETER mirostat_tau 5.0
PARAMETER mirostat_eta 0.1
PARAMETER repeat_last_n 16384
PARAMETER repeat_penalty 1.1
PARAMETER presence_penalty 0.5
PARAMETER frequency_penalty 1.0

TEMPLATE """{{- if .System }}
<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}
{{- range .Messages }}
{{- if eq .Role "user" }}
<|im_start|>user
{{ .Content }}<|im_end|>
{{- else if eq .Role "assistant" }}
<|im_start|>assistant
{{ .Content }}<|im_end|>
{{- end }}
{{- end }}
<|im_start|>assistant
"""
SYSTEM "You are a helpful assistant."
EOF

ollama create qwen3-tiny-v1 -f Modelfile
ollama run qwen3-tiny-v1

With llama.cpp

# Interactive chat
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
  --chat-template chatml -cnv

# Thinking mode
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
  -p "<|im_start|>user\nExplain quantum computing<|im_end|>\n<|im_start|>assistant\n<think>\n" \
  -n 512

# Non-thinking mode
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
  -p "<|im_start|>user\n/no_think What is 2+2?<|im_end|>\n<|im_start|>assistant\n" \
  -n 128

With Python (llama-cpp-python)

from llama_cpp import Llama

model = Llama("Qwen3-g023-tiny-v1-Q8_0.gguf", n_ctx=4096)
response = model.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0.6,
)
print(response["choices"][0]["message"]["content"])

System Requirements

Quantization RAM (CPU) VRAM (GPU)
Q8_0 ~2.0 GB ~2.0 GB
Q6_K ~1.7 GB ~1.7 GB
Q4_K_M ~1.3 GB ~1.3 GB
Q3_K_M ~1.1 GB ~1.1 GB
Q2_K ~0.9 GB ~0.9 GB

v1 vs v2

This model (v1) is the Phase 1 champion, focused on surgical precision with minimal operations.

v1 (this model) v2
Layers 27 30
Parameters ~1.67B ~1.82B
Operations del + swap swap + interpolate + bridge
Score 92.9 / 100 94.3 / 100
Factual 100% (17/17) 94% (16/17)
Perplexity 15.70 15.17
Use Case Max factual accuracy Max overall score

v1 is recommended when factual accuracy is paramount (100% vs 94%). v2 is recommended when overall quality matters more (94.3 vs 92.9).

Methodology

Layer surgery was performed through a systematic, test-driven development process:

  1. Phase 1: Exhaustive search across 150+ configurations testing deletion, duplication, swapping, interpolation, and combined operations
  2. Evaluation: Each configuration was scored on factual accuracy (17 questions), completion coherence, perplexity, repetition ratio, and thinking mode functionality
  3. Selection: The champion was selected based on overall score, with factual accuracy as a tiebreaker

The surgery framework is available in the source repository.

Credits

  • Base model: Qwen/Qwen3-1.7B by the Qwen team at Alibaba
  • Quantization: llama.cpp
  • Surgery: g023

License

Apache 2.0 โ€” same as the original Qwen3-1.7B model.

Downloads last month
-
GGUF
Model size
2B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for g023/qwen3-tiny-v1

Finetuned
Qwen/Qwen3-1.7B
Quantized
(181)
this model