Qwen3-g023-tiny-v1 โ GGUF
A surgically optimized 27-layer Qwen3 variant that outperforms the original 28-layer model.
Created by selectively deleting a harmful layer and swapping adjacent layers for improved information flow. Scores 92.9/100 with 100% factual accuracy โ a 5.1-point improvement over the original Qwen3-1.7B baseline (87.8/100).
Available Quantizations
| Quantization | Bits/Weight | Description | Download |
|---|---|---|---|
| Q8_0 | 8.00 | Highest quality, virtually lossless (USE THIS ONE) | Qwen3-g023-tiny-v1-Q8_0.gguf |
| Q6_K | 6.57 | Excellent quality, good compression | Qwen3-g023-tiny-v1-Q6_K.gguf |
| Q4_K_M | 4.85 | Good balance of quality and size | Qwen3-g023-tiny-v1-Q4_K_M.gguf |
| Q3_K_M | 3.91 | High compression, moderate quality loss | Qwen3-g023-tiny-v1-Q3_K_M.gguf |
| Q2_K | 3.35 | Maximum compression, significant quality loss | Qwen3-g023-tiny-v1-Q2_K.gguf |
Model Details
| Parameter | Value |
|---|---|
| Architecture | Qwen3ForCausalLM |
| Layers | 27 (28 original โ 1 deleted) |
| Hidden Size | 2,048 |
| Intermediate Size | 6,144 |
| Attention Heads | 16 query / 8 key-value (GQA) |
| Head Dimension | 128 |
| Vocabulary | 151,936 tokens |
| Max Context | 40,960 tokens |
| RoPE ฮธ | 1,000,000 |
| Tied Embeddings | Yes |
| Total Parameters | ~1.67B |
| Precision (source) | bfloat16 |
Surgery Operations
This model was created by applying two surgical operations to Qwen/Qwen3-1.7B:
- Delete layer 10 โ Layer 10 was identified as harmful to model quality. Removing it improved the overall score from 85.9 to 91.4.
- Swap layers 11 โ 12 (post-deletion indices) โ Swapping these adjacent attention blocks optimized information flow between the model's middle layers, further improving the score to 92.9.
Key Findings
- Smaller is better: The 27-layer model outperforms both the 28-layer original and various 29โ30 layer expanded models.
- Layer 10 is actively harmful: Removing it alone yields a +3.6 point improvement.
- Operations compound selectively: Deletion + swap works, but deletion + duplication degrades quality.
Benchmark Results
| Metric | Original (28L) | v1 (27L) | ฮ |
|---|---|---|---|
| Overall Score | 87.8 / 100 | 92.9 / 100 | +5.1 |
| Factual Accuracy | 15 / 17 (88%) | 17 / 17 (100%) | +12% |
| Avg Perplexity | โ | 15.70 | โ |
| Thinking Mode | โ | โ | โ |
| Non-Thinking Mode | โ | โ | โ |
Evaluated using a comprehensive test suite with 17 factual questions, 2 completion coherence tests, perplexity measurements, repetition analysis, and thinking/non-thinking mode verification.
Features
- Thinking mode: Full
<think>/</think>reasoning support โ toggle viaenable_thinkingparameter - Non-thinking mode: Direct responses without chain-of-thought overhead
- Tool calling: Full function/tool calling support
- System prompts: Standard system message support
- Chat template: Qwen3 ChatML template embedded in the GGUF
Usage
With Ollama
# Download the GGUF and create from Modelfile
cat > Modelfile << 'EOF'
FROM ./Qwen3-g023-tiny-v1-Q8_0.gguf
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 45
PARAMETER min_p 0.1
PARAMETER num_ctx 40000
PARAMETER mirostat 2
PARAMETER mirostat_tau 5.0
PARAMETER mirostat_eta 0.1
PARAMETER repeat_last_n 16384
PARAMETER repeat_penalty 1.1
PARAMETER presence_penalty 0.5
PARAMETER frequency_penalty 1.0
TEMPLATE """{{- if .System }}
<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}
{{- range .Messages }}
{{- if eq .Role "user" }}
<|im_start|>user
{{ .Content }}<|im_end|>
{{- else if eq .Role "assistant" }}
<|im_start|>assistant
{{ .Content }}<|im_end|>
{{- end }}
{{- end }}
<|im_start|>assistant
"""
SYSTEM "You are a helpful assistant."
EOF
ollama create qwen3-tiny-v1 -f Modelfile
ollama run qwen3-tiny-v1
With llama.cpp
# Interactive chat
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
--chat-template chatml -cnv
# Thinking mode
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
-p "<|im_start|>user\nExplain quantum computing<|im_end|>\n<|im_start|>assistant\n<think>\n" \
-n 512
# Non-thinking mode
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
-p "<|im_start|>user\n/no_think What is 2+2?<|im_end|>\n<|im_start|>assistant\n" \
-n 128
With Python (llama-cpp-python)
from llama_cpp import Llama
model = Llama("Qwen3-g023-tiny-v1-Q8_0.gguf", n_ctx=4096)
response = model.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0.6,
)
print(response["choices"][0]["message"]["content"])
System Requirements
| Quantization | RAM (CPU) | VRAM (GPU) |
|---|---|---|
| Q8_0 | ~2.0 GB | ~2.0 GB |
| Q6_K | ~1.7 GB | ~1.7 GB |
| Q4_K_M | ~1.3 GB | ~1.3 GB |
| Q3_K_M | ~1.1 GB | ~1.1 GB |
| Q2_K | ~0.9 GB | ~0.9 GB |
v1 vs v2
This model (v1) is the Phase 1 champion, focused on surgical precision with minimal operations.
| v1 (this model) | v2 | |
|---|---|---|
| Layers | 27 | 30 |
| Parameters | ~1.67B | ~1.82B |
| Operations | del + swap | swap + interpolate + bridge |
| Score | 92.9 / 100 | 94.3 / 100 |
| Factual | 100% (17/17) | 94% (16/17) |
| Perplexity | 15.70 | 15.17 |
| Use Case | Max factual accuracy | Max overall score |
v1 is recommended when factual accuracy is paramount (100% vs 94%). v2 is recommended when overall quality matters more (94.3 vs 92.9).
Methodology
Layer surgery was performed through a systematic, test-driven development process:
- Phase 1: Exhaustive search across 150+ configurations testing deletion, duplication, swapping, interpolation, and combined operations
- Evaluation: Each configuration was scored on factual accuracy (17 questions), completion coherence, perplexity, repetition ratio, and thinking mode functionality
- Selection: The champion was selected based on overall score, with factual accuracy as a tiebreaker
The surgery framework is available in the source repository.
Credits
- Base model: Qwen/Qwen3-1.7B by the Qwen team at Alibaba
- Quantization: llama.cpp
- Surgery: g023
License
Apache 2.0 โ same as the original Qwen3-1.7B model.
- Downloads last month
- -
2-bit
3-bit
4-bit
6-bit
8-bit