Qwen3-g023-tiny-v1 — GGUF

A surgically optimized 27-layer Qwen3 variant that outperforms the original 28-layer model.

Created by selectively deleting a harmful layer and swapping adjacent layers for improved information flow. Scores 92.9/100 with 100% factual accuracy — a 5.1-point improvement over the original Qwen3-1.7B baseline (87.8/100).

Available Quantizations

Quantization	Bits/Weight	Description	Download
Q8_0	8.00	Highest quality, virtually lossless (USE THIS ONE)	Qwen3-g023-tiny-v1-Q8_0.gguf
Q6_K	6.57	Excellent quality, good compression	Qwen3-g023-tiny-v1-Q6_K.gguf
Q4_K_M	4.85	Good balance of quality and size	Qwen3-g023-tiny-v1-Q4_K_M.gguf
Q3_K_M	3.91	High compression, moderate quality loss	Qwen3-g023-tiny-v1-Q3_K_M.gguf
Q2_K	3.35	Maximum compression, significant quality loss	Qwen3-g023-tiny-v1-Q2_K.gguf

Model Details

Parameter	Value
Architecture	Qwen3ForCausalLM
Layers	27 (28 original − 1 deleted)
Hidden Size	2,048
Intermediate Size	6,144
Attention Heads	16 query / 8 key-value (GQA)
Head Dimension	128
Vocabulary	151,936 tokens
Max Context	40,960 tokens
RoPE θ	1,000,000
Tied Embeddings	Yes
Total Parameters	~1.67B
Precision (source)	bfloat16

Surgery Operations

This model was created by applying two surgical operations to Qwen/Qwen3-1.7B:

Delete layer 10 — Layer 10 was identified as harmful to model quality. Removing it improved the overall score from 85.9 to 91.4.
Swap layers 11 ↔ 12 (post-deletion indices) — Swapping these adjacent attention blocks optimized information flow between the model's middle layers, further improving the score to 92.9.

Key Findings

Smaller is better: The 27-layer model outperforms both the 28-layer original and various 29–30 layer expanded models.
Layer 10 is actively harmful: Removing it alone yields a +3.6 point improvement.
Operations compound selectively: Deletion + swap works, but deletion + duplication degrades quality.

Benchmark Results

Metric	Original (28L)	v1 (27L)	Δ
Overall Score	87.8 / 100	92.9 / 100	+5.1
Factual Accuracy	15 / 17 (88%)	17 / 17 (100%)	+12%
Avg Perplexity	—	15.70	—
Thinking Mode	✅	✅	—
Non-Thinking Mode	✅	✅	—

Evaluated using a comprehensive test suite with 17 factual questions, 2 completion coherence tests, perplexity measurements, repetition analysis, and thinking/non-thinking mode verification.

Features

Thinking mode: Full <think> / </think> reasoning support — toggle via enable_thinking parameter
Non-thinking mode: Direct responses without chain-of-thought overhead
Tool calling: Full function/tool calling support
System prompts: Standard system message support
Chat template: Qwen3 ChatML template embedded in the GGUF

Usage

With Ollama

# Download the GGUF and create from Modelfile
cat > Modelfile << 'EOF'
FROM ./Qwen3-g023-tiny-v1-Q8_0.gguf

PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 45
PARAMETER min_p 0.1
PARAMETER num_ctx 40000
PARAMETER mirostat 2
PARAMETER mirostat_tau 5.0
PARAMETER mirostat_eta 0.1
PARAMETER repeat_last_n 16384
PARAMETER repeat_penalty 1.1
PARAMETER presence_penalty 0.5
PARAMETER frequency_penalty 1.0

TEMPLATE """{{- if .System }}
<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}
{{- range .Messages }}
{{- if eq .Role "user" }}
<|im_start|>user
{{ .Content }}<|im_end|>
{{- else if eq .Role "assistant" }}
<|im_start|>assistant
{{ .Content }}<|im_end|>
{{- end }}
{{- end }}
<|im_start|>assistant
"""
SYSTEM "You are a helpful assistant."
EOF

ollama create qwen3-tiny-v1 -f Modelfile
ollama run qwen3-tiny-v1

With llama.cpp

# Interactive chat
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
  --chat-template chatml -cnv

# Thinking mode
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
  -p "<|im_start|>user\nExplain quantum computing<|im_end|>\n<|im_start|>assistant\n<think>\n" \
  -n 512

# Non-thinking mode
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
  -p "<|im_start|>user\n/no_think What is 2+2?<|im_end|>\n<|im_start|>assistant\n" \
  -n 128

With Python (llama-cpp-python)

from llama_cpp import Llama

model = Llama("Qwen3-g023-tiny-v1-Q8_0.gguf", n_ctx=4096)
response = model.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0.6,
)
print(response["choices"][0]["message"]["content"])

System Requirements

Quantization	RAM (CPU)	VRAM (GPU)
Q8_0	~2.0 GB	~2.0 GB
Q6_K	~1.7 GB	~1.7 GB
Q4_K_M	~1.3 GB	~1.3 GB
Q3_K_M	~1.1 GB	~1.1 GB
Q2_K	~0.9 GB	~0.9 GB

v1 vs v2

This model (v1) is the Phase 1 champion, focused on surgical precision with minimal operations.

	v1 (this model)	v2
Layers	27	30
Parameters	~1.67B	~1.82B
Operations	del + swap	swap + interpolate + bridge
Score	92.9 / 100	94.3 / 100
Factual	100% (17/17)	94% (16/17)
Perplexity	15.70	15.17
Use Case	Max factual accuracy	Max overall score

v1 is recommended when factual accuracy is paramount (100% vs 94%). v2 is recommended when overall quality matters more (94.3 vs 92.9).

Methodology

Layer surgery was performed through a systematic, test-driven development process:

Phase 1: Exhaustive search across 150+ configurations testing deletion, duplication, swapping, interpolation, and combined operations
Evaluation: Each configuration was scored on factual accuracy (17 questions), completion coherence, perplexity, repetition ratio, and thinking mode functionality
Selection: The champion was selected based on overall score, with factual accuracy as a tiebreaker

The surgery framework is available in the source repository.

Credits

Base model: Qwen/Qwen3-1.7B by the Qwen team at Alibaba
Quantization: llama.cpp
Surgery: g023

License

Apache 2.0 — same as the original Qwen3-1.7B model.

Downloads last month: -

GGUF

Model size

2B params

Architecture

qwen3

Hardware compatibility

2-bit

3-bit

4-bit

6-bit

8-bit

Model tree for g023/qwen3-tiny-v1

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Quantized

(181)

this model