Instructions to use hadadxyz/OpenSonnet-Lite with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use hadadxyz/OpenSonnet-Lite with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="hadadxyz/OpenSonnet-Lite")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("hadadxyz/OpenSonnet-Lite")
model = AutoModelForCausalLM.from_pretrained("hadadxyz/OpenSonnet-Lite")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use hadadxyz/OpenSonnet-Lite with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "hadadxyz/OpenSonnet-Lite"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hadadxyz/OpenSonnet-Lite",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/hadadxyz/OpenSonnet-Lite

SGLang

How to use hadadxyz/OpenSonnet-Lite with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "hadadxyz/OpenSonnet-Lite" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hadadxyz/OpenSonnet-Lite",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "hadadxyz/OpenSonnet-Lite" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hadadxyz/OpenSonnet-Lite",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use hadadxyz/OpenSonnet-Lite with Docker Model Runner:
```
docker model run hf.co/hadadxyz/OpenSonnet-Lite
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Introduction

A compact yet capable reasoning model. Built for everyday use, even on limited hardware.

OpenSonnet-Lite

OpenSonnet-Lite is a lightweight language model fine-tuned from Qwen/Qwen3-4B-Thinking-2507, designed to deliver strong Chain-of-Thought (CoT) reasoning without demanding high-end resources. During reasoning tasks, it approaches the performance level of Claude Sonnet 4.6. A frontier commercial model, while remaining fully open weights and accessible.

One key improvement over the base model is the restoration of multi-turn reasoning. The original Qwen3-4B-Thinking-2507 loses its reasoning capability across multi-turn conversations due to chat template limitations (see Qwen's best practices). OpenSonnet-Lite addresses this directly through a corrected chat template, enabling consistent, coherent reasoning across long dialogues.

With the right prompt engineering techniques, this model also handles complex tasks with near-perfect output across several domains.

If you need a quick demo, you can try this model for free. It runs on dual T4 GPUs using Kaggle Notebooks.

Model Overview

Property	Value
Architecture	Causal Language Model
Total Parameters	4.0B
Non-Embedding Parameters	3.6B
Number of Layers	36
Attention Heads (GQA)	32 for Q, 8 for KV
Native Context Length	262,144 tokens

Training

Infrastructure

Resource	Details
GPU	NVIDIA B200 (180 GB VRAM)
Training Duration	9 hours
Estimated Cost	$56.25 (Serverless)

Hyperparameters

The model was trained using supervised fine-tuning techniques with parameter-efficient methods to optimize performance while maintaining computational efficiency. Key training parameters include:

Parameter	Value
Maximum Sequence Length	262,144
Per Device Training Batch Size	64
Number of Training Epochs	3

Datasets

A total of 143,335 raw samples were collected across 11 curated datasets. After filtering to remove empty rows, duplicate CoT tags, and malformed examples, 140,765 samples (~140K) were used for the final training run. All filtering is fully automated using a dedicated script to prevent human error.

#	Dataset	Raw Samples	After Filtering
1	Roman1111111/claude-sonnet-4.6-100000X-filtered	108,978	106,552
2	TeichAI/lordx64-claude-opus-4.7-max-cleaned	4,807	4,807
3	Crownelius/Opus-4.6-Reasoning-3300x	2,160	2,053
4	TeichAI/claude-4.5-opus-high-reasoning-250x	250	250
5	TeichAI/claude-haiku-4.5-high-reasoning-1700x	1,688	1,688
6	TeichAI/claude-sonnet-4.5-high-reasoning-250x	247	247
7	TeichAI/deepseek-v3.2-speciale-openr1-math-3k	3,317	3,317
8	TeichAI/deepseek-v3.2-speciale-1000x	991	975
9	Roman1111111/gemini-3-pro-10000x-hard-high-reasoning	10,031	10,010
10	Roman1111111/gemini-3.1-pro-hard-high-reasoning	3,150	3,150
11	Jackrong/DeepSeek-V4-Distill-8000x	7,716	7,716
	Total	143,335	140,765

Inference Parameters

Update as of 2026-05-06: These are the stable inference parameters.

For best results, the following sampling configuration is recommended:

Parameter	Recommended Value	Description
temperature	1.0	Controls randomness in generation
top_p	0.95	Nucleus sampling threshold
top_k	20	Top-k sampling parameter
min_p	0	Minimum probability threshold
repetition_penalty	1.0	Penalizes repeated tokens
presence_penalty	1.0	Encourages introducing new topics

Max Tokens

Small Tasks	Medium Tasks	Large Tasks	Complex Tasks
4096/8192	16384	32768/81920	131072

Instruction

You are OpenSonnet, a large language model trained by the Open Source community. You are based on the Qwen3 architecture.

You must think concisely, clearly, quickly, and in a direct manner.

Quickstart

#pip install transformers>=4.51.0

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "hadadxyz/OpenSonnet-Lite"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
instruction = "You are OpenSonnet, a large language model trained by the Open Source community. You are based on the Qwen3 architecture.\n\nYou must think concisely, clearly, quickly, and in a direct manner."

prompt = "Hello, who are you?"

messages = [
    {"role": "system", "content": instruction},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=4096
)

output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Bias, Risks, and Hallucinations

As with any language model, users should be aware of the following limitations before deploying OpenSonnet-Lite in production or sensitive contexts.

Bias: This model was fine-tuned on datasets distilled from several large commercial models. Any systemic biases present in those source models. Including cultural, linguistic, or ideological tendencies, may be partially inherited. The model has not undergone dedicated bias auditing or alignment evaluation beyond standard SFT.
Hallucinations: OpenSonnet-Lite can and will generate plausible-sounding but factually incorrect information, particularly on niche topics, recent events, or highly specific technical domains. Extended Chain-of-Thought reasoning reduces this risk but does not eliminate it. Outputs should be verified against authoritative sources when accuracy is critical.
Risks: This is an open weights model with no built-in content filter or safety layer. It may produce outputs that are inappropriate, misleading, or harmful in certain contexts. Users and developers are solely responsible for implementing appropriate safeguards, usage policies, and monitoring when deploying this model in any application.

Use of this model implies acceptance of these limitations. It is intended as a research and general-purpose tool, not as a replacement for human judgment in high-stakes decisions.

Citation

If you use this model in your research or applications, please cite both this model and the base model:

@misc{opensonnet-lite,
  author = {hadadxyz},
  title  = {OpenSonnet-Lite},
  year   = {2026},
  url    = {https://huggingface.co/hadadxyz/OpenSonnet-Lite}
}

Acknowledgments

This model was made possible through the combination of multiple high-quality datasets from the community. We acknowledge and thank all dataset creators and the Qwen team for providing the excellent base model.