GPT-OSS 20B - Q2_K GGUF (12GB VRAM Optimized)

Aggressively quantized version of OpenAI's GPT-OSS 20B for 12GB VRAM GPUs with CPU offload.

Why This Exists

The official GPT-OSS 20B requires 16GB VRAM. This Q2_K quantization runs comfortably on:

RTX 3080 (12GB)
RTX 4070 (12GB)
RTX 5070 (12GB)
Any 12GB+ GPU with CPU offload

Fast inference with GPU/CPU split - not just "it works" but actually usable for real tasks.

Quick Start

Ollama

# Download and run
ollama run RhinoWithAcape/gpt-oss-20b-Q2_K

llama.cpp

# With GPU offload (adjust layers based on your VRAM)
./llama-cli -m gpt-oss-20b-Q2_K.gguf -ngl 28 -c 4096

LM Studio

Just download and load - it will auto-detect optimal settings.

Model Details

Property	Value
Parameters	20.9B
Quantization	Q2_K
File Size	~11GB
Context Length	131,072 (use 4096-8192 for speed)
Architecture	GPT-OSS (MoE)

Recommended Settings

num_gpu: 28        # Layers on GPU (adjust for your VRAM)
num_ctx: 4096      # Context window (increase if needed)
temperature: 0.5   # Good balance for most tasks

For 12GB VRAM, num_gpu: 28 leaves room for context. Reduce if you need larger context windows.

Performance Notes

Q2_K is aggressive quantization - expect some quality loss vs FP16
Still excellent for coding, reasoning, and general tasks
The speed/quality tradeoff is worth it for consumer hardware
Works great as a local coding assistant or agent backbone

Original Model

This is a quantized version of openai/gpt-oss-20b.

License

Apache 2.0 (same as original model)

Credits

Original model by OpenAI
Quantization for 12GB VRAM hardware

Downloads last month: 22

GGUF

Model size

459 params

Architecture

gpt-oss

Hardware compatibility

2-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support