YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GPT-OSS 20B - Q2_K GGUF (12GB VRAM Optimized)

Aggressively quantized version of OpenAI's GPT-OSS 20B for 12GB VRAM GPUs with CPU offload.

Why This Exists

The official GPT-OSS 20B requires 16GB VRAM. This Q2_K quantization runs comfortably on:

  • RTX 3080 (12GB)
  • RTX 4070 (12GB)
  • RTX 5070 (12GB)
  • Any 12GB+ GPU with CPU offload

Fast inference with GPU/CPU split - not just "it works" but actually usable for real tasks.

Quick Start

Ollama

# Download and run
ollama run RhinoWithAcape/gpt-oss-20b-Q2_K

llama.cpp

# With GPU offload (adjust layers based on your VRAM)
./llama-cli -m gpt-oss-20b-Q2_K.gguf -ngl 28 -c 4096

LM Studio

Just download and load - it will auto-detect optimal settings.

Model Details

Property Value
Parameters 20.9B
Quantization Q2_K
File Size ~11GB
Context Length 131,072 (use 4096-8192 for speed)
Architecture GPT-OSS (MoE)

Recommended Settings

num_gpu: 28        # Layers on GPU (adjust for your VRAM)
num_ctx: 4096      # Context window (increase if needed)
temperature: 0.5   # Good balance for most tasks

For 12GB VRAM, num_gpu: 28 leaves room for context. Reduce if you need larger context windows.

Performance Notes

  • Q2_K is aggressive quantization - expect some quality loss vs FP16
  • Still excellent for coding, reasoning, and general tasks
  • The speed/quality tradeoff is worth it for consumer hardware
  • Works great as a local coding assistant or agent backbone

Original Model

This is a quantized version of openai/gpt-oss-20b.

License

Apache 2.0 (same as original model)

Credits

  • Original model by OpenAI
  • Quantization for 12GB VRAM hardware
Downloads last month
22
GGUF
Model size
459 params
Architecture
gpt-oss
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support