YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GPT-OSS 20B - Q2_K GGUF (12GB VRAM Optimized)
Aggressively quantized version of OpenAI's GPT-OSS 20B for 12GB VRAM GPUs with CPU offload.
Why This Exists
The official GPT-OSS 20B requires 16GB VRAM. This Q2_K quantization runs comfortably on:
- RTX 3080 (12GB)
- RTX 4070 (12GB)
- RTX 5070 (12GB)
- Any 12GB+ GPU with CPU offload
Fast inference with GPU/CPU split - not just "it works" but actually usable for real tasks.
Quick Start
Ollama
# Download and run
ollama run RhinoWithAcape/gpt-oss-20b-Q2_K
llama.cpp
# With GPU offload (adjust layers based on your VRAM)
./llama-cli -m gpt-oss-20b-Q2_K.gguf -ngl 28 -c 4096
LM Studio
Just download and load - it will auto-detect optimal settings.
Model Details
| Property | Value |
|---|---|
| Parameters | 20.9B |
| Quantization | Q2_K |
| File Size | ~11GB |
| Context Length | 131,072 (use 4096-8192 for speed) |
| Architecture | GPT-OSS (MoE) |
Recommended Settings
num_gpu: 28 # Layers on GPU (adjust for your VRAM)
num_ctx: 4096 # Context window (increase if needed)
temperature: 0.5 # Good balance for most tasks
For 12GB VRAM, num_gpu: 28 leaves room for context. Reduce if you need larger context windows.
Performance Notes
- Q2_K is aggressive quantization - expect some quality loss vs FP16
- Still excellent for coding, reasoning, and general tasks
- The speed/quality tradeoff is worth it for consumer hardware
- Works great as a local coding assistant or agent backbone
Original Model
This is a quantized version of openai/gpt-oss-20b.
License
Apache 2.0 (same as original model)
Credits
- Original model by OpenAI
- Quantization for 12GB VRAM hardware
- Downloads last month
- 22
Hardware compatibility
Log In to add your hardware
2-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support