Trinity-Large-Thinking APEX GGUF -- A Novel MoE-Aware Mixed-Precision Quantization Technique
Brought to you by the LocalAI team -- the creators of LocalAI the open-source AI engine that runs any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
APEX Technical Report | GitHub Repository | LocalAI
APEX (Adaptive Precision for EXpert Models) is a novel quantization technique for Mixture-of-Experts language models. Unlike uniform quantization methods that apply the same precision to every tensor, APEX introduces a layer-wise precision gradient combined with MoE-aware tensor classification and diverse imatrix calibration to achieve Q8_0-level quality at a fraction of the size. The method was discovered through systematic human-driven, AI-assisted research across 25+ quantization strategies.
This repository contains five APEX GGUF files for arcee-ai/Trinity-Large-Thinking, a 399B-parameter MoE reasoning model with 256 experts + 1 shared expert, 4 active per token (~12B active parameters).
For the full technical details, method description, and reproduction scripts, see the APEX GitHub repository.
Model Architecture
- Architecture: AfmoE (Adaptive Full MoE) with hybrid attention
- Total Parameters: ~399B
- Active Parameters: ~12B per token (4 routed + 1 shared expert)
- Layers: 60
- Experts: 256 routed + 1 shared per layer
- Attention: Mixed full + sliding window
- Calibration: APEX v1.3 diverse dataset (chat, code, reasoning, tool-calling -- no Wikipedia)
Available Files
| File | Configuration | Size | Best for |
|---|---|---|---|
Trinity-Large-Thinking-APEX-Quality.gguf |
APEX Quality | 219 GB | Lowest perplexity, maximum fidelity |
Trinity-Large-Thinking-APEX-I-Quality.gguf |
APEX I-Quality | 219 GB | Best accuracy across benchmarks |
Trinity-Large-Thinking-APEX-Balanced.gguf |
APEX Balanced | 267 GB | Interactive use, serving, general purpose |
Trinity-Large-Thinking-APEX-I-Balanced.gguf |
APEX I-Balanced | 267 GB | All-round with lower KL divergence |
Trinity-Large-Thinking-APEX-Compact.gguf |
APEX Compact | 169 GB | Smaller deployments, constrained VRAM |
APEX Quality uses a 3-tier layer-wise precision gradient (Q6_K/Q5_K/IQ4_XS) with Q8_0 shared experts. Edge layers (first and last 5 of 60) receive higher precision to preserve the model's input processing and output generation fidelity.
APEX I-Quality uses the same architecture as Quality but with a diverse imatrix calibrated on chat, code, reasoning, and tool-calling data. I-variants typically achieve higher accuracy on downstream benchmarks and lower KL divergence from the original model.
APEX Balanced uses a 2-tier gradient (Q6_K edges, Q5_K middle) with Q8_0 shared experts. Provides an excellent quality-to-size ratio for general-purpose deployment.
APEX I-Balanced uses the same architecture as Balanced with diverse imatrix calibration for improved accuracy and lower distribution shift.
APEX Compact uses Q4_K edge layers, Q3_K middle layers, and Q6_K shared experts. The most size-efficient option while maintaining strong quality.
APEX Method
APEX assigns higher precision to edge layers (first/last 5) and lower precision to middle layers, following a gradient pattern that preserves model quality where it matters most:
Layer 0-4: Q8_0 (edge - highest precision)
Layer 5-9: Q6_K (near-edge)
Layer 10-49: Q5_K/IQ4_XS (middle - base precision, varies by profile)
Layer 50-54: Q6_K (near-edge)
Layer 55-59: Q8_0 (edge - highest precision)
Additionally, APEX applies MoE-aware tensor classification:
- Shared expert weights (always active): Q8_0 regardless of layer position
- Gate/router weights: preserved at high precision
- Attention weights: follow the layer gradient
I-variants additionally use importance matrix calibration from diverse data sources, which consistently improves accuracy benchmarks and reduces KL divergence, especially for more aggressively quantized tiers.
How to Download and Use
With llama.cpp
# Download
hf download mudler/Trinity-Large-Thinking-APEX-GGUF Trinity-Large-Thinking-APEX-Quality.gguf
# Run (requires ~230GB VRAM for Quality, multi-GPU recommended)
./llama-cli -m Trinity-Large-Thinking-APEX-Quality.gguf -ngl 99 -p "Your prompt here"
With Ollama
ollama run hf.co/mudler/Trinity-Large-Thinking-APEX-GGUF:Quality
With LocalAI
local-ai run hf.co/mudler/Trinity-Large-Thinking-APEX-GGUF:Quality
Hardware Requirements
This is a very large model. Recommended configurations:
| Variant | Min VRAM | Recommended Setup |
|---|---|---|
| Compact (169 GB) | ~180 GB | 2x H100 80GB or 8x RTX 4090 |
| Quality/I-Quality (219 GB) | ~230 GB | 3x H100 80GB or 8x A100 40GB |
| Balanced/I-Balanced (267 GB) | ~280 GB | 4x H100 80GB |
For CPU offloading, use -ngl to control GPU layer count. Even partial GPU offload significantly improves performance.
Reproduction
All APEX quantizations can be reproduced using the apex-quant toolkit:
git clone https://github.com/mudler/apex-quant
cd apex-quant
# Generate configs for 60-layer model
./scripts/generate_config.sh --layers 60 --profile quality --prefix trinity_large
# Quantize
./scripts/quantize.sh --config configs/trinity_large_quality.txt \
--model source.gguf --output Trinity-Large-Thinking-APEX-Quality.gguf
Links
Brought to you by the LocalAI team.
- Downloads last month
- 653
We're not able to determine the quantization variants.
Model tree for mudler/Trinity-Large-Thinking-APEX-GGUF
Base model
arcee-ai/Trinity-Large-TrueBase