Trinity-Large-Thinking APEX GGUF -- A Novel MoE-Aware Mixed-Precision Quantization Technique

Brought to you by the LocalAI team -- the creators of LocalAI the open-source AI engine that runs any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

APEX Technical Report | GitHub Repository | LocalAI

APEX (Adaptive Precision for EXpert Models) is a novel quantization technique for Mixture-of-Experts language models. Unlike uniform quantization methods that apply the same precision to every tensor, APEX introduces a layer-wise precision gradient combined with MoE-aware tensor classification and diverse imatrix calibration to achieve Q8_0-level quality at a fraction of the size. The method was discovered through systematic human-driven, AI-assisted research across 25+ quantization strategies.

This repository contains five APEX GGUF files for arcee-ai/Trinity-Large-Thinking, a 399B-parameter MoE reasoning model with 256 experts + 1 shared expert, 4 active per token (~12B active parameters).

For the full technical details, method description, and reproduction scripts, see the APEX GitHub repository.

Model Architecture

Architecture: AfmoE (Adaptive Full MoE) with hybrid attention
Total Parameters: ~399B
Active Parameters: ~12B per token (4 routed + 1 shared expert)
Layers: 60
Experts: 256 routed + 1 shared per layer
Attention: Mixed full + sliding window
Calibration: APEX v1.3 diverse dataset (chat, code, reasoning, tool-calling -- no Wikipedia)

Available Files

File	Configuration	Size	Best for
`Trinity-Large-Thinking-APEX-Quality.gguf`	APEX Quality	219 GB	Lowest perplexity, maximum fidelity
`Trinity-Large-Thinking-APEX-I-Quality.gguf`	APEX I-Quality	219 GB	Best accuracy across benchmarks
`Trinity-Large-Thinking-APEX-Balanced.gguf`	APEX Balanced	267 GB	Interactive use, serving, general purpose
`Trinity-Large-Thinking-APEX-I-Balanced.gguf`	APEX I-Balanced	267 GB	All-round with lower KL divergence
`Trinity-Large-Thinking-APEX-Compact.gguf`	APEX Compact	169 GB	Smaller deployments, constrained VRAM

APEX Quality uses a 3-tier layer-wise precision gradient (Q6_K/Q5_K/IQ4_XS) with Q8_0 shared experts. Edge layers (first and last 5 of 60) receive higher precision to preserve the model's input processing and output generation fidelity.

APEX I-Quality uses the same architecture as Quality but with a diverse imatrix calibrated on chat, code, reasoning, and tool-calling data. I-variants typically achieve higher accuracy on downstream benchmarks and lower KL divergence from the original model.

APEX Balanced uses a 2-tier gradient (Q6_K edges, Q5_K middle) with Q8_0 shared experts. Provides an excellent quality-to-size ratio for general-purpose deployment.

APEX I-Balanced uses the same architecture as Balanced with diverse imatrix calibration for improved accuracy and lower distribution shift.

APEX Compact uses Q4_K edge layers, Q3_K middle layers, and Q6_K shared experts. The most size-efficient option while maintaining strong quality.

APEX Method

APEX assigns higher precision to edge layers (first/last 5) and lower precision to middle layers, following a gradient pattern that preserves model quality where it matters most:

Layer 0-4:   Q8_0 (edge - highest precision)
Layer 5-9:   Q6_K (near-edge)
Layer 10-49: Q5_K/IQ4_XS (middle - base precision, varies by profile)
Layer 50-54: Q6_K (near-edge)
Layer 55-59: Q8_0 (edge - highest precision)

Additionally, APEX applies MoE-aware tensor classification:

Shared expert weights (always active): Q8_0 regardless of layer position
Gate/router weights: preserved at high precision
Attention weights: follow the layer gradient

I-variants additionally use importance matrix calibration from diverse data sources, which consistently improves accuracy benchmarks and reduces KL divergence, especially for more aggressively quantized tiers.

How to Download and Use

With llama.cpp

# Download
hf download mudler/Trinity-Large-Thinking-APEX-GGUF Trinity-Large-Thinking-APEX-Quality.gguf

# Run (requires ~230GB VRAM for Quality, multi-GPU recommended)
./llama-cli -m Trinity-Large-Thinking-APEX-Quality.gguf -ngl 99 -p "Your prompt here"

With Ollama

ollama run hf.co/mudler/Trinity-Large-Thinking-APEX-GGUF:Quality

With LocalAI

local-ai run hf.co/mudler/Trinity-Large-Thinking-APEX-GGUF:Quality

Hardware Requirements

This is a very large model. Recommended configurations:

Variant	Min VRAM	Recommended Setup
Compact (169 GB)	~180 GB	2x H100 80GB or 8x RTX 4090
Quality/I-Quality (219 GB)	~230 GB	3x H100 80GB or 8x A100 40GB
Balanced/I-Balanced (267 GB)	~280 GB	4x H100 80GB

For CPU offloading, use -ngl to control GPU layer count. Even partial GPU offload significantly improves performance.

Reproduction

All APEX quantizations can be reproduced using the apex-quant toolkit:

git clone https://github.com/mudler/apex-quant
cd apex-quant

# Generate configs for 60-layer model
./scripts/generate_config.sh --layers 60 --profile quality --prefix trinity_large

# Quantize
./scripts/quantize.sh --config configs/trinity_large_quality.txt \
  --model source.gguf --output Trinity-Large-Thinking-APEX-Quality.gguf