Trinity-Large-Thinking APEX GGUF -- A Novel MoE-Aware Mixed-Precision Quantization Technique

Brought to you by the LocalAI team -- the creators of LocalAI the open-source AI engine that runs any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

APEX Technical Report | GitHub Repository | LocalAI

APEX (Adaptive Precision for EXpert Models) is a novel quantization technique for Mixture-of-Experts language models. Unlike uniform quantization methods that apply the same precision to every tensor, APEX introduces a layer-wise precision gradient combined with MoE-aware tensor classification and diverse imatrix calibration to achieve Q8_0-level quality at a fraction of the size. The method was discovered through systematic human-driven, AI-assisted research across 25+ quantization strategies.

This repository contains five APEX GGUF files for arcee-ai/Trinity-Large-Thinking, a 399B-parameter MoE reasoning model with 256 experts + 1 shared expert, 4 active per token (~12B active parameters).

For the full technical details, method description, and reproduction scripts, see the APEX GitHub repository.

Model Architecture

  • Architecture: AfmoE (Adaptive Full MoE) with hybrid attention
  • Total Parameters: ~399B
  • Active Parameters: ~12B per token (4 routed + 1 shared expert)
  • Layers: 60
  • Experts: 256 routed + 1 shared per layer
  • Attention: Mixed full + sliding window
  • Calibration: APEX v1.3 diverse dataset (chat, code, reasoning, tool-calling -- no Wikipedia)

Available Files

File Configuration Size Best for
Trinity-Large-Thinking-APEX-Quality.gguf APEX Quality 219 GB Lowest perplexity, maximum fidelity
Trinity-Large-Thinking-APEX-I-Quality.gguf APEX I-Quality 219 GB Best accuracy across benchmarks
Trinity-Large-Thinking-APEX-Balanced.gguf APEX Balanced 267 GB Interactive use, serving, general purpose
Trinity-Large-Thinking-APEX-I-Balanced.gguf APEX I-Balanced 267 GB All-round with lower KL divergence
Trinity-Large-Thinking-APEX-Compact.gguf APEX Compact 169 GB Smaller deployments, constrained VRAM

APEX Quality uses a 3-tier layer-wise precision gradient (Q6_K/Q5_K/IQ4_XS) with Q8_0 shared experts. Edge layers (first and last 5 of 60) receive higher precision to preserve the model's input processing and output generation fidelity.

APEX I-Quality uses the same architecture as Quality but with a diverse imatrix calibrated on chat, code, reasoning, and tool-calling data. I-variants typically achieve higher accuracy on downstream benchmarks and lower KL divergence from the original model.

APEX Balanced uses a 2-tier gradient (Q6_K edges, Q5_K middle) with Q8_0 shared experts. Provides an excellent quality-to-size ratio for general-purpose deployment.

APEX I-Balanced uses the same architecture as Balanced with diverse imatrix calibration for improved accuracy and lower distribution shift.

APEX Compact uses Q4_K edge layers, Q3_K middle layers, and Q6_K shared experts. The most size-efficient option while maintaining strong quality.

APEX Method

APEX assigns higher precision to edge layers (first/last 5) and lower precision to middle layers, following a gradient pattern that preserves model quality where it matters most:

Layer 0-4:   Q8_0 (edge - highest precision)
Layer 5-9:   Q6_K (near-edge)
Layer 10-49: Q5_K/IQ4_XS (middle - base precision, varies by profile)
Layer 50-54: Q6_K (near-edge)
Layer 55-59: Q8_0 (edge - highest precision)

Additionally, APEX applies MoE-aware tensor classification:

  • Shared expert weights (always active): Q8_0 regardless of layer position
  • Gate/router weights: preserved at high precision
  • Attention weights: follow the layer gradient

I-variants additionally use importance matrix calibration from diverse data sources, which consistently improves accuracy benchmarks and reduces KL divergence, especially for more aggressively quantized tiers.

How to Download and Use

With llama.cpp

# Download
hf download mudler/Trinity-Large-Thinking-APEX-GGUF Trinity-Large-Thinking-APEX-Quality.gguf

# Run (requires ~230GB VRAM for Quality, multi-GPU recommended)
./llama-cli -m Trinity-Large-Thinking-APEX-Quality.gguf -ngl 99 -p "Your prompt here"

With Ollama

ollama run hf.co/mudler/Trinity-Large-Thinking-APEX-GGUF:Quality

With LocalAI

local-ai run hf.co/mudler/Trinity-Large-Thinking-APEX-GGUF:Quality

Hardware Requirements

This is a very large model. Recommended configurations:

Variant Min VRAM Recommended Setup
Compact (169 GB) ~180 GB 2x H100 80GB or 8x RTX 4090
Quality/I-Quality (219 GB) ~230 GB 3x H100 80GB or 8x A100 40GB
Balanced/I-Balanced (267 GB) ~280 GB 4x H100 80GB

For CPU offloading, use -ngl to control GPU layer count. Even partial GPU offload significantly improves performance.

Reproduction

All APEX quantizations can be reproduced using the apex-quant toolkit:

git clone https://github.com/mudler/apex-quant
cd apex-quant

# Generate configs for 60-layer model
./scripts/generate_config.sh --layers 60 --profile quality --prefix trinity_large

# Quantize
./scripts/quantize.sh --config configs/trinity_large_quality.txt \
  --model source.gguf --output Trinity-Large-Thinking-APEX-Quality.gguf

Links

Brought to you by the LocalAI team.

Downloads last month
653
GGUF
Model size
399B params
Architecture
afmoe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mudler/Trinity-Large-Thinking-APEX-GGUF

Collection including mudler/Trinity-Large-Thinking-APEX-GGUF