Llama 3.1 8B Instruct — GGUF + MXQ for llama-cli-mblt

This repository provides Llama 3.1 8B Instruct compiled and optimized for Mobilint NPU hardware, packaged for use with llama-cli-mblt.

Branches

Branch	Contents	Description
`main`	Body model only	Standard autoregressive decoding
`eagle3`	Body + FC + Draft models	EAGLE3 speculative decoding (~2-4x faster)

Files

main branch

File	Size	Description
`llama-3.1-8b-instruct-vocab.gguf`	7.5 MB	Tokenizer (vocab-only GGUF)
`target_emb.bin`	2.0 GB	Body embedding weights (float32)
`single_Body_Llama-3.1-8B-Instruct.mxq`	3.7 GB	Body model for NPU (W4V8 quantized)
`config.json`	—	Model configuration

eagle3 branch (adds)

File	Size	Description
`single_Fc_Llama-3.1-8B-Instruct.mxq`	49 MB	FC dimension converter model
`Draft_Llama-3.1-8B-Instruct.mxq`	181 MB	EAGLE3 draft model
`draft_emb.bin`	2.0 GB	Draft embedding weights
`d2t.bin`	250 KB	Draft-to-target vocabulary mapping

Quick Start

Install

# Build llama-cli-mblt
cd llama.cpp
cmake -B build -DLLAMA_MOBILINT=ON -DLLAMA_MOBILINT_RUNTIME_DIR=/path/to/qbruntime -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli-mblt -j$(nproc)

Simple decoding (main branch)

# Download model files
huggingface-cli download mobilint/Llama-3.1-8B-Instruct-GGUF --local-dir models/llama-8b

# Run
./build/bin/llama-cli-mblt \
    --gguf  models/llama-8b/llama-3.1-8b-instruct-vocab.gguf \
    --embd  models/llama-8b/target_emb.bin \
    --mxq   models/llama-8b/single_Body_Llama-3.1-8B-Instruct.mxq \
    --core-mode global4 --chat \
    -p "What is the meaning of life?" -n 256

EAGLE3 speculative decoding (eagle3 branch)

# Download with eagle3 branch
huggingface-cli download mobilint/Llama-3.1-8B-Instruct-GGUF --revision eagle3 --local-dir models/llama-8b-eagle3

# Run with ~2-4x speedup
./build/bin/llama-cli-mblt \
    --gguf  models/llama-8b-eagle3/llama-3.1-8b-instruct-vocab.gguf \
    --embd  models/llama-8b-eagle3/target_emb.bin \
    --mxq   models/llama-8b-eagle3/single_Body_Llama-3.1-8B-Instruct.mxq \
    --mxq-fc    models/llama-8b-eagle3/single_Fc_Llama-3.1-8B-Instruct.mxq \
    --mxq-draft models/llama-8b-eagle3/Draft_Llama-3.1-8B-Instruct.mxq \
    --embd-draft models/llama-8b-eagle3/draft_emb.bin \
    --d2t   models/llama-8b-eagle3/d2t.bin \
    --core-mode global4 --n-draft 2 --tree-depth 6 --total-tokens 23 \
    --chat --temp 0.0 -p "Explain quantum computing" -n 200

# Interactive chat
./build/bin/llama-cli-mblt \
    --gguf  models/llama-8b-eagle3/llama-3.1-8b-instruct-vocab.gguf \
    --embd  models/llama-8b-eagle3/target_emb.bin \
    --mxq   models/llama-8b-eagle3/single_Body_Llama-3.1-8B-Instruct.mxq \
    --mxq-fc    models/llama-8b-eagle3/single_Fc_Llama-3.1-8B-Instruct.mxq \
    --mxq-draft models/llama-8b-eagle3/Draft_Llama-3.1-8B-Instruct.mxq \
    --embd-draft models/llama-8b-eagle3/draft_emb.bin \
    --d2t   models/llama-8b-eagle3/d2t.bin \
    --core-mode global4 --n-draft 2 --tree-depth 6 --total-tokens 23 \
    -i -n 256