Llama 3.1 8B Instruct โ€” GGUF + MXQ for llama-cli-mblt

This repository provides Llama 3.1 8B Instruct compiled and optimized for Mobilint NPU hardware, packaged for use with llama-cli-mblt.

Branches

Branch Contents Description
main Body model only Standard autoregressive decoding
eagle3 Body + FC + Draft models EAGLE3 speculative decoding (~2-4x faster)

Files

main branch

File Size Description
llama-3.1-8b-instruct-vocab.gguf 7.5 MB Tokenizer (vocab-only GGUF)
target_emb.bin 2.0 GB Body embedding weights (float32)
single_Body_Llama-3.1-8B-Instruct.mxq 3.7 GB Body model for NPU (W4V8 quantized)
config.json โ€” Model configuration

eagle3 branch (adds)

File Size Description
single_Fc_Llama-3.1-8B-Instruct.mxq 49 MB FC dimension converter model
Draft_Llama-3.1-8B-Instruct.mxq 181 MB EAGLE3 draft model
draft_emb.bin 2.0 GB Draft embedding weights
d2t.bin 250 KB Draft-to-target vocabulary mapping

Quick Start

Install

# Build llama-cli-mblt
cd llama.cpp
cmake -B build -DLLAMA_MOBILINT=ON -DLLAMA_MOBILINT_RUNTIME_DIR=/path/to/qbruntime -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli-mblt -j$(nproc)

Simple decoding (main branch)

# Download model files
huggingface-cli download mobilint/Llama-3.1-8B-Instruct-GGUF --local-dir models/llama-8b

# Run
./build/bin/llama-cli-mblt \
    --gguf  models/llama-8b/llama-3.1-8b-instruct-vocab.gguf \
    --embd  models/llama-8b/target_emb.bin \
    --mxq   models/llama-8b/single_Body_Llama-3.1-8B-Instruct.mxq \
    --core-mode global4 --chat \
    -p "What is the meaning of life?" -n 256

EAGLE3 speculative decoding (eagle3 branch)

# Download with eagle3 branch
huggingface-cli download mobilint/Llama-3.1-8B-Instruct-GGUF --revision eagle3 --local-dir models/llama-8b-eagle3

# Run with ~2-4x speedup
./build/bin/llama-cli-mblt \
    --gguf  models/llama-8b-eagle3/llama-3.1-8b-instruct-vocab.gguf \
    --embd  models/llama-8b-eagle3/target_emb.bin \
    --mxq   models/llama-8b-eagle3/single_Body_Llama-3.1-8B-Instruct.mxq \
    --mxq-fc    models/llama-8b-eagle3/single_Fc_Llama-3.1-8B-Instruct.mxq \
    --mxq-draft models/llama-8b-eagle3/Draft_Llama-3.1-8B-Instruct.mxq \
    --embd-draft models/llama-8b-eagle3/draft_emb.bin \
    --d2t   models/llama-8b-eagle3/d2t.bin \
    --core-mode global4 --n-draft 2 --tree-depth 6 --total-tokens 23 \
    --chat --temp 0.0 -p "Explain quantum computing" -n 200

# Interactive chat
./build/bin/llama-cli-mblt \
    --gguf  models/llama-8b-eagle3/llama-3.1-8b-instruct-vocab.gguf \
    --embd  models/llama-8b-eagle3/target_emb.bin \
    --mxq   models/llama-8b-eagle3/single_Body_Llama-3.1-8B-Instruct.mxq \
    --mxq-fc    models/llama-8b-eagle3/single_Fc_Llama-3.1-8B-Instruct.mxq \
    --mxq-draft models/llama-8b-eagle3/Draft_Llama-3.1-8B-Instruct.mxq \
    --embd-draft models/llama-8b-eagle3/draft_emb.bin \
    --d2t   models/llama-8b-eagle3/d2t.bin \
    --core-mode global4 --n-draft 2 --tree-depth 6 --total-tokens 23 \
    -i -n 256

Performance

Tested on Mobilint Aries NPU with global4 core mode:

Mode Prefill Decode Tokens/Step
Simple ~330 t/s ~10 t/s 1.0
EAGLE3 ~330 t/s ~23 t/s ~4.2

About

This model is compiled and optimized for Mobilint NPU hardware. It is intended to be used with llama-cli-mblt from llama.cpp's mobilint example.

Downloads last month
371
GGUF
Model size
0 params
Architecture
llama
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mobilint/Llama-3.1-8B-Instruct-GGUF

Quantized
(622)
this model

Collection including mobilint/Llama-3.1-8B-Instruct-GGUF