Llama-3.2-3B-Instruct-QCS9075-HTP

This is a pre-compiled version of meta-llama/Llama-3.2-3B-Instruct optimized for the Qualcomm QCS9075 SoC using the Qualcomm Genie SDK.

Model Details

Base Model: meta-llama/Llama-3.2-3B-Instruct
Target Hardware: Qualcomm QCS9075 (IQ-9075 EVK)
Backend: QnnHtp (NPU)
Quantization: W4A16
Compilation: Qualcomm AI Hub (QAIRT 2.42)

Performance

Model	Backend	Performance	Size
Llama-3.2-3B-Instruct-QCS9075-HTP	QnnHtp (NPU)	~18.7 TPS on QCS9075	2.5G

TPS = Tokens Per Second (generation speed)

Hardware Requirements

Device: Qualcomm IQ-9075 EVK or QCS9075-based device
OS: Ubuntu 22.04 (recommended)
SDK: Qualcomm Genie SDK
QAIRT: Version 2.42 or later

Usage

Prerequisites

Install the Qualcomm Genie SDK on your QCS9075 device
Download all model files from this repository
Ensure QAIRT 2.42 libraries are available

Environment Setup

For HTP models, the LD_LIBRARY_PATH ordering is critical:

export LD_LIBRARY_PATH=/opt/qcom/aistack/qairt/2.42.0.250923/lib/aarch64-linux-gnu:/opt/qcom/aistack/genie/qnn/libs:$LD_LIBRARY_PATH

Configuration

Create a genie_config.json file:

{
  "model_path": "/path/to/model/files",
  "backend": "QnnHtp",
  "device": "0"
}

Running the Model

# Using the Genie server
python3 /opt/qcom/aistack/genie/examples/server_persistent.py \
  --config genie_config.json \
  --port 8000

Kubernetes Deployment

For deploying on Kubernetes clusters with QCS9075 nodes, refer to the deployment pattern:

apiVersion: v1
kind: Pod
metadata:
  name: genie-llm-server
spec:
  containers:
  - name: genie
    image: your-registry/genie-runtime:latest
    env:
    - name: LD_LIBRARY_PATH
      value: "'/opt/qcom/aistack/qairt/2.42.0.250923/lib/aarch64-linux-gnu:/opt/qcom/aistack/genie/qnn/libs'"
    volumeMounts:
    - name: model-storage
      mountPath: /models
    - name: qcom-libs
      mountPath: /opt/qcom/aistack
  volumes:
  - name: model-storage
    hostPath:
      path: /mnt/models/llama-3.2-3b-instruct-qcs9075-htp
  - name: qcom-libs
    hostPath:
      path: /opt/qcom/aistack

File Structure

This repository contains:

Compiled model artifacts (.bin files)
Configuration files (genie_config.json)
QNN HTP context binaries

Benchmarking Notes

Performance metrics measured on Qualcomm IQ-9075 EVK
TPS (Tokens Per Second) measured during generation phase
Results may vary based on prompt length and complexity
HTP backend utilizes the NPU for acceleration

License

This model follows the license of the base model meta-llama/Llama-3.2-3B-Instruct. Please refer to the original model card for license details.

Acknowledgments

Base model: meta-llama/Llama-3.2-3B-Instruct
Compiled using Qualcomm AI Hub with QAIRT 2.42
Target hardware: Qualcomm QCS9075 SoC

Support

For issues related to:

Model compilation: Contact Qualcomm AI Hub support
Genie SDK: Refer to Qualcomm Genie documentation
Deployment: Open an issue in this repository

This model is optimized for edge deployment on Qualcomm QCS9075 devices and may not work on other hardware platforms.

Downloads last month: 4

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zededa/Llama-3.2-3B-Instruct-QCS9075-HTP

Base model

meta-llama/Llama-3.2-3B-Instruct

Finetuned

(1549)

this model