Llama-3.2-3B-Instruct-QCS9075-HTP

This is a pre-compiled version of meta-llama/Llama-3.2-3B-Instruct optimized for the Qualcomm QCS9075 SoC using the Qualcomm Genie SDK.

Model Details

  • Base Model: meta-llama/Llama-3.2-3B-Instruct
  • Target Hardware: Qualcomm QCS9075 (IQ-9075 EVK)
  • Backend: QnnHtp (NPU)
  • Quantization: W4A16
  • Compilation: Qualcomm AI Hub (QAIRT 2.42)

Performance

Model Backend Performance Size
Llama-3.2-3B-Instruct-QCS9075-HTP QnnHtp (NPU) ~18.7 TPS on QCS9075 2.5G

TPS = Tokens Per Second (generation speed)

Hardware Requirements

  • Device: Qualcomm IQ-9075 EVK or QCS9075-based device
  • OS: Ubuntu 22.04 (recommended)
  • SDK: Qualcomm Genie SDK
  • QAIRT: Version 2.42 or later

Usage

Prerequisites

  1. Install the Qualcomm Genie SDK on your QCS9075 device
  2. Download all model files from this repository
  3. Ensure QAIRT 2.42 libraries are available

Environment Setup

For HTP models, the LD_LIBRARY_PATH ordering is critical:

export LD_LIBRARY_PATH=/opt/qcom/aistack/qairt/2.42.0.250923/lib/aarch64-linux-gnu:/opt/qcom/aistack/genie/qnn/libs:$LD_LIBRARY_PATH

Configuration

Create a genie_config.json file:

{
  "model_path": "/path/to/model/files",
  "backend": "QnnHtp",
  "device": "0"
}

Running the Model

# Using the Genie server
python3 /opt/qcom/aistack/genie/examples/server_persistent.py \
  --config genie_config.json \
  --port 8000

Kubernetes Deployment

For deploying on Kubernetes clusters with QCS9075 nodes, refer to the deployment pattern:

apiVersion: v1
kind: Pod
metadata:
  name: genie-llm-server
spec:
  containers:
  - name: genie
    image: your-registry/genie-runtime:latest
    env:
    - name: LD_LIBRARY_PATH
      value: "'/opt/qcom/aistack/qairt/2.42.0.250923/lib/aarch64-linux-gnu:/opt/qcom/aistack/genie/qnn/libs'"
    volumeMounts:
    - name: model-storage
      mountPath: /models
    - name: qcom-libs
      mountPath: /opt/qcom/aistack
  volumes:
  - name: model-storage
    hostPath:
      path: /mnt/models/llama-3.2-3b-instruct-qcs9075-htp
  - name: qcom-libs
    hostPath:
      path: /opt/qcom/aistack

File Structure

This repository contains:

  • Compiled model artifacts (.bin files)
  • Configuration files (genie_config.json)
  • QNN HTP context binaries

Benchmarking Notes

  • Performance metrics measured on Qualcomm IQ-9075 EVK
  • TPS (Tokens Per Second) measured during generation phase
  • Results may vary based on prompt length and complexity
  • HTP backend utilizes the NPU for acceleration

License

This model follows the license of the base model meta-llama/Llama-3.2-3B-Instruct. Please refer to the original model card for license details.

Acknowledgments

Support

For issues related to:

  • Model compilation: Contact Qualcomm AI Hub support
  • Genie SDK: Refer to Qualcomm Genie documentation
  • Deployment: Open an issue in this repository

This model is optimized for edge deployment on Qualcomm QCS9075 devices and may not work on other hardware platforms.

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zededa/Llama-3.2-3B-Instruct-QCS9075-HTP

Finetuned
(1549)
this model