Llama-3.2-3B-Instruct-QCS9075-HTP
This is a pre-compiled version of meta-llama/Llama-3.2-3B-Instruct optimized for the Qualcomm QCS9075 SoC using the Qualcomm Genie SDK.
Model Details
- Base Model: meta-llama/Llama-3.2-3B-Instruct
- Target Hardware: Qualcomm QCS9075 (IQ-9075 EVK)
- Backend: QnnHtp (NPU)
- Quantization: W4A16
- Compilation: Qualcomm AI Hub (QAIRT 2.42)
Performance
| Model | Backend | Performance | Size |
|---|---|---|---|
| Llama-3.2-3B-Instruct-QCS9075-HTP | QnnHtp (NPU) | ~18.7 TPS on QCS9075 | 2.5G |
TPS = Tokens Per Second (generation speed)
Hardware Requirements
- Device: Qualcomm IQ-9075 EVK or QCS9075-based device
- OS: Ubuntu 22.04 (recommended)
- SDK: Qualcomm Genie SDK
- QAIRT: Version 2.42 or later
Usage
Prerequisites
- Install the Qualcomm Genie SDK on your QCS9075 device
- Download all model files from this repository
- Ensure QAIRT 2.42 libraries are available
Environment Setup
For HTP models, the LD_LIBRARY_PATH ordering is critical:
export LD_LIBRARY_PATH=/opt/qcom/aistack/qairt/2.42.0.250923/lib/aarch64-linux-gnu:/opt/qcom/aistack/genie/qnn/libs:$LD_LIBRARY_PATH
Configuration
Create a genie_config.json file:
{
"model_path": "/path/to/model/files",
"backend": "QnnHtp",
"device": "0"
}
Running the Model
# Using the Genie server
python3 /opt/qcom/aistack/genie/examples/server_persistent.py \
--config genie_config.json \
--port 8000
Kubernetes Deployment
For deploying on Kubernetes clusters with QCS9075 nodes, refer to the deployment pattern:
apiVersion: v1
kind: Pod
metadata:
name: genie-llm-server
spec:
containers:
- name: genie
image: your-registry/genie-runtime:latest
env:
- name: LD_LIBRARY_PATH
value: "'/opt/qcom/aistack/qairt/2.42.0.250923/lib/aarch64-linux-gnu:/opt/qcom/aistack/genie/qnn/libs'"
volumeMounts:
- name: model-storage
mountPath: /models
- name: qcom-libs
mountPath: /opt/qcom/aistack
volumes:
- name: model-storage
hostPath:
path: /mnt/models/llama-3.2-3b-instruct-qcs9075-htp
- name: qcom-libs
hostPath:
path: /opt/qcom/aistack
File Structure
This repository contains:
- Compiled model artifacts (.bin files)
- Configuration files (genie_config.json)
- QNN HTP context binaries
Benchmarking Notes
- Performance metrics measured on Qualcomm IQ-9075 EVK
- TPS (Tokens Per Second) measured during generation phase
- Results may vary based on prompt length and complexity
- HTP backend utilizes the NPU for acceleration
License
This model follows the license of the base model meta-llama/Llama-3.2-3B-Instruct. Please refer to the original model card for license details.
Acknowledgments
- Base model: meta-llama/Llama-3.2-3B-Instruct
- Compiled using Qualcomm AI Hub with QAIRT 2.42
- Target hardware: Qualcomm QCS9075 SoC
Support
For issues related to:
- Model compilation: Contact Qualcomm AI Hub support
- Genie SDK: Refer to Qualcomm Genie documentation
- Deployment: Open an issue in this repository
This model is optimized for edge deployment on Qualcomm QCS9075 devices and may not work on other hardware platforms.
- Downloads last month
- 4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for zededa/Llama-3.2-3B-Instruct-QCS9075-HTP
Base model
meta-llama/Llama-3.2-3B-Instruct