YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SmolVLA LIBERO - QNN CPU

QNN-converted SmolVLA models for CPU inference on x86_64 Linux

Model Overview

This repository contains the Qualcomm Neural Network (QNN) SDK v2.43.0 converted versions of the SmolVLA LIBERO models, optimized for CPU inference on x86_64-linux-clang architecture.

Property	Value
Source Model	HuggingFaceVLA/smolvla_libero
ONNX Intermediate	xpuenabler/smolvla-libero-ONNX
QNN SDK Version	v2.43.0
Target Architecture	x86_64-linux-clang (CPU backend)
Float Bitwidth	32-bit (no quantization)
Verification	Cosine similarity = 1.000000 vs ONNX baseline

Model Architecture

SmolVLA LIBERO is decomposed into 3 QNN components:

1. Vision Encoder (`libvision_encoder.so`)

Size: 376 MB (.so) + 375 MB (.bin)
Purpose: Encodes RGB images to visual embeddings
Input: pixel_values - RGB images
Output: image_embeddings - Visual feature vectors

2. LLM Backbone (`libllm_backbone.so`)

Size: 1.4 GB (.so) + 1.4 GB (.bin)
Purpose: Language model processing with cross-attention to vision
Inputs: Image embeddings, language tokens, attention masks
Outputs: KV cache, prefix padding masks

3. Action Head v2 (`libaction_head_v2.so`)

Size: 375 MB (.so) + 372 MB (.bin)
Purpose: Predicts robot action velocities
Inputs: Noisy actions, KV cache, prefix masks
Output: Action velocity predictions

Critical I/O Tensor Transposition Requirements

⚠️ IMPORTANT: QNN models require specific tensor layouts. The following transpositions are MANDATORY:

Vision Encoder

Input (pixel_values):
  ONNX format: [1, 3, 512, 512]  (channels-first)
  QNN format:  [1, 512, 512, 3]  (channels-last)
  Transpose:   (0, 2, 3, 1)

Output (image_embeddings):
  ONNX format: [1, 64, 960]
  QNN format:  [1, 64, 960]  (no transpose needed)

LLM Backbone

Input (image_embeddings):
  ONNX format: [1, 64, 960]
  QNN format:  [1, 960, 64]  (transpose 0,2,1)
  Transpose:   (0, 2, 1)

Input (language_tokens, attention_masks):
  ONNX format: int64
  QNN format:  float32  (CRITICAL: QNN SDK v2.43 converts all int64 to float32)
  
Output (kv_cache, prefix_pad_masks):
  Format:      float32 (no transpose needed)

Action Head v2

Input (noisy_actions):
  ONNX format: [1, 8, 7]
  QNN format:  [1, 7, 8]  (transpose 0,2,1)
  Transpose:   (0, 2, 1)

Input (kv_keys, kv_values):
  ONNX format: [1, num_heads, seq_len, head_dim]
  QNN format:  [1, seq_len, head_dim, num_heads]  (transpose 0,2,3,4,1)
  Transpose:   (0, 2, 3, 4, 1)

Input (prefix_pad_masks):
  Format:      float32

Output (velocity):
  ONNX format: [1, 8, 7]
  QNN format:  [1, 7, 8]  (transpose back 0,2,1)
  Transpose:   (0, 2, 1)

Critical QNN SDK v2.43 Behavior

⚠️ KNOWN LIMITATION: QNN SDK v2.43.0 automatically converts ALL int64 tensors to float32 during model conversion. This affects:

Language token inputs (originally int64 in ONNX)
Attention mask inputs (originally int64 in ONNX)

Solution: Cast these inputs to float32 before passing to QNN models. See infer_libero_episode_qnn_cpu.py for implementation.

Verification Results

All 3 models have been verified against their ONNX counterparts:

Vision Encoder:    cosine_similarity = 1.000000 ✓
LLM Backbone:      cosine_similarity = 1.000000 ✓
Action Head v2:    cosine_similarity = 1.000000 ✓

Verification script: scripts/compare_onnx_vs_qnn.py

ONNX Graph Fixes Applied

The action_head required 3 critical ONNX graph fixes before QNN conversion:

RMSNorm Fix: Converted RMSNorm operations to equivalent layer norm patterns
Slice→Gather Fix: Replaced dynamic slice operations with gather operations
Constants→Initializers Fix: Converted constant nodes to initializer tensors

These fixes are documented in scripts/fix_action_head_all.py.

File Structure

xpuenabler/smolvla-libero-QNN-CPU/
├── qnn_models/
│   ├── libvision_encoder.so          # Vision encoder QNN library
│   ├── libllm_backbone.so            # LLM backbone QNN library
│   ├── libaction_head_v2.so          # Action head QNN library
│   ├── vision_encoder.bin            # Vision encoder weights
│   ├── llm_backbone.bin              # LLM backbone weights
│   ├── action_head_v2.bin            # Action head weights
│   ├── vision_encoder_net.json       # Vision encoder graph definition
│   ├── llm_backbone_net.json         # LLM backbone graph definition
│   └── action_head_v2_net.json       # Action head graph definition
├── config.json                        # Model configuration
├── policy_preprocessor.json           # Input preprocessing config
├── policy_postprocessor.json          # Output postprocessing config
├── policy_preprocessor_step_5_normalizer_processor.safetensors
├── policy_postprocessor_step_1_unnormalizer_processor.safetensors
├── convert_onnx_to_qnn.sh            # ONNX → QNN conversion script (all 3 models)
├── scripts/
│   ├── infer_libero_episode_qnn_cpu.py    # Main inference script
│   ├── fix_action_head_all.py             # ONNX graph fixes for action_head
│   ├── compare_onnx_vs_qnn.py             # ONNX vs QNN verification script
│   └── export_smolvla_to_onnx.py          # PyTorch → ONNX export utility
├── videos/                             # LIBERO evaluation episode videos
└── README.md                          # This file

Installation & Setup

Requirements

QNN SDK: v2.43.0 or later
Python: 3.10+
OS: Linux x86_64
Dependencies: numpy, torch, transformers, huggingface_hub

Environment Setup

# Set QNN SDK path
export QNN_SDK_ROOT=/path/to/qnn-sdk-v2.43.0

# Add QNN libraries to library path
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang:$LD_LIBRARY_PATH

# Verify QNN tools are available
which qnn-net-run

Python Dependencies

pip install numpy torch transformers huggingface_hub

Usage

Basic Inference

from infer_libero_episode_qnn_cpu import QNNSmolVLAInference

# Initialize inference engine
inference = QNNSmolVLAInference(
    qnn_models_dir="./qnn_models",
    config_path="./config.json",
    preprocessor_config="./policy_preprocessor.json",
    postprocessor_config="./policy_postprocessor.json"
)

# Run inference on an image and language instruction
image = ...  # PIL Image or numpy array [H, W, 3]
instruction = "pick up the red cube"

action = inference.predict(image, instruction)
print(f"Predicted action: {action}")  # [8, 7] velocity vector

Command-Line Inference

python scripts/infer_libero_episode_qnn_cpu.py \
    --image_path /path/to/image.jpg \
    --instruction "pick up the red cube" \
    --qnn_models_dir ./qnn_models \
    --config_path ./config.json

Verification Against ONNX

python scripts/compare_onnx_vs_qnn.py \
    --onnx_model_path /path/to/onnx/model \
    --qnn_models_dir ./qnn_models \
    --num_samples 100

QNN Model Inspection

View Model Graph

qnn-net-run \
    --model qnn_models/vision_encoder_net.json \
    --input_list input_list.txt \
    --debug

Extract Model Information

python -c "
import json
with open('qnn_models/vision_encoder_net.json') as f:
    graph = json.load(f)
    print('Inputs:', [n['name'] for n in graph['graph_nodes'] if n['op'] == 'INPUT'])
    print('Outputs:', [n['name'] for n in graph['graph_nodes'] if n['op'] == 'OUTPUT'])
"

Performance Characteristics

Model	Size	Latency (CPU)	Memory
Vision Encoder	751 MB	~50-100ms	~1.5 GB
LLM Backbone	2.8 GB	~200-300ms	~5.6 GB
Action Head v2	747 MB	~50-100ms	~1.5 GB
Total	4.3 GB	~300-500ms	~8.6 GB

Latencies measured on Intel Xeon CPU @ 2.20GHz with 8 cores

Known Issues & Workarounds

Issue 1: int64 → float32 Conversion

Problem: QNN SDK v2.43 converts all int64 tensors to float32 Workaround: Cast language tokens and masks to float32 before inference Status: Implemented in infer_libero_episode_qnn_cpu.py

Issue 2: Tensor Layout Mismatches

Problem: QNN uses channels-last format, ONNX uses channels-first Workaround: Apply required transpositions (see I/O Tensor Transposition Requirements) Status: Implemented in inference script

Issue 3: Dynamic Shapes

Problem: QNN models compiled with fixed batch size (1) Workaround: Only batch_size=1 supported; reshape inputs as needed Status: Limitation of current conversion

Troubleshooting

Error: "Cannot find libvision_encoder.so"

# Ensure LD_LIBRARY_PATH includes QNN SDK libraries
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang:$LD_LIBRARY_PATH

Error: "Input shape mismatch"

# Check tensor transposition requirements above
# Verify input shapes match expected dimensions
# Use compare_onnx_vs_qnn.py to debug

Error: "QNN SDK version mismatch"

# Ensure QNN SDK v2.43.0 is installed
# Models may not be compatible with other versions
qnn-net-run --version

LIBERO Evaluation Results

Evaluation on LIBERO benchmark tasks (n_action_steps=10, max_steps=520):

Task ID	Task Description	Steps	Success
0	put_both_the_alphabet_soup_and_the_tomato_sauce_in_the_basket	520	❌
1	put_both_the_cream_cheese_box_and_the_butter_in_the_basket	520	❌
2	put_both_the_alphabet_soup_and_the_cream_cheese_box_in_the_basket	520	❌
3	put_the_black_bowl_in_the_back_compartment_of_the_caddy	221	✅
4	put_the_butter_at_the_back_of_the_top_shelf	520	❌

QNN CPU Success Rate: 1/5 (20%)

Episode videos are available in the videos/ directory.

ONNX → QNN Conversion

Conversion Script

A fully reproducible conversion script is provided: convert_onnx_to_qnn.sh

# Prerequisites
export QNN_SDK=/path/to/qairt/2.43.0.260128
export QNN_PYTHON=/path/to/python3.10
export PATH=/path/to/llvm/bin:$PATH   # clang++ required for .so build

# Run conversion (all 3 models)
bash convert_onnx_to_qnn.sh \
    --onnx-dir ./onnx_models \
    --output-dir ./qnn_models

The script performs 3 stages per model:

qnn-onnx-converter — ONNX → QNN graph (.cpp) + weights (.bin)
qnn-model-lib-generator — QNN graph → shared library (.so)
qnn-net-run (optional) — Smoke test with CPU backend

For the action_head, it also runs fix_action_head_all.py first to apply the required ONNX graph fixes.

Conversion Pipeline

PyTorch (HuggingFaceVLA/smolvla_libero)
   |
   v  export_smolvla_to_onnx.py
ONNX (xpuenabler/smolvla-libero-ONNX)
   |
   +-- vision_encoder.onnx -----------> qnn-onnx-converter -> qnn-model-lib-generator -> libvision_encoder.so
   +-- llm_backbone.onnx -------------> qnn-onnx-converter -> qnn-model-lib-generator -> libllm_backbone.so
   +-- action_head.onnx -> fix_all.py -> qnn-onnx-converter -> qnn-model-lib-generator -> libaction_head_v2.so

Per-Model Conversion Commands

1. Vision Encoder

# I/O: pixel_values [1,3,512,512] -> image_embeddings [1,64,960]
$QNN_PYTHON $QNN_SDK/bin/x86_64-linux-clang/qnn-onnx-converter \
    --input_network onnx_models/vision_encoder.onnx \
    --input_dim pixel_values 1,3,512,512 \
    --output_path qnn_models/vision_encoder \
    --float_bitwidth 32

$QNN_SDK/bin/x86_64-linux-clang/qnn-model-lib-generator \
    -c qnn_models/vision_encoder.cpp \
    -b qnn_models/vision_encoder.bin \
    -o qnn_models -t x86_64-linux-clang

2. LLM Backbone

# I/O: image_embs[1,64,960] x2 + lang_tokens[1,48] + lang_masks[1,48] + state[1,32]
#   -> kv_keys/values[32,1,177,5,64] + prefix_pad_masks[1,177]
$QNN_PYTHON $QNN_SDK/bin/x86_64-linux-clang/qnn-onnx-converter \
    --input_network onnx_models/llm_backbone.onnx \
    --input_dim image_embs_1 1,64,960 \
    --input_dim image_embs_2 1,64,960 \
    --input_dim lang_tokens 1,48 \
    --input_dim lang_masks 1,48 \
    --input_dim state 1,32 \
    --output_path qnn_models/llm_backbone \
    --float_bitwidth 32

$QNN_SDK/bin/x86_64-linux-clang/qnn-model-lib-generator \
    -c qnn_models/llm_backbone.cpp \
    -b qnn_models/llm_backbone.bin \
    -o qnn_models -t x86_64-linux-clang

3. Action Head (requires ONNX fixes first)

# Step 1: Apply 3 ONNX graph fixes (RMSNorm + Slice->Gather + Constants->Initializers)
python3 scripts/fix_action_head_all.py \
    --input onnx_models/action_head.onnx \
    --output onnx_models/action_head_qnn_v2.onnx

# Step 2: Convert fixed ONNX to QNN
# I/O: noisy_actions[1,50,32] + timestep[1] + prefix_pad_masks[1,177]
#      + kv_keys/values[32,1,177,5,64] -> velocity[1,50,32]
$QNN_PYTHON $QNN_SDK/bin/x86_64-linux-clang/qnn-onnx-converter \
    --input_network onnx_models/action_head_qnn_v2.onnx \
    --input_dim noisy_actions 1,50,32 \
    --input_dim timestep 1 \
    --input_dim prefix_pad_masks 1,177 \
    --input_dim kv_keys 32,1,177,5,64 \
    --input_dim kv_values 32,1,177,5,64 \
    --output_path qnn_models/action_head_v2 \
    --float_bitwidth 32

# Step 3: Build shared library
$QNN_SDK/bin/x86_64-linux-clang/qnn-model-lib-generator \
    -c qnn_models/action_head_v2.cpp \
    -b qnn_models/action_head_v2.bin \
    -o qnn_models -t x86_64-linux-clang

Action Head ONNX Fixes (Detail)

The original action_head.onnx contains 3 patterns incompatible with QNN:

Fix	Count	Problem	Solution
RMSNorm	64	`Div` output shape `[1,50,1]` rejected (expects `[1,50,480]`)	Rewrite `Div(x,n)w` to `Div(xw,n)`
Slice to Gather	65	QNN `StridedSlice` validation fails at runtime	Replace with static-index `Gather`
Constants to Init	all	QNN converter ignores interior `Constant` nodes	Move to graph initializers

All fixes are in scripts/fix_action_head_all.py. The fixed model produces identical outputs (cosine similarity = 1.000000).

QNN SDK Patches Required

QNN SDK v2.43.0 has known bugs that must be patched before conversion:

File	Issue	Fix
`onnx_model_api.py`	`spawn_process_and_exec()` corrupts numpy data	Replace with direct function call
`op_adapter.py` (ReshapeOp)	C++ binding corrupts shape data	Pure Python override
`op_adapter.py` (TransposeOp)	Permutation order gets corrupted	`list(perm)` conversion

Citation

If you use these models, please cite:

@article{smolvla,
  title={SmolVLA: Efficient Vision Language Models for Embodied AI},
  author={...},
  journal={...},
  year={2024}
}

@article{libero,
  title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
  author={...},
  journal={...},
  year={2023}
}

License

These QNN-converted models are provided under the same license as the original SmolVLA models. See the original repository for license details.

Support & Issues

For issues related to:

QNN conversion: See scripts/fix_action_head_all.py and conversion logs
Inference: Check scripts/infer_libero_episode_qnn_cpu.py for implementation details
Verification: Run scripts/compare_onnx_vs_qnn.py to validate outputs
ONNX export: See scripts/export_smolvla_to_onnx.py

Related Repositories

ONNX Version: xpuenabler/smolvla-libero-ONNX
Original Model: HuggingFaceVLA/smolvla_libero
LIBERO Benchmark: LIBERO GitHub

Last Updated: March 2025 QNN SDK Version: v2.43.0 Verification Status: ✓ All models verified (cosine_similarity = 1.000000)

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support