YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
SmolVLA LIBERO - QNN CPU
QNN-converted SmolVLA models for CPU inference on x86_64 Linux
Model Overview
This repository contains the Qualcomm Neural Network (QNN) SDK v2.43.0 converted versions of the SmolVLA LIBERO models, optimized for CPU inference on x86_64-linux-clang architecture.
| Property | Value |
|---|---|
| Source Model | HuggingFaceVLA/smolvla_libero |
| ONNX Intermediate | xpuenabler/smolvla-libero-ONNX |
| QNN SDK Version | v2.43.0 |
| Target Architecture | x86_64-linux-clang (CPU backend) |
| Float Bitwidth | 32-bit (no quantization) |
| Verification | Cosine similarity = 1.000000 vs ONNX baseline |
Model Architecture
SmolVLA LIBERO is decomposed into 3 QNN components:
1. Vision Encoder (libvision_encoder.so)
- Size: 376 MB (.so) + 375 MB (.bin)
- Purpose: Encodes RGB images to visual embeddings
- Input:
pixel_values- RGB images - Output:
image_embeddings- Visual feature vectors
2. LLM Backbone (libllm_backbone.so)
- Size: 1.4 GB (.so) + 1.4 GB (.bin)
- Purpose: Language model processing with cross-attention to vision
- Inputs: Image embeddings, language tokens, attention masks
- Outputs: KV cache, prefix padding masks
3. Action Head v2 (libaction_head_v2.so)
- Size: 375 MB (.so) + 372 MB (.bin)
- Purpose: Predicts robot action velocities
- Inputs: Noisy actions, KV cache, prefix masks
- Output: Action velocity predictions
Critical I/O Tensor Transposition Requirements
β οΈ IMPORTANT: QNN models require specific tensor layouts. The following transpositions are MANDATORY:
Vision Encoder
Input (pixel_values):
ONNX format: [1, 3, 512, 512] (channels-first)
QNN format: [1, 512, 512, 3] (channels-last)
Transpose: (0, 2, 3, 1)
Output (image_embeddings):
ONNX format: [1, 64, 960]
QNN format: [1, 64, 960] (no transpose needed)
LLM Backbone
Input (image_embeddings):
ONNX format: [1, 64, 960]
QNN format: [1, 960, 64] (transpose 0,2,1)
Transpose: (0, 2, 1)
Input (language_tokens, attention_masks):
ONNX format: int64
QNN format: float32 (CRITICAL: QNN SDK v2.43 converts all int64 to float32)
Output (kv_cache, prefix_pad_masks):
Format: float32 (no transpose needed)
Action Head v2
Input (noisy_actions):
ONNX format: [1, 8, 7]
QNN format: [1, 7, 8] (transpose 0,2,1)
Transpose: (0, 2, 1)
Input (kv_keys, kv_values):
ONNX format: [1, num_heads, seq_len, head_dim]
QNN format: [1, seq_len, head_dim, num_heads] (transpose 0,2,3,4,1)
Transpose: (0, 2, 3, 4, 1)
Input (prefix_pad_masks):
Format: float32
Output (velocity):
ONNX format: [1, 8, 7]
QNN format: [1, 7, 8] (transpose back 0,2,1)
Transpose: (0, 2, 1)
Critical QNN SDK v2.43 Behavior
β οΈ KNOWN LIMITATION: QNN SDK v2.43.0 automatically converts ALL int64 tensors to float32 during model conversion. This affects:
- Language token inputs (originally int64 in ONNX)
- Attention mask inputs (originally int64 in ONNX)
Solution: Cast these inputs to float32 before passing to QNN models. See infer_libero_episode_qnn_cpu.py for implementation.
Verification Results
All 3 models have been verified against their ONNX counterparts:
Vision Encoder: cosine_similarity = 1.000000 β
LLM Backbone: cosine_similarity = 1.000000 β
Action Head v2: cosine_similarity = 1.000000 β
Verification script: scripts/compare_onnx_vs_qnn.py
ONNX Graph Fixes Applied
The action_head required 3 critical ONNX graph fixes before QNN conversion:
- RMSNorm Fix: Converted RMSNorm operations to equivalent layer norm patterns
- SliceβGather Fix: Replaced dynamic slice operations with gather operations
- ConstantsβInitializers Fix: Converted constant nodes to initializer tensors
These fixes are documented in scripts/fix_action_head_all.py.
File Structure
xpuenabler/smolvla-libero-QNN-CPU/
βββ qnn_models/
β βββ libvision_encoder.so # Vision encoder QNN library
β βββ libllm_backbone.so # LLM backbone QNN library
β βββ libaction_head_v2.so # Action head QNN library
β βββ vision_encoder.bin # Vision encoder weights
β βββ llm_backbone.bin # LLM backbone weights
β βββ action_head_v2.bin # Action head weights
β βββ vision_encoder_net.json # Vision encoder graph definition
β βββ llm_backbone_net.json # LLM backbone graph definition
β βββ action_head_v2_net.json # Action head graph definition
βββ config.json # Model configuration
βββ policy_preprocessor.json # Input preprocessing config
βββ policy_postprocessor.json # Output postprocessing config
βββ policy_preprocessor_step_5_normalizer_processor.safetensors
βββ policy_postprocessor_step_1_unnormalizer_processor.safetensors
βββ convert_onnx_to_qnn.sh # ONNX β QNN conversion script (all 3 models)
βββ scripts/
β βββ infer_libero_episode_qnn_cpu.py # Main inference script
β βββ fix_action_head_all.py # ONNX graph fixes for action_head
β βββ compare_onnx_vs_qnn.py # ONNX vs QNN verification script
β βββ export_smolvla_to_onnx.py # PyTorch β ONNX export utility
βββ videos/ # LIBERO evaluation episode videos
βββ README.md # This file
Installation & Setup
Requirements
- QNN SDK: v2.43.0 or later
- Python: 3.10+
- OS: Linux x86_64
- Dependencies: numpy, torch, transformers, huggingface_hub
Environment Setup
# Set QNN SDK path
export QNN_SDK_ROOT=/path/to/qnn-sdk-v2.43.0
# Add QNN libraries to library path
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang:$LD_LIBRARY_PATH
# Verify QNN tools are available
which qnn-net-run
Python Dependencies
pip install numpy torch transformers huggingface_hub
Usage
Basic Inference
from infer_libero_episode_qnn_cpu import QNNSmolVLAInference
# Initialize inference engine
inference = QNNSmolVLAInference(
qnn_models_dir="./qnn_models",
config_path="./config.json",
preprocessor_config="./policy_preprocessor.json",
postprocessor_config="./policy_postprocessor.json"
)
# Run inference on an image and language instruction
image = ... # PIL Image or numpy array [H, W, 3]
instruction = "pick up the red cube"
action = inference.predict(image, instruction)
print(f"Predicted action: {action}") # [8, 7] velocity vector
Command-Line Inference
python scripts/infer_libero_episode_qnn_cpu.py \
--image_path /path/to/image.jpg \
--instruction "pick up the red cube" \
--qnn_models_dir ./qnn_models \
--config_path ./config.json
Verification Against ONNX
python scripts/compare_onnx_vs_qnn.py \
--onnx_model_path /path/to/onnx/model \
--qnn_models_dir ./qnn_models \
--num_samples 100
QNN Model Inspection
View Model Graph
qnn-net-run \
--model qnn_models/vision_encoder_net.json \
--input_list input_list.txt \
--debug
Extract Model Information
python -c "
import json
with open('qnn_models/vision_encoder_net.json') as f:
graph = json.load(f)
print('Inputs:', [n['name'] for n in graph['graph_nodes'] if n['op'] == 'INPUT'])
print('Outputs:', [n['name'] for n in graph['graph_nodes'] if n['op'] == 'OUTPUT'])
"
Performance Characteristics
| Model | Size | Latency (CPU) | Memory |
|---|---|---|---|
| Vision Encoder | 751 MB | ~50-100ms | ~1.5 GB |
| LLM Backbone | 2.8 GB | ~200-300ms | ~5.6 GB |
| Action Head v2 | 747 MB | ~50-100ms | ~1.5 GB |
| Total | 4.3 GB | ~300-500ms | ~8.6 GB |
Latencies measured on Intel Xeon CPU @ 2.20GHz with 8 cores
Known Issues & Workarounds
Issue 1: int64 β float32 Conversion
Problem: QNN SDK v2.43 converts all int64 tensors to float32
Workaround: Cast language tokens and masks to float32 before inference
Status: Implemented in infer_libero_episode_qnn_cpu.py
Issue 2: Tensor Layout Mismatches
Problem: QNN uses channels-last format, ONNX uses channels-first Workaround: Apply required transpositions (see I/O Tensor Transposition Requirements) Status: Implemented in inference script
Issue 3: Dynamic Shapes
Problem: QNN models compiled with fixed batch size (1) Workaround: Only batch_size=1 supported; reshape inputs as needed Status: Limitation of current conversion
Troubleshooting
Error: "Cannot find libvision_encoder.so"
# Ensure LD_LIBRARY_PATH includes QNN SDK libraries
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang:$LD_LIBRARY_PATH
Error: "Input shape mismatch"
# Check tensor transposition requirements above
# Verify input shapes match expected dimensions
# Use compare_onnx_vs_qnn.py to debug
Error: "QNN SDK version mismatch"
# Ensure QNN SDK v2.43.0 is installed
# Models may not be compatible with other versions
qnn-net-run --version
LIBERO Evaluation Results
Evaluation on LIBERO benchmark tasks (n_action_steps=10, max_steps=520):
| Task ID | Task Description | Steps | Success |
|---|---|---|---|
| 0 | put_both_the_alphabet_soup_and_the_tomato_sauce_in_the_basket | 520 | β |
| 1 | put_both_the_cream_cheese_box_and_the_butter_in_the_basket | 520 | β |
| 2 | put_both_the_alphabet_soup_and_the_cream_cheese_box_in_the_basket | 520 | β |
| 3 | put_the_black_bowl_in_the_back_compartment_of_the_caddy | 221 | β |
| 4 | put_the_butter_at_the_back_of_the_top_shelf | 520 | β |
QNN CPU Success Rate: 1/5 (20%)
Episode videos are available in the videos/ directory.
ONNX β QNN Conversion
Conversion Script
A fully reproducible conversion script is provided: convert_onnx_to_qnn.sh
# Prerequisites
export QNN_SDK=/path/to/qairt/2.43.0.260128
export QNN_PYTHON=/path/to/python3.10
export PATH=/path/to/llvm/bin:$PATH # clang++ required for .so build
# Run conversion (all 3 models)
bash convert_onnx_to_qnn.sh \
--onnx-dir ./onnx_models \
--output-dir ./qnn_models
The script performs 3 stages per model:
qnn-onnx-converterβ ONNX β QNN graph (.cpp) + weights (.bin)qnn-model-lib-generatorβ QNN graph β shared library (.so)qnn-net-run(optional) β Smoke test with CPU backend
For the action_head, it also runs fix_action_head_all.py first to apply the required ONNX graph fixes.
Conversion Pipeline
PyTorch (HuggingFaceVLA/smolvla_libero)
|
v export_smolvla_to_onnx.py
ONNX (xpuenabler/smolvla-libero-ONNX)
|
+-- vision_encoder.onnx -----------> qnn-onnx-converter -> qnn-model-lib-generator -> libvision_encoder.so
+-- llm_backbone.onnx -------------> qnn-onnx-converter -> qnn-model-lib-generator -> libllm_backbone.so
+-- action_head.onnx -> fix_all.py -> qnn-onnx-converter -> qnn-model-lib-generator -> libaction_head_v2.so
Per-Model Conversion Commands
1. Vision Encoder
# I/O: pixel_values [1,3,512,512] -> image_embeddings [1,64,960]
$QNN_PYTHON $QNN_SDK/bin/x86_64-linux-clang/qnn-onnx-converter \
--input_network onnx_models/vision_encoder.onnx \
--input_dim pixel_values 1,3,512,512 \
--output_path qnn_models/vision_encoder \
--float_bitwidth 32
$QNN_SDK/bin/x86_64-linux-clang/qnn-model-lib-generator \
-c qnn_models/vision_encoder.cpp \
-b qnn_models/vision_encoder.bin \
-o qnn_models -t x86_64-linux-clang
2. LLM Backbone
# I/O: image_embs[1,64,960] x2 + lang_tokens[1,48] + lang_masks[1,48] + state[1,32]
# -> kv_keys/values[32,1,177,5,64] + prefix_pad_masks[1,177]
$QNN_PYTHON $QNN_SDK/bin/x86_64-linux-clang/qnn-onnx-converter \
--input_network onnx_models/llm_backbone.onnx \
--input_dim image_embs_1 1,64,960 \
--input_dim image_embs_2 1,64,960 \
--input_dim lang_tokens 1,48 \
--input_dim lang_masks 1,48 \
--input_dim state 1,32 \
--output_path qnn_models/llm_backbone \
--float_bitwidth 32
$QNN_SDK/bin/x86_64-linux-clang/qnn-model-lib-generator \
-c qnn_models/llm_backbone.cpp \
-b qnn_models/llm_backbone.bin \
-o qnn_models -t x86_64-linux-clang
3. Action Head (requires ONNX fixes first)
# Step 1: Apply 3 ONNX graph fixes (RMSNorm + Slice->Gather + Constants->Initializers)
python3 scripts/fix_action_head_all.py \
--input onnx_models/action_head.onnx \
--output onnx_models/action_head_qnn_v2.onnx
# Step 2: Convert fixed ONNX to QNN
# I/O: noisy_actions[1,50,32] + timestep[1] + prefix_pad_masks[1,177]
# + kv_keys/values[32,1,177,5,64] -> velocity[1,50,32]
$QNN_PYTHON $QNN_SDK/bin/x86_64-linux-clang/qnn-onnx-converter \
--input_network onnx_models/action_head_qnn_v2.onnx \
--input_dim noisy_actions 1,50,32 \
--input_dim timestep 1 \
--input_dim prefix_pad_masks 1,177 \
--input_dim kv_keys 32,1,177,5,64 \
--input_dim kv_values 32,1,177,5,64 \
--output_path qnn_models/action_head_v2 \
--float_bitwidth 32
# Step 3: Build shared library
$QNN_SDK/bin/x86_64-linux-clang/qnn-model-lib-generator \
-c qnn_models/action_head_v2.cpp \
-b qnn_models/action_head_v2.bin \
-o qnn_models -t x86_64-linux-clang
Action Head ONNX Fixes (Detail)
The original action_head.onnx contains 3 patterns incompatible with QNN:
| Fix | Count | Problem | Solution |
|---|---|---|---|
| RMSNorm | 64 | Div output shape [1,50,1] rejected (expects [1,50,480]) |
Rewrite Div(x,n)*w to Div(x*w,n) |
| Slice to Gather | 65 | QNN StridedSlice validation fails at runtime |
Replace with static-index Gather |
| Constants to Init | all | QNN converter ignores interior Constant nodes |
Move to graph initializers |
All fixes are in scripts/fix_action_head_all.py. The fixed model produces identical outputs (cosine similarity = 1.000000).
QNN SDK Patches Required
QNN SDK v2.43.0 has known bugs that must be patched before conversion:
| File | Issue | Fix |
|---|---|---|
onnx_model_api.py |
spawn_process_and_exec() corrupts numpy data |
Replace with direct function call |
op_adapter.py (ReshapeOp) |
C++ binding corrupts shape data | Pure Python override |
op_adapter.py (TransposeOp) |
Permutation order gets corrupted | list(perm) conversion |
Citation
If you use these models, please cite:
@article{smolvla,
title={SmolVLA: Efficient Vision Language Models for Embodied AI},
author={...},
journal={...},
year={2024}
}
@article{libero,
title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
author={...},
journal={...},
year={2023}
}
License
These QNN-converted models are provided under the same license as the original SmolVLA models. See the original repository for license details.
Support & Issues
For issues related to:
- QNN conversion: See
scripts/fix_action_head_all.pyand conversion logs - Inference: Check
scripts/infer_libero_episode_qnn_cpu.pyfor implementation details - Verification: Run
scripts/compare_onnx_vs_qnn.pyto validate outputs - ONNX export: See
scripts/export_smolvla_to_onnx.py
Related Repositories
- ONNX Version: xpuenabler/smolvla-libero-ONNX
- Original Model: HuggingFaceVLA/smolvla_libero
- LIBERO Benchmark: LIBERO GitHub
Last Updated: March 2025 QNN SDK Version: v2.43.0 Verification Status: β All models verified (cosine_similarity = 1.000000)
- Downloads last month
- 2