YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SmolVLA LIBERO - QNN CPU

QNN-converted SmolVLA models for CPU inference on x86_64 Linux

Model Overview

This repository contains the Qualcomm Neural Network (QNN) SDK v2.43.0 converted versions of the SmolVLA LIBERO models, optimized for CPU inference on x86_64-linux-clang architecture.

Property Value
Source Model HuggingFaceVLA/smolvla_libero
ONNX Intermediate xpuenabler/smolvla-libero-ONNX
QNN SDK Version v2.43.0
Target Architecture x86_64-linux-clang (CPU backend)
Float Bitwidth 32-bit (no quantization)
Verification Cosine similarity = 1.000000 vs ONNX baseline

Model Architecture

SmolVLA LIBERO is decomposed into 3 QNN components:

1. Vision Encoder (libvision_encoder.so)

  • Size: 376 MB (.so) + 375 MB (.bin)
  • Purpose: Encodes RGB images to visual embeddings
  • Input: pixel_values - RGB images
  • Output: image_embeddings - Visual feature vectors

2. LLM Backbone (libllm_backbone.so)

  • Size: 1.4 GB (.so) + 1.4 GB (.bin)
  • Purpose: Language model processing with cross-attention to vision
  • Inputs: Image embeddings, language tokens, attention masks
  • Outputs: KV cache, prefix padding masks

3. Action Head v2 (libaction_head_v2.so)

  • Size: 375 MB (.so) + 372 MB (.bin)
  • Purpose: Predicts robot action velocities
  • Inputs: Noisy actions, KV cache, prefix masks
  • Output: Action velocity predictions

Critical I/O Tensor Transposition Requirements

⚠️ IMPORTANT: QNN models require specific tensor layouts. The following transpositions are MANDATORY:

Vision Encoder

Input (pixel_values):
  ONNX format: [1, 3, 512, 512]  (channels-first)
  QNN format:  [1, 512, 512, 3]  (channels-last)
  Transpose:   (0, 2, 3, 1)

Output (image_embeddings):
  ONNX format: [1, 64, 960]
  QNN format:  [1, 64, 960]  (no transpose needed)

LLM Backbone

Input (image_embeddings):
  ONNX format: [1, 64, 960]
  QNN format:  [1, 960, 64]  (transpose 0,2,1)
  Transpose:   (0, 2, 1)

Input (language_tokens, attention_masks):
  ONNX format: int64
  QNN format:  float32  (CRITICAL: QNN SDK v2.43 converts all int64 to float32)
  
Output (kv_cache, prefix_pad_masks):
  Format:      float32 (no transpose needed)

Action Head v2

Input (noisy_actions):
  ONNX format: [1, 8, 7]
  QNN format:  [1, 7, 8]  (transpose 0,2,1)
  Transpose:   (0, 2, 1)

Input (kv_keys, kv_values):
  ONNX format: [1, num_heads, seq_len, head_dim]
  QNN format:  [1, seq_len, head_dim, num_heads]  (transpose 0,2,3,4,1)
  Transpose:   (0, 2, 3, 4, 1)

Input (prefix_pad_masks):
  Format:      float32

Output (velocity):
  ONNX format: [1, 8, 7]
  QNN format:  [1, 7, 8]  (transpose back 0,2,1)
  Transpose:   (0, 2, 1)

Critical QNN SDK v2.43 Behavior

⚠️ KNOWN LIMITATION: QNN SDK v2.43.0 automatically converts ALL int64 tensors to float32 during model conversion. This affects:

  • Language token inputs (originally int64 in ONNX)
  • Attention mask inputs (originally int64 in ONNX)

Solution: Cast these inputs to float32 before passing to QNN models. See infer_libero_episode_qnn_cpu.py for implementation.

Verification Results

All 3 models have been verified against their ONNX counterparts:

Vision Encoder:    cosine_similarity = 1.000000 βœ“
LLM Backbone:      cosine_similarity = 1.000000 βœ“
Action Head v2:    cosine_similarity = 1.000000 βœ“

Verification script: scripts/compare_onnx_vs_qnn.py

ONNX Graph Fixes Applied

The action_head required 3 critical ONNX graph fixes before QNN conversion:

  1. RMSNorm Fix: Converted RMSNorm operations to equivalent layer norm patterns
  2. Slice→Gather Fix: Replaced dynamic slice operations with gather operations
  3. Constants→Initializers Fix: Converted constant nodes to initializer tensors

These fixes are documented in scripts/fix_action_head_all.py.

File Structure

xpuenabler/smolvla-libero-QNN-CPU/
β”œβ”€β”€ qnn_models/
β”‚   β”œβ”€β”€ libvision_encoder.so          # Vision encoder QNN library
β”‚   β”œβ”€β”€ libllm_backbone.so            # LLM backbone QNN library
β”‚   β”œβ”€β”€ libaction_head_v2.so          # Action head QNN library
β”‚   β”œβ”€β”€ vision_encoder.bin            # Vision encoder weights
β”‚   β”œβ”€β”€ llm_backbone.bin              # LLM backbone weights
β”‚   β”œβ”€β”€ action_head_v2.bin            # Action head weights
β”‚   β”œβ”€β”€ vision_encoder_net.json       # Vision encoder graph definition
β”‚   β”œβ”€β”€ llm_backbone_net.json         # LLM backbone graph definition
β”‚   └── action_head_v2_net.json       # Action head graph definition
β”œβ”€β”€ config.json                        # Model configuration
β”œβ”€β”€ policy_preprocessor.json           # Input preprocessing config
β”œβ”€β”€ policy_postprocessor.json          # Output postprocessing config
β”œβ”€β”€ policy_preprocessor_step_5_normalizer_processor.safetensors
β”œβ”€β”€ policy_postprocessor_step_1_unnormalizer_processor.safetensors
β”œβ”€β”€ convert_onnx_to_qnn.sh            # ONNX β†’ QNN conversion script (all 3 models)
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ infer_libero_episode_qnn_cpu.py    # Main inference script
β”‚   β”œβ”€β”€ fix_action_head_all.py             # ONNX graph fixes for action_head
β”‚   β”œβ”€β”€ compare_onnx_vs_qnn.py             # ONNX vs QNN verification script
β”‚   └── export_smolvla_to_onnx.py          # PyTorch β†’ ONNX export utility
β”œβ”€β”€ videos/                             # LIBERO evaluation episode videos
└── README.md                          # This file

Installation & Setup

Requirements

  • QNN SDK: v2.43.0 or later
  • Python: 3.10+
  • OS: Linux x86_64
  • Dependencies: numpy, torch, transformers, huggingface_hub

Environment Setup

# Set QNN SDK path
export QNN_SDK_ROOT=/path/to/qnn-sdk-v2.43.0

# Add QNN libraries to library path
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang:$LD_LIBRARY_PATH

# Verify QNN tools are available
which qnn-net-run

Python Dependencies

pip install numpy torch transformers huggingface_hub

Usage

Basic Inference

from infer_libero_episode_qnn_cpu import QNNSmolVLAInference

# Initialize inference engine
inference = QNNSmolVLAInference(
    qnn_models_dir="./qnn_models",
    config_path="./config.json",
    preprocessor_config="./policy_preprocessor.json",
    postprocessor_config="./policy_postprocessor.json"
)

# Run inference on an image and language instruction
image = ...  # PIL Image or numpy array [H, W, 3]
instruction = "pick up the red cube"

action = inference.predict(image, instruction)
print(f"Predicted action: {action}")  # [8, 7] velocity vector

Command-Line Inference

python scripts/infer_libero_episode_qnn_cpu.py \
    --image_path /path/to/image.jpg \
    --instruction "pick up the red cube" \
    --qnn_models_dir ./qnn_models \
    --config_path ./config.json

Verification Against ONNX

python scripts/compare_onnx_vs_qnn.py \
    --onnx_model_path /path/to/onnx/model \
    --qnn_models_dir ./qnn_models \
    --num_samples 100

QNN Model Inspection

View Model Graph

qnn-net-run \
    --model qnn_models/vision_encoder_net.json \
    --input_list input_list.txt \
    --debug

Extract Model Information

python -c "
import json
with open('qnn_models/vision_encoder_net.json') as f:
    graph = json.load(f)
    print('Inputs:', [n['name'] for n in graph['graph_nodes'] if n['op'] == 'INPUT'])
    print('Outputs:', [n['name'] for n in graph['graph_nodes'] if n['op'] == 'OUTPUT'])
"

Performance Characteristics

Model Size Latency (CPU) Memory
Vision Encoder 751 MB ~50-100ms ~1.5 GB
LLM Backbone 2.8 GB ~200-300ms ~5.6 GB
Action Head v2 747 MB ~50-100ms ~1.5 GB
Total 4.3 GB ~300-500ms ~8.6 GB

Latencies measured on Intel Xeon CPU @ 2.20GHz with 8 cores

Known Issues & Workarounds

Issue 1: int64 β†’ float32 Conversion

Problem: QNN SDK v2.43 converts all int64 tensors to float32 Workaround: Cast language tokens and masks to float32 before inference Status: Implemented in infer_libero_episode_qnn_cpu.py

Issue 2: Tensor Layout Mismatches

Problem: QNN uses channels-last format, ONNX uses channels-first Workaround: Apply required transpositions (see I/O Tensor Transposition Requirements) Status: Implemented in inference script

Issue 3: Dynamic Shapes

Problem: QNN models compiled with fixed batch size (1) Workaround: Only batch_size=1 supported; reshape inputs as needed Status: Limitation of current conversion

Troubleshooting

Error: "Cannot find libvision_encoder.so"

# Ensure LD_LIBRARY_PATH includes QNN SDK libraries
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang:$LD_LIBRARY_PATH

Error: "Input shape mismatch"

# Check tensor transposition requirements above
# Verify input shapes match expected dimensions
# Use compare_onnx_vs_qnn.py to debug

Error: "QNN SDK version mismatch"

# Ensure QNN SDK v2.43.0 is installed
# Models may not be compatible with other versions
qnn-net-run --version

LIBERO Evaluation Results

Evaluation on LIBERO benchmark tasks (n_action_steps=10, max_steps=520):

Task ID Task Description Steps Success
0 put_both_the_alphabet_soup_and_the_tomato_sauce_in_the_basket 520 ❌
1 put_both_the_cream_cheese_box_and_the_butter_in_the_basket 520 ❌
2 put_both_the_alphabet_soup_and_the_cream_cheese_box_in_the_basket 520 ❌
3 put_the_black_bowl_in_the_back_compartment_of_the_caddy 221 βœ…
4 put_the_butter_at_the_back_of_the_top_shelf 520 ❌

QNN CPU Success Rate: 1/5 (20%)

Episode videos are available in the videos/ directory.

ONNX β†’ QNN Conversion

Conversion Script

A fully reproducible conversion script is provided: convert_onnx_to_qnn.sh

# Prerequisites
export QNN_SDK=/path/to/qairt/2.43.0.260128
export QNN_PYTHON=/path/to/python3.10
export PATH=/path/to/llvm/bin:$PATH   # clang++ required for .so build

# Run conversion (all 3 models)
bash convert_onnx_to_qnn.sh \
    --onnx-dir ./onnx_models \
    --output-dir ./qnn_models

The script performs 3 stages per model:

  1. qnn-onnx-converter β€” ONNX β†’ QNN graph (.cpp) + weights (.bin)
  2. qnn-model-lib-generator β€” QNN graph β†’ shared library (.so)
  3. qnn-net-run (optional) β€” Smoke test with CPU backend

For the action_head, it also runs fix_action_head_all.py first to apply the required ONNX graph fixes.

Conversion Pipeline

PyTorch (HuggingFaceVLA/smolvla_libero)
   |
   v  export_smolvla_to_onnx.py
ONNX (xpuenabler/smolvla-libero-ONNX)
   |
   +-- vision_encoder.onnx -----------> qnn-onnx-converter -> qnn-model-lib-generator -> libvision_encoder.so
   +-- llm_backbone.onnx -------------> qnn-onnx-converter -> qnn-model-lib-generator -> libllm_backbone.so
   +-- action_head.onnx -> fix_all.py -> qnn-onnx-converter -> qnn-model-lib-generator -> libaction_head_v2.so

Per-Model Conversion Commands

1. Vision Encoder

# I/O: pixel_values [1,3,512,512] -> image_embeddings [1,64,960]
$QNN_PYTHON $QNN_SDK/bin/x86_64-linux-clang/qnn-onnx-converter \
    --input_network onnx_models/vision_encoder.onnx \
    --input_dim pixel_values 1,3,512,512 \
    --output_path qnn_models/vision_encoder \
    --float_bitwidth 32

$QNN_SDK/bin/x86_64-linux-clang/qnn-model-lib-generator \
    -c qnn_models/vision_encoder.cpp \
    -b qnn_models/vision_encoder.bin \
    -o qnn_models -t x86_64-linux-clang

2. LLM Backbone

# I/O: image_embs[1,64,960] x2 + lang_tokens[1,48] + lang_masks[1,48] + state[1,32]
#   -> kv_keys/values[32,1,177,5,64] + prefix_pad_masks[1,177]
$QNN_PYTHON $QNN_SDK/bin/x86_64-linux-clang/qnn-onnx-converter \
    --input_network onnx_models/llm_backbone.onnx \
    --input_dim image_embs_1 1,64,960 \
    --input_dim image_embs_2 1,64,960 \
    --input_dim lang_tokens 1,48 \
    --input_dim lang_masks 1,48 \
    --input_dim state 1,32 \
    --output_path qnn_models/llm_backbone \
    --float_bitwidth 32

$QNN_SDK/bin/x86_64-linux-clang/qnn-model-lib-generator \
    -c qnn_models/llm_backbone.cpp \
    -b qnn_models/llm_backbone.bin \
    -o qnn_models -t x86_64-linux-clang

3. Action Head (requires ONNX fixes first)

# Step 1: Apply 3 ONNX graph fixes (RMSNorm + Slice->Gather + Constants->Initializers)
python3 scripts/fix_action_head_all.py \
    --input onnx_models/action_head.onnx \
    --output onnx_models/action_head_qnn_v2.onnx

# Step 2: Convert fixed ONNX to QNN
# I/O: noisy_actions[1,50,32] + timestep[1] + prefix_pad_masks[1,177]
#      + kv_keys/values[32,1,177,5,64] -> velocity[1,50,32]
$QNN_PYTHON $QNN_SDK/bin/x86_64-linux-clang/qnn-onnx-converter \
    --input_network onnx_models/action_head_qnn_v2.onnx \
    --input_dim noisy_actions 1,50,32 \
    --input_dim timestep 1 \
    --input_dim prefix_pad_masks 1,177 \
    --input_dim kv_keys 32,1,177,5,64 \
    --input_dim kv_values 32,1,177,5,64 \
    --output_path qnn_models/action_head_v2 \
    --float_bitwidth 32

# Step 3: Build shared library
$QNN_SDK/bin/x86_64-linux-clang/qnn-model-lib-generator \
    -c qnn_models/action_head_v2.cpp \
    -b qnn_models/action_head_v2.bin \
    -o qnn_models -t x86_64-linux-clang

Action Head ONNX Fixes (Detail)

The original action_head.onnx contains 3 patterns incompatible with QNN:

Fix Count Problem Solution
RMSNorm 64 Div output shape [1,50,1] rejected (expects [1,50,480]) Rewrite Div(x,n)*w to Div(x*w,n)
Slice to Gather 65 QNN StridedSlice validation fails at runtime Replace with static-index Gather
Constants to Init all QNN converter ignores interior Constant nodes Move to graph initializers

All fixes are in scripts/fix_action_head_all.py. The fixed model produces identical outputs (cosine similarity = 1.000000).

QNN SDK Patches Required

QNN SDK v2.43.0 has known bugs that must be patched before conversion:

File Issue Fix
onnx_model_api.py spawn_process_and_exec() corrupts numpy data Replace with direct function call
op_adapter.py (ReshapeOp) C++ binding corrupts shape data Pure Python override
op_adapter.py (TransposeOp) Permutation order gets corrupted list(perm) conversion

Citation

If you use these models, please cite:

@article{smolvla,
  title={SmolVLA: Efficient Vision Language Models for Embodied AI},
  author={...},
  journal={...},
  year={2024}
}

@article{libero,
  title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
  author={...},
  journal={...},
  year={2023}
}

License

These QNN-converted models are provided under the same license as the original SmolVLA models. See the original repository for license details.

Support & Issues

For issues related to:

  • QNN conversion: See scripts/fix_action_head_all.py and conversion logs
  • Inference: Check scripts/infer_libero_episode_qnn_cpu.py for implementation details
  • Verification: Run scripts/compare_onnx_vs_qnn.py to validate outputs
  • ONNX export: See scripts/export_smolvla_to_onnx.py

Related Repositories


Last Updated: March 2025 QNN SDK Version: v2.43.0 Verification Status: βœ“ All models verified (cosine_similarity = 1.000000)

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support