Octen-Embedding-0.6B β FP32 ONNX
FP32 ONNX re-export of Octen/Octen-Embedding-0.6B using only standard ONNX operators (opset 18). Compatible with any ONNX Runtime backend including CoreML, CUDA, DirectML, and CPU.
Why this export?
The original ONNX export uses MatMulBnb4 (bitsandbytes INT4) operators from the com.microsoft domain. These are not supported by most execution providers (CoreML, TensorRT, etc.) and limit the model to CPU-only inference in practice.
This re-export replaces all non-standard ops with standard ONNX MatMul in float32, enabling hardware-accelerated inference across all major platforms with full precision.
Differences from the original
| Original (INT4) | This export (FP32) | |
|---|---|---|
| Precision | 4-bit (bitsandbytes) | float32 |
| Weight size | 533 MB | 4.4 GB |
| ONNX ops | com.microsoft.MatMulBnb4 |
Standard MatMul only |
| CoreML EP | Not supported | Supported |
| CUDA EP | Limited | Supported |
Files
model.fp32.onnxβ ONNX graph (4.8 MB)model.fp32.onnx.dataβ External weights (4.4 GB)tokenizer.jsonβ HuggingFace BPE tokenizer (11 MB)
Usage
Python (ONNX Runtime)
import onnxruntime as ort
from tokenizers import Tokenizer
import numpy as np
tokenizer = Tokenizer.from_file("tokenizer.json")
session = ort.InferenceSession("model.fp32.onnx")
encoding = tokenizer.encode("hello world", add_special_tokens=True)
input_ids = np.array([encoding.ids], dtype=np.int64)
attention_mask = np.array([encoding.attention_mask], dtype=np.int64)
embeddings = session.run(None, {
"input_ids": input_ids,
"attention_mask": attention_mask,
})[0]
print(f"Shape: {embeddings.shape}") # (1, 1024)
Rust (ort)
use int4_runner::EmbeddingModel;
let tok = std::fs::read("tokenizer.json").unwrap();
let model = EmbeddingModel::from_file("model.fp32.onnx", &tok).unwrap();
let embedding = model.embed("hello world").unwrap();
println!("dimensions: {}", embedding.values.len()); // 1024
Model details
- Architecture: Qwen3 transformer, 28 layers, 1024 hidden dim, 16 attention heads
- Max sequence length: 32,768 tokens (model supports), 512 tokens (typical usage)
- Output dimension: 1024
- Tokenizer: Qwen BPE (vocab size 151,669)
- Pooling: Mean pooling + L2 normalization
Attribution
- Original model: Octen/Octen-Embedding-0.6B by Octen
- Base model: Qwen/Qwen3-Embedding-0.6B
- License: Apache-2.0
Changes from original
This is a precision-format conversion only. The model weights were cast from bfloat16 to float32 and re-exported using torch.onnx.export (opset 18) with mean pooling and L2 normalization baked into the graph. No fine-tuning or weight modification was performed.
Model tree for geoffsee/octen-embedding-0.6b-onnx-fp32
Base model
Qwen/Qwen3-0.6B-Base