language:
- en
license: mit
library_name: transformers
tags:
- multimodal
- moe
- text-to-image
- image editing
- image to video
- text-to-video
- video editing
- text-to-speech
- speech-to-text
- speech-to-speech
- image-to-text
- video-to-text
- agentic
- tool-use
- flow-matching
- 3d-rope
- titok
- vidtok
- dual-stream-attention
- zero-shot-voice-cloning
- bigvgan
- snake-activation
- multi-receptive-field-fusion
pipeline_tag: any-to-any
inference: false
datasets:
- m-a-p/Code-Feedback
- iamtarun/python_code_instructions_18k_alpaca
- codeparrot/codeparrot-clean
- bigcode/humanevalpack
- loubnabnl/github-jupyter-code-to-text
- saurabh5/rlvr-code-data-Swift
- finbarr/rlvr-code-data-swift-code-edit
- ExAi/Code-Golang-QA-2k
- smcleod/golang-coder
- databricks/databricks-dolly-15k
- OpenAssistant/oasst1
- HuggingFaceH4/no_robots
- Open-Orca/OpenOrca
- abhi227070/conversation-to-summarization-dataset
- allenai/WildChat-1M
- THUDM/AgentInstruct
- glaiveai/glaive-code-assistant-v2
- stingning/ultrachat
- RyokoAI/ShareGPT52K
- AlicanKiraz0/Agentic-Chain-of-Thought-Coding-SFT-Dataset
- Locutusque/function-calling-chatml
- driaforall/pythonic-function-calling
- argilla/Synth-APIGen-v0.1
- interstellarninja/tool-calls-singleturn
- interstellarninja/tool-calls-multiturn
- Naveengo/flickr8k
- ybelkada/football-dataset
- jmhessel/newyorker_caption_contest
- derek-thomas/ScienceQA
- HuggingFaceM4/WebSight
- lmms-lab/Video-MME
- MBZUAI/VideoInstruct-100K
- Gustavosta/Stable-Diffusion-Prompts
- FredZhang7/stable-diffusion-prompts-2.47M
- succinctly/midjourney-prompts
- osunlp/MagicBrush
- timbrooks/instructpix2pix-clip-filtered
- Rapidata/sora-video-generation-physics-likert-scoring
- Rapidata/sora-video-generation-style-likert-scoring
- Rapidata/sora-video-generation-alignment-likert-scoring
- Rapidata/text-2-video-human-preferences
- Rapidata/text-2-video-human-preferences-sora-2
- TempoFunk/webvid-10M
- multimodalart/panda-70m
- nkp37/OpenVid-1M
- WenhaoWang/VidProM
- WenhaoWang/TIP-I2V
- jovianzm/img2vid-pexels-350k
- TencentARC/MiraData
- APRIL-AIGC/UltraVideo
- Mutonix/Vript
- Rapidata/image-to-video-human-preference-seedance-1-pro
- openslr/librispeech_asr
- blabble-io/libritts_r
- parler-tts/mls_eng_10k
- MikhailT/hifi-tts
- renjiepi/medium_20000-file_operations_n100k1
π Xoron-Dev: State-of-the-Art Multimodal MoE
Xoron-Dev
β¨ Xoron-Dev: The Elite SOTA Omni-Modal Intelligence
Xoron-Dev is the definitive open-source architecture for Omni-Modal Artificial Intelligence. Unlike legacy models that treat vision and audio as plugins, Xoron-Dev is designed for native, high-fidelity perception across every major sensory dimension.
π Why Xoron-Dev?
Xoron-Dev represents a massive leap in multimodal reasoning, combining cutting-edge Sparse MoE architecture with a refined sensory stack.
1. ποΈ SOTA Vision (SigLIP-2 & TiTok)
Xoron-Dev exclusively uses SigLIP-2 for superior zero-shot performance and semantic alignment.
- TiTok 1D VAE: Images are compressed into 256 ultra-dense tokens, allowing Xoron to "see" high-resolution scenes with unprecedented efficiency.
- 2D-RoPE: Integrated positional embeddings that maintain spatial relationships regardless of aspect ratio.
2. π¬ Native Video Intelligence (VidTok)
Our custom VidTok encoder uses 3D Volumetric Compression to ingest up to 32 frames of high-definition video natively. Xoron doesn't just see a sequence of imagesβit understands motion, causality, and temporal context.
3. ποΈ Raw PCM Audio (Conformer + BigVGAN)
Xoron-Dev processes Raw 16kHz PCM Audio directly. No Mel Spectrograms, no lossy Fourier transforms.
- Micro-Latency S2S: True Speech-to-Speech interactions (<200ms) for natural, fluid conversations.
- Zero-Shot Voice Cloning: Instantly clone any voice from a 5-second sample for high-fidelity personalized output.
π§ The Brain: Aux-Lossless MoE & 128K Ring Attention
A sophisticated Mixture of Experts (MoE) backbone that dynamically routes the logic of every token through specialized hardware-aware sub-networks.
ποΈ Deep Expert Hierarchy
Unlike standard MoE models with uniform experts, Xoron-Dev implements a specialized Deep Expert system.
- Expert Pool: 16 Experts Total (8 Standard + 8 Deep).
- Variable Logical Depth: Deep Experts possess internal depths scaling from 2 up to 9 layers.
- Expert Penalty Routing: A soft utilization penalty ($Cost \propto Depth$) ensures that the model only invokes deeper computation for tasks requiring maximum logical precision, maintaining high inference throughput for simpler tokens.
β‘ Reasoning Acceleration: Fast Ponder
Xoron-Dev features a dedicated FastPonderBlock for near-instant latent deliberation.
- Attention-Free Reasoning: By bypassing the $O(N^2)$ Self-Attention stack during thought loops, the Depth-3 reasoning block propagates logic at 120+ thoughts/sec.
- Dynamic Halting: A learned
halt_headmonitors latent entropy. Once the model reaches a decision (entropy threshold < 0.2), it breaks the ponder loop and returns to token decoding, reducing unnecessary FLOPs by up to 90%.
π Infinite Context
Using Ring Attention, Xoron-Dev can analyze books, hour-long videos, or massive codebases with native 128K context window support.
π Get Started with Xorfice
The easiest way to experience Xoron-Dev is via the xorfice engineβthe SOTA orchestrator for multimodal deployment.
Installation
pip install xorfice
High-Fidelity Interaction
from xorfice import XoronEngine
# The engine automatically handles weights and optimizations
# Correct model slug: Backup-bdg/Xoron-Dev-MultiMoe
engine = XoronEngine(model_path="Backup-bdg/Xoron-Dev-MultiMoe")
# Start an omni-modal conversation
response = engine.generate(
prompt="Who is this person and what are they doing?",
images="https://example.com/interview.jpg",
videos="https://example.com/interview.mp4"
)
print(response["text"])
π SOTA Benchmarks & Features
| Feature | Xoron-Dev |
|---|---|
| Vision Backbone | SigLIP-2 |
| Video Compression | VidTok 3D |
| Audio Ingestion | Raw PCM |
| Inference Efficiency | Sparse MoE (5B) |
| Context Window | 128K (Ring) |
π¨ Creative Generation
Fully integrated with MobileDiffusion, Xoron-Dev doesn't just understandβit creates.
- Text-to-Video (T2V)
- Image-to-Video (I2V)
- Text-to-Image (T2I)
- Image-to-Image (I2I)
- Video-to-Video (V2V)
Join the Revolution
Xoron-Dev is more than a modelβit's a vision for the future of AI. Build your own multimodal agent today.
Powered by Xoron-Dev Team