Update README.md

67ff103 verified about 2 months ago

7.6 kB

language:
  - en
license: mit
library_name: transformers
tags:
  - multimodal
  - moe
  - text-to-image
  - image editing
  - image to video
  - text-to-video
  - video editing
  - text-to-speech
  - speech-to-text
  - speech-to-speech
  - image-to-text
  - video-to-text
  - agentic
  - tool-use
  - flow-matching
  - 3d-rope
  - titok
  - vidtok
  - dual-stream-attention
  - zero-shot-voice-cloning
  - bigvgan
  - snake-activation
  - multi-receptive-field-fusion
pipeline_tag: any-to-any
inference: false
datasets:
  - m-a-p/Code-Feedback
  - iamtarun/python_code_instructions_18k_alpaca
  - codeparrot/codeparrot-clean
  - bigcode/humanevalpack
  - loubnabnl/github-jupyter-code-to-text
  - saurabh5/rlvr-code-data-Swift
  - finbarr/rlvr-code-data-swift-code-edit
  - ExAi/Code-Golang-QA-2k
  - smcleod/golang-coder
  - databricks/databricks-dolly-15k
  - OpenAssistant/oasst1
  - HuggingFaceH4/no_robots
  - Open-Orca/OpenOrca
  - abhi227070/conversation-to-summarization-dataset
  - allenai/WildChat-1M
  - THUDM/AgentInstruct
  - glaiveai/glaive-code-assistant-v2
  - stingning/ultrachat
  - RyokoAI/ShareGPT52K
  - AlicanKiraz0/Agentic-Chain-of-Thought-Coding-SFT-Dataset
  - Locutusque/function-calling-chatml
  - driaforall/pythonic-function-calling
  - argilla/Synth-APIGen-v0.1
  - interstellarninja/tool-calls-singleturn
  - interstellarninja/tool-calls-multiturn
  - Naveengo/flickr8k
  - ybelkada/football-dataset
  - jmhessel/newyorker_caption_contest
  - derek-thomas/ScienceQA
  - HuggingFaceM4/WebSight
  - lmms-lab/Video-MME
  - MBZUAI/VideoInstruct-100K
  - Gustavosta/Stable-Diffusion-Prompts
  - FredZhang7/stable-diffusion-prompts-2.47M
  - succinctly/midjourney-prompts
  - osunlp/MagicBrush
  - timbrooks/instructpix2pix-clip-filtered
  - Rapidata/sora-video-generation-physics-likert-scoring
  - Rapidata/sora-video-generation-style-likert-scoring
  - Rapidata/sora-video-generation-alignment-likert-scoring
  - Rapidata/text-2-video-human-preferences
  - Rapidata/text-2-video-human-preferences-sora-2
  - TempoFunk/webvid-10M
  - multimodalart/panda-70m
  - nkp37/OpenVid-1M
  - WenhaoWang/VidProM
  - WenhaoWang/TIP-I2V
  - jovianzm/img2vid-pexels-350k
  - TencentARC/MiraData
  - APRIL-AIGC/UltraVideo
  - Mutonix/Vript
  - Rapidata/image-to-video-human-preference-seedance-1-pro
  - openslr/librispeech_asr
  - blabble-io/libritts_r
  - parler-tts/mls_eng_10k
  - MikhailT/hifi-tts
  - renjiepi/medium_20000-file_operations_n100k1

🚀 Xoron-Dev: State-of-the-Art Multimodal MoE

Training-Stage

Xoron-Dev

✨ Xoron-Dev: The Elite SOTA Omni-Modal Intelligence

Xoron-Dev is the definitive open-source architecture for Omni-Modal Artificial Intelligence. Unlike legacy models that treat vision and audio as plugins, Xoron-Dev is designed for native, high-fidelity perception across every major sensory dimension.

🌟 Why Xoron-Dev?

Xoron-Dev represents a massive leap in multimodal reasoning, combining cutting-edge Sparse MoE architecture with a refined sensory stack.

1. 👁️ SOTA Vision (SigLIP-2 & TiTok)

Xoron-Dev exclusively uses SigLIP-2 for superior zero-shot performance and semantic alignment.

TiTok 1D VAE: Images are compressed into 256 ultra-dense tokens, allowing Xoron to "see" high-resolution scenes with unprecedented efficiency.
2D-RoPE: Integrated positional embeddings that maintain spatial relationships regardless of aspect ratio.

2. 🎬 Native Video Intelligence (VidTok)

Our custom VidTok encoder uses 3D Volumetric Compression to ingest up to 32 frames of high-definition video natively. Xoron doesn't just see a sequence of images—it understands motion, causality, and temporal context.

3. 🎙️ Raw PCM Audio (Conformer + BigVGAN)

Xoron-Dev processes Raw 16kHz PCM Audio directly. No Mel Spectrograms, no lossy Fourier transforms.

Micro-Latency S2S: True Speech-to-Speech interactions (<200ms) for natural, fluid conversations.
Zero-Shot Voice Cloning: Instantly clone any voice from a 5-second sample for high-fidelity personalized output.

🧠 The Brain: Aux-Lossless MoE & 128K Ring Attention

A sophisticated Mixture of Experts (MoE) backbone that dynamically routes the logic of every token through specialized hardware-aware sub-networks.

🏗️ Deep Expert Hierarchy

Unlike standard MoE models with uniform experts, Xoron-Dev implements a specialized Deep Expert system.

Expert Pool: 16 Experts Total (8 Standard + 8 Deep).
Variable Logical Depth: Deep Experts possess internal depths scaling from 2 up to 9 layers.
Expert Penalty Routing: A soft utilization penalty ($Cost \propto Depth$) ensures that the model only invokes deeper computation for tasks requiring maximum logical precision, maintaining high inference throughput for simpler tokens.

⚡ Reasoning Acceleration: Fast Ponder

Xoron-Dev features a dedicated FastPonderBlock for near-instant latent deliberation.

Attention-Free Reasoning: By bypassing the $O(N^2)$ Self-Attention stack during thought loops, the Depth-3 reasoning block propagates logic at 120+ thoughts/sec.
Dynamic Halting: A learned halt_head monitors latent entropy. Once the model reaches a decision (entropy threshold < 0.2), it breaks the ponder loop and returns to token decoding, reducing unnecessary FLOPs by up to 90%.

🔘 Infinite Context

Using Ring Attention, Xoron-Dev can analyze books, hour-long videos, or massive codebases with native 128K context window support.

🚀 Get Started with Xorfice

The easiest way to experience Xoron-Dev is via the xorfice engine—the SOTA orchestrator for multimodal deployment.

Installation

pip install xorfice

High-Fidelity Interaction

from xorfice import XoronEngine

# The engine automatically handles weights and optimizations
# Correct model slug: Backup-bdg/Xoron-Dev-MultiMoe
engine = XoronEngine(model_path="Backup-bdg/Xoron-Dev-MultiMoe")

# Start an omni-modal conversation
response = engine.generate(
    prompt="Who is this person and what are they doing?",
    images="https://example.com/interview.jpg",
    videos="https://example.com/interview.mp4"
)
print(response["text"])

📈 SOTA Benchmarks & Features

Feature	Xoron-Dev
Vision Backbone	SigLIP-2
Video Compression	VidTok 3D
Audio Ingestion	Raw PCM
Inference Efficiency	Sparse MoE (5B)
Context Window	128K (Ring)

🎨 Creative Generation

Fully integrated with MobileDiffusion, Xoron-Dev doesn't just understand—it creates.

Text-to-Video (T2V)
Image-to-Video (I2V)
Text-to-Image (T2I)
Image-to-Image (I2I)
Video-to-Video (V2V)

Join the Revolution

Xoron-Dev is more than a model—it's a vision for the future of AI. Build your own multimodal agent today.