Aetheris β€” Hybrid Mamba-MoE Multilingual Model

Aetheris is a ~536M parameter hybrid SSM-MoE language model distilled from CohereLabs/tiny-aya-global (3.35B). It supports 67 languages with 6.3x compression.

Architecture

  • Type: Hybrid Mamba-MoE (interleaved SSM + Sparse MoE layers)
  • Layers: 24 (12 SSM + 12 MoE)
  • Hidden dim: 1024
  • Experts: 4 (top-1 routing)
  • Vocab: 80,000 tokens (pruned from 261K Aya tokenizer)
  • Parameters: 536M (pruned from 722M via vocabulary pruning)

Compression

Stage Technique Before After Savings
1 Knowledge Distillation 3,350M 722M 4.6x
2 Vocabulary Pruning 722M 536M 25.7%
Total 3,350M 536M 6.3x

Vocabulary Pruning Details

  • Original vocab: 255,000 tokens β†’ Pruned: 80,000 tokens
  • Dead tokens removed: 131,231 (never used by any of 67 target languages)
  • Per-language coverage preserved via frequency-based keep-list union
  • Mean fertility increase: <5% across all languages
  • Weight tying preserved (embedding = lm_head)

Training

  • Stage 1: CKA-guided layer alignment (10K steps)
  • Stage 2: KL divergence distillation, T=2.0, alpha=0.7 (20K steps, best loss=2.73)
  • Stage 3: SFT fine-tuning (pending)
  • Teacher: CohereLabs/tiny-aya-global (3.35B)
  • Data: ClimbMix (NVIDIA)

Usage

import torch, yaml, sys
sys.path.insert(0, ".")
from aetheris.config import AetherisConfig
from aetheris.model import HybridMambaMoE

config = AetherisConfig.from_yaml("config.yaml")
model = HybridMambaMoE(config)
sd = torch.load("pytorch_model.pt", map_location="cpu")
model.load_state_dict(sd)
model.eval()

Note: This model uses a pruned vocabulary. Use the vocab_mapping.json file to map between original Aya tokenizer IDs and pruned model IDs.

Wayy Research

People for research, research for people. Buffalo, NY β€” Est. 2024

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using wayyresearch/aetheris 1