Aetheris — Hybrid Mamba-MoE Multilingual Model

Aetheris is a ~536M parameter hybrid SSM-MoE language model distilled from CohereLabs/tiny-aya-global (3.35B). It supports 67 languages with 6.3x compression.

Architecture

Type: Hybrid Mamba-MoE (interleaved SSM + Sparse MoE layers)
Layers: 24 (12 SSM + 12 MoE)
Hidden dim: 1024
Experts: 4 (top-1 routing)
Vocab: 80,000 tokens (pruned from 261K Aya tokenizer)
Parameters: 536M (pruned from 722M via vocabulary pruning)

Compression

Stage	Technique	Before	After	Savings
1	Knowledge Distillation	3,350M	722M	4.6x
2	Vocabulary Pruning	722M	536M	25.7%
Total		3,350M	536M	6.3x

Vocabulary Pruning Details

Original vocab: 255,000 tokens → Pruned: 80,000 tokens
Dead tokens removed: 131,231 (never used by any of 67 target languages)
Per-language coverage preserved via frequency-based keep-list union
Mean fertility increase: <5% across all languages
Weight tying preserved (embedding = lm_head)

Training

Stage 1: CKA-guided layer alignment (10K steps)
Stage 2: KL divergence distillation, T=2.0, alpha=0.7 (20K steps, best loss=2.73)
Stage 3: SFT fine-tuning (pending)
Teacher: CohereLabs/tiny-aya-global (3.35B)
Data: ClimbMix (NVIDIA)

Usage

import torch, yaml, sys
sys.path.insert(0, ".")
from aetheris.config import AetherisConfig
from aetheris.model import HybridMambaMoE

config = AetherisConfig.from_yaml("config.yaml")
model = HybridMambaMoE(config)
sd = torch.load("pytorch_model.pt", map_location="cpu")
model.load_state_dict(sd)
model.eval()

Note: This model uses a pruned vocabulary. Use the vocab_mapping.json file to map between original Aya tokenizer IDs and pruned model IDs.

Wayy Research

People for research, research for people. Buffalo, NY — Est. 2024

Downloads last month: 8

wayyresearch
/

aetheris