Spaces:
Running
MoE Notes
How to create mixture of experts from different archs and sizes?
Based on the mergekit codebase, creating a MoE from different architectures like Llama 3.1 8B and Mistral Nemo 12B is currently not supported due to strict architecture compatibility checks. Here's what would need to be implemented to enable this:
Current Limitations
The MoE system requires all input models to have identical architectures. Each output architecture class (MixtralMoE, DeepseekMoE, QwenMoE, Qwen3MoE) enforces this in their supports_config methods by checking that all models have the same model_type 1 .
Required Changes to Enable Cross-Architecture MoE
1. New Heterogeneous MoE Architecture Class
Create a new MoEOutputArchitecture subclass that can handle different architectures:
class HeterogeneousMoE(MoEOutputArchitecture):
def supports_config(self, config, explain=False, trust_remote_code=False):
# Allow different model_types
return True
def write_model(self, out_path, config, merge_options, router_weights, shared_router_weights=None):
# Implement heterogeneous expert handling
2. Tensor Shape Alignment System
The current system assumes identical tensor shapes. You'd need to implement:
- Shape detection: Analyze tensor shapes from each expert model
- Alignment strategy: Either pad smaller tensors or project to common dimensions
- Router adaptation: Handle experts with different hidden sizes
3. Architecture-Agnostic Weight Mapping
Modify the weight copying logic in mergekit/moe/common.py to handle architecture differences:
def copy_tensor_out_heterogeneous(weight_info, expert_loader, writer, expert, target_shape):
# Project or reshape tensors to match target architecture
tensor = expert_loader.get_tensor(weight_info.name)
if tensor.shape != target_shape:
tensor = project_tensor(tensor, target_shape)
writer.save_tensor(weight_info.name, tensor)
4. Dynamic Router Configuration
The router weights need to accommodate experts with different dimensions. This requires:
- Per-expert input/output dimension tracking
- Custom router initialization that handles heterogeneous experts
- Modified gating logic in the output model configuration
5. Configuration Schema Extensions
Extend MoEMergeConfig to include architecture-specific parameters:
class Expert(BaseModel):
source_model: ModelReference
# New fields for heterogeneous support
target_architecture: Optional[str] = None
projection_method: Optional[str] = "linear" # "linear", "pad", "interpolate"
Implementation Challenges
- Memory Efficiency: Different model sizes would require careful memory management
- Inference Compatibility: The resulting model would need custom inference code
- Performance Optimization: Heterogeneous experts may impact routing efficiency
- Tokenizer Alignment: Different vocabularies would need tokenization handling
Alternative Approach
Instead of full architectural heterogeneity, consider:
- Using models with the same architecture family (e.g., both Llama-based)
- Implementing size adaptation through intermediate projections
- Creating separate MoE layers for different architecture groups
The current architecture system is designed for homogeneous experts to ensure the MoE routing mechanism functions correctly 2 . Enabling true cross-architecture MoE would require significant architectural changes throughout the mergekit MoE system.
Notes
The current MoE implementation in mergekit is intentionally restrictive to ensure model compatibility and proper routing functionality. The architecture compatibility checks prevent runtime errors that would occur from mismatched tensor shapes and layer structures. Any implementation of heterogeneous MoE would need to address these fundamental compatibility issues at the tensor operation level.