MoE Notes

#3
by Naphula - opened

How to create mixture of experts from different archs and sizes?

Based on the mergekit codebase, creating a MoE from different architectures like Llama 3.1 8B and Mistral Nemo 12B is currently not supported due to strict architecture compatibility checks. Here's what would need to be implemented to enable this:

Current Limitations

The MoE system requires all input models to have identical architectures. Each output architecture class (MixtralMoE, DeepseekMoE, QwenMoE, Qwen3MoE) enforces this in their supports_config methods by checking that all models have the same model_type 1 .

Required Changes to Enable Cross-Architecture MoE

1. New Heterogeneous MoE Architecture Class

Create a new MoEOutputArchitecture subclass that can handle different architectures:

class HeterogeneousMoE(MoEOutputArchitecture):
    def supports_config(self, config, explain=False, trust_remote_code=False):
        # Allow different model_types
        return True
    
    def write_model(self, out_path, config, merge_options, router_weights, shared_router_weights=None):
        # Implement heterogeneous expert handling

2. Tensor Shape Alignment System

The current system assumes identical tensor shapes. You'd need to implement:

  • Shape detection: Analyze tensor shapes from each expert model
  • Alignment strategy: Either pad smaller tensors or project to common dimensions
  • Router adaptation: Handle experts with different hidden sizes

3. Architecture-Agnostic Weight Mapping

Modify the weight copying logic in mergekit/moe/common.py to handle architecture differences:

def copy_tensor_out_heterogeneous(weight_info, expert_loader, writer, expert, target_shape):
    # Project or reshape tensors to match target architecture
    tensor = expert_loader.get_tensor(weight_info.name)
    if tensor.shape != target_shape:
        tensor = project_tensor(tensor, target_shape)
    writer.save_tensor(weight_info.name, tensor)

4. Dynamic Router Configuration

The router weights need to accommodate experts with different dimensions. This requires:

  • Per-expert input/output dimension tracking
  • Custom router initialization that handles heterogeneous experts
  • Modified gating logic in the output model configuration

5. Configuration Schema Extensions

Extend MoEMergeConfig to include architecture-specific parameters:

class Expert(BaseModel):
    source_model: ModelReference
    # New fields for heterogeneous support
    target_architecture: Optional[str] = None
    projection_method: Optional[str] = "linear"  # "linear", "pad", "interpolate"

Implementation Challenges

  1. Memory Efficiency: Different model sizes would require careful memory management
  2. Inference Compatibility: The resulting model would need custom inference code
  3. Performance Optimization: Heterogeneous experts may impact routing efficiency
  4. Tokenizer Alignment: Different vocabularies would need tokenization handling

Alternative Approach

Instead of full architectural heterogeneity, consider:

  • Using models with the same architecture family (e.g., both Llama-based)
  • Implementing size adaptation through intermediate projections
  • Creating separate MoE layers for different architecture groups

The current architecture system is designed for homogeneous experts to ensure the MoE routing mechanism functions correctly 2 . Enabling true cross-architecture MoE would require significant architectural changes throughout the mergekit MoE system.

Notes

The current MoE implementation in mergekit is intentionally restrictive to ensure model compatibility and proper routing functionality. The architecture compatibility checks prevent runtime errors that would occur from mismatched tensor shapes and layer structures. Any implementation of heterogeneous MoE would need to address these fundamental compatibility issues at the tensor operation level.

Sign up or log in to comment