Mixture of Experts (MoE)

Architecture with many expert subnetworks where only a few activate per token.

MoE replaces the feed-forward layer with multiple expert MLPs. A router picks 1-2 experts per token. Total parameters can be huge (DeepSeek V3 has 671B) while active parameters per token are much smaller (37B for V3, 13B for Mixtral 8x7B). Memory cost is governed by total params, speed by active params.