Scaling Smart: How Meituan's LongCat-Flash Rewrites the MoE Playbook

Longcat-Flash’s intelligent compute allocation (ZCE), architectural fusion (ScMoE), and many-system-level efficiencies

Interesting Engineering ++

Sep 02, 2025

I am in awe!! Hats off, and very well done to the Meituan team!! Wow!!

They literally threw the entire (un)imaginable, “optimization-at-every-level-book” at this Model:

TLDR:

1) New Architecture

Two attention blocks per layer + FFN + MoE
→ This is captured under Novel Architecture Designs (MLA + dual FFN/MoE per block, only 28 layers total).
Zero-computation expert (sink for easy tokens)
→ Covered in Zero-Computation Experts section.
Load balancing with dsv3-like aux loss + decay schedule
→ This is included under Computational Budget Control (PID bias adjustment, loss-free load balancing) and Load Balance Control.

2) Scaling

Variance alignment for MLA/MoE init
→ See below under Variance Alignment Design for Scalability.
Model Growth Initialization (stacking smaller model)
→ Included under Model Growth Initialization.
Hyperparameter transfer with SP instead of muP
→ Covered this under Hyperparameter Transfer, referencing the theory behind parameter scaling rules.

3) Stability

Gradient Norm Ratio & expert cosine similarity for load balancing
→ Included below under router-gradient balancing and monitoring metrics as part of your Training Stability Suite.
Hidden-state z-loss (to avoid large activations)
→ Explicitly in Training Stability Suite.
Adam epsilon set to 1e-16, tuned relative to RMS range
→ Captured this point in the stability discussion.

4) Others

Data pipeline: context extraction, quality filtering, deduplication
→ Covered under systematic data pipelines in pre/mid/post-training breakdown.
Long-context training: 20T tokens pretraining, multi-T mid-training, 100B long-context extension
→ Covered under Long Context Extension with explicit 128K tokens.
Benchmark comparisons (MMLU, GSM8K, top-k expert allocations)
→ Present in Competitive and Optimized Performance section.
Two new benchmarks (Meeseeks, VitaBench)
→ Explicitly included in Chat Model Performance list.
Infra/inference details: speculative decoding acceptance, quantization, deployment, kernel optimization, overlap
→ These are under Inference/Deployment Optimizations (MTP speculative decoding, custom kernels, quantization, SBO scheduling).

⚡ LongCat-Flash – Optimized at Every Level

Massive yet Efficient
- 560B total parameters.
- Only 18.6B–31.3B (avg. ~27B) dynamically activated per token → compute efficiency without sacrificing capability.
Dynamic Computation Mechanism
- Context-aware activation of experts.
- Allocates more compute to “hard” tokens, less to “easy” ones.
Shortcut-Connected MoE (ScMoE)
- Expands computation–communication overlap window.
- Removes major bottleneck of large MoE models.
- Delivers over 100 tokens/sec inference speed at low cost.
Comprehensive Training & Scaling Strategy
- Stable and reproducible training across tens of thousands of accelerators.
- Specialized initialization, variance alignment, and stability controls → no collapse at scale.
Tailored Data Strategies
- Reasoning-intensive, multilingual, and code-rich datasets.
- Extended to long context (128k tokens).
Competitive Performance
- Released as LongCat-Flash-Chat, a strong non-thinking foundation model.
- State-of-the-art results in agentic tasks (reasoning, tool use, coding).

The technical report is one of the most detailed (amongst OSS) I have seen to date. It will, however, be amiss not to acknowledge all prior releases and technical reports for the various models (and “borrowed & stacked optimizations” and learning lessons from these stars) - DeepSeek’s V3, R1, V3.1, Qwen3 series, Zhipu’s GLM4.5, Moonshot’s Kimi-K2 etc. Thank you for contributing so very much towards OSS!

Standing on the Shoulders of Giants

Acknowledging the OSS contributions, especially from Chinese AI model developers—like those behind LongCat-Flash—have indeed built upon and integrated techniques from recent open-source and in-house models (e.g., DeepSeek V3/R1/V3.1, Kimi-K2, GLM-4.5, Tülu). However there are many innovations including Longcat’s architecture that makes them another one to watch, amongst the long-list OSS models (china based). Here’s how they’ve thrown the “optimizations-book at every level, also borrowing and integrating on proven innovations:

The Few Key Building Blocks and Their Lineage

1. Mixture-of-Experts with Sparse Activation

DeepSeek-V3 introduced MoE with dense parameters but only a subset activated per token (37B active of 671B total) arXiv.
Kimi-K2 uses MoE with 32B active of 1T total, optimized with a new MuonClip optimizer and zero loss spikes during post-training arXiv.
GLM-4.5 and GLM-4.5-Air also follow the hybrid active-parameter pattern (32B active of 355B total, and 12B of 106B respectively) Analytics Vidhya Hugging Face.

LongCat-Flash similarly uses MoE plus zero-computation experts to dynamically manage activated parameters (~27B on average).

2. Multi-Token Prediction (Speculative Decoding)

DeepSeek-V3 pioneered multi-token prediction objectives in MoE models, boosting inference efficiency arXiv.

LongCat-Flash uses a similar MTP design for speculative decoding with ~90% acceptance rates.

3. Large Context Windows

DeepSeek-V3.1 provides massive 128K token context support with a hybrid thinking/non-thinking mode deepseek.ai(Hugging Face).
GLM-4.5 family also supports contextual switching and reasoning features on large contexts Analytics Vidhya(DeepLearning.ai).
Tülu-3 achieved competitive performance across benchmarks with large-scale fine-tuning and inference optimizations allenai.org(Unite.AI).

LongCat-Flash matches them with MLA for efficient 128K contexts with compressed KV cache.

4. Inference Efficiency and Mode Switching

DeepSeek-V3.1 implements a hybrid Think/Non-Think inference mode that improves tool use and agentic behavior Hugging Face(Reuters).
GLM-4.5 offers dual reasoning modes for flexible task handling Analytics Vidhya(The Economic Times).
Kimi-K2 Instruct provides post-trained, agent-optimized behavior out-of-the-box (Medium).

LongCat-Flash incorporates ScMoE + SBO to support efficient inference concurrently with dynamic routing capability.

5. Training Stability, Data Strategies

Kimi-K2 used MuonClip optimizer with novel QK clipping and large-scale multi-stage training to avoid loss crashes arXiv.
Tülu-3 employs open-source recipes, data pipelines, and vLLM inference tricks for stability allenai.org ai(wire.net).

LongCat-Flash packages these into deterministic kernels, router balancing, hidden-z loss, half-scale initialization, and systematic data pipelines.

Shared Foundations, Local Refinements

Ok, I admit, I now sound almost like one of those banking ads. Hah! Anyway - seriously, though - Chinese models are engaging in a cycle of cumulative innovation—taking baseline advances from open sources and each other, then differentiating via:

ZCE routing and MoE enhancements
Efficient attention for long contexts
Speculative decoding with MTP
Hybrid inference modes
Stability at extreme scales
Dense / sparse training strategies

LongCat-Flash is built atop these shoulders—but injects fresh ideas (e.g., shortcut-connected MoE, aggressive infrastructure optimizations, deterministic kernels), making it a standout in computational and agentic efficiency.

To truly appreciate what went into this, here is a non-exhaustive list of the optimizations I am highlighting. My guide (for personal knowledge) type thing. The before and after basically highlight general/classical mechanisms present in many of the older models, and the after - the tweaks facilitated by Meituan. Appreciating the impact is sufficient for those who may not need those details to understand the level of “ elegant fine-tuning” undertaken by the team - hats off, once again:

🔧 LongCat-Flash Optimizations — Comprehensive Before vs After Analysis

A. Architecture-Level Innovations

1. Zero-Computation Experts (ZCE)

Before:
- Classic MoE models require routing every token to k FFN experts.
- All experts perform heavy matrix multiplications regardless of whether the token is “simple” or “difficult.”
- Leads to wasted FLOPs on trivial tokens (e.g., punctuation, common words) and occasional overloading on tokens requiring deeper reasoning.
After:
- LongCat-Flash introduces ZCEs — experts that return the input unchanged (no FLOPs).
- Router can assign a fraction of tokens to these ZCEs, effectively reducing active computation per token.
- A PID controller dynamically adjusts routing scores so the number of active parameters stays near the target budget (~27B vs. 560B total).
Impact:
- Reduces unnecessary computation without sacrificing representational power.
- Allows flexible compute allocation: hard tokens get more FFN experts, easy tokens get fewer.
- Results: During pre-training, mean active experts converged to 8 (~27B params), variance across tokens stayed high → efficient dynamic adaptation.

2. Shortcut-Connected MoE (ScMoE)

Before:
- MoE layers require token dispatch (all-to-all comms) before FFN compute.
- Communication time often exceeded compute, leaving accelerators idle.
After:
- ScMoE adds cross-layer shortcut: the dense FFN of the previous layer is executed in parallel with MoE dispatch/combination.
- This enlarges the overlap window between computation and communication.
Impact:
- Training throughput: removes major communication bottleneck at large scale.
- Inference: enables Single Batch Overlap (SBO) scheduling, hiding comms overhead inside compute pipeline.
- Reported result: nearly 50% reduction in theoretical time-per-token in inference; training loss curves identical to baseline (quality-neutral).

3. Variance Alignment Designs

Before:
- Low-rank MLA attention and fine-grained expert partitioning reduced variance in outputs.
- Misaligned variances caused unstable activations and degraded convergence when scaling up.
After:
- Scale correction factors (αq, αkv) applied to Q/K/V projections in MLA.
- Scaling factor (γ) applied to outputs of fine-grained experts to restore variance lost from gating dilution.
Impact:
- Stabilized initialization → smoother convergence.
- Prevents scale-up degradation seen in earlier MoE models.
- Quality at 560B matches or exceeds extrapolations from smaller-scale training.

B. Training Optimizations

4. Dynamic Computation Allocation

Before:
- Fixed expert count → static FLOP budget per token.
- Inefficient when token complexity varied widely.
After:
- ZCE routing allowed per-token variation: 18.6B–31.3B params active (avg. 27B).
Impact:
- Computational efficiency improved (same quality with fewer FLOPs).
- Model achieves performance parity with larger activated-param peers (e.g., DeepSeek-V3 Base 37B active).

5. Hyperparameter Transfer

Before:
- Hyperparameter tuning at scale required expensive sweeps, often unstable.
After:
- Used scaling rules: predicted optimal lr, batch size, optimizer settings from smaller proxy models.
Impact:
- Reduced compute wasted on tuning.
- Allowed reliable jump to trillion-token training without instability.

6. Model Growth Initialization

Before:
- Random init at 560B scale produced gradient explosion/collapse.
After:
- Pre-trained half-scale checkpoint grown to full size using layer-stacking.
Impact:
- Faster convergence.
- Higher baseline quality from start.
- Ablations: full-scale random init had higher loss; growth-init curve tracked smoothly.

7. Training Stability Suite

Before:
- MoE models prone to “expert collapse” (router sends tokens to only a few experts).
- Loss spikes → irrecoverable training crashes.
After:
- Router-gradient balancing: equalizes backprop signal across experts.
- Hidden z-loss: penalizes over-confident logits, suppresses instabilities.
- Adam ε tuned to 1e-16 to prevent numerical blow-ups.
Impact:
- Training runs stable across 20T tokens.
- No collapse events reported.
- Consistent convergence across scales.

8. Deterministic Training + Custom Kernels

Before:
- Non-deterministic ops (e.g., ScatterAdd, GEMM accumulation) → run-to-run drift.
- Silent data corruption (SDC) undetectable at 10k+ GPU scale.
After:
- Custom deterministic kernels:
  - FlashAttention Gradients
  - ScatterAdd
  - Grouped GEMM
  - Fused GemmAdd (adds in FP32 inside GEMM epilogue)
- Nearly same throughput as non-deterministic kernels.
Impact:
- Exact reproducibility.
- SDC detection possible.
- Confidence in large-scale multi-week training jobs.

9. Distributed Training Strategy

Before:
- Memory usage exceeded 80GB per GPU.
- Communication latency from all-gathers and reduce-scatters.
After:
- Multi-dimensional parallelism: EP + TP + PP + DP + Context Parallelism.
- V-ZB memory balancing reduced peak memory <60GB.
Impact:
- Enabled training on H800-80GB GPUs.
- 98.5% training availability across 20T tokens in 30 days.

C. Inference & Deployment

10. Mixed Linear Attention (MLA)

Before:
- Quadratic complexity attention (O(n²)) → infeasible for 128k contexts.
After:
- MLA reduces complexity → O(n).
- Splits RoPE/NOPE dims to balance positional encoding.
Impact:
- Supports 128k context efficiently.
- KV cache significantly smaller → less bandwidth per step.

11. KV-Projection Factorization (LoRA-style, from config)

In the Technical Report (PDF):
- Describes Mixed Linear Attention (MLA) and variance alignment mechanisms.
- Explains the use of scale correction factors (αq, αkv) to stabilize initialization and balance variance in Q/K projections.
- Highlights that MLA reduces KV-cache size and bandwidth pressure, enabling 128k context.
- ❌ Does not mention LoRA explicitly.
In the Hugging Face Configuration (configuration_longcat_flash.py):
- Confirms low-rank factorization of Q/K/V projections is actually implemented.
- Parameters:

q_lora_rank = 1536  
kv_lora_rank = 512

- This shows the design is LoRA-style factorization, reducing KV-cache dimensionality while preserving representational power.
Before:
- Full-rank Q/K/V projections → large KV cache, memory bandwidth bottleneck at long contexts.
After:
- Factorized (LoRA-style) Q/K/V → smaller cache, reduced memory footprint, less communication overhead.
Impact:
- Practical feasibility of 128k-token context windows without prohibitive memory use.
- Higher throughput and lower latency in inference.

12. Multi-Token Prediction (MTP)

Before:
- Each step predicted 1 token → slow decoding.
After:
- Lightweight MTP head (single dense layer) predicts multiple tokens.
- ~90% acceptance rate in speculative decoding pipeline.
Impact:
- Significant latency reduction in generation.
- Maintains near-baseline accuracy.

13. Single-Batch Overlap (SBO)

Before:
- Compute and communication sequential; latency compounded.
After:
- Reorders execution; overlaps NVLink intra-node bandwidth with RDMA inter-node comm.
Impact:
- 100 tokens/sec throughput on H800.
- Cost per million tokens: $0.70.

14. TVD Fusing & Scheduler Optimizations

Before:
- Separate CUDA graph launches for Target, Verification, Draft passes.
- CPU sync overhead.
After:
- TVD fused into a single CUDA graph.
- Multi-step overlapped scheduler → continuous GPU occupancy.
Impact:
- TTFT reduced under high QPS.
- Safe dynamic KV cache allocation.

15. Custom Kernels

Before:
- Standard GEMM inefficient for irregular expert batches.
- Communication limited by NCCL/MSCCL++.
After:
- MoE GEMM: SwapAB trick improves tensor-core usage.
- Fused GemmAdd reduces memory traffic.
- NVLink Sharp kernels accelerate all-to-all ops.
Impact:
- Higher tensor utilization, reduced latency.
- Communication faster than NCCL/MSCCL++ baselines.

16. Quantization

Before:
- BF16/FP16 → high memory & compute.
After:
- Blockwise quantization: [1,44] activations, [44,44] weights.
- Layer-wise mixed precision (some FP8).
Impact:
- Lower memory.
- Maintained accuracy with faster inference.

17. Deployment Architecture (PD-Disaggregated)

Before:
- Prefill and decode competed on same hardware.
After:
- Prefill and decode disaggregated into separate nodes.
- KV cache transmitted layer-wise.
Impact:
- Lower TTFT under high load.
- Wide EP with DeepEP support for ZCE minimizes comm overhead.

D. System & Infrastructure

18. DeepEP

Before:
- MoE expert comms bottlenecked at scale.
After:
- Custom comm backend optimized for MoE.
Impact:
- Higher throughput at 1000s of GPUs.

19. NVLS (Collectives Library)

Before:
- NCCL/MSCCL++ limited throughput at large cluster sizes.
After:
- NVLS optimized collectives for LongCat workloads.
Impact:
- Reduced collective comm bottlenecks.

20. NVLink Multimem

Before:
- Memory replication to multiple GPUs required sequential copies.
After:
- Hardware-level broadcast.
Impact:
- Faster GPU-to-GPU communication.

21. Sharp-enabled Switch Reductions

Before:
- All reductions executed on GPUs.
After:
- In-switch reductions offload compute to network switch.
Impact:
- Lower latency sync across nodes.

22. Multi-dimensional Parallelism

Before:
- Training limited by single parallel dimension.
After:
- Uses TP + EP + PP + DP + CP.
Impact:
- Near-linear scaling across tens of thousands of accelerators.

✅ High-Level Overview

Training Stability: router-balancing, z-loss, deterministic kernels = stable at 560B.
Efficiency: ZCE, ScMoE, MLA, LoRA, quantization = less compute per token, same quality.
Throughput: SBO, MTP, fused kernels = >100 TPS on H800.
Scaling: Hyperparam transfer, growth init, multi-parallelism = linear scaling across 10k+ GPUs.
Reliability: 98.5% availability with automated recovery.
Context Window: MLA+LoRA enable 128k tokens.

📊 LongCat-Flash Optimizations — Master Comparison Table

🌟 LongCat-Flash: Top 5 Breakthrough Optimizations

📌 Why these matter most:

ScMoE + SBO = System-level efficiency, directly drives training/inference cost savings.
ZCE = Smarter compute usage, more “bang for FLOP.”
MLA + KV factorization = Unlocks 128k context without exploding memory.
MTP = Real-world inference latency gains.
Stability Suite = Makes training at this insane scale actually possible.

As a mark of respect and much appreciation to the amazing teamwork, I would additionally like to recommend both these “watches”. I find the expression of views rational, and open-minded. Basically, logical:

Elie Bakouch (on X)

References:

Deploying Longcat-Flash: https://lmsys.org/blog/2025-09-01-sglang-longcat-flash/

HuggingFace: https://huggingface.co/meituan-longcat/LongCat-Flash-Chat

Github: https://github.com/meituan-longcat/LongCat-Flash-Chat

Longcat-Flash Technical Report: https://github.com/meituan-longcat/LongCat-Flash-Chat/blob/main/tech_report.pdf

Elie Bakouch (on X): https://x.com/eliebakouch/status/1961999252311204147

Interesting Engineering++

Discussion about this post