Skip to content

GLM MoE DSA — Sparse Mixture-of-Experts with Dynamic Sparse Attention

Architecture behind GLM-5.1 (and GLM-5): 744B total parameters with only 40B active per token, using DeepSeek Sparse Attention to reduce deployment cost while preserving long-context capacity. Relevant because this is the frontier of what can run on the M3 Ultra (512GB).

Key Info

  • Developer: Zhipu AI (zai-org)
  • Architecture: glm_moe_dsa — MoE with DeepSeek Sparse Attention + Multi-Latent Attention (MLA)
  • Total params: 744B (GLM-5.1: 754B per HF)
  • Active params: 40B per token
  • Training data: 28.5T tokens (up from GLM-4.5's 23T)
  • License: MIT

Sources

  • HuggingFace model card — April 7, 2026
  • GitHub repo — architecture details
  • llama.cpp PR #19460 — GGUF conversion support (merged Feb 13, 2026)
  • mlx-lm issue #879 / PR #881 — full MLX support (merged Feb 2026)
  • CC Sam's research session — April 7, 2026

Why It Matters

This is the largest open-weight model that could plausibly run on our hardware. At Q4 quantization (~361GB for IQ4_XS), it fits on the M3 Ultra with room for KV cache. The MoE architecture means only 40B params are active per token — comparable to running a 40B dense model in terms of compute, but with 744B worth of knowledge encoded in the expert weights.

The DSA (Dynamic Sparse Attention) component is particularly interesting: it learns which KV positions to attend to instead of attending to everything, which is what enables long context at this scale. Without DSA, a 744B model with long context would be impractical even on datacenter hardware.

For our fleet: this is the model that could make the M3 Ultra's 512GB genuinely differentiated. No API cost, no rate limits, running the frontier of open-source locally.

Key Ideas

  • MoE + Sparse Attention = two levels of sparsity (expert selection AND attention selection)
  • Only 40B active per token despite 744B total — inference cost comparable to a 40B dense model
  • DSA selects top-K key-value positions per query instead of full attention
  • MLA (Multi-Latent Attention) uses compressed KV representations — further memory savings
  • BF16 is 1.5TB, FP8 is ~744GB, Q4 is ~361GB — only Q4 and below fit M3 Ultra

Inference Stack Status (April 2026)

  • llama.cpp: GGUF support merged, but DSA indexer NOT implemented. Works but suboptimal quality.
  • mlx-lm: Full support including DSA indexer, MLA, and MoE routing. The better path for Apple Silicon.
  • LM Studio: Has both llama.cpp 2.12.0 and mlx-llm 1.5.0 runtimes. Model not yet indexed in catalog (just released). MLX backend is the right choice when available.
  • vLLM/SGLang/KTransformers: Full support for GPU inference.

Connections

  • LM Studio — inference runtime
  • M3 Ultra — target hardware (512GB unified memory)
  • GLM-4.7 Flash — predecessor, already running on M3 Ultra (30B, glm4_moe_lite)

Timeline

  • 2026-04-07 | GLM-5.1 released on HuggingFace. CC Sam researched hardware compatibility, inference stack status, and quantization options for M3 Ultra. Download started via LM Studio (~9 hour ETA). [Source: CC Sam research session, Telegram]