Skip to content

Releases: NVIDIA/Megatron-LM

NVIDIA Megatron Core 0.15.0

05 Dec 01:35
core_v0.15.0
0112156

Choose a tag to compare

  • Features
    • Performance
      • Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup, 10-14% E2E) (MR !3912)
      • Use new TE interface for user buffers (MR !3886)
      • Add CPU activation offloading via TE (MR !4286)
      • Add configurable double buffering (MR !4026)
      • Add Muon optimizer and distributed optimizer support (MR !4106)
      • Add setting to support Adam or AdamW optimizer (MR !3866)
    • MoE
      • Add DTensor support for EP and DSv3 modules (MR !3955)
      • Add HybridEP backend to Flex Dispatcher (MR !4237)
      • Support FP8 recomputation for MoE components (MR !4030)
      • Implement NVFP4 Zero Padding for MoE (MR !4225)
      • Compute shared experts before router (MR !4068)
      • Enable bias in expert MLP (MR !3858)
    • Model support
      • Add YaRN support for GPT-OSS (MR !4044)
      • Add support for Qwen3-Next arguments (MR !4070)
      • Add FP8 init for MTP (MR !3958)
      • Add fp8_dpa option for FP8 scaling (MR !4053)
      • Add RADIO-g support to converter and tester (MR !4371)
      • Add audio semantic reasoning data for voice chat and speech instructions (MR !4397)
    • FSDP
      • Enable joint training of parallel modules (MR !3850)
      • Add support for multimodule communication (MR !4235)
    • Inference
      • Add CUDA Graph runner lookup table cache (up to 2x E2E speedup) (MR !4082)
      • Add MoE dropping and padding router for CUDA Graph + decode (MR !3816)
      • Dynamic audio shapes with variable sequence lengths (2.5x throughput improvement) (MR !4274)
      • Integrate unified memory for dynamic inference context (MR !3985)
    • Post-training
      • Add GPT-OSS ModelOpt support with quantization, import/export (MR !4169)
      • Enable KD support with hybrid training loop (MR !4021)
      • Add ModelOpt pruning example (MR !4022)
    • RL
      • Add importance sampling and partial rollouts to Megatron RL (MR !4000)
      • Add sequence packing for RL (MR !4191)
    • Ease of use
      • Handle CUDA absence during import (MR !4120)
      • Add granary dataloader functionality (MR !4291)
      • Enable SWA mixing with attention (MR !3855)
  • Bug fixes
    • Fix convergence bug in MXFP8 parameter gradient buffer reuse (MR !3999)
    • Fix loss mask cloning to prevent incorrect updates (MR !4164)
    • Fix metadata loss in checkpoints (MR !4182)
    • Fix FSDP grad accum fusion support (MR !4018)
    • Fix non-TE optimizer checkpoint issue (MR !3931)
    • Fix BERT virtual pipeline parallelism (MR !3993)
    • Fix gc.freeze() slowdown by adding gc.collect() on last layer (MR !4003)
    • Fix full iteration CUDA graph non-tensor handling (MR !4019)
    • Fix model_auto_sync mis-set and add gradient assertion (MR !4062)
    • Fix HF import dtype and checkpoint loading issues (MR !4095)
    • Fix missing initialization in ProcessGroupCollection (MR !4159)
    • Fix sink attention TP (MR !4173)
    • Fix num_microbatches calculation (MR !4199)
    • Fix 1f1b overlap unit tests for MTP standalone (MR !4210)
    • Fix stale state dict handling (MR !4226)
    • Fix dataset divergence with tokenizer PAD handling (MR !4231)
    • Fix parameter initialization (MR !4296)
    • Ensure tensor-parallel attributes set regardless of initialization flag (MR !4312)
  • Known issues

NVIDIA Megatron Core 0.14.0

  • Features
    • Inference
      • Add async support for DynamicInferenceEngine (MR !3187)
      • Pad input tensors and enable FP8 weights for FP8 inference (MR !3341)
      • Force inference to always gather logits with tensor parallelism (MR !3442)
      • Multi batch size CUDA Graphs for Dynamic Inference (MR !3402)
    • Post-training
      • ModelOpt updates (MR !3268)
        • Add speculative decoding AR validation feature
        • Add DeepSeek and Qwen model configs
    • Performance
      • ModelCommProcessGroup integration (MR !3391)
      • Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism (MR !3398)
        • Flexible creation and management of communication groups
      • Add support for Spike No More embedding initializations and weight decay skipping (MR !3500)
    • MoE
      • We're actively optimizing large-scale fine-grained MoE performance on Blackwell Platform.
      • Features:
      • Memory Optimization
        • Support recomputati...
Read more

NVIDIA Megatron Core 0.14.0

08 Oct 15:04

Choose a tag to compare

  • Features
    • Inference
      • Add async support for DynamicInferenceEngine (MR !3187)
      • Pad input tensors and enable FP8 weights for FP8 inference (MR !3341)
      • Force inference to always gather logits with tensor parallelism (MR !3442)
      • Multi batch size CUDA Graphs for Dynamic Inference (MR !3402)
    • Post-training
      • ModelOpt updates (MR !3268)
        • Add speculative decoding AR validation feature
        • Add DeepSeek and Qwen model configs
    • Performance
      • ModelCommProcessGroup integration (MR !3391)
      • Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism (MR !3398)
        • Flexible creation and management of communication groups
      • Add support for Spike No More embedding initializations and weight decay skipping (MR !3500)
    • MoE
      • We're actively optimizing large-scale fine-grained MoE performance on Blackwell Platform.
      • Features:
      • Memory Optimization
        • Support recomputation for FP8 layernorm/moe_act/shared_experts (MR !3465)
        • Support optimizer offloading for DSV3 FP8 training (MR !3659)
      • Performance Optimization
      • Bug fixes:
        • Fix router input jitter dtype (MR !3774)
    • Model support
    • Ease of use
      • Add uv support for source installs (MR !3615)
      • Automated weekly prereleases (MR !3574)
  • Bug fixes
    • Use mscale_all_dim for softmax_factor (MR !2800)
    • Fix FP8 param blockwise scaling unit test (MR !3480)
    • Fix unit test blockwise scaling (MR !3491)
    • Optimize prefill for token-less requests (MR !3499)
    • Add default values for Fp8Padding and Fp8Unpadding (MR !3501)
    • Fix CUDA graph logic for flexible pp layout (MR !3505)
    • Load FP8 models with strict=False (MR !3508)
    • Skip rope check for torch < 1.4.0 (MR !3528)
    • Disable Apex tests for stability (MR !3539)
    • Fix typo in parallel_state expert parallelism (MR !3548)
    • Guard modelopt on macOS (MR !3549)
    • Retry on CUDA function failure (MR !3554)
    • Fix NCCL mem pool creation error (MR !3557)
    • Fix get_rotary_seq_len return type (MR !3559)
    • Retry on CUDA function failure (MR !3560)
    • Fix NCCL allocator attribute error (MR !3565)
    • Ensure multi-prompt inference works (MR !3568)
    • Fix MD5 on FIPS systems (MR !3577)
    • Fixes dynamic context and inference bugs (MR !3582)
    • Fix TE version for interleaved fused RoPE (MR !3586)
    • Fix MTP with MoE and TP logging (MR !3594)
    • Guard TE import fix (MR !3596)
    • Add assertion for NCCL UB case (MR !3599)
    • Remove Encoder PP related Functions (MR !3604)
    • Fix segfaults in tests (MR !3605)
    • Fix TE error in distributed optimizer (MR !3625)
    • Remove redundant barrier in checkpoint flow (MR !3626)
    • Support VPP MTP, fix logging (MR !3630)
    • Retry mechanism for free(): invalid pointer errors (MR !3632)
    • Fix test_replication.py issues (MR !3633)
    • Fix typo in parallel_state (MR !3634)
    • Fix CUDA graph logic determination (MR !3635)
    • Fix TE installation error (MR !3636)
    • Ensure correct sharding type in local tests (MR !3643)
    • Fix cudagraphed backward buffer reuse for last layer (MR !3645)
    • Set default for packed_seq_params in get_rotary_seq_len (MR !3651)
    • Fix dynamic example script errors (MR !3653)
    • Guard TE import fix (MR !3666)
  • Breaking changes:
    • megatron.core.distributed.custom_fsdp refactored as breaking change to megatron.core.distributed.fsdp.src.megatron_fsdp
  • Known issues

25.09-alpha.rc1

03 Oct 14:41

Choose a tag to compare

Add fp8 attn knobs

NVIDIA Megatron Core 0.13.1

12 Aug 18:33

Choose a tag to compare

Merge branch 'cherry-pick-f36e1705' into 'core_r0.13.0'

Cherry-pick 'Use ruff linter (3627)' into 'core_r0.13.0'

See merge request ADLR/megatron-lm!3793

NVIDIA Megatron Core 0.14.0rc5

11 Aug 04:12

Choose a tag to compare

Pre-release

Prerelease: NVIDIA Megatron Core 0.14.0rc5 (2025-08-11)

NVIDIA Megatron Core 0.12.3

12 Aug 18:12

Choose a tag to compare

Merge branch 'chtruong/cherry-pick-3627' into 'core_r0.12.0'

Cherry-pick 'use yaml safe load  (3627)' into 'core_r0.12.0'

See merge request ADLR/megatron-lm!3795

NVIDIA Megatron Core 0.14.0rc4

04 Aug 04:12

Choose a tag to compare

Pre-release

Prerelease: NVIDIA Megatron Core 0.14.0rc4 (2025-08-04)

NVIDIA Megatron Core 0.14.0rc3

28 Jul 04:13

Choose a tag to compare

Pre-release

Prerelease: NVIDIA Megatron Core 0.14.0rc3 (2025-07-28)

NVIDIA Megatron Core 0.13.0

25 Jul 18:04

Choose a tag to compare

  • Support bf16 dtype for optimizer states to use precision-aware optimizer in TransformerEngine
  • MoE
    • Features:
      • Flexible Asymmetric Virtual Pipeline Parallelism with Custom Pipeline Layout (--pipeline-model-parallel-layout)
      • Add support to pass custom parallelism groups to MoE modules.
      • Add Hybrid Shard Data-Parallel support for MoE models (--num-distributed-optimizer-instances)
      • Support EP + custom FSDP training for DeepSeek-V3
      • FP8 support for Multi-Token-Prediction
    • Memory Optimization
      • Fine-grained recomputation to reduce activation memory. (--recompute-modules with --recompute-granularity selective)
      • Memory efficient token permutation by moving the probs multiplication from unpermutation to activation function of GroupedMLP.
    • Performance Optimization
      • MLA RoPE fusion kernel and YARN embedding cache.
      • FP8 padding optimization of MoE models by padding the routing map.
    • Bug fixes:
      • Fix the aux loss calculation when expert_bias or group limited routing is used. This leads to load_balancing_loss values change compared to the previous version.
      • Fix packed sequence support for MLA
    • Known Issues:
      • MTP is not compatible with flexible pipeline layout, will be fixed at !3594.
      • MTP convergence issue with TP2, will be fixed at !3594.

NVIDIA Megatron Core 0.14.0rc2

21 Jul 04:12

Choose a tag to compare

Pre-release

Prerelease: NVIDIA Megatron Core 0.14.0rc2 (2025-07-21)