Releases · NVIDIA/Megatron-LM

05 Dec 01:35

ko3n1g

core_v0.15.0

0112156

NVIDIA Megatron Core 0.15.0 Latest

Latest

Features
- Performance
  - Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup, 10-14% E2E) (MR !3912)
  - Use new TE interface for user buffers (MR !3886)
  - Add CPU activation offloading via TE (MR !4286)
  - Add configurable double buffering (MR !4026)
  - Add Muon optimizer and distributed optimizer support (MR !4106)
  - Add setting to support Adam or AdamW optimizer (MR !3866)
- MoE
  - Add DTensor support for EP and DSv3 modules (MR !3955)
  - Add HybridEP backend to Flex Dispatcher (MR !4237)
  - Support FP8 recomputation for MoE components (MR !4030)
  - Implement NVFP4 Zero Padding for MoE (MR !4225)
  - Compute shared experts before router (MR !4068)
  - Enable bias in expert MLP (MR !3858)
- Model support
  - Add YaRN support for GPT-OSS (MR !4044)
  - Add support for Qwen3-Next arguments (MR !4070)
  - Add FP8 init for MTP (MR !3958)
  - Add fp8_dpa option for FP8 scaling (MR !4053)
  - Add RADIO-g support to converter and tester (MR !4371)
  - Add audio semantic reasoning data for voice chat and speech instructions (MR !4397)
- FSDP
  - Enable joint training of parallel modules (MR !3850)
  - Add support for multimodule communication (MR !4235)
- Inference
  - Add CUDA Graph runner lookup table cache (up to 2x E2E speedup) (MR !4082)
  - Add MoE dropping and padding router for CUDA Graph + decode (MR !3816)
  - Dynamic audio shapes with variable sequence lengths (2.5x throughput improvement) (MR !4274)
  - Integrate unified memory for dynamic inference context (MR !3985)
- Post-training
  - Add GPT-OSS ModelOpt support with quantization, import/export (MR !4169)
  - Enable KD support with hybrid training loop (MR !4021)
  - Add ModelOpt pruning example (MR !4022)
- RL
  - Add importance sampling and partial rollouts to Megatron RL (MR !4000)
  - Add sequence packing for RL (MR !4191)
- Ease of use
  - Handle CUDA absence during import (MR !4120)
  - Add granary dataloader functionality (MR !4291)
  - Enable SWA mixing with attention (MR !3855)
Bug fixes
- Fix convergence bug in MXFP8 parameter gradient buffer reuse (MR !3999)
- Fix loss mask cloning to prevent incorrect updates (MR !4164)
- Fix metadata loss in checkpoints (MR !4182)
- Fix FSDP grad accum fusion support (MR !4018)
- Fix non-TE optimizer checkpoint issue (MR !3931)
- Fix BERT virtual pipeline parallelism (MR !3993)
- Fix gc.freeze() slowdown by adding gc.collect() on last layer (MR !4003)
- Fix full iteration CUDA graph non-tensor handling (MR !4019)
- Fix model_auto_sync mis-set and add gradient assertion (MR !4062)
- Fix HF import dtype and checkpoint loading issues (MR !4095)
- Fix missing initialization in ProcessGroupCollection (MR !4159)
- Fix sink attention TP (MR !4173)
- Fix num_microbatches calculation (MR !4199)
- Fix 1f1b overlap unit tests for MTP standalone (MR !4210)
- Fix stale state dict handling (MR !4226)
- Fix dataset divergence with tokenizer PAD handling (MR !4231)
- Fix parameter initialization (MR !4296)
- Ensure tensor-parallel attributes set regardless of initialization flag (MR !4312)
Known issues

NVIDIA Megatron Core 0.14.0

Features
- Inference
  - Add async support for DynamicInferenceEngine (MR !3187)
  - Pad input tensors and enable FP8 weights for FP8 inference (MR !3341)
  - Force inference to always gather logits with tensor parallelism (MR !3442)
  - Multi batch size CUDA Graphs for Dynamic Inference (MR !3402)
- Post-training
  - ModelOpt updates (MR !3268)
    - Add speculative decoding AR validation feature
    - Add DeepSeek and Qwen model configs
- Performance
  - ModelCommProcessGroup integration (MR !3391)
  - Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism (MR !3398)
    - Flexible creation and management of communication groups
  - Add support for Spike No More embedding initializations and weight decay skipping (MR !3500)
- MoE
  - We're actively optimizing large-scale fine-grained MoE performance on Blackwell Platform.
  - Features:
    - Support Expert Parallel A2A Overlapping (MR !3470; MR !3074)
    - Support CP and recompute for MTP (MR !3330)
    - Add support for global aux loss (MR !3318)
  - Memory Optimization
    - Support recomputati...

Assets 2

08 Oct 15:04

ko3n1g

core_v0.14.0

23e00ed

NVIDIA Megatron Core 0.14.0

Features
- Inference
  - Add async support for DynamicInferenceEngine (MR !3187)
  - Pad input tensors and enable FP8 weights for FP8 inference (MR !3341)
  - Force inference to always gather logits with tensor parallelism (MR !3442)
  - Multi batch size CUDA Graphs for Dynamic Inference (MR !3402)
- Post-training
  - ModelOpt updates (MR !3268)
    - Add speculative decoding AR validation feature
    - Add DeepSeek and Qwen model configs
- Performance
  - ModelCommProcessGroup integration (MR !3391)
  - Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism (MR !3398)
    - Flexible creation and management of communication groups
  - Add support for Spike No More embedding initializations and weight decay skipping (MR !3500)
- MoE
  - We're actively optimizing large-scale fine-grained MoE performance on Blackwell Platform.
  - Features:
    - Support Expert Parallel A2A Overlapping (MR !3470; MR !3074)
    - Support CP and recompute for MTP (MR !3330)
    - Add support for global aux loss (MR !3318)
  - Memory Optimization
    - Support recomputation for FP8 layernorm/moe_act/shared_experts (MR !3465)
    - Support optimizer offloading for DSV3 FP8 training (MR !3659)
  - Performance Optimization
    - Add MoE router fusion (MR !3809)
    - Updates for MoE cudagraph (MR !3631)
  - Bug fixes:
    - Fix router input jitter dtype (MR !3774)
- Model support
  - Add MiMo video VLM train example (MR !3543)
  - Add AVLM for MIMO (MR !3624)
- Ease of use
  - Add uv support for source installs (MR !3615)
  - Automated weekly prereleases (MR !3574)
Bug fixes
- Use mscale_all_dim for softmax_factor (MR !2800)
- Fix FP8 param blockwise scaling unit test (MR !3480)
- Fix unit test blockwise scaling (MR !3491)
- Optimize prefill for token-less requests (MR !3499)
- Add default values for Fp8Padding and Fp8Unpadding (MR !3501)
- Fix CUDA graph logic for flexible pp layout (MR !3505)
- Load FP8 models with strict=False (MR !3508)
- Skip rope check for torch < 1.4.0 (MR !3528)
- Disable Apex tests for stability (MR !3539)
- Fix typo in parallel_state expert parallelism (MR !3548)
- Guard modelopt on macOS (MR !3549)
- Retry on CUDA function failure (MR !3554)
- Fix NCCL mem pool creation error (MR !3557)
- Fix get_rotary_seq_len return type (MR !3559)
- Retry on CUDA function failure (MR !3560)
- Fix NCCL allocator attribute error (MR !3565)
- Ensure multi-prompt inference works (MR !3568)
- Fix MD5 on FIPS systems (MR !3577)
- Fixes dynamic context and inference bugs (MR !3582)
- Fix TE version for interleaved fused RoPE (MR !3586)
- Fix MTP with MoE and TP logging (MR !3594)
- Guard TE import fix (MR !3596)
- Add assertion for NCCL UB case (MR !3599)
- Remove Encoder PP related Functions (MR !3604)
- Fix segfaults in tests (MR !3605)
- Fix TE error in distributed optimizer (MR !3625)
- Remove redundant barrier in checkpoint flow (MR !3626)
- Support VPP MTP, fix logging (MR !3630)
- Retry mechanism for free(): invalid pointer errors (MR !3632)
- Fix test_replication.py issues (MR !3633)
- Fix typo in parallel_state (MR !3634)
- Fix CUDA graph logic determination (MR !3635)
- Fix TE installation error (MR !3636)
- Ensure correct sharding type in local tests (MR !3643)
- Fix cudagraphed backward buffer reuse for last layer (MR !3645)
- Set default for packed_seq_params in get_rotary_seq_len (MR !3651)
- Fix dynamic example script errors (MR !3653)
- Guard TE import fix (MR !3666)
Breaking changes:
- megatron.core.distributed.custom_fsdp refactored as breaking change to megatron.core.distributed.fsdp.src.megatron_fsdp
Known issues

Assets 2

03 Oct 14:41

mmarcinkiewicz

25.09-alpha.rc1

d339190

25.09-alpha.rc1

Add fp8 attn knobs

Assets 2

12 Aug 18:33

ko3n1g

core_v0.13.1

ea651a3

NVIDIA Megatron Core 0.13.1

Merge branch 'cherry-pick-f36e1705' into 'core_r0.13.0'

Cherry-pick 'Use ruff linter (3627)' into 'core_r0.13.0'

See merge request ADLR/megatron-lm!3793

Assets 2

11 Aug 04:12

ko3n1g

core_v0.14.0rc5

d338252

NVIDIA Megatron Core 0.14.0rc5 Pre-release

Pre-release

Prerelease: NVIDIA Megatron Core 0.14.0rc5 (2025-08-11)

Assets 2

12 Aug 18:12

ko3n1g

core_v0.12.3

3ea68ad

NVIDIA Megatron Core 0.12.3

Merge branch 'chtruong/cherry-pick-3627' into 'core_r0.12.0'

Cherry-pick 'use yaml safe load  (3627)' into 'core_r0.12.0'

See merge request ADLR/megatron-lm!3795

Assets 2

04 Aug 04:12

ko3n1g

core_v0.14.0rc4

7f7439f

NVIDIA Megatron Core 0.14.0rc4 Pre-release

Pre-release

Prerelease: NVIDIA Megatron Core 0.14.0rc4 (2025-08-04)

Assets 2

28 Jul 04:13

ko3n1g

core_v0.14.0rc3

adbab64

NVIDIA Megatron Core 0.14.0rc3 Pre-release

Pre-release

Prerelease: NVIDIA Megatron Core 0.14.0rc3 (2025-07-28)

Assets 2

25 Jul 18:04

ko3n1g

core_v0.13.0

c550cf6

NVIDIA Megatron Core 0.13.0

Support bf16 dtype for optimizer states to use precision-aware optimizer in TransformerEngine
MoE
- Features:
  - Flexible Asymmetric Virtual Pipeline Parallelism with Custom Pipeline Layout (--pipeline-model-parallel-layout)
  - Add support to pass custom parallelism groups to MoE modules.
  - Add Hybrid Shard Data-Parallel support for MoE models (--num-distributed-optimizer-instances)
  - Support EP + custom FSDP training for DeepSeek-V3
  - FP8 support for Multi-Token-Prediction
- Memory Optimization
  - Fine-grained recomputation to reduce activation memory. (--recompute-modules with --recompute-granularity selective)
  - Memory efficient token permutation by moving the probs multiplication from unpermutation to activation function of GroupedMLP.
- Performance Optimization
  - MLA RoPE fusion kernel and YARN embedding cache.
  - FP8 padding optimization of MoE models by padding the routing map.
- Bug fixes:
  - Fix the aux loss calculation when expert_bias or group limited routing is used. This leads to load_balancing_loss values change compared to the previous version.
  - Fix packed sequence support for MLA
- Known Issues:
  - MTP is not compatible with flexible pipeline layout, will be fixed at !3594.
  - MTP convergence issue with TP2, will be fixed at !3594.

Assets 2

21 Jul 04:12

ko3n1g

core_v0.14.0rc2

db41707

NVIDIA Megatron Core 0.14.0rc2 Pre-release

Pre-release

Prerelease: NVIDIA Megatron Core 0.14.0rc2 (2025-07-21)

Assets 2

Releases: NVIDIA/Megatron-LM

NVIDIA Megatron Core 0.15.0

NVIDIA Megatron Core 0.14.0

Uh oh!

NVIDIA Megatron Core 0.14.0

Uh oh!

25.09-alpha.rc1

Uh oh!

NVIDIA Megatron Core 0.13.1

Uh oh!

NVIDIA Megatron Core 0.14.0rc5

Uh oh!

NVIDIA Megatron Core 0.12.3

Uh oh!

NVIDIA Megatron Core 0.14.0rc4

Uh oh!

NVIDIA Megatron Core 0.14.0rc3

Uh oh!

NVIDIA Megatron Core 0.13.0

Uh oh!

NVIDIA Megatron Core 0.14.0rc2

Uh oh!