[QUESTION] bf16 Parameters and fp32 Gradients #1186

pluiez · 2024-04-30T09:02:21Z

pluiez
Apr 30, 2024

In the README for the distributed optimizer, it is mentioned that when using bf16 training, a combination of bf16 model parameters and fp32 model grads is employed, and the distributed optimizer's fp32 main gradients are the same as the model's fp32 gradients. However, I am aware that in PyTorch, after the forward and backward passes, the gradients after forward+backward typically match the data type of the parameters. So, there should be always bf16 model grads given bf16 mdoel params, and this is apparently true in the case of fp16 training where an extra copy of fp32 main grads in the optimizer is necessary.

Could you please explain how it is possible to have bf16 parameters with fp32 gradients in the context of bf16 training? I am wondering why is there a difference between fp16 and bf16 training.

2024-06-29T18:20:25Z

github-actions[bot]
bot Jun 29, 2024

Marking as stale. No activity in 60 days.

0 replies

nastya236 · 2025-08-15T16:08:56Z

nastya236
Aug 15, 2025

I was thinking about this recently and based on all that I read it seems that there are 2 main strategies for bfloat16 training without accuracy loss:

BF16 weights + BF16 grads + FP32 master copy + FP32 optimizer state → (2 + 2 + 4 + 8) = 16 bytes/param (as in NeMo) and I assume pytorch does it in the following way: we keep a master copy in float32 and use automatic mixed precision to perform computation in bfloat16.
BF16 weights + FP32 grad accumulation + FP32 optimizer state→ (2 + 4 + 8) = 14 bytes/param (I haven’t found a a reference, but it appears Megatrom-LM does this).

I then questioned the need for FP32 grad accumulation and came across this paper (page 6, Section 5.1 “Precision options”), case D−MW. In this setup, gradients are not accumulated in FP32, so pure bf16 , but according to Table 3, performance is worse compared to the first option (and I’d assume options 1 and 2 have similar accuracy).
So my take is the following: you either do:

FP32 copy + BF16 weights + BF16 grads + 32OPT = (4 + 2 + 2 + 4*2) = 16 byte per parameter
or:
no copy + BF16 weights + FP32 grads + 32 OPT = 14 per parameter

Probably, master copy in case of BF16 is an overkill, because the range of values is comparable..
But I am not entirely sure that my observations are correct, so I would really appreciate if someone corrects me.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QUESTION] bf16 Parameters and fp32 Gradients #1186

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[QUESTION] bf16 Parameters and fp32 Gradients #1186

Uh oh!

pluiez Apr 30, 2024

Replies: 2 comments

Uh oh!

github-actions[bot] bot Jun 29, 2024

Uh oh!

nastya236 Aug 15, 2025

pluiez
Apr 30, 2024

github-actions[bot]
bot Jun 29, 2024

nastya236
Aug 15, 2025