-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Feature request
Feature request:
Paper: https://arxiv.org/abs/2510.11696
Reference code: https://github.com/NVlabs/QeRL
TL; DR: QeRL supports reinforcement learning for 32B LLMs on a single H100 GPU, maintaining full-parameter FT performance!
The Speedup and Final Evaluation of QeRL (Qwen2.5-7B-Instruction)
QeRL achieves faster RL rollout and end-to-end training speeds, while delivering performance superior to vanilla LoRA and QLoRA, also comparable to full-parameter RL on mathematical benchmarks.
Rollout throughput of Qwen2.5-14B/32B-Instruct models under different lora ranks.
Its dual benefits in memory efficiency and training speed make QeRL highly effective for end-to-end RL workflows, especially in scenarios requiring extensive rollouts. QeRL shows rollout performance across various LoRA ranks, with QeRL achieving over 2× speedups on 14B and 32B models.
Motivation:
Multi-step reasoning is essential for large language models to solve complex tasks, yet prevailing training paradigms face key limitations. Supervised fine-tuning can teach explicit chains of thought but tends to promote imitation rather than genuine reasoning, whereas reinforcement learning provides verifiable rewards and supports exploration of diverse reasoning paths, albeit at high computational cost.
In RL, large model sizes inflate memory use and a multistage pipeline—spanning rollouts, reward computation, logit evaluation, and gradient updates—slows training, with long-sequence rollouts being especially expensive. Our experiments indicate a promising remedy: quantized LLMs combined with LoRA not only reduce training resources but also outperform vanilla LoRA in reward growth and evaluation, challenging the conventional view from SFT that quantization degrades training. We further observe that quantization error behaves like beneficial parameter noise, increasing entropy and broadening exploration. These findings motivate leveraging low-bit quantization in RL to improve both efficiency and reasoning quality.
Quantization noise brings higher initialized entropy, which encourages exploration in RL training, accelerating the increase of reward.
Summary:
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for LLMs. While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training.
Experiments demonstrate that QeRL delivers over 1.5× speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.
Key Results:
Training Reward of 7B/14B-Instruct Models on BigMath dataset (level 3-5), 32B-Instruct Model on BigMath dataset (level 4-5).
QeRL achieves a rapid reward increase within 200 steps (left), while vanilla LoRA requires over 500 steps to show improvement. In the hardest training data for 32B (right), QeRL still show higher accuracy reward curve than vanilla LoRA.
Reward visualization on Different RL Algorithms. In both GRPO and DAPO, quantized model delivers better converged reward score and faster reward growth.
Comparison with Relevant Baselines on GSM8K.
We report the GSM8k training results of the 3B and 7B models using GRPO.
Comparison with Relevant Baselines on More Benchmarks.
We report the BigMath (level 3~5) training results of other models using DAPO.
Memory Saving and End-to-End Training Speedup on 7B and 14B Models.
We compare the quantized model sizes and end-to-end RL training speedup of these PEFT methods, with all experiments conducted on a single NVIDIA H100-80GB GPU.
Ablation of Learning Rate. When the learning rate is increased to 3e-5, the larger update magnitude in the adapter results in faster reward growth and quicker model convergence. However, in 16-bit models, the excessive update magnitude leads to instability, often causing the training process to collapse. In contrast, QeRL demonstrates remarkable robustness to larger learning rates due to the presence of NVFP4 quantization noise, which helps stabilize updates.
Your contribution
NVFP4 quantization with LoRA, reducing memory and enabling faster RL while matching full-parameter finetuning performance with adaptive quantization noise. AQN dynamically adjusts quantization noise with an exponential scheduler, enhancing exploration. We can help build the QeRL and integrate it into TRL based on our code :-)