[FEAT] Integrate QeRL (NVFP4 + LoRA + Noise scheduler) into PEFT

### Feature request

### Feature request:
Paper: https://arxiv.org/abs/2510.11696
Reference code: https://github.com/NVlabs/QeRL

### TL; DR:  QeRL supports reinforcement learning for 32B LLMs on a single H100 GPU, maintaining full-parameter FT performance!

<img width="1268" height="262" alt="Image" src="https://github.com/user-attachments/assets/5ca39a22-a059-4bb0-aa7a-9bfea6c50171" />
The Speedup and Final Evaluation of QeRL (Qwen2.5-7B-Instruction)

QeRL achieves faster RL rollout and end-to-end training speeds, while delivering performance superior to vanilla LoRA and QLoRA, also comparable to full-parameter RL on mathematical benchmarks.

<img width="1269" height="295" alt="Image" src="https://github.com/user-attachments/assets/6769ae10-7626-4bba-9d73-de1814e41a73" />
Rollout throughput of Qwen2.5-14B/32B-Instruct models under different lora ranks.

Its dual benefits in memory efficiency and training speed make QeRL highly effective for end-to-end RL workflows, especially in scenarios requiring extensive rollouts. QeRL shows rollout performance across various LoRA ranks, with QeRL achieving over 2× speedups on 14B and 32B models.

### Motivation:
Multi-step reasoning is essential for large language models to solve complex tasks, yet prevailing training paradigms face key limitations. Supervised fine-tuning can teach explicit chains of thought but tends to promote imitation rather than genuine reasoning, whereas reinforcement learning provides verifiable rewards and supports exploration of diverse reasoning paths, albeit at high computational cost.
In RL, large model sizes inflate memory use and a multistage pipeline—spanning rollouts, reward computation, logit evaluation, and gradient updates—slows training, with long-sequence rollouts being especially expensive. Our experiments indicate a promising remedy: quantized LLMs combined with LoRA not only reduce training resources but also outperform vanilla LoRA in reward growth and evaluation, challenging the conventional view from SFT that quantization degrades training. We further observe that quantization error behaves like beneficial parameter noise, increasing entropy and broadening exploration. These findings motivate leveraging low-bit quantization in RL to improve both efficiency and reasoning quality.

<img width="1269" height="233" alt="Image" src="https://github.com/user-attachments/assets/af898af9-c7d9-4af3-9169-2f911e7851db" />
Quantization noise brings higher initialized entropy, which encourages exploration in RL training, accelerating the increase of reward.



### Summary:

<img width="1269" height="546" alt="Image" src="https://github.com/user-attachments/assets/6ab504b2-c375-4fcf-9c9f-7bab56d0888d" />

We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for LLMs. While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training.
Experiments demonstrate that QeRL delivers over 1.5× speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.

### Key Results:

<img width="119" height="48" alt="Image" src="https://github.com/user-attachments/assets/c2bcbccb-0ff5-4a86-a47e-4f7535d810d2" />

<img width="513" height="298" alt="Image" src="https://github.com/user-attachments/assets/e1e9a2ea-98e6-497d-b93d-980f4e228dc5" />

Training Reward of 7B/14B-Instruct Models on BigMath dataset (level 3-5), 32B-Instruct Model on BigMath dataset (level 4-5).
QeRL achieves a rapid reward increase within 200 steps (left), while vanilla LoRA requires over 500 steps to show improvement. In the hardest training data for 32B (right), QeRL still show higher accuracy reward curve than vanilla LoRA.

<img width="193" height="102" alt="Image" src="https://github.com/user-attachments/assets/1317910d-5124-4f6b-92c1-07d4d6be83d2" />

Reward visualization on Different RL Algorithms. In both GRPO and DAPO, quantized model delivers better converged reward score and faster reward growth.

<img width="314" height="113" alt="Image" src="https://github.com/user-attachments/assets/ab2898cf-4e08-4aa9-bb6f-e27acceff106" />

Comparison with Relevant Baselines on GSM8K.
We report the GSM8k training results of the 3B and 7B models using GRPO.

<img width="314" height="188" alt="Image" src="https://github.com/user-attachments/assets/491eba59-93c9-41c6-b324-42ed600d8fe7" />

Comparison with Relevant Baselines on More Benchmarks.
We report the BigMath (level 3~5) training results of other models using DAPO.

<img width="309" height="94" alt="Image" src="https://github.com/user-attachments/assets/b8f33ea7-879a-4195-abaf-e5bd0b00ef69" />

Memory Saving and End-to-End Training Speedup on 7B and 14B Models.
We compare the quantized model sizes and end-to-end RL training speedup of these PEFT methods, with all experiments conducted on a single NVIDIA H100-80GB GPU.

<img width="104" height="50" alt="Image" src="https://github.com/user-attachments/assets/8e435702-e0ba-437f-8498-37df6954104e" />

<img width="102" height="50" alt="Image" src="https://github.com/user-attachments/assets/61d39cdf-f35b-4dc0-93af-7b72524bc168" />

Ablation of Learning Rate.  When the learning rate is increased to 3e-5, the larger update magnitude in the adapter results in faster reward growth and quicker model convergence. However, in 16-bit models, the excessive update magnitude leads to instability, often causing the training process to collapse. In contrast, QeRL demonstrates remarkable robustness to larger learning rates due to the presence of NVFP4 quantization noise, which helps stabilize updates.











### Your contribution

NVFP4 quantization with LoRA, reducing memory and enabling faster RL while matching full-parameter finetuning performance with adaptive quantization noise. AQN dynamically adjusts quantization noise with an exponential scheduler, enhancing exploration.  We can help build the QeRL and integrate it into TRL based on our code :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT] Integrate QeRL (NVFP4 + LoRA + Noise scheduler) into PEFT #2886

Feature request

Feature request:

TL; DR: QeRL supports reinforcement learning for 32B LLMs on a single H100 GPU, maintaining full-parameter FT performance!

Motivation:

Summary:

Key Results:

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEAT] Integrate QeRL (NVFP4 + LoRA + Noise scheduler) into PEFT #2886

Description

Feature request

Feature request:

TL; DR: QeRL supports reinforcement learning for 32B LLMs on a single H100 GPU, maintaining full-parameter FT performance!

Motivation:

Summary:

Key Results:

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions