CePO Update: Turbocharging Reasoning Models’ capability using test-time planning

Michael Wang Anisha Garg Pawel Filipczuk David Bick Ganesh Venkatesh

We are happy to announce an upcoming update to CePO (Cerebras Planning and Optimization framework) [1] that turbocharges the capabilities of the open-source reasoning models – DeepSeek R1 [2] and QwQ 32B [3]. While these recently released reasoning models have clearly demonstrated very impressive accuracy for complex tasks using advanced reasoning capabilities such as backtracking, our CePO framework further elevates them to unprecedented accuracy through test-time planning and verification. Furthermore, we demonstrate the efficiency of CePO framework by showcasing its superior accuracy when operating under limitations of context length.

Results

DeepSeek R1 Llama [2] and Qwen QwQ 32B [3] are two recently released reasoning model that achieve extremely impressive accuracy outperforming many other models from the community that are many times larger than themselves including Llama 3.1 405B [4], GPT-4o [5], Claude Sonnet 3.5 [6] among others. Our results show that CePO helps decisively boost the accuracy of these state-of-the-art reasoning models – providing a boost of 20+ points in accuracy in some instances!

Accuracy under Context Length Limitations

In scenarios where hosting long context length model becomes a challenge, QwQ and DeepSeek’s accuracy drop significantly due to running out of context in the completions. By breaking complex problems into sub tasks and steps, CePO provides consistent and impressive accuracy gains for the reasoning models even when we limit the maximum context length to 16k. The unanswered rate because of model running out of context drops by 10+ percentage points after applying CePO for AIME and GPQA, leading to accuracy gains of 5+ percentage points as shown in the table below.

Update 04/10

In this update, we’ve enhanced CePO to work effectively with reasoning models while staying within the maximum context length, significantly boosting our final achieved accuracy. Reasoning models such as Qwen QwQ-32B are effective at using large context length to reason their way to response. However, this also limits their effectiveness in multi-step reasoning workflows such as CePO. We updated key steps in the CePO pipeline to allow reasoning models to fully leverage the available context,achieving performance on par with or surpassing Deepseek-R1 on LiveCodeBench and AIME, and driving accuracy closer to state-of-the-art levels.

Conclusion

CePO empowers the recently released reasoning models to achieve unprecedented accuracy even under context length limitations. We will be updating our open source CePO github with the above enhancements to support community-driven development of inference-time optimization techniques. For updates on the release and to try CePO, follow us on Twitter or join our Discord.

References

[1] Cerebras Systems team. CePO: Empowering Llama with Reasoning using Test-Time Compute. https://cerebras.ai/blog/cepo (2024).

[2] Guo, Daya, et al. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv preprint arXiv:2501.12948 (2025).

[3] Qwen Team. QwQ-32B: Embracing the Power of Reinforcement Learning. https://qwenlm.github.io/blog/qwq-32b/ (2025).

[4] Grattafiori, Aaron, et al. “The llama 3 herd of models.” arXiv preprint arXiv:2407.21783 (2024).

[5] Hurst, Aaron, et al. “Gpt-4o system card.” arXiv preprint arXiv:2410.21276 (2024).

[6] Anthropic team. “Claude 3.5 Sonnet.” https://www.anthropic.com/news/claude-3-5-sonnet (2024).

[7] DeepSeek-R1. https://github.com/deepseek-ai/deepseek-r1