LongCePO: Empowering LLMs to efficiently leverage infinite context

Ivan Lazarevich Mohammad Hassanpour Ganesh Venkatesh

Cerebras continues to push the boundaries of large language model capabilities with the introduction of LongCePO (Long-Context Cerebras Planning and Optimization), an extension of our CePO framework [1]. Building upon the original CePO’s ability to enhance the reasoning prowess of the Llama family through sophisticated test-time computation, LongCePO now addresses the critical challenge of context length limitations.

LongCePO leverages planning to enable LLMs to effectively manage and process information beyond their inherent context windows, effectively supporting infinite context at inference time. This is achieved by strategically decomposing complex tasks and utilizing iterative planning steps to access and integrate relevant information from extended data sources.

Just as the original CePO empowered Llama to excel in challenging reasoning tasks, LongCePO unlocks a new dimension of capability – leveraging extended data sources. By applying LongCePO to Llama 3.3 70B Instruct [2], we enable it to achieve a new high-score on LongBench v2 [3], a recently released challenging benchmark to evaluate long context capabilities of LLMs. Furthermore, LongCePO achieves this advancement in Llama’s long context accuracy while limiting the context window of the model to ~8K tokens at inference time. Running with the Cerebras inference service, this allows users to perform natural language tasks such as question answering or reasoning across their custom data sources in real time.

Results

Meta released Llama 3.3 70B with 128K context window to endow it with the capability to work with long contexts. However, even with this long context, Llama 3.3 70B barely scores above random guessing on LongBench v2’s long context tasks. LongCePO enables Llama 3.3 70B to achieve frontier-level quality for long-context tasks – turbocharging its accuracy from 27.0 to 39.5 on medium-context tasks supporting up to 128K words. The improvements are even more impressive for long context scenarios (>128K words) where we boost accuracy from 24.1 to 38.9. LongCePO enables Llama 3.3 70B to outperform the larger Mistral-Large-Instruct-2411 model [4] and o1-mini-2024-09-12 [5] by a wide margin and puts its performance in the same category as Claude Sonnet 3.5 [6].

Conclusion

LongCePO demonstrates a significant leap forward in addressing the limitations of context window sizes, enabling models like Llama 3.3 70B to achieve frontier-level performance on long-context tasks. By employing strategic planning to access and integrate external information, LongCePO effectively circumvents the need for massive context windows during inference. We will be open-sourcing LongCePO, as an update to the already available CePO release on GitHub, fostering community-driven development of inference-time optimization techniques for long-context applications. For updates on the release and to explore LongCePO, follow us on Twitter and join our Discord!

References

[1] Cerebras Systems team. CePO: Empowering Llama with Reasoning using Test-Time Compute. https://cerebras.ai/blog/cepo (2024).

[2] Grattafiori, Aaron, et al. “The llama 3 herd of models.” arXiv preprint arXiv:2407.21783 (2024).

[3] Bai, Yushi, et al. “LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks.” arXiv preprint arXiv:2412.15204 (2024).

[4] Mistral AI team, Mistral-Large-Instruct-2411, https://huggingface.co/mistralai/Mistral-Large-Instruct-2411 (2024).

[5] OpenAI team, OpenAI o1-mini, https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning (2024).

[6] Anthropic team, Claude Sonnet 3.5, https://www.anthropic.com/news/claude-3-5-sonnet (2024).