
Chinese AI startup DeepSeek has published a landmark peer-reviewed paper in Nature detailing the revolutionary methods behind its R1 reasoning model and disclosing that the total cost to train R1 was only $294,000 for reinforcement learning—bringing the overall training expense to roughly $6.3 million, including an estimated $6 million for the underlying base model. This figure stands in stark contrast to the tens of millions of dollars reported for comparable large language models (LLMs) developed by major U.S. technology firms (Nature).
Breakthrough in Pure Reinforcement Learning
DeepSeek’s key innovation lies in its use of pure reinforcement learning (RL) to cultivate advanced reasoning capabilities, rather than relying predominantly on supervised fine-tuning with human-annotated examples. In the pure RL paradigm, the model receives rewards for reaching correct solutions, incentivizing it to develop its own problem-solving strategies and self-verification mechanisms.
According to the Nature article, DeepSeek introduced a novel algorithm called Group Relative Policy Optimization (GRPO), which enables R1 to autonomously evaluate the quality of its generated reasoning paths by estimating expected rewards, eliminating the need for a separate critic model. The paper notes that R1 can “self-score” its attempts and prioritize higher-quality reasoning sequences, leading to more robust performance on complex tasks. AI researcher Huan Sun from Ohio State University remarked, “Almost all RL work in LLMs in 2025 appears to have been inspired by R1 in some way” (Phil Schmid).
Five-Stage Training Process
DeepSeek’s R1 training regimen comprises five stages that alternate between supervised fine-tuning and pure reinforcement learning:
- Cold-Start Fine-Tuning: DeepSeek’s V3-Base model is first fine-tuned on thousands of diverse “cold-start” examples covering writing tasks, factual questions, and logic puzzles.
- RL Enhancement: Using GRPO, the model undergoes pure RL to refine reasoning skills—receiving rewards for correct final answers without direct human guidance.
- Rejection Sampling: Near convergence, R1 generates synthetic training data by filtering and selecting its highest-scoring RL outputs.
- Supervised Merge: The synthetic data is merged with the original supervised dataset to reinforce both human-provided and model-generated high-quality examples.
- Final RL Polishing: A second RL phase further fine-tunes reasoning strategies, enabling R1 to self-correct and verify intermediate steps.
This cyclic approach allowed DeepSeek to maximize learning efficiency while containing compute costs—reportedly training on relatively modest clusters of NVIDIA H800 GPUs, which are subject to U.S. export restrictions on sales to China (Nature Volume 645, Issue 8081).
First Major LLM Peer-Reviewed in Nature
R1 is the first major LLM to undergo rigorous peer review in a top-tier journal, setting a new standard for transparency in AI research. Machine-learning engineer Lewis Tunstall of Hugging Face, one of the paper’s reviewers, praised DeepSeek’s openness: “Publishing detailed training processes is critical to assessing model risks and reproducibility.” In response to reviewer feedback, DeepSeek clarified data sources, safety measures, and benchmark protocols in the final manuscript (Nature).
Performance and Community Adoption
Since its public release on Hugging Face in January 2025, R1 has amassed over 10.9 million downloads, becoming the platform’s most popular model for complex reasoning tasks. On the AIME 2024 mathematics benchmark, R1 achieved a pass@1 score of 79.8%, marginally surpassing OpenAI’s o1 model at 79.2%. The model also performs competitively on logical reasoning, code synthesis, and scientific QA benchmarks, all while being fully open-source.
Implications for AI Development
DeepSeek’s success challenges prevailing assumptions about the necessity of colossal training budgets and hardware resources. By demonstrating that pure RL—when paired with strategic fine-tuning and synthetic data generation—can yield state-of-the-art reasoning at a fraction of the cost, DeepSeek has sparked debate about future directions for cost-effective, transparent AI research. The company’s reliance on H800 GPUs—now restricted from export to China—underscores the complex interplay between geopolitics and AI innovation.
DeepSeek’s R1 model offers a compelling blueprint for democratizing advanced AI: modest training budgets, transparent methodologies, and community-driven development may become hallmarks of the next generation of large language models.