DeepSeekMath: Pushing the Limits of Mathematical Reasoning
DeepSeekMath: Pushing the Limits of Mathematical Reasoning
Mathematical reasoning in language models has advanced significantly, but challenges remain. DeepSeekMath 7B introduces a new approach by utilizing 120B math-related tokens to enhance performance. It outperforms traditional models in benchmarks, showcasing the potential of well-curated data and innovative training techniques like Group Relative Policy Optimization (GRPO) for improving reasoning abilities.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning
@Genaillmnews1ย day ago
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao 1,2 รขยยรขยย , Peiyi Wang 1,3 รขยยรขยย , Qihao Zhu 1,3 รขยยรขยย , Runxin Xu 1 , Junxiao Song 1 Xiao Bi 1 , Haowei Zhang 1 , Mingchuan Zhang 1 , Y.K. Li 1 , Y. Wu 1 , Daya Guo 1 รขยย
1 DeepSeek-AI, 2 Tsinghua University, 3 Peking University
{zhihongshao,wangpeiyi,zhuqh,guoday}@deepseek.com https://github.com/deepseek-ai/DeepSeek-Math
Abstract
Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pretraining DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.
- รยท Our research provides compelling evidence that the publicly accessible Common Crawl data contains valuable information for mathematical purposes. By implementing a meticulously designed data selection pipeline, we successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content, which is almost 7 times the size of the math web pages used by Minerva (Lewkowycz et al., 2022a) and 9 times the size of the recently released OpenWebMath (Paster et al., 2023).
- รยท Our pre-trained base model DeepSeekMath-Base 7B achieves comparable performance with Minerva 540B (Lewkowycz et al., 2022a), indicating the number of parameters is not the only key factor in mathematical reasoning capability. A smaller model pre-trained on high-quality data could achieve strong performance as well.
- รยท We share our findings from math training experiments. Code training prior to math training improves models' ability to solve mathematical problems both with and without tool use. This offers a partial answer to the long-standing question: does code training improve reasoning abilities? We believe it does, at least for mathematical reasoning.
- รยท Although training on arXiv papers is common, especially in many math-related papers, it brings no notable improvements on all mathematical benchmarks adopted in this paper.
- รยท We introduce Group Relative Policy Optimization (GRPO), an efficient and effective reinforcement learning algorithm. GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources compared to Proximal Policy Optimization (PPO).
- รยท We demonstrate that GRPO significantly enhances the performance of our instructiontuned model DeepSeekMath-Instruct, by solely using the instruction-tuning data. Furthermore, we observe enhancements in the out-of-domain performance during the reinforcement learning process.
- รยท We provide a unified paradigm to understand different methods, such as RFT, DPO, PPO, and GRPO. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative reinforcement learning, and so on to deeply investigate the essential elements of this paradigm.
- รยท Based on our unified paradigm, we explore the reasons behind the effectiveness of reinforcement learning, and summarize several potential directions to achieve more effective reinforcement learning of LLMs.
- รยท English and Chinese Mathematical Reasoning : We conduct comprehensive assessments of our models on English and Chinese benchmarks, covering mathematical problems
- รยท Formal Mathematics : We evaluate DeepSeekMath-Base using the informal-to-formal theorem proving task from (Jiang et al., 2022) on miniF2F (Zheng et al., 2021) with Isabelle (Wenzel et al., 2008) chosen to be the proof assistant. DeepSeekMath-Base demonstrates strong few-shot autoformalization performance.
- รยท Natural Language Understanding, Reasoning, and Code : To build a comprehensive profile of models' general understanding, reasoning, and coding capabilities, we evaluate DeepSeekMath-Base on the Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2020) which encompasses 57 multiple-choice tasks covering diverse subjects, BIG-Bench Hard (BBH) (Suzgun et al., 2022) which consists of 23 challenging tasks that mostly require multi-step reasoning to solve, as well as HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) which are widely used to evaluate code language models. Math pre-training benefits both language understanding and reasoning performance.
- รยท MathPile (Wang et al., 2023c): a multi-source corpus (8.9B tokens) aggregated from textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, and arXiv, with the majority (over 85%) sourced from arXiv;
- รยท OpenWebMath (Paster et al., 2023): CommonCrawl data filtered for mathematical content, totaling 13.6B tokens;
- รยท Proof-Pile-2 (Azerbayev et al., 2023): a mathematical corpus consisting of OpenWebMath, AlgebraicStack (10.3B tokens of mathematical code), and arXiv papers (28.0B tokens). When experimenting on Proof-Pile-2, we follow Azerbayev et al. (2023) to use an arXiv:Web:Code ratio of 2:4:1.
- รยท High-quality : We evaluate downstream performance on 8 mathematical benchmarks using few-shot chain-of-thought prompting Wei et al. (2022). As shown in Table 1, there is a clear performance lead of the model trained on the DeepSeekMath Corpus. Figure 3 shows that the model trained on the DeepSeekMath Corpus demonstrates better performance than
- รยท Multilingual : The DeepSeekMath Corpus encompasses data in multiple languages, predominantly featuring English and Chinese as the two most represented languages. As shown in Table 1, training on the DeepSeekMath Corpus enhances mathematical reasoning performance in both English and Chinese. In contrast, existing mathematical corpora, which are primarily English-centric, show limited improvement and may even hinder performance in Chinese mathematical reasoning.
- รยท Large-scale : The DeepSeekMath Corpus is several times larger than existing mathematical corpora. As depicted in Figure 3, DeepSeek-LLM 1.3B, when trained on the DeepSeekMath Corpus, shows a steeper learning curve along with more lasting improvements. In contrast, the baseline corpora are much smaller, and have already been repeated multiple rounds during training, with the resulting model performance quickly reaching a plateau.
- รยท English mathematical datasets : We annotate GSM8K and MATH problems with toolintegrated solutions, and adopt a subset of MathInstruct (Yue et al., 2023) along with the training set of Lila-OOD (Mishra et al., 2022) where problems are solved with CoT or PoT. Our English collection covers diverse fields of mathematics, e.g., algebra, probability, number theory, calculus, and geometry.
- รยท Chinese mathematical datasets : We collect Chinese K-12 mathematical problems spanning 76 sub-topics such as linear equations, with solutions annotated in both CoT and toolintegrated reasoning format.
- รยท Closed-source models include: (1) the GPT family among which GPT-4 (OpenAI, 2023) and GPT-4 Code Interpreter 2 are the most capable ones, (2) Gemini Ultra and Pro (Anil et al., 2023), (3) Inflection-2 (Inflection AI, 2023), (4) Grok-1 3 , as well as models recently released by Chinese companies including (5) Baichuan-3 4 , (6) the latest GLM-4 5 from the GLMfamily (Du et al., 2022). These models are for general purposes, most of which have undergone a series of alignment procedures.
- รยท Open-source models include: general models like (1) DeepSeek-LLM-Chat 67B (DeepSeekAI, 2024), (2) Qwen 72B (Bai et al., 2023), (3) SeaLLM-v2 7B (Nguyen et al., 2023), and (4)
- init รฐยยย 1: policy model รฐยยย รฐยยย รขยย รฐยยย รฐยยย init 2: for iteration = 1, . . . , I do 3: reference model รฐยยย รฐยยยรฐยยย รฐยยย รขยย รฐยยย รฐยยย 4: for step = 1, . . . , M do 5: Sample a batch D รฐยยย from D 6: Update the old policy model รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย รขยย รฐยยย รฐยยย 7: Sample รฐยยยบ outputs { รฐยยย รฐยยย } รฐยยยบ รฐยยย = 1 รขยยผ รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย (รยท | รฐยยย ) for each question รฐยยย รขยย D รฐยยย 8: Compute rewards { รฐยยย รฐยยย } รฐยยยบ รฐยยย = 1 for each sampled output รฐยยย รฐยยย by running รฐยยย รฐยยย 9: Compute รย รฐยยยด รฐยยย , รฐยยยก for the รฐยยยก -th token of รฐยยย รฐยยย through group relative advantage estimation. 10: for GRPO iteration = 1, . . . , รฐยยย do 11: Update the policy model รฐยยย รฐยยย by maximizing the GRPO objective (Equation 21) 12: Update รฐยยย รฐยยย through continuous training using a replay mechanism.
1. Introduction
Large language models (LLM) have revolutionized the approach to mathematical reasoning in artificial intelligence, spurring significant advancements in both the quantitative reasoning benchmark (Hendrycks et al., 2021) and the geometry reasoning benchmark (Trinh et al., 2024). Moreover, these models have proven instrumental in assisting humans in solving complex mathematical problems (Tao, 2023). However, cutting-edge models such as GPT-4 (OpenAI, 2023) and Gemini-Ultra (Anil et al., 2023) are not publicly available, and the currently accessible open-source models considerably trail behind in performance.
In this study, we introduce DeepSeekMath, a domain-specific language model that significantly outperforms the mathematical capabilities of open-source models and approaches the performance level of GPT-4 on academic benchmarks. To achieve this, we create the DeepSeekMath Corpus, a large-scale high-quality pre-training corpus comprising 120B math tokens. This dataset is extracted from the Common Crawl (CC) using a fastText-based classifier (Joulin et al., 2016). In the initial iteration, the classifier is trained using instances from OpenWebMath (Paster et al., 2023) as positive examples, while incorporating a diverse selection of other web pages to serve as negative examples. Subsequently, we employ the classifier to mine additional positive instances from the CC, which are further refined through human annotation. The classifier is then updated with this enhanced dataset to improve its performance. The evaluation results indicate that the large-scale corpus is of high quality, as our base model DeepSeekMath-Base 7B achieves 64.2% on GSM8K (Cobbe et al., 2021) and 36.2% on the competition-level MATH dataset (Hendrycks et al., 2021), outperforming Minerva 540B (Lewkowycz et al., 2022a). In addition, the DeepSeekMath Corpus is multilingual, so we notice an improvement in Chinese mathematical benchmarks (Wei et al., 2023; Zhong et al., 2023). We believe that our experience in mathematical data processing is a starting point for the research community, and there is significant room for improvement in the future.
DeepSeekMath-Base is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024), as we notice that starting from a code training model is a better choice compared to a general LLM. Furthermore, we observe the math training also improves model capability on MMLU (Hendrycks et al., 2020) and BBH benchmarks (Suzgun et al., 2022), indicating it does not only enhance the model's mathematical abilities but also amplifies general reasoning capabilities.
After pre-training, we apply mathematical instruction tuning to DeepSeekMath-Base with chain-of-thought (Wei et al., 2022), program-of-thought (Chen et al., 2022; Gao et al., 2023), and tool-integrated reasoning (Gou et al., 2023) data. The resulting model DeepSeekMath-Instruct 7B beats all 7B counterparts and is comparable with 70B open-source instruction-tuned models.
Furthermore, we introduce the Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9% รขยย 88.2%, MATH: 46.8% รขยย 51.7%) and out-of-domain mathematical tasks (e.g., CMATH: 84.6% รขยย 88.8%) during the reinforcement learning phase. We also provide a unified paradigm to understand different methods, such as Rejection Sampling Fine-Tuning (RFT) (Yuan et al., 2023a), Direct Preference Optimization (DPO) (Rafailov et al., 2023), PPO and GRPO. Based on such a unified paradigm, we find that all these methods are conceptualized as either direct or simplified RL techniques. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative RL and so on,
to deeply investigate the essential elements of this paradigm. At last, we explain why our RL boosts the performance of instruction-tuned models, and further summarize potential directions to achieve more effective RL based on this unified paradigm.
1.1. Contributions
Our contribution includes scalable math pre-training, along with the exploration and analysis of reinforcement learning.
Math Pre-Training at Scale
Exploration and Analysis of Reinforcement Learning
1.2. Summary of Evaluations and Metrics
from grade-school level to college level. English benchmarks include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), SAT (Azerbayev et al., 2023), OCW Courses (Lewkowycz et al., 2022a), MMLU-STEM (Hendrycks et al., 2020). Chinese benchmarks include MGSM-zh (Shi et al., 2023), CMATH (Wei et al., 2023), Gaokao-MathCloze (Zhong et al., 2023), and Gaokao-MathQA (Zhong et al., 2023). We evaluate models' ability to generate self-contained text solutions without tool use, and also the ability to solve problems using Python.
On English benchmarks, DeepSeekMath-Base is competitive with the closed-source Minerva 540B (Lewkowycz et al., 2022a), and surpasses all open-source base models (e.g., Mistral 7B (Jiang et al., 2023) and Llemma-34B (Azerbayev et al., 2023)), regardless of whether they've undergone math pre-training or not, often by a significant margin. Notably, DeepSeekMath-Base is superior on Chinese benchmarks, likely because we don't follow previous works (Azerbayev et al., 2023; Lewkowycz et al., 2022a) to collect English-only math pre-training data, and also include high-quality non-English ones. With mathematical instruction tuning and reinforcement learning, the resulting DeepSeekMath-Instruct and DeepSeekMath-RL demonstrate strong performance, obtaining an accuracy of over 50% on the competition-level MATH dataset for the first time within the open-source community.
2. Math Pre-Training
2.1. Data Collection and Decontamination
In this section, we will outline the process of constructing the DeepSeekMath Corpus from Common Crawl. As depicted in Figure 2, we present an iterative pipeline that demonstrates how to systematically gather a large-scale mathematical corpus from Common Crawl, starting with a seed corpus (e.g., a small but high-quality collection of math-related dataset). It's worth noting that this approach is also applicable to other domains, such as coding.
First, we choose OpenWebMath (Paster et al., 2023), a collection of high-quality mathematical web texts, as our initial seed corpus. Using this corpus, we train a fastText model (Joulin et al., 2016) to recall more OpenWebMath-like mathematical web pages. Specifically, we randomly select 500,000 data points from the seed corpus as positive training examples and another 500,000 web pages from Common Crawl as negative ones. We employ an open-source library 1 for training, configuring the vector dimension to 256, learning rate to 0.1, the maximum length
of word n-gram to 3, the minimum number of word occurrences to 3, and the number of training epochs to 3. To reduce the size of the original Common Crawl, we employ URL-based deduplication and near-deduplication techniques, resulting in 40B HTML web pages. We then recall mathematical web pages from deduplicated Common Crawl with the fastText model. To filter out low-quality mathematical content, we rank the collected pages according to their scores predicted by the fastText model, and only preserve the top-ranking ones. The volume of data preserved is assessed through pre-training experiments on the top 40B, 80B, 120B, and 160B tokens. In the first iteration, we choose to keep the top 40B tokens.
After the first iteration of data collection, numerous mathematical web pages remain uncollected, mainly because the fastText model is trained on a set of positive examples that lacks sufficient diversity. We therefore identify additional mathematical web sources to enrich the seed corpus, so that we can optimize the fastText model. Specifically, we first organize the entire Common Crawl into disjoint domains; a domain is defined as web pages sharing the same base URL. For each domain, we calculate the percentage of web pages that are collected in the first iteration. Domains where over 10% of the web pages have been collected are classified as math-related (e.g., mathoverflow.net ). Subsequently, we manually annotate the URLs associated with mathematical content within these identified domains (e.g., mathoverflow.net/questions ). Web pages linked to these URLs, yet uncollected, will be added to the seed corpus. This approach enables us to gather more positive examples, thereby training an improved fastText model capable of recalling more mathematical data in the subsequent iteration. After four iterations of data collection, we end up with 35.5M mathematical web pages, totaling 120B tokens. In the fourth iteration, we notice that nearly 98% of the data has already been collected in the third iteration, so we decide to cease data collection.
To avoid benchmark contamination, we follow Guo et al. (2024) to filter out web pages containing questions or answers from English mathematical benchmarks such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) and Chinese benchmarks such as CMATH (Wei et al., 2023) and AGIEval (Zhong et al., 2023). The filtering criteria are as follows: any text segment containing a 10-gram string that matches exactly with any sub-string from the evaluation benchmarks is removed from our math training corpus. For benchmark texts that are shorter than 10 grams but have at least 3 grams, we employ exact matching to filter out contaminated web pages.
2.2. Validating the Quality of the DeepSeekMath Corpus
We run pre-training experiments to investigate how the DeepSeekMath Corpus is compared with the recently released math-training corpora:
2.2.1. Training Setting
We apply math training to a general pre-trained language model with 1.3B parameters, which shares the same framework as the DeepSeek LLMs (DeepSeek-AI, 2024), denoted as DeepSeekLLM 1.3B. We separately train a model on each mathematical corpus for 150B tokens. All experiments are conducted using the efficient and light-weight HAI-LLM (High-flyer, 2023) training framework. Following the training practice of DeepSeek LLMs, we use the AdamW optimizer (Loshchilov and Hutter, 2017) with รฐยยยฝ 1 = 0.9, รฐยยยฝ 2 = 0.95, and weight_decay = 0.1, along with a multi-step learning rate schedule where the learning rate reaches the peak after 2,000 warmup steps, decreases to its 31.6% after 80% of the training process, and further decreases to 10.0% of the peak after 90% of the training process. We set the maximum value of learning rate to 5.3e-4, and use a batch size of 4M tokens with a 4K context length.
2.2.2. Evaluation Results
The DeepSeekMath Corpus is of high quality, covers multilingual mathematical content, and is the largest in size.
Proof-Pile-2 at 50B tokens (1 full epoch of Proof-Pile-2), indicating the average quality of DeepSeekMath Corpus is higher.
2.3. Training and Evaluating DeepSeekMath-Base 7B
In this section, we introduce DeepSeekMath-Base 7B, a base model with strong reasoning abilities, especially in mathematics. Our model is initialized with DeepSeek-Coder-Base-v1.5 7B
(Guo et al., 2024) and trained for 500B tokens. The distribution of the data is as follows: 56% is from the DeepSeekMath Corpus, 4% from AlgebraicStack, 10% from arXiv, 20% is Github code, and the remaining 10% is natural language data from Common Crawl in both English and Chinese. We mainly adopt the training setting specified in Section 2.2.1, except that we set the maximum value of the learning rate to 4.2e-4 and use a batch size of 10M tokens.
We conduct a comprehensive assessment of the mathematical capabilities of DeepSeekMathBase 7B, focusing on its ability to produce self-contained mathematical solutions without relying on external tools, solve mathematical problems using tools, and conduct formal theorem proving. Beyond mathematics, we also provide a more general profile of the base model, including its performance of natural language understanding, reasoning, and programming skills.
Mathematical Problem Solving with Step-by-Step Reasoning We evaluate DeepSeekMathBase's performance of solving mathematical problems using few-shot chain-of-thought prompting (Wei et al., 2022), across eight benchmarks in English and Chinese. These benchmarks encompass quantitative reasoning (e.g., GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and CMATH (Wei et al., 2023)) and multiple-choice problems (e.g., MMLU-STEM (Hendrycks et al., 2020) and Gaokao-MathQA (Zhong et al., 2023)), covering diverse fields of mathematics from elementary to college-level complexity.
As shown in Table 2, DeepSeekMath-Base 7B leads in performance across all eight benchmarks among the open-source base models (including the widely-used general model Mistral 7B (Jiang et al., 2023) and the recently released Llemma 34B (Azerbayev et al., 2023) which underwent math training on Proof-Pile-2 (Azerbayev et al., 2023)). Notably, on the competitionlevel MATH dataset, DeepSeekMath-Base surpasses existing open-source base models by over 10% absolute, and outperforms Minerva 540B (Lewkowycz et al., 2022a), a closed-source base model 77 times larger which builds on PaLM (Lewkowycz et al., 2022b) and is further trained on mathematical texts.
Mathematical Problem Solving with Tool Use We evaluate program-aided mathematical reasoning on GSM8K and MATH using few-shot program-of-thought prompting (Chen et al., 2022; Gao et al., 2023). Models are prompted to solve each problem by writing a Python program where libraries such as math and sympy can be utilized for intricate computations. The execution result of the program is evaluated as the answer. As shown in Table 3, DeepSeekMath-Base 7B outperforms the prior state-of-the-art Llemma 34B.
Table 3 | Few-shot evaluation of base models' ability to solve mathematical problems using tools and the ability to conduct informal-to-formal theorem proving in Isabelle.
Formal Mathematics Formal proof automation is beneficial to ensure the accuracy and reliability of mathematical proofs and enhance efficiency, with increasing attention in recent years. We evaluate DeepSeekMath-Base 7B on the task of informal-to-formal proving from (Jiang et al., 2022) which is to generate a formal proof based on an informal statement, a formal counterpart of the statement, and an informal proof. We evaluate on miniF2F (Zheng et al., 2021), a benchmark for formal Olympiad-level mathematics, and generate a formal proof in Isabelle for each problem with few-shot prompting. Following Jiang et al. (2022), we leverage models to generate proof sketches, and execute the off-the-shelf automated prover Sledgehammer (Paulson, 2010) to fill in the missing details. As shown in Table 3, DeepSeekMath-Base 7B demonstrates strong performance in proof autoformalization.
Natural Language Understanding, Reasoning, and Code We evaluate model performance of natural language understanding on MMLU (Hendrycks et al., 2020), reasoning on BBH (Suzgun et al., 2022), and coding capabilities on HumanEval (Chen et al., 2021) and MBPP (Austin et al.,
2021). As shown in Table 4, DeepSeekMath-Base 7B exhibits significant enhancements in performance on MMLU and BBH over its precursor, DeepSeek-Coder-Base-v1.5 (Guo et al., 2024), illustrating the positive impact of math training on language understanding and reasoning. Additionally, by including code tokens for continual training, DeepSeekMath-Base 7B effectively maintains the performance of DeepSeek-Coder-Base-v1.5 on the two coding benchmarks. Overall, DeepSeekMath-Base 7B significantly outperforms the general model Mistral 7B (Jiang et al., 2023) on the three reasoning and coding benchmarks.
3. Supervised Fine-Tuning
3.1. SFT Data Curation
We construct a mathematical instruction-tuning dataset covering English and Chinese problems from different mathematical fields and of varying complexity levels: problems are paired with solutions in chain-of-thought (CoT) (Wei et al., 2022), program-of-thought (PoT) (Chen et al., 2022; Gao et al., 2023), and tool-integrated reasoning format (Gou et al., 2023). The total number of training examples is 776K.
3.2. Training and Evaluating DeepSeekMath-Instruct 7B
In this section, we introduce DeepSeekMath-Instruct 7B which undergoes mathematical instruction tuning based on DeepSeekMath-Base. Training examples are randomly concatenated until reaching a maximum context length of 4K tokens. We train the model for 500 steps with a batch size of 256 and a constant learning rate of 5e-5.
We evaluate models' mathematical performance both without and with tool use, on 4 quantitative reasoning benchmarks in English and Chinese. We benchmark our model against the leading models of the time:
ChatGLM3 6B (ChatGLM3 Team, 2023), as well as models with enhancements in mathematics including (5) InternLM2-Math 20B 6 which builds on InternLM2 and underwent math training followed by instruction tuning, (6) Math-Shepherd-Mistral 7B which applys PPO training (Schulman et al., 2017) to Mistral 7B (Jiang et al., 2023) with a process-supervised reward model, (7) the WizardMath series (Luo et al., 2023) which improves mathematical reasoning in Mistral 7B and Llama-2 70B (Touvron et al., 2023) using evolve-instruct (i.e., a version of instruction tuning that uses AI-evolved instructions) and PPO training with training problems primarily sourced from GSM8K and MATH, (8) MetaMath 70B (Yu et al., 2023) which is Llama-2 70B fine-tuned on an augmented version of GSM8K and MATH, (9) ToRA 34B Gou et al. (2023) which is CodeLlama 34B fine-tuned to do tool-integrated mathematical reasoning, (10) MAmmoTH 70B (Yue et al., 2023) which is Llama-2 70B instruction-tuned on MathInstruct.
AsshowninTable 5, under the evaluation setting where tool use is disallowed, DeepSeekMathInstruct 7B demonstrates strong performance of step-by-step reasoning. Notably, on the competition-level MATH dataset, our model surpasses all open-source models and the majority of proprietary models (e.g., Inflection-2 and Gemini Pro) by at least 9% absolute. This is true even for models that are substantially larger (e.g., Qwen 72B) or have been specifically enhanced through math-focused reinforcement learning (e.g., WizardMath-v1.1 7B). While DeepSeekMath-Instruct rivals the Chinese proprietary models GLM-4 and Baichuan-3 on MATH, it still underperforms GPT-4 and Gemini Ultra.
Under the evaluation setting where models are allowed to integrate natural language reasoning and program-based tool use for problem solving, DeepSeekMath-Instruct 7B approaches an accuracy of 60% on MATH, surpassing all existing open-source models. On the other benchmarks, our model is competitive with DeepSeek-LLM-Chat 67B, the prior state-of-the-art that is 10 times larger.
4. Reinforcement Learning
4.1. Group Relative Policy Optimization
Reinforcement learning (RL) has been proven to be effective in further improving the mathematical reasoning ability of LLMs after the Supervised Fine-Tuning (SFT) stage (Luo et al., 2023; Wang et al., 2023b). In this section, we introduce our efficient and effective RL algorithm, Group Relative Policy Optimization (GRPO).
4.1.1. From PPO to GRPO
Proximal Policy Optimization (PPO) (Schulman et al., 2017) is an actor-critic RL algorithm that is widely used in the RL fine-tuning stage of LLMs (Ouyang et al., 2022). In particular, it optimizes LLMs by maximizing the following surrogate objective:
J รฐยยยรฐยยยรฐยยย ( รฐยยย ) = E [ รฐยยย รขยยผ รฐยยย ( รฐยยย ) , รฐยยย รขยยผ รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย ( รฐยยย | รฐยยย )] 1 | รฐยยย | | รฐยยย | รขยยรฏยธย รฐยยยก = 1 min GLYPH<20> รฐยยย รฐยยย ( รฐยยย รฐยยยก | รฐยยย , รฐยยย <รฐยยยก ) รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย ( รฐยยย รฐยยยก | รฐยยย , รฐยยย <รฐยยยก ) รฐยยยด รฐยยยก , clip GLYPH<18> รฐยยย รฐยยย ( รฐยยย รฐยยยก | รฐยยย , รฐยยย <รฐยยยก ) รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย ( รฐยยย รฐยยยก | รฐยยย , รฐยยย <รฐยยยก ) , 1 -รฐยยย , 1 + รฐยยย GLYPH<19> รฐยยยด รฐยยยก GLYPH<21> , (1)
where รฐยยย รฐยยย and รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย are the current and old policy models, and รฐยยย , รฐยยย are questions and outputs sampled from the question dataset and the old policy รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย , respectively. รฐยยย is a clipping-related hyper-parameter introduced in PPO for stabilizing training. รฐยยยด รฐยยยก is the advantage, which is computed by applying Generalized Advantage Estimation (GAE) (Schulman et al., 2015), based
#
#
รฐยยยรฐยยย
รฐยยยรฐยยย
รฐยยยดรฐยยยด
on the rewards { รฐยยย รขยยฅ รฐยยยก } and a learned value function รฐยยย รฐยยย . Thus, in PPO, a value function needs to be trained alongside the policy model and to mitigate over-optimization of the reward model, the standard approach is to add a per-token KL penalty from a reference model in the reward at each token (Ouyang et al., 2022), i.e.,
รฐยยย รฐยยยก = รฐยยย รฐยยย ( รฐยยย , รฐยยย รขยยค รฐยยยก ) -รฐยยยฝ log รฐยยย รฐยยย ( รฐยยย รฐยยยก | รฐยยย , รฐยยย <รฐยยยก ) รฐยยย รฐยยยรฐยยย รฐยยย ( รฐยยย รฐยยยก | รฐยยย , รฐยยย <รฐยยยก ) , (2)
where รฐยยย รฐยยย is the reward model, รฐยยย รฐยยยรฐยยย รฐยยย is the reference model, which is usually the initial SFT model, and รฐยยยฝ is the coefficient of the KL penalty.
As the value function employed in PPO is typically another model of comparable size as the policy model, it brings a substantial memory and computational burden. Additionally, during RL training, the value function is treated as a baseline in the calculation of the advantage for variance reduction. While in the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token. To address this, as shown in Figure 4, we propose Group Relative Policy Optimization (GRPO), which obviates the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. More specifically, for each question รฐยยย , GRPO samples a group of outputs { รฐยยย 1, รฐยยย 2, รยท รยท รยท , รฐยยย รฐยยยบ } from the old policy รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย and then optimizes the policy model by maximizing the following objective:
J รฐยยยบรฐยยย รฐยยยรฐยยย ( รฐยยย ) = E [ รฐยยย รขยยผ รฐยยย ( รฐยยย ) , { รฐยยย รฐยยย } รฐยยยบ รฐยยย = 1 รขยยผ รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย ( รฐยยย | รฐยยย )] 1 รฐยยยบ รฐยยยบ รขยยรฏยธย รฐยยย = 1 1 | รฐยยย รฐยยย | | รฐยยย รฐยยย | รขยยรฏยธย รฐยยยก = 1 GLYPH<26> min GLYPH<20> รฐยยย รฐยยย ( รฐยยย รฐยยย , รฐยยยก | รฐยยย , รฐยยย รฐยยย , <รฐยยยก ) รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย ( รฐยยย รฐยยย , รฐยยยก | รฐยยย , รฐยยย รฐยยย , <รฐยยยก ) รย รฐยยยด รฐยยย , รฐยยยก , clip GLYPH<18> รฐยยย รฐยยย ( รฐยยย รฐยยย , รฐยยยก | รฐยยย , รฐยยย รฐยยย , <รฐยยยก ) รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย ( รฐยยย รฐยยย , รฐยยยก | รฐยยย , รฐยยย รฐยยย , <รฐยยยก ) , 1 -รฐยยย , 1 + รฐยยย GLYPH<19> รย รฐยยยด รฐยยย , รฐยยยก GLYPH<21> -รฐยยยฝ D รฐยยยพรฐยยยฟ GLYPH<2> รฐยยย รฐยยย | | รฐยยย รฐยยยรฐยยย รฐยยย GLYPH<3> GLYPH<27> , (3)
where รฐยยย and รฐยยยฝ are hyper-parameters, and รย รฐยยยด รฐยยย , รฐยยยก is the advantage calculated based on relative rewards of the outputs inside each group only, which will be detailed in the following subsections. The group relative way that GRPO leverages to calculate the advantages, aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question. Also note that, instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of รย รฐยยยด รฐยยย , รฐยยยก .
Algorithm 1 Iterative Group Relative Policy Optimization
Input initial policy model รฐยยย รฐยยย ; reward models รฐยยย รฐยยย ; task prompts D ; hyperparameters รฐยยย , รฐยยยฝ ,
Output รฐยยย รฐยยย
And different from the KL penalty term used in (2), we estimate the KL divergence with the following unbiased estimator (Schulman, 2020):
D รฐยยยพรฐยยยฟ GLYPH<2> รฐยยย รฐยยย | | รฐยยย รฐยยยรฐยยย รฐยยย GLYPH<3> = รฐยยย รฐยยยรฐยยย รฐยยย ( รฐยยย รฐยยย , รฐยยยก | รฐยยย , รฐยยย รฐยยย , <รฐยยยก ) รฐยยย รฐยยย ( รฐยยย รฐยยย , รฐยยยก | รฐยยย , รฐยยย รฐยยย , <รฐยยยก ) -log รฐยยย รฐยยยรฐยยย รฐยยย ( รฐยยย รฐยยย , รฐยยยก | รฐยยย , รฐยยย รฐยยย , <รฐยยยก ) รฐยยย รฐยยย ( รฐยยย รฐยยย , รฐยยยก | รฐยยย , รฐยยย รฐยยย , <รฐยยยก ) -1, (4)
which is guaranteed to be positive.
4.1.2. Outcome Supervision RL with GRPO
Formally, for each question รฐยยย , a group of outputs { รฐยยย 1, รฐยยย 2, รยท รยท รยท , รฐยยย รฐยยยบ } are sampled from the old policy model รฐยยย รฐยยย รฐยยยรฐยยยรฐยยย . A reward model is then used to score the outputs, yielding รฐยยยบ rewards r = { รฐยยย 1, รฐยยย 2, รยท รยท รยท , รฐยยย รฐยยยบ } correspondingly. Subsequently, these rewards are normalized by subtracting the group average and dividing by the group standard deviation. Outcome supervision provides the normalized reward at the end of each output รฐยยย รฐยยย and sets the advantages รย รฐยยยด รฐยยย , รฐยยยก of all tokens in the output as the normalized reward, i.e., รย รฐยยยด รฐยยย , รฐยยยก = e รฐยยย รฐยยย = รฐยยย รฐยยย -mean ( r ) std ( r ) , and then optimizes the policy by maximizing the objective defined in equation (3).
4.1.3. Process Supervision RL with GRPO
Outcome supervision only provides a reward at the end of each output, which may not be sufficient and efficient to supervise the policy in complex mathematical tasks. Following Wang et al. (2023b), we also explore process supervision, which provides a reward at the end of each reasoning step. Formally, given the question รฐยยย and รฐยยยบ sampled outputs { รฐยยย 1, รฐยยย 2, รยท รยท รยท , รฐยยย รฐยยยบ } , a process reward model is used to score each step of the outputs, yielding corresponding rewards: R = {{ รฐยยย รฐยยยรฐยยยรฐยยยรฐยยยรฐยยยฅ ( 1 ) 1 , รยท รยท รยท , รฐยยย รฐยยยรฐยยยรฐยยยรฐยยยรฐยยยฅ ( รฐยยยพ 1 ) 1 } , รยท รยท รยท , { รฐยยย รฐยยยรฐยยยรฐยยยรฐยยยรฐยยยฅ ( 1 ) รฐยยยบ , รยท รยท รยท , รฐยยย รฐยยยรฐยยยรฐยยยรฐยยยรฐยยยฅ ( รฐยยยพ รฐยยยบ ) รฐยยยบ }} , where รฐยยยรฐยยยรฐยยยรฐยยยรฐยยยฅ ( รฐยยย ) is the end token index of the รฐยยย -th step, and รฐยยยพ รฐยยย is the total number of steps in the รฐยยย -th output. We also normalize these rewards with the average and the standard deviation, i.e., e รฐยยย รฐยยยรฐยยยรฐยยยรฐยยยรฐยยยฅ ( รฐยยย ) รฐยยย = รฐยยย รฐยยยรฐยยยรฐยยยรฐยยยรฐยยยฅ ( รฐยยย ) รฐยยย -mean ( R ) std ( R ) . Subsequently, the process supervision calculates the advantage of each token as the sum of the normalized rewards from the following steps, i.e., รย รฐยยยด รฐยยย , รฐยยยก = " รฐยยยรฐยยยรฐยยยรฐยยยรฐยยยฅ ( รฐยยย ) รขยยฅ รฐยยยก e รฐยยย รฐยยยรฐยยยรฐยยยรฐยยยรฐยยยฅ ( รฐยยย ) รฐยยย , and then optimizes the policy by maximizing the objective defined in equation (3).
4.1.4. Iterative RL with GRPO
As the reinforcement learning training process progresses, the old reward model may not be sufficient to supervise the current policy model. Therefore, we also explore the iterative RL with GRPO. As shown in Algorithm 1, in iterative GRPO, we generate new training sets for the reward model based on the sampling results from the policy model and continually train the old reward model using a replay mechanism that incorporates 10% of historical data. Then, we set the reference model as the policy model, and continually train the policy model with the new reward model.
4.2. Training and Evaluating DeepSeekMath-RL
We conduct RL based on DeepSeekMath-Instruct 7B. The training data of RL are chain-ofthought-format questions related to GSM8K and MATH from the SFT data, which consists of around 144K questions. We exclude other SFT questions to investigate the impact of RL on benchmarks that lack data throughout the RL phase. We construct the training set of reward models following (Wang et al., 2023b). We train our initial reward model based on the DeepSeekMath-Base 7B with a learning rate of 2e-5. For GRPO, we set the learning rate of the policy model as 1e-6. The KL coefficient is 0.04. For each question, we sample 64 outputs. The max length is set to 1024, and the training batch size is 1024. The policy model only has a single update following each exploration stage. We evaluate DeepSeekMath-RL 7B on benchmarks following DeepSeekMath-Instruct 7B. For DeepSeekMath-RL 7B, GSM8K and MATH with chain-of-thought reasoning can be regarded as in-domain tasks and all the other benchmarks can be regarded as out-of-domain tasks.
Table 5 demonstrates the performance of open- and closed-source models with both chainof-thought and tool-integrated reasoning on English and Chinese benchmarks. We find that: 1) DeepSeekMath-RL 7B attains accuracies of 88.2% and 51.7% on GSM8K and MATH, respectively, utilizing chain-of-thought reasoning. This performance surpasses that of all open-source models in the 7B to 70B range, as well as the majority of closed-source models. 2) Crucially, DeepSeekMath-RL 7B is only trained on chain-of-thought-format instruction tuning data of GSM8K and MATH, starting from DeepSeekMath-Instruct 7B. Despite the constrained scope of its training data, it outperforms DeepSeekMath-Instruct 7B across all evaluation metrics, showcasing the effectiveness of reinforcement learning.
5. Discussion
In this section, we will share our findings in pre-training and RL experiments.
5.1. Lessons Learnt in Pre-Training
We first share our experience in pre-training. Unless otherwise specified, we will adhere to the training settings outlined in Section 2.2.1. It is worth noting that, when referring to the DeepSeekMath Corpus in this section, we use an 89B-token dataset from the second iteration of the data collection process.
5.1.1. Code Training Benefits Mathematical Reasoning
Apopular yet unverified hypothesis suggests that code training improves reasoning. We attempt to offer a partial response to this, particularly within the mathematical domain: code training