--- license: llama2 model-index: - name: Xwin-Math-70B-V1.0 results: - task: type: text-generation name: Text Generation dataset: name: AI2 Reasoning Challenge (25-Shot) type: ai2_arc config: ARC-Challenge split: test args: num_few_shot: 25 metrics: - type: acc_norm value: 64.51 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/Xwin-Math-70B-V1.0 name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: HellaSwag (10-Shot) type: hellaswag split: validation args: num_few_shot: 10 metrics: - type: acc_norm value: 84.88 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/Xwin-Math-70B-V1.0 name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MMLU (5-Shot) type: cais/mmlu config: all split: test args: num_few_shot: 5 metrics: - type: acc value: 66.2 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/Xwin-Math-70B-V1.0 name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: TruthfulQA (0-shot) type: truthful_qa config: multiple_choice split: validation args: num_few_shot: 0 metrics: - type: mc2 value: 51.58 source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/Xwin-Math-70B-V1.0 name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: Winogrande (5-shot) type: winogrande config: winogrande_xl split: validation args: num_few_shot: 5 metrics: - type: acc value: 81.53 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/Xwin-Math-70B-V1.0 name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: GSM8k (5-shot) type: gsm8k config: main split: test args: num_few_shot: 5 metrics: - type: acc value: 58.0 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/Xwin-Math-70B-V1.0 name: Open LLM Leaderboard --- # Xwin-Math
Xwin-Math is a series of powerful SFT LLMs for math problem based on LLaMA-2. ## 🔥 News - 💥 [Nov, 2023] The [Xwin-Math-70B-V1.0](https://huggingface.co/Xwin-LM/Xwin-Math-70B-V1.0) model achieves **31.8 pass@1 on the MATH benchmark** and **87.0 pass@1 on the GSM8K benchmark**. This performance places it first amongst all open-source models! - 💥 [Nov, 2023] The [Xwin-Math-7B-V1.0](https://huggingface.co/Xwin-LM/Xwin-Math-7B-V1.0) and [Xwin-Math-13B-V1.0](https://huggingface.co/Xwin-LM/Xwin-Math-13B-V1.0) models achieve **66.6 and 76.2 pass@1 on the GSM8K benchmark**, ranking as top-1 among all LLaMA-2 based 7B and 13B open-source models, respectively! ## ✨ Model Card | Model | GSM8K | MATH | Checkpoint | License | |:-:|:-:|:-:|:-:|:-:| |Xwin-Math-7B-V1.0 | 66.6 | 17.4 | 🤗 HF Link | Llama 2 License| |Xwin-Math-13B-V1.0| 76.2 | 21.7 | 🤗 HF Link | Llama 2 License| |Xwin-Math-70B-V1.0| 87.0 | 31.8 | 🤗 HF Link | Llama 2 License| ## 🚀 Benchmarks ### Xwin-Math performance on [MATH](https://github.com/hendrycks/math) and [GSM8K](https://github.com/openai/grade-school-math). Xwin-Math-70B-V1.0 has achieved **31.8% on MATH** and **87.0% on GSM8K**. These scores are **5.3** and **3.1** points higher, respectively, than the previous state-of-the-art open-source MetaMath and LEMAv1 model. | **Model** |**MATH (Our test)** | **GSM8K (Our test)** | |:-:|:-:|:-:| | GPT-4 (zero-shot) | 52.4 | 94.8 | | GPT-35-Turbo (8-shot)| 37.1 | 81.0 | | | | WizardMath-70B | 23.9 | 81.1 | | MAmmoTH-70B | 20.8 | 72.6 | | MetaMath-70B | 26.5 | 82.0 | | LEMAv1-70B | 25.9 | 83.9 | |**Xwin-Math-70B-V1.0** |**31.8**|**87.0**| | | | WizardMath-13B | 15.0 | 63.7 | | MAmmoTH-13B | 12.3 | 56.2 | | MetaMath-13B | 22.7 | 70.9 | | LEMAv1-13B | 13.6 | 65.0 | |**Xwin-Math-13B-V1.0** | 21.7 | 76.2 | | | | WizardMath-7B | 10.9 | 55.0 | | MAmmoTH-7B | 9.6 | 50.2 | | MetaMath-7B | 20.1 | 66.6 | | LEMAv1-7B | 10.0 | 54.7 | |**Xwin-Math-7B-V1.0** | 17.4 | 66.6 | We obtain these results using our flexible evaluation strategy. Due to differences in environment and hardware, the numbers may be different from the reported results, but we ensure that the evaluation is as accurate and fair as possible. ### Xwin-Math performance on other math benchmarks. Our 70B model shows strong mathematical synthesis capabilities among all open-sourced models. Also note that our model even approaches or surpasses the performance of GPT-35-Turbo on some benchmarks. | **Model** | SVAMP | ASDiv | NumGlue | Algebra | MAWPS | **Average** | |:-:|:-:|:-:|:-:|:-:|:-:|:-:| | GPT-35-Turbo (8-shot)| 80.6 | 84.1 | 81.8 | 90.5 | 91.7 | 85.7 | | | | WizardMath-70B | 80.2 | 75.8 | 71.4 | 64.0 | 74.9 | 73.3 | | MAmmoTH-70B | 71.2 | 73.9 | 62.7 | 58.1 | 72.2 | 67.6 | | MetaMath-70B | 85.8 | 81.1 | 77.5 | 79.7 | 81.4 | 81.1 | | LEMAv1-70B-MATH * | 81.6 | 77.1 | 72.1 | 69.4 | 81.8 | 76.5 | |**Xwin-Math-70B-V1.0** | 84.0 | 84.1 | 81.3 | 78.4 | 90.8 | 83.7 | \* LEMAv1 has two models, and we report the better LEMAv1-70B-MATH model in these benchmarks. ## 🔨 Evaluation In order to evaluate a model's mathematical capabilities more flexibly and ensure a fair comparison of results, particularly for the MATH benchmark, we have developed a new evaluation tool. We have also assessed the pass@1 results of recent models on MATH and GSM8K benchmarks, which provides more accurate results. We hope this toolkit can benefit open-source community by providing more accurate insights and conclusions. For a deeper understanding of our evaluation tool and methods, please visit [here](https://github.com/Xwin-LM/Xwin-LM/tree/main/Xwin-Math/eval) * "Report" refers to the accuracy stated in the original papers. * "Repro" indicates the results is reproduced by generating responses and evaluating them using the respective open-source models and scripts. * "Strict" and "Flex" denote the results we achieved by employing our two strategies to extract answer and evaluate the same responses as "Repro". | Model | MATH