Gemma-3-1B-IT Math RS-SFT (Best Model)
SFT β Rejection Sampling β SFT 2λ¨κ³ νμ΄νλΌμΈμΌλ‘ νμ΅ν νκ΅μ΄ μν λͺ¨λΈ. μ΅κ³ μ±λ₯.
μ±λ₯
| Benchmark | Score |
|---|---|
| HRM8K eval GSM8K (264λ¬Έμ , Korean) | ~46.6% avg, 48.9% best run |
| HRM8K eval MATH (577λ¬Έμ , Korean) | ~17% |
β οΈ temperature=0μμλ vLLM inference variance Β±2-4%p μ‘΄μ¬. μ μμΉλ 3ν νκ° νκ· .
λ°μ΄ν° μμ± νμ΄νλΌμΈ
Stage 1: SFT λ°μ΄ν° (κ΅μ¬ μ¦λ₯)
μ SFT λͺ¨λΈκ³Ό λμΌ. GSM8K 7,473λ¬Έμ β Qwen3-30Bλ‘ νκ΅μ΄ νμ΄ 26,254κ° μμ±.
Stage 2: RS λ°μ΄ν° (On-policy μνλ§)
RS μνλ§
RS λ°μ΄ν° νν°λ§
RS-SFT νμ΅ λ°μ΄ν° κ΅¬μ± (ν΅μ¬!)
Replayκ° ν΅μ¬: RS λ°μ΄ν°λ§ μ¬μ©νλ©΄ κ΅μ¬ νμ΄ ν¨ν΄μ μμ΄ μ±λ₯ νλ½ (catastrophic forgetting).
| Replay λΉμ¨ | GSM8K | λΉκ³ |
|---|---|---|
| 0x (RS only) | 46.2% | forgetting |
| 2x | 46.6% | λΆμ‘± |
| 3x | 48.5% | μνΈ |
| 5x | 48.9% | μ΅μ |
| max (μ λΆ) | 47.3% | RS ν¬μ |
RS-SFT νμ΅ λ°μ΄ν° νμ
SFTμ λμΌν question/answer JSON. μ°¨μ΄μ μ answerκ° νμ λͺ¨λΈ(SFT)μ΄ μ€μ€λ‘ μμ±ν μ λ΅ νμ΄λΌλ κ².
νμ΅ μ€μ
Stage 1: SFT
Stage 2: RS-SFT
μ¬ν λ°©λ²
INFO 03-19 14:53:13 [init.py:216] Automatically detected platform cuda. [1;36m(APIServer pid=3428638)[0;0m INFO 03-19 14:53:19 [api_server.py:1839] vLLM API server version 0.11.0 [1;36m(APIServer pid=3428638)[0;0m INFO 03-19 14:53:19 [utils.py:233] non-default args: {'model_tag': './sft_model', 'model': './sft_model', 'dtype': 'bfloat16', 'max_model_len': 4096, 'gpu_memory_utilization': 0.85} INFO 03-19 14:53:25 [init.py:216] Automatically detected platform cuda. [1;36m(APIServer pid=3428911)[0;0m INFO 03-19 14:53:31 [api_server.py:1839] vLLM API server version 0.11.0 [1;36m(APIServer pid=3428911)[0;0m INFO 03-19 14:53:31 [utils.py:233] non-default args: {'model_tag': './rs_sft_model', 'model': './rs_sft_model', 'dtype': 'bfloat16', 'max_model_len': 4096, 'gpu_memory_utilization': 0.85}
μ€ν¨ν μ κ·Όλ€ (μ°Έκ³ )
- Iterative RS (RS λͺ¨λΈ μμ λ€μ RS): νμ ν΄λ³΄
- DPO (10κ°μ§ μλ): λͺ¨λ λ¬΄ν¨ (1B λͺ¨λΈ capacity λΆμ‘±)
- GRPO (2κ°μ§ μλ): base variance λ²μ λ΄
- λ€λ₯Έ κ΅μ¬ λͺ¨λΈ: μ€νμΌ λΆμΌμΉλ‘ λν νλ½
νμΌ
- : Stage 1 SFT νμ΅
- : RS μνλ§ μ€ν¬λ¦½νΈ (vLLM μλΉ νμ)
- : Stage 2 RS-SFT νμ΅ (replay ν¬ν¨)
- : HRM8K νκ°
- Downloads last month
- 66