Gemma-3-1B-IT Math RS-SFT (Best Model)

SFT β†’ Rejection Sampling β†’ SFT 2단계 νŒŒμ΄ν”„λΌμΈμœΌλ‘œ ν•™μŠ΅ν•œ ν•œκ΅­μ–΄ μˆ˜ν•™ λͺ¨λΈ. 졜고 μ„±λŠ₯.

μ„±λŠ₯

Benchmark Score
HRM8K eval GSM8K (264문제, Korean) ~46.6% avg, 48.9% best run
HRM8K eval MATH (577문제, Korean) ~17%

⚠️ temperature=0μ—μ„œλ„ vLLM inference variance Β±2-4%p 쑴재. μœ„ μˆ˜μΉ˜λŠ” 3회 평가 평균.

데이터 생성 νŒŒμ΄ν”„λΌμΈ

Stage 1: SFT 데이터 (ꡐ사 증λ₯˜)

μœ„ SFT λͺ¨λΈκ³Ό 동일. GSM8K 7,473문제 β†’ Qwen3-30B둜 ν•œκ΅­μ–΄ 풀이 26,254개 생성.

Stage 2: RS 데이터 (On-policy μƒ˜ν”Œλ§)

RS μƒ˜ν”Œλ§

RS 데이터 필터링

RS-SFT ν•™μŠ΅ 데이터 ꡬ성 (핡심!)

Replayκ°€ 핡심: RS λ°μ΄ν„°λ§Œ μ‚¬μš©ν•˜λ©΄ ꡐ사 풀이 νŒ¨ν„΄μ„ μžŠμ–΄ μ„±λŠ₯ ν•˜λ½ (catastrophic forgetting).

Replay λΉ„μœ¨ GSM8K λΉ„κ³ 
0x (RS only) 46.2% forgetting
2x 46.6% λΆ€μ‘±
3x 48.5% μ–‘ν˜Έ
5x 48.9% 졜적
max (μ „λΆ€) 47.3% RS 희석

RS-SFT ν•™μŠ΅ 데이터 ν˜•μ‹

SFT와 λ™μΌν•œ question/answer JSON. 차이점은 answerκ°€ 학생 λͺ¨λΈ(SFT)이 슀슀둜 μƒμ„±ν•œ μ •λ‹΅ ν’€μ΄λΌλŠ” 것.

ν•™μŠ΅ μ„€μ •

Stage 1: SFT

Stage 2: RS-SFT

μž¬ν˜„ 방법

INFO 03-19 14:53:13 [init.py:216] Automatically detected platform cuda. (APIServer pid=3428638) INFO 03-19 14:53:19 [api_server.py:1839] vLLM API server version 0.11.0 (APIServer pid=3428638) INFO 03-19 14:53:19 [utils.py:233] non-default args: {'model_tag': './sft_model', 'model': './sft_model', 'dtype': 'bfloat16', 'max_model_len': 4096, 'gpu_memory_utilization': 0.85} INFO 03-19 14:53:25 [init.py:216] Automatically detected platform cuda. (APIServer pid=3428911) INFO 03-19 14:53:31 [api_server.py:1839] vLLM API server version 0.11.0 (APIServer pid=3428911) INFO 03-19 14:53:31 [utils.py:233] non-default args: {'model_tag': './rs_sft_model', 'model': './rs_sft_model', 'dtype': 'bfloat16', 'max_model_len': 4096, 'gpu_memory_utilization': 0.85}

μ‹€νŒ¨ν•œ μ ‘κ·Όλ“€ (μ°Έκ³ )

  • Iterative RS (RS λͺ¨λΈ μœ„μ— λ‹€μ‹œ RS): 항상 퇴보
  • DPO (10κ°€μ§€ μ‹œλ„): λͺ¨λ‘ 무효 (1B λͺ¨λΈ capacity λΆ€μ‘±)
  • GRPO (2κ°€μ§€ μ‹œλ„): base variance λ²”μœ„ λ‚΄
  • λ‹€λ₯Έ ꡐ사 λͺ¨λΈ: μŠ€νƒ€μΌ 뢈일치둜 λŒ€ν­ ν•˜λ½

파일

  • : Stage 1 SFT ν•™μŠ΅
  • : RS μƒ˜ν”Œλ§ 슀크립트 (vLLM μ„œλΉ™ ν•„μš”)
  • : Stage 2 RS-SFT ν•™μŠ΅ (replay 포함)
  • : HRM8K 평가
Downloads last month
66
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for NotoriousH2/gemma-3-1b-it-Math-RS-SFT

Finetuned
(455)
this model