Reproducibility
#12
by RoflanVglorius - opened
Hi! I'm trying to reproduce your results on IMO-AnswerBench and consistently getting higher accuracy (69β73%) than what you report in the paper.
I'm using your vLLM fork on H200 GPUs. Is it possible the numbers in the paper correspond to IMO-ProofBench rather than IMO-AnswerBench? Or is there something else I might be missing in my setup?