Reproducibility

#12
by RoflanVglorius - opened

Hi! I'm trying to reproduce your results on IMO-AnswerBench and consistently getting higher accuracy (69–73%) than what you report in the paper.
I'm using your vLLM fork on H200 GPUs. Is it possible the numbers in the paper correspond to IMO-ProofBench rather than IMO-AnswerBench? Or is there something else I might be missing in my setup?

Sign up or log in to comment