Reproducibility

#12

by RoflanVglorius - opened about 17 hours ago

Hi! I'm trying to reproduce your results on IMO-AnswerBench and consistently getting higher accuracy (69–73%) than what you report in the paper.
I'm using your vLLM fork on H200 GPUs. Is it possible the numbers in the paper correspond to IMO-ProofBench rather than IMO-AnswerBench? Or is there something else I might be missing in my setup?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment