Running 37 TRUEBench π₯ 37 Explore and compare language model performance across categories and languages