igormolybog 's Collections evals
updated
Holistic Evaluation of Text-To-Image Models
Paper
• 2311.04287
• Published • 15
MEGAVERSE: Benchmarking Large Language Models Across Languages,
Modalities, Models and Tasks
Paper
• 2311.07463
• Published • 15
Trusted Source Alignment in Large Language Models
Paper
• 2311.06697
• Published • 12
DiLoCo: Distributed Low-Communication Training of Language Models
Paper
• 2311.08105
• Published • 16
Instruction-Following Evaluation for Large Language Models
Paper
• 2311.07911
• Published • 22
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
• 2311.12022
• Published • 36
GAIA: a benchmark for General AI Assistants
Paper
• 2311.12983
• Published • 245
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Paper
• 2312.04724
• Published • 21
Evaluation of Large Language Models for Decision Making in Autonomous
Driving
Paper
• 2312.06351
• Published • 6
PromptBench: A Unified Library for Evaluation of Large Language Models
Paper
• 2312.07910
• Published • 16
TrustLLM: Trustworthiness in Large Language Models
Paper
• 2401.05561
• Published • 69
OLMo: Accelerating the Science of Language Models
Paper
• 2402.00838
• Published • 85
Can Large Language Models Understand Context?
Paper
• 2402.00858
• Published • 24
Design2Code: How Far Are We From Automating Front-End Engineering?
Paper
• 2403.03163
• Published • 98
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
• 2403.14624
• Published • 53
Long-context LLMs Struggle with Long In-context Learning
Paper
• 2404.02060
• Published • 37
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark
for Video Generation
Paper
• 2410.05363
• Published • 45
LongGenBench: Long-context Generation Benchmark
Paper
• 2410.04199
• Published • 22
GLEE: A Unified Framework and Benchmark for Language-based Economic
Environments
Paper
• 2410.05254
• Published • 85