-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper β’ 2310.17631 β’ Published β’ 35 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper β’ 2310.08491 β’ Published β’ 57 -
Generative Judge for Evaluating Alignment
Paper β’ 2310.05470 β’ Published β’ 1 -
Calibrating LLM-Based Evaluator
Paper β’ 2309.13308 β’ Published β’ 12
Andrew Reed
andrewrreed
AI & ML interests
Applied ML, Practical AI, Inference & Deployment, LLMs, Multi-modal Models, RAG
Organizations
Curated resources that support the use of LLMs to serve as automatic evaluators of other LLM outputs.
Eval Leaderboards
- Running4.84k
Arena Leaderboard
π4.84kView the LMArena language model leaderboard
- Running on CPU Upgrade13.9k
Open LLM Leaderboard
π13.9kTrack, rank and evaluate open LLMs and chatbots
- Running on CPU Upgrade7.25k
MTEB Leaderboard
π₯7.25kEmbedding Leaderboard
- RunningFeatured586
LLM-Perf Leaderboard
π586Explore LLM performance across hardware configurations
AI x Audio
Hallucination Detection
-
vectara/hallucination_evaluation_model
Text Classification β’ Updated β’ 78.5k β’ 348 -
notrichardren/HaluEval
Viewer β’ Updated β’ 35k β’ 91 -
TRUE: Re-evaluating Factual Consistency Evaluation
Paper β’ 2204.04991 β’ Published β’ 1 -
Fine-grained Hallucination Detection and Editing for Language Models
Paper β’ 2401.06855 β’ Published β’ 4
Small, but mighty chat models
Awesome Spaces
- Running on Zero117
StableDesign
π117Generate interior designs from empty room photos
- Running on ZeroFeatured5.4k
IllusionDiffusion
π5.4kGenerate stunning high quality illusion artwork
- Runtime errorFeatured1.57k
InstantMesh
π1.57kCreate a 3D model from an image in 10 seconds!
- Runtime errorFeatured184
Sing an idea β‘οΈ Music
π₯184Bring song ideas to life
LLM as a Judge
Curated resources that support the use of LLMs to serve as automatic evaluators of other LLM outputs.
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper β’ 2310.17631 β’ Published β’ 35 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper β’ 2310.08491 β’ Published β’ 57 -
Generative Judge for Evaluating Alignment
Paper β’ 2310.05470 β’ Published β’ 1 -
Calibrating LLM-Based Evaluator
Paper β’ 2309.13308 β’ Published β’ 12
Hallucination Detection
-
vectara/hallucination_evaluation_model
Text Classification β’ Updated β’ 78.5k β’ 348 -
notrichardren/HaluEval
Viewer β’ Updated β’ 35k β’ 91 -
TRUE: Re-evaluating Factual Consistency Evaluation
Paper β’ 2204.04991 β’ Published β’ 1 -
Fine-grained Hallucination Detection and Editing for Language Models
Paper β’ 2401.06855 β’ Published β’ 4
Eval Leaderboards
- Running4.84k
Arena Leaderboard
π4.84kView the LMArena language model leaderboard
- Running on CPU Upgrade13.9k
Open LLM Leaderboard
π13.9kTrack, rank and evaluate open LLMs and chatbots
- Running on CPU Upgrade7.25k
MTEB Leaderboard
π₯7.25kEmbedding Leaderboard
- RunningFeatured586
LLM-Perf Leaderboard
π586Explore LLM performance across hardware configurations
Small, but mighty chat models
AI x Audio
Awesome Spaces
- Running on Zero117
StableDesign
π117Generate interior designs from empty room photos
- Running on ZeroFeatured5.4k
IllusionDiffusion
π5.4kGenerate stunning high quality illusion artwork
- Runtime errorFeatured1.57k
InstantMesh
π1.57kCreate a 3D model from an image in 10 seconds!
- Runtime errorFeatured184
Sing an idea β‘οΈ Music
π₯184Bring song ideas to life