Thomas Broadley's picture

Thomas Broadley

tbroadley

·

AI & ML interests

None yet

Recent Activity

updated a dataset about 13 hours ago

metr-evals/apps-with-input-validation

new activity about 13 hours ago

metr-evals/apps-with-input-validation:Unify verification scripts into verify.py, add AGENTS.md

new activity about 13 hours ago

metr-evals/apps-with-input-validation:Fix trailing empty strings in 94 train strs samples

View all activity

Organizations

updated a dataset about 13 hours ago

metr-evals/apps-with-input-validation

Preview • Updated about 2 hours ago • 208

New activity in metr-evals/apps-with-input-validation about 13 hours ago

Unify verification scripts into verify.py, add AGENTS.md

#12 opened about 13 hours ago by

Fix trailing empty strings in 94 train strs samples

#11 opened about 13 hours ago by

Fix trailing spaces in expected test outputs (89 samples across train and test)

#10 opened about 14 hours ago by

New activity in metr-evals/apps-with-input-validation about 14 hours ago

Fix trailing spaces in expected test outputs (95 samples across train and test)

#8 opened about 15 hours ago by

Re-apply fixes from PR #2 and PR #3 that were accidentally reverted by PR #5

#9 opened about 15 hours ago by

New activity in metr-evals/apps-with-input-validation about 15 hours ago

Fix trailing spaces in test output for sample 3341

#7 opened about 15 hours ago by

New activity in metr-evals/apps-with-input-validation 4 days ago

Remove counterexample sample (id=737) from train split

#6 opened 4 days ago by

New activity in metr-evals/apps-with-input-validation 28 days ago

Fix expected outputs to match golden solutions

#5 opened 28 days ago by

Fix expected outputs to match golden solutions

#4 opened 28 days ago by

New activity in metr-evals/apps-with-input-validation about 1 month ago

Remove 615 problems with ambiguous outputs

#3 opened about 1 month ago by

New activity in metr-evals/apps-with-input-validation about 2 months ago

Fix test set: whitespace issues, dependency errors, and Python 3.11+ compatibility

#2 opened about 2 months ago by

Fix whitespace discrepancies in expected test outputs

#1 opened about 2 months ago by

upvoted a paper 3 months ago

Measuring AI Ability to Complete Long Tasks

Paper • 2503.14499 • Published Mar 18, 2025 • 16

authored a paper 12 months ago

Measuring AI Ability to Complete Long Tasks

Paper • 2503.14499 • Published Mar 18, 2025 • 16