This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
AI Benchmarking
Edit
History
Subscribe
Discussion
(0)
Help improve this page
Edit
History
Subscribe
Discussion
(0)
Help improve this page
AI Benchmarking
Random Tag
Contributors
Posts tagged
AI Benchmarking
Most Relevant
2
24
Broken Benchmark: MMLU
awg
1y
5
1
33
Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols
Arjun Panickssery
,
agg
10mo
0
1
24
Improving Model-Written Evals for AI Safety Benchmarking
Ω
Sunishchal Dev
,
Marius Hobbhahn
1mo
Ω
0
1
20
Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents
Ω
Sam F. Brown
,
BasilLabib
,
Codruta (Coco) Lugoj
,
Sai Sasank Y
4mo
Ω
0
1
18
MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures
Ω
corey morris
1y
Ω
2
1
3
LLM Psychometrics: A Speculative Approach to AI Safety
pskl
10mo
4