lemonhope — LessWrong

Many can write faster asm than the compiler, yet don't. Why?

I hope that when episodic memory is cracked (ie cheap), it helps with this all-or-nothing problem. That would be really great.

Any evidence or reason to expect a multiverse / Everett branches?

lemonhope3d20

Could one do Efficient Pilot Wave? Mathematically equivalent but you hardly need to compute any of it?

lemonhope's Shortform

lemonhope8d20

I have been thinking in terms of this fwdperbkwd model of measuring AI agent progress for a few months. This is the first time I actually spelled it out, I just realized. https://www.lesswrong.com/posts/2RwDgMXo6nh42egoC/how-to-game-the-metr-plot?commentId=6fXFqMFxJKdtcZzbs

How to game the METR plot

lemonhope8d40

To clarify my point, I'm saying

A. Standalone/unattended (!) AI agents are clearly much worse than nothing, for both writing code (with a goal chosen in advance) and hacking (a prechosen target), they will clearly only mess up your plans.

If you want to build whatever or hack whoever, and you don't care what, that's a different story.

B. If your AI can make productive incremental progress, without messing up old code, writing false notes, citing misleading evidence, breaking infra, alerting the target, falsifying or ruining the tests, fooling the judge, etc, etc, then you can scale that to do basically anything. If you can lay a brick then you can make a cathedral.

If your whole fancy best-of-n-with-rollbacks system is able to lay down bricks, that also counts as OK.

C. I'm betting against the amazing-do-it-all-at-once-from-scratch AI being either a serious risk vector or driver of economic progress.

At least, if you do take over the world on Tuesday morning, it only counts as a victory if you don't also accidentally give it all back on Tuesday afternoon.

D. I do think these task executions measure something meaningful, and are valuable to conduct, and we should try to extrapolate, but "task horizon length" is definitely not the thing to focus on here. Maybe something like the "zero fatal mistake" rate and the "net-forward-progress geometric mean" or something.

The world just does not look like "task horizons". The world is made up of steps forward and steps backwards. A scientist or a CEO or an engineer or a horse or a door is measured by how many steps forward it brings you, per step backwards it takes you (which you weren't able to spot and avoid obviously).

Fwdperbkwd is harder to measure than task horizons, maybe twice as hard, but not four times harder.

The strongest way I could put the punchline here is that I think AI agents will be useless until their code helps more than it hurts, then AI agents will replace coders completely. And you can only measure this by comparing agentless vs agentful approaches to the same task maybe (hey! METR did that too!)

And more ai-live-assistant is-it-worse-than-nothing-or-not experiments would be highly informative in my opinion!

How to game the METR plot

lemonhope11d5-10

Horizon length makes no sense fundamentally. The 9s of reliability don't matter for anything but driving a car. Software and hacking and science and engineering are not a series of sequential steps. It is more like adding stuff to your house. If you can make a tiny improvement then you are net-helpful and you can build a cathedral. If you make improvements more often than you cause problems, anything is possible. If you break stuff more often than you fix it, then nothing is possible. It is a binary question. If you make big improvements, then small mistakes don't matter. If you mess up stuff bigtime and can't fix it, then big improvements are useless.

In defence of the human agency: "Curing Cancer" is the new "Think of the Children"