ozziegooen — LessWrong

LESSWRONG
LW

Are there lessons from high-reliability engineering for AGI safety?

I appreciate that you're engaging in this question, but I have a hard time taking much away from this.

Quickly:

1. I think we're both probably skeptical of what Joshua Achiam is approaching. My hunch is that he's representing a particularly niche view on things.

2. It feels to me like there's a gigantic challenge of choosing what scale to approach this problem. You seem to be looking at AGI and it's effects from a very high-level. Like, we need to run evals on massive agents exploring the real world - and if this is too complex for more regular robust engineering, then we better throw out the idea of robust engineering. To me this seems like a bit complaining that "Robust engineering won't work for vehicle safety because it's won't be able to tackle the complex society-wide effects of how vehicles will be used at scale."

I think that the case for using more robust methods for AI doesn't look like straightforwardly solving all global-scale large-agent challenges, but instead achieving high reliability levels on much narrower tasks. For example, can we make sure that a single LLM pass has a 99.99% success rate at certain kinds of hacks.

This wouldn't solve all the challenges that would come from complex large systems, similar to how vehicle engineering doesn't solve all challenges with vehicles. But it might be able to still be a big part of the solution.

You might then argue that a complex system composed of robust parts wouldn't have any benefit from that robustness, because all the 'interesting' stuff happens on the larger scales. My quick guess is that this is one area where we would disagree, perhaps there's a crux here.

Conditional Kickstarter for the "Don't Build It" March

ozziegooen5d42

Went here to post something similar. My quick guess is that such a march would get a better fit of attendance times direction by expanding the scope a fair bit, but I realize that the creators have certain preferences here.

What's Going on With OpenAI's Messaging?

ozziegooen6d20

Update:

I've recently been experimenting with a small llm-generated wiki, mostly with Claude Code.

I was curious about the above, so I had it generate a page to much better cover Sam Altman's history, and his prediction track record.

https://www.longtermwiki.com/knowledge-base/people/sam-altman/
https://www.longtermwiki.com/knowledge-base/people/track-records/sam-altman-predictions/

I don't think there's anything incredibly novel there, but I think it's a decent overview.

I think the main Sam Altman page is more interesting than the prediction page, because his actions were a lot more interesting/informative than his predictions.

Announcing RoastMyPost: LLMs Eval Blog Posts and More

ozziegooen2mo20

Thanks for letting me know!

I'm not very attached to the name. Some people seem to like it a lot, some dislike.

I don't feel very confident in how it will be most used 4-12 months from now. So my plan at this point is to wait on use, then consider renaming later as the situation gets clearer.

Announcing RoastMyPost: LLMs Eval Blog Posts and More

ozziegooen2mo20

Thanks!

I tried adding a "Clarity Coach" earlier. I think that this is a sort of area where RoastMyPost probably wouldn't have a massive advantage over custom LLM prompts directly, but we might be able to make some things easy.

It would be very doable to add custom evaluators to give tips along these lines. Doing a great job would likely involve a fair bit of prompting, evaluation, and iteration. I might give it a stab and if so will get back to you on this.

(One plus is that I'd assume this would be parallelizable, so it could be fast at least)

Announcing RoastMyPost: LLMs Eval Blog Posts and More

ozziegooen2mo20

Thanks for reporting your findings!

As I stated here, the Fact Checker has a bunch of false positives, and you've noted some.

The Fact Checker (and other checkers) have trouble telling which claims are genuine and which are part of fictional scenarios, a la AI-2027.

The Fallacy Checker is overzealous, and doesn't use web search (adds costs), so will particularly make mistakes when it's above the date the models were trained.

There's clearly more work to do to make better evals. Right now I recommend using this as a way to flag potential errors, and feel free to add any specific evaluator AIs that you think would be a fit for certain documents.

Announcing RoastMyPost: LLMs Eval Blog Posts and More

ozziegooen2mo20

I experimented with Opus 4.5 a bit for the Fallacy Check. Results did seem a bit better, but costs were much higher.

I think the main way I could picture adding money is to add some agentic setup that does a deep review of a certain paper and presents a summary. I could see the marginal costs of this being maybe $10 to $50 per 5k words or so, using a top model like Opus. That said, the fixed costs of doing a decent job seem frustrating, especially because we're still lacking easy API use of existing agents (My preferred method would be a high-level Claude Code API, but that doesn't really exist yet).

I've been thinking of having competitions here, for people to make their own reviews, then we could compare with a few researchers and LLMs. I think this area could make for a lot of cleverness and innovation.

Announcing RoastMyPost: LLMs Eval Blog Posts and More

ozziegooen2mo20

I also added a table back to this post that gives a better summary.

Announcing RoastMyPost: LLMs Eval Blog Posts and More

ozziegooen2mo50

There's more going on, though that doesn't mean it will be necessarily better for you than your system.

There are some "custom" evaluators, that are basically just what you're describing, but with specific prompts. Though in these cases, note that there's extra functionality for users to re-run evaluators, see the histories of the runs, and see the specific agent's evals of many different documents.

The "system" evaluators typically have more specific code. They have short readmes you can see more on their pages:
https://www.roastmypost.org/evaluators/system-fact-checker
https://www.roastmypost.org/evaluators/system-fallacy-check

https://www.roastmypost.org/evaluators/system-forecast-checker

https://www.roastmypost.org/evaluators/system-link-verifier

https://www.roastmypost.org/evaluators/system-math-checker

https://www.roastmypost.org/evaluators/system-spelling-grammar

Some of this is just the tool splitting up a post into chunks, then doing analysis on each chunk. Some are more different. The link verifier works without any AI.

One limitation of these systems is that they're not very customizable. So if you're making something fairly specific to your system, this might be tricky.

My quick recommendation is to try running all the system evaluators at least on some docs so you can try them out (or just see the outputs on other docs).

Announcing RoastMyPost: LLMs Eval Blog Posts and More

ozziegooen2mo50

Agreed!

The workflow we have does use a step for this. This specific workflow:
1. Chunks document
2. Runs analysis on each chunk, producing a long set of total comments.
3. Then, all the comments are fed into a final step. This step sees the full post. It then removes a bunch of comments and writes a summary.

I think it could use a lot more work and tuning. Generally, I've found these workflows fairly tricky and time-intensive to work on so far. I assume they will get easier in the next year or so.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

Sequences

Posts

Wikitag Contributions

Comments