Seeking feedback on this AI Safety proposal:
(I don't have experience in AI experimentation)

I'm interested in the question of, "How can we use smart AIs to help humans at strategic reasoning."

We don't want the solution to be, "AIs just tell humans exactly what to do without explaining themselves." We'd prefer situations where smart AIs can explain to humans how to think about strategy, and this information makes humans much better at doing strategy.

One proposal to make progress on this is to set a benchmark for having smart AIs help out dumb AIs by providing them with strategic information.

Or more specifically, we find methods of having GPT4 give human-understandable prompts to GPT2, that would allow GPT2 to do as well as possible on specific games like chess.

Some improvements/changes would include:

Try to expand the games to include simulations of high-level human problems. Like simplified versions of Civilization.
We could also replace GPT2 with a different LLM that better represents how a human, with some amount of specialized knowledge (for example, being strong at probability).
There could be a strong penalty for prompts that aren't human-understandable.
Use actual humans in some experiments. See how much they improve at specific [chess | civilization] moves, with specific help text.
Instead of using GPT2, you could likely just use GPT4. My impression is that GPT4 is a fair bit worse than the SOTA chess engines. So you use some amplified GPT4 procedure, to figure out how to come up with the best human-understandable chess prompts, to give to GPT4s without the amplification.
You set certain information limits. For example, you see how good of a job an LLM could do with "100 bits" of strategic information.

A solution would likely involve search processes where GPT4 experiments with a large space of potential English prompts, and tests them over the space of potential chess moves. I assume that reinforcement learning could be done here, but perhaps some LLM-heavy mechanism could work better. I'd assume that good strategies would be things like, "In cluster of situations X, you need to focus on optimizing Y." So the "smart agent" would need to be able to make clusters of different situations, and solve for a narrow prompt for many of them.

It's possible that the best "strategies" would be things like long decision-trees. One of the key things to learn about is what sorts/representations of information wind up being the densest and most useful.

Zooming out, if we had AIs that we knew give AIs and humans strong and robust strategic advice in test cases, I imagine we could use some of this for real life cases - perhaps most importantly, to strategize about AI safety.

LLM-Secured Systems: A General-Purpose Tool For Structured Transparency

ozziegooen4mo20

I was pretty surprised to see so little engagement / net-0 upvotes (besides mine). Any feedback is appreciated, I'm curious how to do better going forward.

I spent a while on this and think there's a fair bit here that would be useful to some community members, though perhaps the jargon or some key aspects weren't very liked.

Eli's shortform feed

ozziegooen4mo20

Yea

Eli's shortform feed

ozziegooen4mo21

I'd flag that I think it's very possible TSMC will be very much hurt/destroyed if China is in control. There's been a bit of discussion of this.

I'd suspect China might fix this after some years, but would expect it would be tough for a while.

https://news.ycombinator.com/item?id=40426843

jacquesthibs's Shortform

ozziegooen5mo46

Quick point - a "benefit corporation" seems almost identical to a "corporation" to me, from what I understand. I think many people assume it's a much bigger deal than it actually is.

My impression is that practically speaking, this just gives the execs more power to do whatever they feel they can sort of justify, without shareholders being able to have the legal claims to stop them. I'm not sure if this is a good thing in the case of OpenAI. (Would we prefer Sam A / the board have more power, or that the shareholders have more power?)

I think B-Corps make it harder for them to get sued for not optimizing for shareholders. Hypothetically, it makes it easier for them to be sued for not optimizing their other goals, but I'm not sure if this ever/frequently actually happens.

MIRI's June 2024 Newsletter

ozziegooen5mo175

Really wishing the new Agent Foundations team the best. (MIRI too, but its position seems more secured)

I think that naively, I feel pretty good about this potential split. If MIRI is doing much more advocacy work, that work just seems very different to Agent Foundations research.

This could allow MIRI to be more controversial and risk-taking without tying things to the Agent Foundations research, and that research could hypothetically more easily getting funding from groups that otherwise disagree with MIRI's political views.

I hope that team finds good operations support or a different nonprofit sponsor of some kind.

My AI Model Delta Compared To Christiano

ozziegooen5mo40

Thinking about this more, it seems like there are some key background assumptions that I'm missing.

Some assumptions that I often hear get presenting on this topic are things like:
1. "A misaligned AI will explicitly try to give us hard-to-find vulnerabilities, so verifying arbitrary statements from these AIs will be incredibly hard."
2. "We need to generally have incredibly high assurances to build powerful systems that don't kill us".

My obvious counter-arguments would be:
1. Sure, but smart agents would have a reasonable prior that agents would be misaligned, and also, they would give these agents tasks that would be particularly easy to verify. Any action actually taken by a smart overseer, using information provided by another agent with a chance of being misaligned, M (known by the smart overseer), should be EV-positive in value. With some creativity, there's likely a bunch of ways of structuring things (using systems likely not to be misaligned, using more verifiable questions), where many resulting actions will likely be heavily EV-positive.

2. "Again, my argument in (1). Second, we can build these systems gradually, and with a lot of help from people/AIs that won't require such high assurances." (This is similar to the HCH / oversight arguments)

My AI Model Delta Compared To Christiano

ozziegooen5mo135

First, I want to flag that I really appreciate how you're making these delta clear and (fairly) simple.

I like this, though I feel like there's probably a great deal more clarity/precision to be had here (as is often the case).

Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better.

I'm not sure what "bad" means exactly. Do you basically mean, "if I were to spend resources R evaluating this object, I could identify some ways for it to be significantly improved?" If so, I assume we'd all agree that this is true for some amount R, the key question is what that amount is.

I also would flag that you draw attention to the issue with air conditioners. But for the issue of personal items, I'd argue that when I learn more about popular items, most of what I learn are positive things I didn't realize. Like with Chesterton's fence - when I get many well-reviewed or popular items, my impression is generally that there were many clever ideas or truths behind those items that I don't at all have time to understand, let alone invent myself. A related example is cultural knowledge - a la The Secret of Our Success.

When I try out software problems, my first few attempts don't go well for reasons I didn't predict. The very fact that "it works in tests, and it didn't require doing anything crazy" is a significant update.

Sure, with enough resources R, one could very likely make significant improvements to any item in question - but as a purchaser, I only have resources r << R to make my decisions. My goal is to buy items to make my life better, it's fine that there are potential other gains to be had by huge R values.

> “verification is easier than generation”

I feel like this isn't very well formalized. I think I agree with this comment on that post. I feel like you're saying, "It's easier to generate a simple thing than verify all possible things", but Paul and co are saying more like, "It's easier to verify/evaluate a thing of complexity C than generate a think of complexity C, in many important conditions", or, "There are ways of delegating many tasks where the evaluation work required would be less than that of doing the work yourself, in order to get a result of a certain level of quality."

I think that Paul's take (as I understand it) seems like a fundamental aspect about the working human world. Humans generally get huge returns from not inventing the wheel all the time, and deferring to others a great deal. This is much of what makes civilization possible. It's not perfect, but it's much better than what individual humans could do by themselves.

> Under my current models, I expect that, shortly after AIs are able to autonomously develop, analyze and code numerical algorithms better than humans, there’s going to be some pretty big (like, multiple OOMs) progress in AI algorithmic efficiency (even ignoring a likely shift in ML/AI paradigm once AIs start doing the AI research)

I appreciate the precise prediction, but don't see how it exactly follows. This seems more like a question of "how much better will early AIs be compared to current humans", than one deeply about verification/generation. Also, I'd flag that in many worlds, I'd expect that pre-AGI AIs could do a lot of this code improvement - or they already have - so it's not clear exactly how big a leap the "autonomously" is doing here.

---

I feel like there are probably several wins to be had by better formalizing these concepts more. They seem fairly cruxy/high-delta in the debates on this topic.

I would naively approach some of this with some simple expected value/accuracy lens. There are many assistants (including AIs) that I'd expect would improve the expected accuracy on key decisions, like knowing which AI systems to trust. In theory, it's possible to show a bunch of situations where delegation would be EV-positive.

That said, a separate observer could of course claim that one using the process above would be so wrong as to be committing self-harm. Like, "I think that when you would try to use delegation, your estimates of impact are predictably wrong in ways that would lead to you losing." But this seems like mainly a question about "are humans going to be predictably overconfident in a certain domain, as seen by other specific humans".

AiPhone

ozziegooen5mo75

My hunch is that this is arguably bad insofar that it helps out OpenAI / SOTA LLMs, but otherwise a positive thing?

I think we want to see people start deploying weak AIs on massive scales, for many different sorts of tasks. The sooner we do this, the sooner we get a better idea of what real problems will emerge, and the sooner engineers will work on figuring out ways of fixing those problems.

On-device AIs generally seem a safer than server LLMs, mainly because they're far less powerful. I think we want a world where we can really maximize the value we get from small, secure AI.

If this does explode in a thousand ways, I assume it would be shut down soon enough. I assume Apple will roll out some of these features gradually and carefully. I'll predict that damages caused by AI failures with this won't be catastrophic. (Let's say, < ~$30B in value, over 2 years).

Out of the big tech companies (FAANG), I think I might trust Apple the most to do a good job on this.

And, while the deal does bring attention to ChatGPT, it comes across to me as a temporary and limited thing, rather than a deep integration. I wouldn't expect this to dramatically boost OpenAI's market cap. The future of Apple / server LLM integration still seems very unclear.

My AI Model Delta Compared To Yudkowsky

ozziegooen5mo30

Thanks for that explanation.

EG, if the hypothesis is true, I can imagine that "do a lot of RLHF, and then ramp up the AIs intelligence" could just work. Similarly for "just train the AI to not be deceptive".)

Thanks, this makes sense to me.

Yea, I guess I'm unsure about that '[Inference step missing here.]'. My guess is that such system would be able to recognize situations where things that score highly with respect to its ontology, would score lowly, or would be likely to score lowly, using a human ontology. Like, it would be able to simulate a human deliberating on this for a very long time and coming to some conclusion.

I imagine that the cases where this would be scary are some narrow ones (though perhaps likely ones) where the system is both dramatically intelligent in specific ways, but incredibly inept in others. This ineptness isn't severe enough to stop it from taking over the world, but it is enough to stop it from being at all able to maximize goals - and it also doesn't take basic risk measures like "just keep a bunch of humans around and chat to them a whole lot, when curious", or "try to first make a better AI that doesn't have these failures, before doing huge unilateralist actions" for some reason.

It's very hard for me to imagine such an agent, but that doesn't mean it's not possible, or perhaps likely.