I think it seems like a fine possibility in principle, actually; sorry to have given the wrong impression! It's not my central hope, since strategy-stealing seems like it should make many human-augmentations "available" to AI systems as well. This is notably not true for things involving, e.g., BCIs or superbabies.

Reply

Not all capabilities will be created equal: focus on strategically superhuman agents

benwr8d10

When I'm thinking about this, it seems kind of fine if the goalposts move - human strategic capacity will certainly move over time no matter what, right? Like, someone invented crowdfunding and suddenly we could do types of coordination that we previously couldn't do.

Reply

Biological humans collectively exert at most 400 gigabits/s of control over the world.

benwr2mo10

Nate Soares points out that the first paragraph is not quite right: Imagine writing a program that somehow implements an aligned superintelligence, giving it as an objective, "maximize utility according to the person who pressed the 'go' button", and pressing the 'go' button.

There's some sense in which, by virtue of existing in the world, you're already kind of "lucky" by this metric: It can take a finite amount of information to instantiate an agent that takes unbounded actions on your behalf.

Reply

benwr's unpolished thoughts

benwr2mo10

I asked Deep Research to see if there are existing treatments of this basic idea in the literature. It seems most closely related to the concept of "empowerment" in RL, which I'm surprised I hadn't heard of: https://en.m.wikipedia.org/wiki/Empowerment_(artificial_intelligence)

The Wikipedia article makes it seem like this might also be how RL people think about instrumental convergence?

Reply

benwr's unpolished thoughts

benwr2mo*10

Human information throughput is allegedly only about 10-50 bits per second. This implies an interesting upper bound, in that the information throughput of biological humanity as a whole can't be higher than around 50 * 10^10 = 500Gbit/s. I.e., if all distinguishable actions made by humans were perfectly independent, biological humanity as a whole would have at most 500Gbit/s of "steering power".

I need to think more about the idea of "steering power" (e.g. some obvious rough edges around amplifying your steering power using external information processing / decision systems), but I have some intuition that one might actually be able to come up with a not-totally-useless concept that lets us say something like "humanity can't stay in 'meaningful control' if we have an unaligned artificial agent with more steering power than humanity, expressed in bits/s".

Reply

Not all capabilities will be created equal: focus on strategically superhuman agents

benwr2mo10

I think you may have missed, or at least not taken literally, at least one of these things in the post:

The expansion of "superhuman strategic agent" is not "agent that's better than humans at strategic reasoning", it's "agent that is better than the best groups of humans at taking (situated) strategic action"
Strategic action is explicitly context-dependent, e.g. an AI system that's inside a mathematically perfect simulated world that can have no effect on the rest of the physical world and vice versa, has zero strategic power in this sense. Also e.g. in the FAQ, "Capabilities and controls are relevant to existential risks from agentic AI insofar as they provide or limit situated strategic power." So, yes, an agent that lives on your laptop is only strategically superhuman if it has the resources to actually take strategic action rivaling the most strategically capable groups of humans.
"increasingly accurately" is meant to point out that we don't need to understand or limit the capabilities of things that are obviously much strategically worse than us.

Reply

benwr's unpolished thoughts

benwr3mo104

I think it probably makes sense for ~everyone to have an explicit list of "things I'd like AI to do for me", especially around productivity and/or things that could help you with world-saving. If you have a list like this, and we happen to hit a relevant capability threshold before we lose, you should probably avoid wasting time on that thing as quickly as possible.

Reply

Bounty for Evidence on Some of Palisade Research's Beliefs

benwr7mo20

Thanks everyone for thoughts so far! I do want to emphasize that we're actually highly interested in collecting even the most "obvious" evidence in favor of or against these ideas. In fact, in many ways we're more interested in the obvious evidence than in reframes or conceptual problems in the ideas here; of course we want to be updating our beliefs, but we also want to get a better understanding of the existing state of concrete evidence on these questions. This is partly because we consider it part of our mission to expand the amount and quality of relevant evidence on these beliefs, and are trying to ensure that we're aware of existing work.

Reply

benwr's unpolished thoughts

benwr10mo50

Surprisingly to me, Claude 3.5 Sonnet is much more consistent in its answer! It is still not perfect, but it usually says the same thing (9/10 times it gave the same answer).

Reply

1

benwr's unpolished thoughts

benwr10mo167

From the "obvious-but-maybe-worth-mentioning" file:

ChatGPT (4 and 4o at least) cheats at 20 questions:

If you ask it "Let's play a game of 20 questions. You think of something, and I ask up to 20 questions to figure out what it is.", it will typically claim to "have something in mind", and then appear to play the game with you.

But it doesn't store hidden state between messages, so when it claims to "have something in mind", either that's false, or at least it has no way of following the rule that it's thinking of a consistent thing throughout the game. i.e. its only options are to cheat or refuse to play.

You can verify this by responding "Actually, I don't have time to play the whole game right now. Can you just tell me what it was you were thinking of?", and then "refreshing" its answer. When I did this 10 times, I got 9 different answers and only one repeat.

Reply

5