If there are models that are that much better than SOTA models, would they be posting to LW? Seems unlikely - but if so, and they generate good enough content, that seems mostly fine, albeit deeply concerning on the secretly-more-capable-models front.

Reply

LDT (and everything else) can be irrational

Davidmanheim4d20

But you really aren't assuming that, you're doing something much stranger.

Either the actual opponent is a rock, in which case it gains nothing from "winning" the game, and there's no such thing as being more or less rational than something without preferences, or the actual opponent is the agent who wrote the number on the rock and put it in front of the agent, in which case the example fails because the game actually started with an agent explicitly trying to manipulate the LDT agent into underperforming.

Reply

Why Yudkowsky is wrong about "covalently bonded equivalents of biology"

Davidmanheim4d20

This post is a stronger arguments against Drexlerian nanomachines that outperform biology in general, which doesn't rely on the straw man.

Reply

LDT (and everything else) can be irrational

Davidmanheim4d20

"More rational in a given case" isn't more rational! You might as well say it's more rational to buy a given lottery ticket because it's the winning ticket.

Reply

LDT (and everything else) can be irrational

Davidmanheim5d20

As I commented on your other post, there are two possibilities; either the universe is murphy-like and pessimizing your outcome, in which case sure, you might be in a worst case universe, and there is a bound on how well you can do, or some agent sent the rock, in which case you are playing the game against that agent, and would know that fact. Or, as I mentioned, yes, an ultrapowerful system can create situations where you lose because it can fool you, lying to you perfectly, and that is equivalent to a worst-case universe.

Reply

1

LDT (and everything else) can be irrational

Davidmanheim5d20

As I commented on another post, It seems Eliezer already addressed the specific claim you made here via probabilistic LDT solutions, as Mikhail explained there, and in a comment here. (And the quoted solution was written before you wrote this post.)

Is there a version that the modification explains there fails to address?

Reply

How to Give in to Threats (without incentivizing them)

Davidmanheim5d42

I think your post misses the point made here.

What about a rock with $9 painted on it? The LDT agent in the problem reasons that the best action is to choose $1, so the rock gets $9.
Thus, $9 rock is more rational than LDT in this problem.

The solution above addresses this; by playing probabilistically, the rock gets a payoff somewhat less than $5 in expectation, so it does worse than an LDT agent.

Reply

How to Give in to Threats (without incentivizing them)

Davidmanheim5d20

I'm a bit confused how this is a problem.

Either there is an agent that stands to benefit from my acceding to a threat, or there is not. If an agent "sufficiently" turns itself into a rock for a single interaction, but reaps the benefit as an agent, it's a full-fledged agent. Same if it sends a minion, where the relevant agent is the one who sent the rock, not the rock. And if we have uncertainty about the situation, that's part of the game.

If the question is whether other players can deceive you about the nature of the game or the probabilities, sure, that is a possibility, but it is not really a question about LDT, it's just a question about whether we should expand every decision into a recursive web of uncertainties about all other possible agents - and, I suspect, come to the conclusion that smarter agents can likely fool you, and you shouldn't allow others with misaligned incentives to manipulate your information environment, especially if they have more optimization power than you do. But as we all should know, once we make misaligned super-intelligent systems, we stop being meaningful players anyways.

In this world, maybe you want to suppose the agent's terminal value is to cause me to pay some fixed cost, and it permanently disables itself to that end - but that makes it either a minion sent by something else, or a natural feature of a Murphy-like universe where you started out screwed, in which case you should treat the natural environment as an adversary. But that's not our situation, again, at least until ASI shows up.

cc: @Mikhail Samin - does that seem right to you?

Reply

1

Transhumanism and AI: Toward Prosperity or Extinction?

Davidmanheim6d10

This article seems good, but this seems not to be the right place to post it, given that those here already are generally aware of this.

Reply

How far along Metr's law can AI start automating or helping with alignment research?

Answer by DavidmanheimMar 20, 20251-8

8 hours of clock time for an expert seems likely to be enough to do anything humans can do; people rarely productively work in longer chunks than that, and as long as we assume models are capable of task breakdown and planning, (which seems like a non trivial issue, but an easier one than the scaling itself,) that should allow it to parallelize and serialize chucks to do larger human-type tasks.

But it's unclear alignment can be solved by humans at all, and even if it can, of course, there is no reason to think these capabilities would scale as well or better for alignment than for capabilities and self-improvement, so this is not at all reassuring to me.

Reply