I think I've found a new argument, which I'll call X, against Paul Christiano's "indirect normativity" approach to FAI goals. I just discussed X with Paul, who agreed that it's serious.

This post won't describe X in detail because it's based on basilisks, which are a forbidden topic on LW, and I respect Eliezer's requests despite sometimes disagreeing with them. If you understand Paul's idea and understand basilisks, figuring out X should take you about five minutes (there's only one obvious way to combine the two ideas), so you might as well do it now. If you decide to discuss X here, please try to follow the spirit of LW policy.

In conclusion, I'd like to ask Eliezer to rethink his position on secrecy. If more LWers understood basilisks, somebody might have come up with X earlier.

New Comment
29 comments, sorted by Click to highlight new comments since: Today at 2:08 PM

I don't believe that Paul's approach to indirect normativity is on the right track. I also have no idea which of the possible problems you might be talking about. PM me. I'd put something a very high probability that it's either not a problem, or that I thought of it years ago.

Yes, blackmail can enable a remote attacker to root your AI if your AI was not designed with this in mind and does not have a no-blackmail equilibrium (which nobody knows how to describe yet). This is true for any AI that ends up with a logical decision theory, indirectly normative or otherwise, or also CDT AIs which encounter other agents which can make credible precommitments. I figured that out I don't even recall how long ago (remember: I'm the guy who first wrote down an equation for that issue; also I wouldn't be bothering with TDT at all if it wasn't relevant to some sort of existential risk). Didn't talk about it at the time for obvious reasons. The existence of N fiddly little issues like this, any one of which can instantly kill you with zero warning if you haven't reasoned through something moderately complex in advance and without any advance observations to hit you over the head, is why I engage in sardonic laughter whenever someone suggests that the likes of current AGI developers would be able to handle the problem at their current level of caring. Anyway, MIRI workshops are actively working on advancing our understanding of blackmail to the point where we can eventually derive a robust no-blackmail equilibrium, which is all that anyone can or should be doing AFAICT.

PM sent.

I agree that solving blackmail in general would make things easier, and it's good that MIRI is working on this.

This is a reference for indirect normativity and its associated Less Wrong discussion.

Yep. One of my comments in the discussion thread describes about half of X, I didn't figure out the other half until today, and others didn't realize the problem back then.

For future reference, whenever anybody says "I've got a secret" I strongly desire to learn that secret, especially if it's prefaced with "ooh, and the secret is potentially dangerous" (because "dangerous" translates to "cool"), and I expect that I'm not alone in this.

Yeah, I feel the same way. Fortunately, this particular secret is not hard to figure out once you know it's there, several LWers have done so already. Knowing your work on LW, I would expect you of all people to solve it in about 30 seconds. That's assuming you already understand UDT, indirect normativity, and Roko's basilisk. If you don't, it's easy to find explanations online. If you understand the prerequisites but are still stumped, PM me and I'll explain.

Surely EY won't mind if you put a trigger warning and rot13 the argument. Or post it on the LW subreddit.

It's really rather confusing and painful to have a discussion without talking about what you are discussion, using only veiled words, indirect hints, and private messages. May I suggest that instead we move the discussion to the uncensored thread if we are going to discuss it at all?

I believe that the whole point of the uncensored thread is to provide a forum for the discussion of topics, such as basilisks, that Eliezer is uncomfortable with us discussing on the main site.

[-][anonymous]11y00

I don't know of anyone who's seriously thinking about this particular problem and would want to discuss it there. Maybe someone will prove me wrong, though.

[This comment is no longer endorsed by its author]Reply

How is the location of the discussion of any relevance? Discussing it over there is no harder then anywhere else, and is in fact even easier, seeing as one wouldn't need to go through the whole dance of PMing one another.

And the exact purpose of that thread was for discussions like this one. If discussions like this one are not held there, why does that thread even exist?

Is this relevant for my variant of "indirect normativity" (i.e. allowing human-designed WBEs, no conceptually confusing tricks in constructing the goal definition)?

(I'm generally skeptical about anything that depends on the concept of "blackmail" distinct from general bargaining, as it seems to be hard to formulate the distinction. It seems to be mostly the affect of being bargained against really unfairly, which might happen with a smart opponent if you are bad at bargaining and are vulnerable to accepting unfair deals, so the solution appears to be to figure out how to bargain well, and avoid bargaining against strong opponents until you do figure that out.)

Yeah, I think it's relevant for your variant as well.

I don't see how something like this could be a natural problem in my setting, it all seems to depend on how the goal definition (i.e. WBE research team evaluated by the outer AGI) think such issues through, e.g. make sure that they don't participate in any bargaining when they are not ready, at human level or using poorly-understood tools. Whatever problem you can notice now, they could also notice while being evaluated by the outer AGI, if evaluation of the goal definition does happen, so critical issues for the goal definition are things that can't be solved at all. It's more plausible to get decision theory in the goal-evaluating AGI wrong so that it can itself lose bargaining games and end up effectively abandoning goal evaluation, which is a clue in favor of it being important to understand bargaining/blackmail pre-AGI. PM/email me?

I agree that if the WBE team (who already know that a powerful AI exists and that they're in a simulation) can resist all blackmail, then the problem goes away.

I think I understand X, and it seems like a legitimate problem, but the comment I think you're referring to here seems to contain (nearly) all of X and not just half of it. So I'm confused and think I don't completely understand X.

Edit: I think I found the missing part of X. Ouch.

Yeah. The idea came when I was lying in a hammock half asleep after dinner, it really woke me up :-) Now I wonder what approach could overcome such problems, even in principle.

If the basilisk is correct* it seems any indirect approach is doomed, but I don't see how it prevents a direct approach. But that has it's own set of probably-insurmountable problems, I'd wager.

* I remain highly uncertain about that, but it's not something I can claim to have a good grasp on or to have thought a lot about.

[-][anonymous]11y00

My position on the basilisk: if someone comes to me worrying about it, I can probably convince them not to worry (I've done that several times), but if someone comes up with an AI idea that seems to suffer from basilisks, I hope that AI doesn't get built. Unfortunately we don't know very much. IMO open discussion would help.

[This comment is no longer endorsed by its author]Reply
[-][anonymous]11y00

You probably understand it correctly. I say "half" because Paul didn't consider the old version serious, because we hadn't made the connection with basilisks.

[This comment is no longer endorsed by its author]Reply
[-][anonymous]11y00

I thought of this while reading Paul's original post and assumed he knew about it. I guess I should update in the direction of saying obvious things, but I have trouble distinguishing things that are stated explicitly from inferences that I make quickly. (I also have the opposite problem where if something is obviously wrong and reading it doesn't make me update at all, I don't notice that the point was made and think that the writer is just asserting something.)

Should I make any other updates? Does anyone have experience with reducing the frequency of this sort of error?

[This comment is no longer endorsed by its author]Reply

If you understand Paul's idea and understand basilisks, figuring out X should take you about five minutes (there's only one obvious way to combine the two ideas), so you might as well do it now.

I agree that X (as well as some related non-basilisk issues) are weaknesses of the indirect normative approach.

In conclusion, I'd like to ask Eliezer to rethink his position on secrecy. If more LWers understood basilisks, somebody might have come up with X earlier.

Unless I am mistaken about what you mean by X they did come up with X earlier. They just couldn't speak of it here at risk of provoking Yudkowskian Hysterics.

Can you tell me who specifically came up with X and when? PM if you like.

Can you tell me who specifically came up with X and when? PM if you like.

Just me specifically. But I had better put the disclaimer that I considered X' that seems to fit the censored context, assuming that I have correctly mapped Paul's 'indirect normativity' to my informal conception of the context. I am of course curious as to what specifically your X is. I am wary of assuming I correctly understand you---people misunderstand each other easily even when the actual arguments are overt. If you have written it up already would you mind PMing it?

If the NBC (No Blackmail Conjecture) is correct, then that shouldn't be a problem.

Can you state the conjecture or link to a description?

By that term I simply mean Eliezer's idea that the correct decision theory ought to use a maximization vantage points with a no-blackmail equilibrium.

Maybe a more scary question isn't whether we can stop our AIs from blackmailing us, but whether we want to. If the AI has an opportunity to blackmail Alice for a dollar to save Bob from some suffering, do we want the AI to do that, or let Bob suffer? Eliezer seems to think that we obviously don't want our FAI to use certain tactics, but I'm not sure why he thinks that.

Hm, I don't see it. Unless you're referring to the case where the humans aren't smart enough to abide by something like UDT. Maybe you estimated it as too obvious :)

EDIT: okay, I see a more general version now. But I think it's speculative enough that it's actually not the biggest problem with Paul's particular proposal. So it's probably still not what you expected me to think of :D