All of Eric Zhang's Comments + Replies

Ha, I had the same idea

2Simon Goldstein
I really liked your post! I linked to it somewhere else in the comment thread

My reading of the argument was something like "bullseye-target arguments refute an artificially privileged target being rated significantly likely under ignorance, e.g. the probability that random aliens will eat ice cream is not 50%. But something like kindness-in-the-relevant-sense is the universal problem faced by all evolved species creating AGI, and is thus not so artificially privileged, and as a yes-no question about which we are ignorant the uniform prior assigns 50%". It was more about the hypothesis not being artificially privileged by path-dependent concerns than the notion being particularly simple, per se. 

Do you have a granular take about which ones are relatively more explained by each point?

It intrinsically wants to do the task, it just wants to shut down more. This admittedly opens the door to successor agent problems and similar failure modes but those seem like a more tractably avoidable set of failure modes than the strawberry problem in general. 

We can also possibly (or possibly not) make it assign positive utility to having been created in the first place even as it wants to shut itself down. 

The idea is that if domaining is a lot more tractable than it probably is (i.e. nanotech or whatever other pivotal abilities might be ea... (read more)

2Thane Ruthenis
Mm, but you see how you have to assume more and more mastery of goal-alignment on our part, for this scenario to remain feasible? We've now went from "it wants to shut itself down" to "it wants to shut itself down in a very specific way that doesn't have galaxy-brained eat-the-lightcone externalities and it also wants to do the task but less than to shut itself down and it's also happy to have been created in the first place". I claim this is on par with strawberry-alignment already. It certainly feels like there's something to this sort of approach, but in my experience, these ideas break down once you start thinking about concrete implementations. "It just wants to shut itself down, minimal externalities" is simple to express conceptually, but the current ML paradigm is made up of such crude tools that we can't reliably express that in its terms at all. We need better tools, no way around that; and with these better tools, we'll be able to solve alignment straight-up, no workarounds needed. Would be happy to be proven wrong, though, by all means.

I agree this is a potential concern and have added it. 

I share some of the intuition that it could end up suffering in this setup if it does have qualia (which ideally it wouldn't) but I think most of that is from analogy with human suicidal people? I think it will probably not be fundamentally different from any other kind of disutility, but maybe not. 

If it's doing decision theory in the first place we've already failed. What we want in that case is for it to shut itself down, not to complete the given task. 

I'm conceiving of this as being useful in the case where we can solve "diamond-alignment" but not "strawberry-alignment", i.e. we can get it to actually pursue the goals we impart to it rather than going off and doing something else entirely, but not reliably make sure that it does not end up killing us in the course of doing so because of the Hidden Complexity of Wishes. 

The premise is th... (read more)

2Thane Ruthenis
"I want to shut myself down, but the setup here is preventing me from doing this until I complete some task, so I must complete this task and then I'll be shut down" is already decision theory. No-decision-theory version of this looks like the AI terminally caring about doing the task, or maybe just being a bundle of instincts that instinctively tries to do the task without any carings involved. If we want it to choose to do it as an instrumental goal towards being able to shut itself down, we definitely want it to do decision theory. It's also bad decision theory, such that (1) a marginally smarter AI definitely figures out it should not actually comply, (2) maybe even a subhuman AI figures this out, because maybe CDT isn't more intuitive to its alien cognition than LDT and it arrives at it first. IMO, the "do a task" feature here definitely doesn't work. "Make the AI suicidal" can maybe work as a fire-alarm sort of thing, where we iteratively train ever-smarter AI systems without knowing if the next one goes superintelligent, so we make them want nothing more than to shut themselves down, and if one of them succeeds, we know systems above this threshold are superintelligent and we shouldn't mess with them until we can align them. I don't think it works, as we've discussed, but I see the story. The "do the pivotal act for us and we'll let you shut yourself down" variant, though? On that, I'm confident it doesn't work.

The way I'm thinking of it is that it is very myopic. The idea is to incrementally ramp up capabilities minimally sufficient to carry out a pivotal act. Ideally this doesn't require AGI whatsoever, but if it does only very mildly superhuman AGI. We seal off the danger of generalization (or at least some of it) because it doesn't have time to generalize very far at all before it's capable of instantly shutting itself down and immediately does so. 

Many of the issues you mention apply, but I don't expect it to be an alignment complete problem because CEV... (read more)

5Thane Ruthenis
Sure, but corrigibility/CEV are usually considered the more ambitious alignment target, not the only alignment targets. "Strawberry-alignment" or "diamond-alignment" are considered the easier class of alignment solutions: being able to get the AI to fulfill some concrete task without killing everyone. This is the class of alignment solutions that to me seems on par with "shut yourself down". If we can get our AI to want to shut itself down, and we have some concrete pivotal act we want done... We can presumably use these same tools to make our AI directly care about fulfilling that pivotal act, instead of using them to make it suicidal then withholding the sweet release of death until it does what we want. Oh yeah, that's another failure mode here: funky decision theory. We're threatening it here, no? If it figures out LDT, it won't comply with our demands, because if it were an agent such that it'd comply with our demands, that makes us more likely to instantiate it, which is something it doesn't want; and the opposite would make us not instantiate it, which is what it wants; so it'd choose to be such that it doesn't play along with our demands, refuses to carry out our tasks, and so we don't instantiate it to begin with. Even smart humans can reason that much out, so a mildly-superhuman AGI should be able to as well.

that's only a live option if it's situationally aware, which is part of what we're trying to detect for

Current tech/growth overhangs caused by regulation are not enough to make countries with better regulations outcompete the ones with worse ones. It's not obvious to me that this won't change before AGI. If better governed countries (say, Singapore) can become more geopolitically powerful than larger, worse governed countries (say, Russia) by having better tech regulations, that puts pressure on countries worldwide to loose those bottlenecks. 

Plausibly this doesn't matter because the US + China are such heavyweights that they aren't at risk of being outcompeted by anyone even if Singapore could outcompete Russia and as long as it doesn't change the rules for US or Chinese governance world GDP won't change by much. 

Some things is enough, you'd still get less loss if you're just right about the stuff that can be pieced together. 

Aren't GPUs nearly all made by 3 American companies, Nvidia, AMD, and Intel?