Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Bob: "I want my AGI to make everyone extremely wealthy! I'm going to train that to be its goal."

Cassie: "Stop! You'll doom us all! While wealth is good, it's not everything that's good, and so even if you somehow build a wealth-maximizer (instead of summoning some random shattering of your goal), it will sacrifice all the rest of the good in the name of wealth!"

Bob: "Maybe if it suddenly became a god-like superintelligence, but I'm a hard take-off skeptic. In the real world we have continuous processes and I'm going to be in control. If it starts to go off the rails, I'll just stop it and re-train it to not do that."

Cassie: "Be careful what you summon! While it may seem like you're in control in the beginning, these systems are generalized obstacle-bypassers, and you're making yourself into an obstacle that needs to be bypassed. Whether that takes two days or twenty years, you're setting us up to die."

Bob: "Ok, fine. So I'll build my AGI to make people rich and simultaneously to respect human values and property rights and stuff. At the point where it can bypass me, it'll avoid turning everyone into bitcoin mining rigs or whatever because that would go against its goal of respecting human values."

Cassie: "What does 'human values' even mean? I agree that if you can build an AGI that is truly aligned, we're good, but that's a tall order and it doesn't even seem like what you're aiming for. Instead, it seems like you think we should train the AGI to maximize a pile of desiderata."

Bob: "Yeah! My AGI will be helpful, obedient, corrigible, honest, kind, and will never produce copyrighted songs, memorize the NYT, or impersonate Scarlett Johansson! I'll add more desiderata to the list as I think of them."

Cassie: "And what happens when those desiderata come into conflict? How does it decide what to do?"

Bob: "Hrm. I suppose I'll define a hierarchy like Asimov's laws. Some of my desiderata, like corrigibility, will be constraints, while others, like making people rich, will be values. When a constraint comes in conflict with a value, the constraint wins. That way my agent will always shut down when asked, even though doing so would be a bad way to make us rich."

Cassie: "Shutting down when asked isn't the hard part of corrigibility, but that's a tangent. Suppose that the AGI is faced with a choice of a 0.0001% chance of being dishonest, but earning a billion dollars, or a 0.00001% chance of being dishonest, but earning nothing. What will it do?"

Bob: "Hrm. I see what you're saying. If my desiderata are truly arranged in a hierarchy with certain constraints on top, then my agent will only ever pursue its values if everything upstream is exactly equal, which won't be true in most contexts. Instead, it'll essentially optimize solely for the topmost constraint."

Cassie: "I predict that it'll actually learn to want a blend of things, and find some weighting such that your so called 'constraints' are actually just numerical values along with the other things in the blend. In practice you'll probably get a weird shattering, but if you're magically lucky on getting what you aim for, you'll still probably just get a weighted mixture. Getting a truly hierarchical goal seems nearly impossible outside of toy problems."

Bob: "Doesn't this mean we're also doomed if we train an AGI to be truly aligned? Like, won't it still sometimes sacrifice one aspect of alignment, like being honest, in order to get a sufficiently large quantity of another aspect of alignment, like saving lives?"

Cassie: "That seems confused. My point is that a coherent agent will act as though it's maximizing a utility function, and that if your strategy involves lumping together a bunch of desiderata as good-in-themselves (i.e. terminally valuable) then you need to have a story about why the system will have the exact right weighting. By contrast, we can see some systems as emergently producing desiderata as a means towards their underlying goals (i.e. instrumentally valuable). If an aligned system choses to be deceitful, it is because, by assumption, the system believed that dishonesty was the best way to get a good outcome. It's not sacrificing anything, except insofar as there might be better strategies that are even better at getting what it wants."

Bob: "So you're saying that I should be making an AGI which satisfies my desiderata as an emergent consequence of pursuing its goals, rather than as an ends in themselves?"

Cassie: "Actually, I'm telling you not to build AGI. One of many reasons you're doomed is because, regardless of what choice you make regarding whether to try to get your desiderata instrumentally or terminally, you've got problems. If you try and get all your desiderata a terminal values, somehow you need to ensure that the numerical weights you attach to each one are exactly correct, and it won't end up ignoring some vital characteristic like shutdownability in favor of maximizing the other things in its weighted sum. Conversely, if your story is that by picking your goal very carefully, you'll get all the things you need emergently, then you need to ensure that you're not fooling yourself, and that those things do, in fact, show up in practice, even when the AGI starts operating in a novel environment, such as one where it's superintelligent. There are additional sources of doom besides this, but at the very least you need to be clear about which side of this fork you're trying to take, and how you're dealing with that path's problem."

New Comment