User Comment Replies

Alignment - Path to AI as ally, not slave nor foe

the assumption that "technically superior alien race" would be safe to create.

You are right, that's not a valid assumption, at least not fully. But I do think this approach substantially moves the needle on whether we should try to ban all AI work, in a context where the potential benefits are also incalculable and it's not at all clear we could stop AGI at this point even with maximum effort.

Then what we get is an equilibrium where there is some stable amount of rogues, some of which gets caught and punished, and some don't and get positive reward, a

... (read more)

1silent-observer2y

I guess we are converging. I'm just pointing out flaws in this option, but I also can't give a better solution off the top of my head. At least this won't insta-kill us, assuming that real-world humans count as non-copyable agents (how does that generalize again? Are you sure RL agents can just learn our definition of an agent correctly, and that won't include stuff like ants?), and that they can't get excessive virtual resources from our world without our cooperation (in that case a substantial amount of agents goes rogue, and some of them get punished, but some get through). I still think we can do better than this though, somehow. (Also if with "ban all AI work" you're referring to the open letter thing, that's not really what it's trying to do, but sure) How does non-harm follow from cooperation? If we remove the "negative reward for killing" part, what stops them from randomly killing agents (and everyone else believing it's okay, so no punishment), if there is still enough other agents to cooperate with? Grudges? How do they work exactly for harm other than killing?

Half-baked alignment idea

ozb2y10

How do humans do it? Ultimately, genuine altruism is computationally hard to fake; so it ends up being evolutionarily advantageous to have some measure of the real thing. This is particularly true in environments with high cooperation rewards and low resource competition; eg where carrying capacity is maintained primarily by wild animals, general hard conditions, and disease, rather than overuse of resources. So we put our thumbs on the scale there to make these AIs better than your average human. And we rely on the AIs themselves to keep each other in check.

1silent-observer2y

Ah, true. I just think this wouldn't be enough and that there could be distributional shift if the agents are put into an environment with low cooperation rewards and high resource competition. I'll reply in more detail under your new post, it looks a lot better

Half-baked alignment idea

ozb2y10

New post with a distillation of my evolved thoughts on this: https://www.lesswrong.com/posts/3SJCNX4onzu4FZmoG/alignment-ai-as-ally-not-slave

Half-baked alignment idea

ozb2y10

Doesn't that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?

Yes, just like for humans. But also, if they can escape that game and genuinely cooperate, they're rewarded, like humans but more so.

1silent-observer2y

Ah, I see. But how would they actually escape the deception arms race? The agents still need some system of detecting cooperation, and if it can be easily abused, it generally will be (Goodhart's Law and all that). I just can't see any other outcome other than agents evolving exceedingly more complicated ways to detect if someone is cooperating or not. This is certainly an interesting thing to simulate, but I'm not sure how that is useful for aligning the agents. Aren't we supposed to make them not even want to deceive others, instead of trying to find a deception strategy and failing? (Also, I think even an average human isn't that well aligned as we want our AIs to be. You wouldn't want to give a random guy from the street nuclear codes, would you?)

Half-baked alignment idea

ozb2y10

Will continue responding, but first, after reading the existing comments, I think I do need to explicitly make humans preferred. I propose that in the sim we have some agents whose inner state is "copyable" and they get reincarnated, and some agents who are not "copyable". Subtract points from all agents whenever a non-copyable agent is harmed/dies. The idea is that humans are not copyable, and that's the main reason AIs should treat us well, while AIs are, and that's the main reason we don't have to treat them well. But also, I think we as humans might actually have to learn to treat AIs well in some sense/to some degree...

Half-baked alignment idea

ozb2y10

what do we count as an agent?

Within the training, an agent (from the AI's perspective) is ultimately anything in the environment that responds to incentives, can communicate intentions, and can help/harm you Outside the environment that's not really any different

Just build a swarm of small AI

That's actually a legitimate point: assuming an AI in the real world has been effectively trained to value happy AIs, it could try to "game" that by just creating more happy AIs rather than making existing ones happy. Like some parody of a politician supporting ... (read more)

1silent-observer2y

So agents are rewarded for doing stuff that convinces others that they're a "happy AI", not necessarily actually being a "happy AI"? Doesn't that start an arms race of agents coming up with more and more sophisticated ways to deceive each other? Like, suppose you start with a population of "happy AIs" that cooperate with each other, then if one of them realizes there's a new way to deceive the others, there's nothing to stop them until other agents adapt to this new kind of deception and learn to detect it? That feels like training an inherently unsafe and deceptive AI that also are extremely suspicious of others, not something "happy" and "friendly"

Half-baked alignment idea

ozb2y10

Yep pretty much what I had in mind

Half-baked alignment idea

ozb2y10

Ideally, sure, except that I don't know of a way to make "assist humans" be a safe goal. So I'm advocating for a variant of "treat humans as you would want to be treated", which I think can be trained

Half-baked alignment idea

ozb2y10

Thanks for helping me think this through.

For the first problem, the basic idea is that this is used to solve the specification problem of defining values and training a "conscience", rather than it being the full extent of training. The conscience can remain static, and provide goals for the rest of the "brain", which can then update its beliefs.

For the second issue, I meant that we would have no objective way to check "cooperate" and "respect" on the individual agent level, except that the individual can get other agents to cooperate with it. So eg, in or... (read more)

1silent-observer2y

Okay, so if that's just a small component, then sure, first issue goes away (though I still have questions on how you're gonna make this simulation realistic enough to just hook it up to an LLM or "something smart" and expect it to set coherent and meaningful goals in real life, though that's more of a technical issue). However, I believe there are still other issues with this appoach. The way you describe it makes me think it's really similar to Axelrod's Iterated Prisoner's Dilemma tournament, and that did invent tit-for-tat strategy as one of the most successful ones. But that wasn't the only successful strategy. For example there were strategies that were mostly tit-for-tat, but would defect if they could get away with it. If, for example, that still mostly results in tit-for-tat, except for some rare cases of it defecting when the agents in question are too stupid to "vote it out", do we punish it? Second, tit-for-tat is quite succeptible to noise. What if the trained agent misinterprets someone's actions as "bad", even if they in actuality did something innocent, like "affectionately pat their friend on the back", which the AI interpreted as fighting? No matter how smart the AI gets, there still will be cases where it wrongly believes someone to be a liar, and so believes it has every justification to "deter" them. Do we really want even some small probability of that behaviour? How about an AI that doesn't hurt humans unconditionally, and not only when it believes us to be "good agents"?[1] Last thing, how does AI determine what is an "intelligent agent" and what is a "rock"? If there are explicit tags in the simulation for that thing, then how do you make sure every actual human in the real world gets tagged correctly? Also, do we count animals? Should animal abuse be enough justification for the AI to turn on the rage mode? What about accidentally stepping on an ant? And if you define "intelligent agent" as relative to the AI, when what do we do once it

Half-baked alignment idea

ozb2y*10

Better yet, have the agents experience discrimination themselves to internalize the message that it is bad

Half-baked alignment idea

ozb2y10

To extend the approach to address this, I think we'd have to explicitly convey a message of the form "do not discriminate based on superficial traits, only choices"; eg, in addition to behavioral patterns, agents possess superficial traits that are visible to other agents, and are randomly assigned with no particular correlation with the behaviors.

1ozb2y

Better yet, have the agents experience discrimination themselves to internalize the message that it is bad

Half-baked alignment idea

ozb2y10

I think a key danger here is that treatment of other agents wouldn't transfer to humans, both because it's inherently different and because humans themselves are likely to be on the belligerent side of the spectrum. But even so I think it's a good start in defining an alignment function that doesn't require explicitly encoding some particular form of human values.

1ozb2y

Half-baked alignment idea

ozb2y10

Part of the idea is to ultimately have a super intelligent AI treat us the way it would want to be treated if it ever met an even more intelligent being (eg, one created by an alien species, or one that it itself creates). In order to do that, I want it to ultimately develop a utility function that gives value to agents regardless of their intelligence. Indeed, in order for this to work, intelligence cannot be the only predictor of success in this environment; agents must benefit from cooperation with those of lower intelligence. But this should certainly ... (read more)

1baturinsky2y

While having lower intelligence, humans may have bigger authority. And AIs terminal goals should be about assisting specifically humans too.

Half-baked alignment idea

ozb2y10

In general my thinking was to have enough agents such that each would find at least a few within a small range of their level; does that make sense?

1baturinsky2y

I just mean that "wildly different levels of intelligence" is probably not necessary, and maybe even harmful. Because then there will be few very smart AIs at the top, which could usurp the power without smaller AI even noticing. Though, it maybe could work if those AI are smartest, but have little authority. For example they can monitor other AIs and raise alarm/switch them off if they misbehave, but nothing else.

LESSWRONG
LW

All of ozb's Comments + Replies