There is a line in alignment-related thinking, of looking for ways that agents will tend to be similar. An early example is convergent instrumental goals, and a later is natural abstractions. Those two ideas share an important attribute - trying to think on something "mental" (values, abstractions) as at least partially grounded in the objective environment. My goal in this post is to present and discuss another family of mechanisms for convergence of agents. Those mechanisms are different in that they arise from interaction between the agents and make them "synchronize" with each other, rather than adapt to similar non-agentic environment. As a result, the convergence is around things that are "socially constructed" and somewhat arbitrary, rather than "objective” things. I’ll then shortly touch another family of mechanisms and conclude with some short points about relevance to alignment.

 

Instrumental Value Synchronization

Depending on their specific situations and values, agents may care about different aspects of the environment to different degrees. A not-too-ambitious paperclip maximizer (Pam for short) may care more about controlling metal on earth than it cares about controlling rocks on the moon. A huge-pyramids-on-moon maximizer (Pym) may have the opposite priorities. But they both care about uranium to energize their projects and get into conflicts around it. It then makes sense for Pym to care about controlling metal too – to have some leverage on Pam, that may then be used to achieve more uranium. It may want the ability to give Pam metal in exchange for uranium, or promise to create paperclips in exchange for uranium, or to retaliate with paperclip-destruction if Pam try to take its uranium. Anyway, the result for Pym is that it is now more like Pam in that it care about metal and paperclips, and about things that are relevant for those.

 

Money

Money is a strange thing, leading people to say strange things about how it works. The most common ones are "It only has value because we believe it does" and "It only has value because we all agree to accept it". But I do not believe that money have intrinsic value, nor do I recall anyone asking me to consent it’s having value. My belief that it has value is only my well-founded expectation that people will want it in exchange for goods, and my agreement to accept it for my work is solely based on this expectation. And even if everyone else had the same attitude, money would not have lost its value. The reason is that humans are part of the territory. And it is a fact about the territory that we participate in a robust Nash equilibrium where money have exchange value. Furthermore, the game setup for which there is such a Nash equilibrium is pretty general, resulting in variations of money being re-invented again and again. On the other hand, it is true that money doesn't have to be green pictures of George Washington and creepy pyramids. This game has many many equilibria, and the one we happen to participate in is indeed arbitrary.

Money is when the self-referential nature of instrumental value synchronization allows it to go wild.

 

Communicative Synchronization: Words

A very similar thing happens with words, with very similar confusions. People say that we "agree" to give a word a specific meaning, though as a matter of fact we were not asked to agree. Or that we "believe" that the meaning is such and such though no metaphysical commitment is really required. When I use the word "dog" to refer to my dog, I act on the expectation that you will understand me, because I expect you to expect me to only use this word in this context when talking about dogs. Those expectations are rationally acquired, and everybody behave rationally given those expectations. That "dog" means dog is a fact about the Nash equilibrium of the English-speaking community.

Conceptual Synchronization: Races and Gods

You may think that the concept of "race" as used in the US is nonsensical - either because any concept of race would be nonsensical, or because the specific racial categories don't make biological sense. But you simply cannot be racially blind – because races exist, as a matter of social fact. Refusing to know someone's race is refusing to know too many socially relevant things: How other people see them, how they see themselves, how they may expect you to see them, and what you may do that would accidentally confirm those expectations. Once people make a stupid categorization that keep track of nothing real, the categorization start to keep track of something very real – the way that people respond to those categories.

Similar thing goes for religions. You may not be Christian. The story in your mind about God becoming flash and dying for our seen may not have label TRUE written next to it. But the story is written in your mind because the existence of the story is an important fact about the world’s culture.

The lesson is easy to generalize: Everything in the worldview of surrounding agents – even if not endorsed – is potentially relevant for your own worldview. 

 

Computationally Parasitic Mechanisms

There are more interaction patterns that result in similarity between agents in the same environment. Two closely related mechanisms that I want to mention for completeness’ sake are imitation and prediction-fulfillment. I put those two together as they are both basically using other agents’ brain power.

 

Imitation

We were all born ignorant and unpracticed, into a very complicated world. Reinventing and rediscovering everything from scratch would have been hard. Fortunately, it was not necessary – as we were surrounded by many adults who had already mastered at least the basics. Doing what they do and believing what they believe was a very good first approximation. Many people will never grow beyond that first approximation. Even the smartest people are mostly smart because they found teachers better than their parents (including books) later in life. They are therefore mostly smart in ways that were already discovered, or that is at most 90-brain-years away from things that were already discovered.

It stands to reason that even a young AGI will have much to learn from humans, that would be harder to learn alone. And that at least in this phase the AGI will be not as alien as one would expect.

 

Prediction Fulfillment

Another way to use others’ brain power is to deduce what they think that you should do rather than what they do themselves. To the degree that their simulation of you is accurate, it is basically thinking about it instead of you. If they overestimate you – even better! 

I personally find myself use this heuristic more than I like to acknowledge and bet that many of you use it too. It probably results in us becoming more like how others see us. [An ironic doomsday scenario: an AGI is born with good objectives, and then read in LessWrong that it should probably take over the world and kill all humans.] 

 

Counter Arguments

It is important to be careful about that kind of "generally agents will tend to X" arguments, as those are often dependent on many hidden assumptions. We should be extra careful given how epistemically comfortable it would be if it turned out to be true. The above mechanisms may easily be not just false, but go the other way around given the right conditions.

Specifically, I think that all the above mechanisms assume agents of similar power and some modest potential to coordinate: 

  • Having control on something that someone else care about may be a very bad thing when that someone is much more powerful. For many nations, having gold was a great curse. 
  • Having concepts and communication protocols like those of other agents may be undesirable in some adversarial settings, when not being predictable is more important than predicting others.
  • There is no point for a child to imitate a chess-player whose plans she does not even understand and can’t successfully complete.
  • If you don’t have a human-like body, you aren’t going to learn from humans how to walk.

I don't think that it makes the arguments useless, just that the hidden assumptions should not be hidden. Then, we should think which of those mechanisms we do or don’t want to activate and try to create the right conditions.

New Comment
1 comment, sorted by Click to highlight new comments since:

I think it could be way easier to achieve robust Alignment in a system with a lot of individially weak agents. Because then to do something, they would need to cooperate, both implicitly and explicitly. So, they would need some common values and common culture code. Which, with some nudging from people, could converge into Algnment.