Joel Z. Leibo

Joel is a senior staff research scientist at Google DeepMind and visiting professor at King's College London. He obtained his PhD from MIT where he studied computational neuroscience and machine learning with Tomaso Poggio. Joel is interested in reverse engineering human biological and cultural evolution to inform the development of artificial intelligence that is simultaneously human-like and human-compatible. In particular, Joel believes that theories of cooperation from fields like cultural evolution and institutional economics can be fruitfully applied to inform the development of safe, ethical, and effective artificial intelligence technology.

www.jzleibo.com

Wikitag Contributions

Comments

Sorted by

I'm glad you mentioned the orthogonality hypothesis (that goals are orthogonal to intelligence). Part of our argument can be seen as reconceptualizing orthogonality, and rejecting a strong version of it. 

So it's interesting to me that you see orthogonality as leading folks to conflict theory over mistake theory. I am not sure it has to lead either way.

In our language, we say that it's better to have an "endogenous preferences" theory than an "exogenous preferences" theory, meaning that it's better to be able to construct models where preferences can change as a function of the mechanisms and processes being modeled. This is not true for most rational actor and RL models; they typically assume preferences at the start (i.e. "exogenously").

Here's a paragraph making this more concrete, which I pulled from a shorter summary of https://arxiv.org/abs/2412.19010 which I've been working on.

"Consider an individual's preference for certain types of music, say, rock versus classical. We need not assume they begin with a scalar utility or ``taste'' for one over the other. Instead, in our theory, their preferences are shaped by their experiences: perhaps they grow up hearing rock music frequently at home, attend rock concerts with friends, and rarely encounter classical music. These experiences are encoded as memories. When prompted by the question "What kind of person am I?" and summarizing these memories, the individual's pattern completion network p might generate the assembly "I am the kind of person who likes rock music". This self-attribution, influenced by past behavior and exposure, becomes part of the global workspace context. Subsequently, when faced with a choice like "What music should I listen to?", p, conditioned on the self-description "I like rock music", is more likely to complete the pattern with "Listen to rock music". Through this process, the individual's behavior (listening to rock) is influenced by their inferred identity, which itself arose from past behavior and exposure. This example illustrates how preferences are constructed endogenously through the aggregation and consolidation of experience, rather than being static, external inputs to the model."

Nice work! I like this approach very much. It seems we have been thinking in very related and compatible directions to each another.

I posted a related one last week: Societal and technological progress as sewing an ever-growing, ever-changing, patchy, and polychrome quilt

"there still is semi-agreement (not just within MIRI) that the alignment problem is due to difficulty aligning the AI's goals with human goals, rather than difficulty finding the universe's objective morality to program the AI to follow."

 

Right, that's part of the consensus we wrote the post in order to dispute. Even though people say they recognize a difference between "align the AI's goals with human goals" and "find the universe's objective morality", they might have implicitly assumed the two are identical as a consequence of other parts of the paradigm's logic (like CEV, and the mistake theory framing, and various other equivalent foundational ideas).

Of course individuals have complex views and many who have thought about this have resolved it for themselves in one way or another. But we still think it's an overall blindspot for the field, especially since there isn't consensus that it's even an issue.

We are saying in the post that we find it more helpful to start from a different set of assumptions.

I agree that one part of our conclusion is related to corrigibility. We didn't use that term here but we have done so elsewhere, in older work from before we had the new theory. We could probably still use it. It might be a distraction to this audience though since corrigibility looks rather different in a multi-agent context, and it's not the focus here.

  1. Not sure what you mean by this comment.
  2. Doesn't this hindsight-based definition of CEV that you offer here preclude using it as an objective to coordinate around? You can't coordinate around something that you'll only know in hindsight. And however you try to estimate it, if you try to get people to coordinate around your estimated CEV than some groups will feel like it doesn't represent their view of how to make the world a better place, or it doesn't prioritize X over Y in the way they would, etc.

In the context of your Twitter comments about biorisk, the relevant question for this post and the appropriateness theory is probably just about whether it would be better to conceptualize these risks as part of a theory that includes a single objective for all humanity ("make the world a better place") versus a theory that doesn't accept that. Our claim is that it's easier to think about 'stability of society' and misuse type harms when using our theory. Nothing you said on Twitter is an argument for why it's important to have anything resembling a single objective.

I do think you're right to point out that the sentence you extracted from the long paper where we speculate about the future doesn't really follow from the theory itself.

Well that fetal alcohol syndrome is associated with poor emotional regulation is apparently a thing that is true in the sense that you can measure it, and it seems to be helpful to know it if you're treating patients with fetal alcohol syndrome or living with them.

But our theory is formal, it's not just a collection of true statements. It's not really clear how we could use it to model the effect of fetal alcohol syndrome on emotional regulation. So, there's a sense in which we don't capture that "feature of reality". But how important is it to capture? All theories are wrong in some ways. The only way to judge them is in terms of whether they are useful. In this context, we are arguing that our theory of appropriateness is a useful improvement on the rational actor theory, which by the way, also has trouble accounting for the effect of fetal alcohol syndrome on emotional regulation. So if you need a theory of human behavior that can accommodate it, you probably shouldn't use either of these.

I don't think our story is really ultimately about "making chatbots polite", it's just that impoliteness or awkwardness are common manifestations of AIs acting in contextually inappropriate ways that happen all the time, right now, today. 

However, at present, people are starting to give AIs more decision making authority in various domains. And as we continue doing that, giving AIs more and more decision making authority, the scope of possible harm from contextually inappropriate behavior will grow massively. I don't think the scope of possible harm achieved this way is different between our theory and others in the AI safety community. 

We are just saying that we think it's useful to conceptualize the possible harms as stemming from contextually inappropriate behavior. Our claim is that there are valuable lessons to learn about governance from the "content moderation" view of already deployed chatbots and these lessons will transfer to other kinds of AI systems that take actions and have much greater capacity to produce both direct harms and stability of society type harms.

"Values are free parameters in rational behavior"

 

Right, so this is the part we reject. In the long theory of appropriateness paper, instead of having a model with exogenous preferences (that's the term we use for the assumption that values are free parameters in rational behavior), we say that it's better to have a theory where preferences are endogenous so they can change as a function of the social mechanisms being modeled.

So, in our theory, personal values are caused by social conventions, norms, and institutions.

Combine this with the contrast between thick and thin morality, which we did mention in the post. You get the conclusion that it's very difficult for individuals to tell which of their personal values are part of 'thin' morality that applies cross culturally versus which are just part of their own culture's 'thick' morality. Another way of saying this is, we're surrounded by moral rules and morally-laden preferences and it's very difficult to tell from the inside of any given culture which of those rules are important versus which of them are silly. From the inside perspective they look exactly the same. Transgressions are punished in the same way, etc.

Since we the AI safety community are ourselves inside a particular culture, when we talk about CEV as being "about acting where human values are coherent", we still mean that with an implicit "as measured and understood by us, here and now".  But, from the perspective of someone in a different culture, that makes it indistinguishable from "imposing coherence out of nowhere". 

You could reply that I'm talking about practical operationalizations of CEV, not the abstract concept of CEV itself. And OK, sure, fair enough. But the abstract concept doesn't do anything on its own. You always have to operationalize it in some practical way. And all practical operationalizations will have this problem.

Load More