Note: I'm cross-posting this from EA Forum (where I posted it on Sept 8, 2022), in case anybody on LessWrong or the AI Alignment Forum is interested in commenting; note that there were some very helpful suggested readings in the replies to this: https://forum.effectivealtruism.org/posts/DXuwsXsqGq5GtmsB3/ai-alignment-with-humans-but-with-which-humans
Updated tldr: If human aren't aligned with each other (and we aren't, at any level of social organization above the individual), then it'll be very hard for any AI systems to be aligned with 'humans in general'.
Caveat: This post probably raises a naive question; I assume there's at least a 70% chance it's been considered (if not answered) exhaustively elsewhere already; please provide links if so. I've studied evolutionary psych & human nature for 30 years, but am a relative newbie to AI safety research. Anyway....
When AI alignment researchers talk about 'alignment', they often seem to have a mental model where either (1) there's a single relevant human user whose latent preferences the AI system should become aligned with (e.g. a self-driving car with a single passenger); or (2) there's all 7.8 billion humans that the AI system should be aligned with, so it doesn't impose global catastrophic risks. In those relatively simple cases, I could imagine various current alignment strategies, such as cooperative inverse reinforcement learning (CIRL) being useful, or at least a vector in a useful direction.
However, there are large numbers of intermediate-level cases where an AI system that serves multiple humans would need to become aligned with diverse groups of users or subsets of humanity. And within each such group, the humans will have partly-overlapping but partly-conflicting interests.
Example 1: a smart home/domestic robot AI might be serving a family consisting of a mom, a dad, an impulsive teenage kid, a curious toddler, and an elder grandparent with Alzheimer's. Among these five humans, whose preferences should the AI try to align with? It can't please all of them all the time. They may have genuinely diverging interests and incommensurate preferences. So it may find itself in much the same position as a traditional human domestic servant (maid, nanny, butler) trying to navigate through the household's minefield of conflicting interests, hidden agendas, family dramas, seething resentments, etc. Such challenges, of course, provide much of the entertainment value and psychological complexity of TV series such as 'Downtown Abbey', or the P.G. Wodehouse 'Jeeves' novels.
Example 2: a tactical advice AI might be serving a US military platoon deployed near hostile forces, doing information-aggregation and battlefield-simulation services. The platoon includes a lieutenant commanding 3-4 squads, each with a sergeant commanding 6-10 soldiers. The battlefield also includes a few hundred enemy soldiers, and a few thousand civilians. Which humans should this AI be aligned with? The Pentagon procurement office might have intended for the AI to maximize the likelihood of 'victory' while minimizing 'avoidable casualties'. But the Pentagon isn't there to do the cooperative inverse reinforcement learning (or whatever preference-alignment tech the AI uses) with the platoon. The battlefield AI may be doing its CIRL in interaction with the commanding lieutenant and their sergeants -- who may be somewhat aligned with each other in their interests (achieve victory, avoid death), but who may be quite mis-aligned with each other in their specific military career agendas, family situations, and risk preferences. The ordinary soldiers have their own agendas. And they are all constrained, in principle, by various rules of engagement and international treaties regarding enemy combatants and civilians -- whose interests may or may not be represented in the AI's alignment strategy.
Examples 3 through N could include AIs serving various roles in traffic management, corporate public relations, political speech-writing, forensic tax accounting, factory farm inspections, crypto exchanges, news aggregation, or any other situation where groups of humans affected by the AI's behavior have highly divergent interests and constituencies.
The behavioral and social sciences focus on these ubiquitous conflicts of interest and diverse preferences and agendas that characterize human life. This is the central stuff of political science, economics, sociology, psychology, anthropology, and media/propaganda studies. I think that to most behavioral scientists, the idea that an AI system could become aligned simultaneously with multiple diverse users, in complex nested hierarchies of power, status, wealth, and influence, would seem highly dubious.
Likewise, in evolutionary biology, and its allied disciplines such as evolutionary psychology, evolutionary anthropology, Darwinian medicine, etc., we use 'mid-level theories' such as kin selection theory, sexual selection theory, multi-level selection theory, etc to describe the partly-overlapping, partly-divergent interests of different genes, individuals, groups, and species. The idea that AI could become aligned with 'humans in general' would seem impossible, given these conflicts of interest.
In both the behavioral sciences and the evolutionary sciences, the best insights into animal and human behavior, motivations, preferences, and values often involve some game-theoretic modeling of conflicting interests. And ever since von Neumann and Morgenstern (1944), it's been clear that when strategic games include lots of agents with different agendas, payoffs, risk profiles, and choice sets, and they can self-assemble into different groups, factions, tribes, and parties with shifting allegiances, the game-theoretic modeling gets very complicated very quickly. Probably too complicated for a CIRL system, however cleverly constructed, to handle.
So, I'm left wondering what AI safety researchers are really talking about when they talk about 'alignment'. Alignment with whoever bought the AI? Whoever users it most often? Whoever might be most positively or negatively affected by its behavior? Whoever the AI's company's legal team says would impose the highest litigation risk?
I don't have any answers to these questions, but I'd value your thoughts, and links to any previous work that addresses this issue.
When In Rome
Thank you for posting this Geoffrey. I myself have recently been considering posting the question, “Aligned with which values exactly?”
TL;DR - Could an AI be trained to deduce a default set and system of human values by reviewing all human constitutions, laws, policies and regulations in the manner of AlphaGo?
I come at this from a very different angle than you do. I am not an academic but rather am retired after a thirty year career in IT systems management at the national and provincial (Canada) levels.
Aside from my career my lifelong personal interest has been, well let’s call it “Human Nature”. So long before I had any interest in AI I was reading about anthropology, archeology, philosophy, psychology, history and so on but during the last decade mostly focused on human values. Schwartz and all that. In a very unacademic way, I came to the conclusion that human values seem to explain everything with regards to what individual people feel, think, say and do and the same goes for groups.
Now that I’m retired I write hard science fiction novellas and short stories about social robots. I don’t write hoping for publication but rather to explore issues of human nature both social (e.g. justice) and personal (e.g. purpose). Writing about how and why social robots might function, and with the theory of convergent evolution in mind, I came to the conclusion that social robots would have to have an operating system based on values.
From my reading up to this point I had the gained impression that the study of human values was largely considered a pseudoscience (my apologies if you feel otherwise). Given my view of the foundational importance of values I found this attitude and the accompanying lack of hard scientific research into values frustrating.
However as I did the research into artificial intelligence that was necessary to write my stories I realized that my sense of the importance of values was about to be vindicated. The opening paragraph of one of my chapters is as follows…
During the great expansionist period of the Republic, it was not the fashion to pursue an interest in philosophy. There was much practical work to be done. Science, administration, law and engineering were well regarded careers. The questions of philosophy popular with young people were understandable and tolerated but where expected to be put aside upon entering adulthood.
All that changed with the advent of artificial intelligence.
As I continued to explore the issues of an AI values based operating system the enormity of the problem became clear and is expressed as follows in another chapter…
Until the advent of artificial intelligence the study of human values had not been taken seriously. Values had been spoken of for millennia however scientifically no one actually knew what they were, whether they had any physical basis or how they worked as a system. Yet it seemed that humans based most if not all of their decisions on values and a great deal of the brain’s development between the ages of five and twenty five had to do with values. When AI researchers began to investigate the process by which humans made decisions based on values they found some values seemed to be genetically based but they could not determine in what way, some were learned yet could be inherited and the entire genetic, epigenetic and extra-genetic system of values interacted in a manner that was a complete mystery.
They slowly realized they faced one of the greatest challenges in scientific history.
I’ve come to the conclusion that values are too complex a system to be understood by our current sciences. I believe in this regard that we are about where the ancient Greeks were regarding the structure of matter or where genetics was around the time of Gregor Mendel.
Expert systems or even our most advanced mathematics are not going to be enough nor even suitable approaches towards solving the problem. Something new will be required. I reviewed Stuart Russell’s approach which I interpret as "learning by example" and felt it glossed over some significant issues, for example children learn many things from their parents, not all of them good.
So in answer to your question, “AI alignment with humans... but with which humans?” might I suggest another approach? Could an AI be trained to deduce a default set and system of human values by reviewing all human constitutions, laws, policies and regulations in the manner of AlphaGo? In every culture and region, constitutions, law, policies and regulations represent our best attempts to formalize and institutionalize human values based on our ideas of ethics and justice.
I do appreciate the issue of values conflict that you raise. The Nazis passed some laws. But that’s where the AI and the system it develops comes in. Perhaps we don’t currently have an AI that is up to the task but it appears we are getting there.
This approach it seems would solve three problems; 1) the problem of "which humans" (because it includes source material from all cultures etc.), 2) the problem of "which values" for the same reason and 3) your examples of the contextual problem of "which values apply in which situations" with the approach of “When in Rome, do as the Romans do”.
Netcentrica - thanks for this thoughtful comment.
I agree that the behavioral sciences, social sciences, and humanities need more serious (quantitative) research on values; there is some in fields such as political psychology, social psychology, cultural anthropology, comparative religion, etc -- but often such research is a bit pseudo-scientific and judgmental, biased by the personal/political views of the researchers.
However, all these fields seem to agree that there are often much deeper and more pervasive differences in values across people ... (read more)