I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn't generalize. E.g. the fact that different humans have relative similar levels of power to each other seems important; we aren't very aligned to agents much less powerful than us like animals, and I wouldn't expect a human who had been given all the power in the world all their life such that they've learned they can solve any conflict by destroying their opposition to be very aligned.
I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn't generalize.
I disagree both with this conclusion and the process that most people use to reach it.
The process: I think that, unless you have a truly mechanistic, play-by-play, and predicatively robust understanding of how human values actually form, then you are not in an epistemic position to make strong conclusions about whether or not the underlying mechanisms can generalize to superintelligences.
E.g., there are no birds in the world able to lift even a single ton of weight. Despite this fact, the aerodynamic principles underlying bird flight still ended up allowing for vastly more capable flying machines. Until you understand exactly why (some) humans end up caring about each other and why (some) humans end up caring about animals, you can't say whether a similar process can be adapted to make AIs care about humans.
The conclusion: Humans vary wildly in their degrees of alignment to each other and to less powerful agents. People often take this as a bad thing, that humans aren't "good enough" for us to draw useful insights from. I disa...
I don't think I've ever seen a truly mechanistic, play-by-play and robust explanation of how anything works in human psychology. At least not by how I would label things, but maybe you are using the labels differently; can you give an example?
The principles from the post can still be applied. Some humans do end up aligned to animals - particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.
Also, remember that the problem is not to align an entire civilization of naturally evolved organisms to weaker entities. The problem is to align exactly one entirely artificial organism to weaker entities. This is much simpler, and as mentioned entirely possible by just figuring out how already existing people of that sort end up that way - but your use of "we" here seems to imply that you think the entirety of human civilization is the thing we ought to be using as inspiration for the AGI, which is not the case.
By the way: at least part of the explanation for why I personally am aligned to animals is that I have a strong tendency to be moved by the Care/Harm moral foundation - see this summary of The Righteous Mind for more details. It is unclear exactly how it is implemented in the brain, but it is suspected to be a generalization of the...
There must have been some reason(s) why organisms exhibiting empathy were selected for during our evolution. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.
The human learning process (somewhat) consistently converges to empathy. Evolution might have had some weird, inhuman reason for configuring a learning process to converge to empathy, but it still built such a learning process.
It therefore seems very worthwhile to understand what part of the human learning process allows for empathy to emerge in humans. We may not be able to replicate the selection pressures that caused evolution to build an empathy-producing learning process, but it’s not clear we need to. We still have an example of such a learning process to study. The Wright brothers didn't need to re-evolve birds to create their flying machine.
The "Humans do X because evolution" argument does not actually explain anything about mechanisms. I keep seeing people make this argument, but it's a non sequitur to the points I'm making in this post. You're explaining how the behavior may have gotten there, not how the behavior is implemented. I think that "because selection pressure" is a curiosity-stopper, plain and simple.
AGI won't be subjected to the same evolutionary pressures, so every alignment strategy relying on empathy or social reward functions, it is, in my opinion, hopelessly naive.
This argument proves too much, since it implies that planes can't work because we didn't subject them to evolutionary pressures for flight. It's locally invalid.
I'm just pointing out that a lot of people have already thought about mechanisms and concluded that the mechanisms they could come up with would be unlikely to scale.
In my experience, researchers tend to stop at "But humans are hacky kludges" (what do they think they know, and why do they think they know it?). Speaking for myself, I viewed humans as complicated hacks which didn't offer substantial evidence about alignment questions. This "humans as alignment-magic" or "the selection pressure down the street did it" view seems quite common (but not universal).
AFAICT, most researchers do not appreciate the importance of asking questions with guaranteed answers.
AFAICT, most alignment-produced thinking about humans is about their superficial reliability (e.g. will they let an AI out of the box) or the range of situations in which their behavior will make sense (e.g. how hard is it to find adversarial examples which make a perfect imitation of a human). I think these questions are relatively unimportant to alignment.
I don't have anything especially insightful to contribute, but I wanted to thank you (TurnTrout and Quinton) for this post. I agree with it, and I often find myself thinking things like this when I read alignment posts by others on LW/AF.
When people present frameworks for thinking about AGIs or generic "intelligent agents," I often want to ask them: "are humans expressible in your framework?" Often it seems like the answer is "no."
And a common symptom of this is that the framework cannot express entities with human-level capabilities that are as well aligned with other such agents are humans are with one another. Deception, for example, is much less of a problem for humans in practice than it is claimed to be for AGIs in theory. Yes, we do engage in it sometimes, but we could do it a lot more than (most of us) do. Since this state of affairs is possible, and since it's desirable, it seems important to know how it can be achieved.
To add to this, I think that paying attention to your own thought processes can also be helpful when you're trying to formulate theories about how cognition in ML models works.
I like many aspects of this post.
Curated. I'm not sure I endorse all the specific examples, but the general principles make sense to me as considerations to help guide alignment research directions.
I strongly disagree with your notion of how privileging the hypothesis works. It's not absurd to think that techniques for making AIXI-tl value diamonds despite ontological shifts could be adapted for other architectures. I agree that there are other examples of people working on solving problems within a formalisation that seem rather formalisation specific, but you seem to have cast the net too wide.
It is not that human values are particularly stable. It's that humans themselves are pretty limited. Within that context, we identify the stable parts of ourselves as "our human values".
If we lift that stability - if we allow humans arbitrary self-modification and intelligence increase - the parts of us that are stable will change, and will likely not include much of our current values. New entities, new attractors.
I haven't been studying alignment for that long, but pretty obsessively for the past 9 months. I've read about a lot of different approaches. If this way of looking at human value formation has been studied previously, I think it's at least been under-written about on these forums.
Your sequence is certainly giving me a new way of thinking about alignment. Thanks, looking forward to your next post.
(Transcribed in part from Eleuther discussion and DMs.)
My understanding of the argument here is that you're using the fact that you care about diamonds as evidence that whatever the brain is doing is worth studying, with the hope that it might help us with alignment. I agree with that part. However, I disagree with the part where you claim that things like CIRL and ontology identification aren't as worthy of being elevated to consideration. I think there exist lines of reasoning that these fall naturally out as subproblems, and the fact that they fall out ...
TLDR+question:
I appreciate you for writing that article. Humans seem bad at choosing what to work on. Is there a sub-field in AI alignment where a group of researchers solely focus on finding the most relevant questions to work on, make a list, and others pick from that list?
• • &nbs...
Do my values bind to objects in reality, like dogs, or do they bind to my mental representations of those objects at the current timestep?
You might say: You value the dog's happiness over your mental representation of it, since if I gave you a button which made the dog sad, but made you believe the dog was happy, and another button which made the dog happy, but made you believe the dog was sad, you'd press the second button.
I say in response: You've shown that I value my current timestep estimation of the dog's future happiness over my current timestep est...
I find myself confused about what point this post is trying to make even after reading through it twice. Can you summarize your central point in 100 words or less?
If the title is meant to be a summary of the post, I think that would be analogous to someone saying "nuclear forces provide an untapped wealth of energy". It's true, but the reason the energy is untapped is because nobody has come up with a good way of tapping into it. A post which tried to address engineering problems around energy production by "we need to look closely at how to extract energy...
If the title is meant to be a summary of the post, I think that would be analogous to someone saying "nuclear forces provide an untapped wealth of energy". It's true, but the reason the energy is untapped is because nobody has come up with a good way of tapping into it.
The difference is people have been trying hard to harness nuclear forces for energy, while people have not been trying hard to research humans for alignment in the same way. Even relative to the size of the alignment field being far smaller, there hasn't been a real effort as far as I can see. Most people immediately respond with "AGI is different from humans for X,Y,Z reasons" (which are true) and then proceed to throw out the baby with the bathwater by not looking into human value formation at all.
Planes don't fly like birds, but we sure as hell studied birds to make them.
If you come up with a strategy for how to do this then I'm much more interested, and that's a big reason why I'm asking for a summary since I think you might have tried to express something like this in the post that I'm missing.
This is their current research direction, The shard theory of human values which they're currently making posts on.
There's apparently some controversy over what the Wright brothers were able to infer from studying birds. From Wikipedia:
...On the basis of observation, Wilbur concluded that birds changed the angle of the ends of their wings to make their bodies roll right or left.[34] The brothers decided this would also be a good way for a flying machine to turn – to "bank" or "lean" into the turn just like a bird – and just like a person riding a bicycle, an experience with which they were thoroughly familiar. Equally important, they hoped this method would enable recovery when the wind tilted the machine to one side (lateral balance). They puzzled over how to achieve the same effect with man-made wings and eventually discovered wing-warping when Wilbur idly twisted a long inner-tube box at the bicycle shop.[35]
Other aeronautical investigators regarded flight as if it were not so different from surface locomotion, except the surface would be elevated. They thought in terms of a ship's rudder for steering, while the flying machine remained essentially level in the air, as did a train or an automobile or a ship at the surface. The idea of deliberately leaning, or rolling, to one side seemed either u
How can I, a person who is better at introspection than basically anything else, help you with the shard theory project? I actually can explain in detail - at least, the kind of detail accessible to me, which doesn't include e.g. neuron firing patterns - how I developed some of my values, or I can at least use reliable methods to figure out good hypotheses on the matter.
I am skeptical of your premise. I know of zero humans who terminally value “diamonds” as defined by their chemical constitution.
Indeed, diamonds are widely considered to be a fake scarce good, elevated to their current position by deceptive marketing and monopolistic practices. So this seems more like a case study of how humans’ desires to own scarce symbols of wealth have been manipulated to lead to an outcome that is misaligned with the original objective.
My model of how human values form:
Precondition: The brain has already figured out how the body works and some rough world model, say at the level of a small child. It has concepts of space and actions that can meet its basic needs by, e.g., looking for food and getting and eating it. It has a concept of other agents but no concept for interacting with them yet.
The brain learns to predict that other agents (parents, siblings...) will act to (help) get its needs met by acting in certain ways, e.g., by smiling, crying, or what else works.
The brain learns to p...
After more discussion with bmk, I appended the following edit:
...In this post, I wrote about the Arbital article's unsupported jump from "Build an AI which cares about a simple object like diamonds" to "Let's think about ontology identification for AIXI-tl." The point is not that there is no valid reason to consider the latter, but that the jump, as written, seemed evidence-starved. For separate reasons, I currently think that ontology identification is unattractive in some ways, but this post isn't meant to argue against that framing in general. The main poi
Looks like some of the protein computers ended up with your values, even. Small universe, huh?
I've noticed that this "protein computers" framing makes it a lot intuitively easier to think about where humans are situated in the space of intelligent algorithms.
E.g., it's intuitively harder to think about an unaligned AGI manipulating its way past humans than it is to think about unaligned AGI optimizing the arrangement of protein computers in its vicinity. In the "humans" framing, killing all the humans is the central turning point in the takeover story. In ...
I would be the last person to dismiss the potential relevance understanding value formation and management in the human brain might have for AI alignment research, but I think there are good reasons to assume that the solutions our evolution has resulted in would be complex and not sufficiently robust.
Humans are [Mesa-Optimizers](https://www.alignmentforum.org/tag/mesa-optimization) and the evidence is solid that as a consequence, our alignment with the implicit underlying utility function (reproductive fitness) is rather brittle (i.e. sex with contracepti...
I suspect that the underlying mechanism of how humans can be aligned isn’t something that’s particularly useful applied to AI. One explanation for human alignment is that our values are mostly just a rationalization layer on top of the delicate balance of short term reward heuristics that correlate well with long term genetic fitness.
Maybe the real secret sauce isn’t any specific mechanism but rather the fact that that the reward heuristics are finely tuned by eons of evolution to counterbalance each other well enough to form a semi-stable local optimum wh...
I feel like this connects with what Max Tegmark was talking about in his recent 80k hours podcast interview. The idea that we need hierarchical alignment across groups of humans (companies, governments, sets of humans + the programs they've written plus their ML models?) as well as just within AI systems. I think if you carefully design an experiment which would generalize from humans to AGI systems, you could potentially learn some valuable lessons.
A couple of points. In certain, limited expressions of human international relations, we do know what the decision making process was to reach major decisions, such as the drafting and signing of the Geneva Conventions, some of humanity's best and most successful policy documents. This is because the archived UK and US decision making processes were very well documented and are now declassified.
In other words, we can actually deconstruct human decision making principles (probably conforming instrumentalism in the case of the Geneva Conventions) and t...
This post has been recorded as part of the LessWrong Curated Podcast, and can be listened to on Spotify, Apple Podcasts, and Libsyn.
TL;DR: To even consciously consider an alignment research direction, you should have evidence to locate it as a promising lead. As best I can tell, many directions seem interesting but do not have strong evidence of being “entangled” with the alignment problem such that I expect them to yield significant insights.
For example, “we can solve an easier version of the alignment problem by first figuring out how to build an AI which maximizes the number of real-world diamonds” has intuitive appeal and plausibility, but this claim doesn’t have to be true and this problem does not necessarily have a natural, compact solution. In contrast, there do in fact exist humans who care about diamonds. Therefore, there are guaranteed-to-exist alignment insights concerning the way people come to care about e.g. real-world diamonds.
“Consider how humans navigate the alignment subproblem you’re worried about” is a habit which I (TurnTrout) picked up from Quintin Pope. I wrote the post, he originated the tactic.
I find this problem interesting, both in terms of wanting to know how to solve a reframed version of it, and in terms of what I used to think about the problem. I used to[1] think, “yeah, ‘diamond’ is relatively easy to define. Nice problem relaxation.” It felt like the diamond maximizer problem let us focus on the challenge of making the AI’s values bind to something at all which we actually intended (e.g. diamonds), in a way that’s robust to ontological shifts and that doesn’t collapse into wireheading or tampering with e.g. the sensors used to estimate the number of diamonds.
Although the details are mostly irrelevant to the point of this blog post, the Arbital article suggests some solution ideas and directions for future research, including:
Do you notice anything strange about these three ideas? Sure, the ideas don’t seem workable, but they’re good initial thoughts, right?
The problem isn’t that the ideas aren’t clever enough. Eliezer is pretty dang clever, and these ideas are reasonable stabs given the premise of “get some AIXI variant to maximize diamond instead of reward.”
The problem isn’t that it’s impossible to specify a mind which cares about diamonds. We already know that there are intelligent minds who value diamonds. You might be dating one of them, or you might even be one of them! Clearly, the genome + environment jointly specify certain human beings who end up caring about diamonds.
One problem is where is the evidence required to locate these ideas? Why should I even find myself thinking about diamond maximization and AIXI and Turing machines and utility functions in this situation? It’s not that there’s no evidence. For example, utility functions ensure the agent can’t be exploited in some dumb ways. But I think that the supporting evidence is not commensurate with the specificity of these three ideas or with the specificity of the “ontology identification” problem framing.
Here’s an exaggeration of how these ideas feel to me when I read them:
I recently made a similar point about Cooperative Inverse Reinforcement Learning:
Now, if you are confused about a problem, it can be better to explore some guesses than no guesses—perhaps it’s better to think about Turing machines than to stare helplessly at the wall (but perhaps not). Your best guess may be wrong (e.g. write a utility function which scans Turing machines for atomic representations of diamonds), but you sometimes still learn something by spelling out the implications of your best guess (e.g. the ontology identifier stops working when AIXI Bayes-updates to non-atomic physical theories). This can be productive, as long as you keep in mind the wrongness of the concrete guess, so as to not become anchored on that guess or on the framing which originated it (e.g. build a diamond maximizer).
However, in this situation, I want to look elsewhere. When I confront a confusing, difficult problem (e.g. how do you create a mind which cares about diamonds?), I often first look at reality (e.g. are there any existing minds which care about diamonds?). Even if I have no idea how to solve the problem, if I can find an existing mind which cares about diamonds, then since that mind is real, that mind has a guaranteed-to-exist causal mechanistic play-by-play origin story for why it cares about diamonds. I thereby anchor my thinking to reality; reality is sturdier than “what if” and “maybe this will work”; many human minds do care about diamonds.
In addition to “there’s a guaranteed causal story for humans valuing diamonds, and not one for AIXI valuing diamonds”, there’s a second benefit to understanding how human values bind to the human’s beliefs about real-world diamonds. This second benefit is practical: I’m pretty sure the way that humans come to care about diamonds has nearly nothing to do with the ways AIXI-tl might be motivated to maximize diamonds. This matters, because I expect that the first AGI’s value formation will be far more mechanistically similar to within-lifetime human value formation, than to AIXI-tl’s value alignment dynamics.
Next, it can be true that the existing minds are too hard for us to understand in ways relevant to alignment. One way this could be true is that human values are a "mess", that "our brains are kludges slapped together by natural selection." If human value formation were sufficiently complex, with sufficiently many load-bearing parts such that each part drastically affects human alignment properties, then we might instead want to design simpler human-comprehensible agents and study their alignment properties.
While I think that human values are complex, I think the evidence for human value formation’s essential complexity is surprisingly weak, all things reconsidered in light of modern, post-deep learning understanding. Still... maybe humans are too hard to understand in alignment-relevant ways!
But, I mean, come on. Imagine an alien[2] visited and told you:
Ignoring the weird implications of the aliens existing and talking to you like this, and considering only the alignment implications—The absolute top priority of many alignment researchers should be figuring out how the hell the aliens got as far as they did.[3] Whether or not you know if their approach scales to further intelligence levels, whether or not their approach seems easy to understand, you have learned that these computers are physically possible, practically trainable entities. These computers have definite existence and guaranteed explanations. Next to these actually existent computers, speculation like “maybe attainable utility preservation leads to cautious behavior in AGIs” is dreamlike, unfounded, and untethered.
If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insight I gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature universality hypothesis for neural networks, about natural abstractions.
Because, given my life’s present ambition (solve AI alignment), that’s what it makes sense for me to do—at each major new insight, to reconsider my models[4] of the single known empirical example of general intelligences with values, to scour the Earth for every possible scrap of evidence that humans provide about alignment. We may not get much time with human-level AI before we get to superhuman AI. But we get plenty of time with human-level humans, and we get plenty of time being a human-level intelligence.
The way I presently see it, the godshatter of human values—the rainbow of desires, from friendship to food—is only unpredictable relative to a class of hypotheses which fail to predict the shattering.[5] But confusion is in the map, not the territory. I do not consider human values to be “unpredictable” or “weird”, I do not view them as a “hack” or a “kludge.” Human value formation may or may not be messy (although I presently think not). Either way, human values are, of course, part of our lawful reality. Human values are reliably produced by within-lifetime processes within the brain. This has an explanation, though I may be ignorant of it. Humans usually bind their values to certain objects in reality, like dogs. This, too, has an explanation.
And, to be clear, I don’t want to black-box outside-view extrapolate from the “human datapoint”; I don’t want to focus on thoughts like “Since alignment ‘works well’ for dogs and people, maybe it will work well for slightly superhuman entities.” I aspire for the kind of alignment mastery which lets me build a diamond-producing AI, or if that didn’t suit my fancy, I’d turn around and tweak the process and the AI would press green buttons forever instead, or—if I were playing for real—I’d align that system of mere circuitry with humane purposes.
For that ambition, the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes form their values, and, more generally, about how intelligent minds can form values at all. What factors matter for the learned values, what factors don’t, and what we should do for AI. Maybe humans have special inductive biases or architectural features, and without those, they’d grow totally different kinds of values. But if that were true, wouldn’t that be important to know?
If I knew how to interpret the available evidence, I probably would understand how I came to weakly care about diamonds, and what factors were important to that process (which reward circuitry had to fire at which frequencies, what concepts I had to have learned in order to grow a value around “diamonds”, how precisely activated the reward circuitry had to be in order for me to end up caring about diamonds).
Humans provide huge amounts of evidence, properly interpreted—and therein lies the grand challenge upon which I am presently fixated. In an upcoming post, I’ll discuss one particularly rich vein of evidence provided by humans. (EDIT 1/1/23: See this shortform comment.)
Thanks to Logan Smith and Charles Foster for feedback. Spiritually related to but technically distinct from The First Sample Gives the Most Information.
EDIT: In this post, I wrote about the Arbital article's unsupported jump from "Build an AI which cares about a simple object like diamonds" to "Let's think about ontology identification for AIXI-tl." The point is not that there is no valid reason to consider the latter, but that the jump, as written, seemed evidence-starved. For separate reasons, I currently think that ontology identification is unattractive in some ways, but this post isn't meant to argue against that framing in general. The main point of the post is that humans provide tons of evidence about alignment, by virtue of containing guaranteed -to-exist mechanisms which produce e.g. their values around diamonds.
Appendix: One time I didn’t look for the human mechanism
Back in 2018, I had a clever-seeming idea. We don’t know how to build an aligned AI; we want multiple tries; it would be great if we could build an AI which “knows it may have been incorrectly designed”; so why not have the AI simulate its probable design environment over many misspecifications, and then not do plans which tend to be horrible for most initial conditions. While I drew some inspiration from how I would want to reason in the AI’s place, I ultimately did not think thoughts like:
Instead, I was trying out clever, off-the-cuff ideas in order to solve e.g. Eliezer’s formulation of the hard problem of corrigibility. However, my idea and his formulation suffered a few disadvantages, including:
I wrote this post as someone who previously needed to read it.
I now think that diamond’s physically crisp definition is a red herring. More on that in future posts.
This alien is written to communicate my current belief state about how human value formation works, so as to make it clear why, given my beliefs, this value formation process is so obviously important to understand.
There is an additional implication present in the alien story, but not present in the evolutionary production of humans. The aliens are implied to have purposefully aligned some of their protein computers with human values, while evolution is not similarly “purposeful.” This implication is noncentral to the key point, which is that the human-values-having protein computers exist in reality.
Well, I didn’t even have a detailed picture of human value formation back in 2021. I thought humans were hopelessly dumb and messy and we want a nice clean AI which actually is robustly aligned.
Suppose we model humans as the "inner agent" and evolution as the "outer optimizer"—I think this is, in general, the wrong framing, but let's roll with it for now. I would guess that Eliezer believes that human values are an unpredictable godshatter with respect to the outer criterion of inclusive genetic fitness. This means that if you reroll evolution many times with perturbed initial conditions, you get inner agents with dramatically different values each time—it means that human values are akin to a raindrop which happened to land in some location for no grand reason. I notice that I have medium-strength objections to this claim, but let's just say that he is correct for now.
I think this unpredictability-to-evolution doesn't matter. We aren't going to reroll evolution to get AGI. Thus, for a variety of reasons too expansive for this margin, I am little moved by analogy-based reasoning along the lines of "here's the one time inner alignment was tried in reality, and evolution failed horribly." I think that historical fact is mostly irrelevant, for reasons I will discuss later.