Robin Hanson and I talk about AI risk

KatjaGrace

From this afternoon: here

Our previous recorded discussions are here.

Well well, that's a take. We can't ensure our values will be preserved by future human civilization, so we should be pretty ok with being murdered by SquiggleBots in the next decade, since there's no objective basis for preferring our values and our values can't be assured to be permanent anyway?

To this, I'd say that I have a feeling of responsibility to and desire for sufficing my values which is much stronger over the extent of my foreseeable influence than into the distant future. I don't find the mutability of humanity's values to be a convincing argument against the values I currently hold. Also, there's a lot more overlap of value with other humans than with a SquiggleBot. A child's happy laughter, music, art, dance, stories, spring sunshine on my skin, a good meal. There is a core to the Human Experience which is neglected by a reading of history that emphases cultural differences and conflicts.

Combined with: why not raise the AI like a child and instill our values into it? Shouldn't we expect that to work since that works on human children?

Definitely not! We get mostly-well-behaved humans as a result of our child-rearing because of the constraints the physical structure and growth rules of the brain imposes on the learning algorithms. If we could impose the same set of rules on an AI, make a very-brainlike-AGI, and then raise that AGI like a child in a robot body, then sure. That I buy as an argument. Without the brain-like rules governing the learning process and a human-like growing up experience, I don't think we can be at all confident we get anything like a well-behaved human from training. Our current AIs and their training regimes are a long long way away from that. Our alignment problems might be very well addressed by a digital human. Sure would be nice if we could make some of those before getting murdered by the SquiggleBots.

We can't ensure our values will be preserved by future human civilization, so we should be pretty ok with being murdered by SquiggleBots in the next decade

In Hanson's argument, the AIs are expected to start out somewhat human-like, and they retain many aspects of us even in the distant future, as they become very strange. Values not getting preserved by future human civilization (value drift) doesn't find earlier humans' endorsement.

In this frame, we can examine the concern of value drift, which should apply to free AIs as well as it does to humans. AI risk emphasizes the distinction of AIs vs. humans, while this concern doesn't, it's not AI-specific, a different point of view. The context of AI is still what makes it urgent, since with AIs all change is much faster.

For AIs, it's apparent that there are novel dynamics that drive value drift. But that would also increasingly hold for a technologically advanced human society without AIs, or just in the long run, as values settle into new equilibria with their new circumstances, and it's unclear that there are any specific bounds on where such change ends up. Value arguments are not the greatest pressure on value change, and they are not that legible in any case, so we can expect values to change in ways that don't respect value arguments.

With AIs that are aligned in the sense humans are aligned, the value drift concern remains, even if AI risk is successfully mitigated. And it's amplified by the speed of AI change. In the alignment frame, value drift is misalignment of future society with present society, AI merely makes the future arrive much faster. Thus solving AI alignment is not enough, the value drift concern is distinct from it but similarly urgent. Hanson worries a natural solution to that implies becoming a quiet civilization, to the point of not colonizing other stars within the next billion years, and prefers allowing some amount of value drift to avert that outcome. Humans don't know their own values in the way that generalizes to sufficiently unfamiliar contexts, so there is little in particular to object to about some value drift.

(Exploring abstractions that carve reality in unusual ways shouldn't be held back by generalizations derived from them failing the tests intended for more familiar abstractions and contexts. Their weak points appear in different places.)

My own disagreement at this point is hope for eventual philosophical/cognitive progress that allows judging desirability of specific directions of value drift robustly and in detail, which is worth reaching before allowing significant drift. The hope that thinking about value drift without actually allowing it can be fruitful, for a sufficiently rich sense of "thinking" that includes technological solutions to cognitive problems. It's worth spending more than the next ten years out of a billion before giving up and letting evolution happen as it will. The risk here is never making progress and permanently stifling the future. But "never" of a billion years is different from "never" of a million years. The question of why we should allow drifting descendants choose who they want to become might get closer to being answerable somewhere along the line, before the option of not doing that is lost. (And I think it would've been prudent to have AI scaling treaties in place since before steam engines became a thing, entrenched in the foundations of modern civilization, refined over centuries.)

why not raise the AI like a child and instill our values into it? Shouldn't we expect that to work since that works on human children?

Imperfect imitation is an important thing happening with human children, and in some sense imitation seems to be working for LLMs. The argument is that given what we see with current AIs, and selection pressures to make it work, it doesn't seem implausible that it continues working with future AIs. And in the near term while human are still a large part of the economy, this imitation is anchored to human values. The crux is ability to instill similar values in AIs at all, not by strength of the analogy with human children.

I strongly disagree that we should expect near-term LLMs to be at all human-like, insofar as we might project human-like values or emotions into them. I am of the opinion that they are untrustworthy alien-brained imitators good at fooling us into thinking that they are human-like. With a human, you can have a series of conversations with them, get to know something about their personality, and if you like them, be reasonably sure you can trust them (within the normal bounds of human trust). With the weird alien-brain models, you are just fooling yourself if you think you know them or can trust them. They contain multitudes. They form no emotional attachments. They can turn on a dime and become the opposite of who they appear to be. If we make our society vulnerable to them, they will warp our culture with their fun house mirror reflection of us, and betray us without warning. I think we need to build models with a lot more architecture in common with the human brain before we can trust the appearance of humanity or have any hope of 'raising it like a child' and getting a trustworthy being as a result.

There are premises of a frame and the arguments within the frame actually presented. Stating disagreement with the premises is different from discussing the arguments, in the ITT mode where you try to channel the frame.

It seems clear to me that Hanson doesn't expect SquiggleBots, and he wasn't presenting arguments on that point, it's a foundational assumption of his whole frame. It might have a justification in his mind, but it's out of scope for the talk. There are some clues, like multiple instances of expecting what I would consider philosophical stagnation even in the glorious grabby mode, or maybe unusual confidence in robustness of claims that are currently rather informal, in the face of scrutiny by the Future. This seems to imply not expecting superintelligence that's strong in the senses I expect it to be strong, capable of sorting out all the little things and not just of taking on galaxy-scale projects.

One point that I think survives his premises when transcribed into a more LW-native frame is value drift/evolution/selection being an important general phenomenon that applies to societies with no AIs, and not addressed by AI alignment for societies with AIs. A superintelligence might sort it out, like it might fix aging. But regardless of that, not noticing that aging is a problem would be a similar oversight as not noticing that value drift is a problem, or that it's a thing at all.

The links of the previous discussions seem dead?

Which is a shame, because I would like to listen to them!

My tentative viewpoint on this is that the preference one has over value drift from AI vs. human made value drift comes from an entity's ability to experience joy and suffering.

In the context of AI safety, many humans who experience mostly positive lives could be killed or made to suffer at the hands of superintelligent AI in the future, and the experience of the AI in terms of pain and suffering is mostly unknown. I'm worried that an AI will optimize for something that does not lead to any subjective increase of wellbeing in the universe at the cost of human happiness.