Additional source on the subject I recommend: https://m.youtube.com/watch?embeds_referring_euri=https%3A%2F%2Fwww.lesswrong.com%2F&source_ve_path=MTY0OTksMjg2NjQsMTY0NTAz&v=V7AyriUcXZQ
Thanks! I read the paper and used it as material for a draft article on evidence for NAH. But I haven't seen this video before.
AISafety.info writes AI safety intro content. We'd appreciate any feedback.
Introduction
The natural abstraction hypothesis (NAH) claims:
If the NAH is true, AI alignment could be dramatically simplified, as it implies that any cognition a very powerful AI uses will be in terms of concepts that humans can understand.[1]
Explanation of the natural abstraction hypothesis
Let's unpack that definition. First, what do we mean by “our physical world abstracts well”? Just that for most things in the world, the information that describes how the thing interacts with other stuff “far away” from the system is much lower-dimensional (i.e., described by fewer numbers) than the thing itself. “Far away” can refer to many kinds of separation, including physical[2], conceptual, or causal separation.
For example, a wheel can be understood without considering the position and velocity of every atom in it. We only need to know a few large-scale properties like its shape, how it rotates, etc. to know how a wheel interacts with other parts of the world. This is a handful of numbers compared to an atomically precise description, which would require over 10^26 numbers! In this sense, the wheel is an abstraction of the atoms that compose it. Or consider a rock: you don't need to keep track of its chemical composition if you’re chucking it at someone. You just need to know how hard and heavy it is.
The NAH claims that different minds will converge to the same set of abstractions because they are the most efficient representations of all relevant info that reaches the mind from “far away”. And many parts of the world that are far from a mind will influence things the mind cares about, so a mind will be incentivized to learn these abstractions. So, for instance, if someone mostly cares about building great cars, then things like “Hertzian Zones” may affect its ability to build great cars despite being conceptually far from car-design. So said mind would plausibly have to learn what high pressure phase transitions are.
Moreover, NAH claims that the abstractions that humans usually use are approximately natural abstractions. That is, any mind that looks at and uses car wheels successfully will have learned what a circle is in approximately the same way as a human. Or if some aliens about the size of humans, born on a planet similar to our own, were to come up with a theory of motion, they’d land on Newtonian physics. Or perhaps General Relativity if they were more sophisticated.
Note how strong a claim NAH is! It applies to aliens, to superintelligences, and even to alien superintelligences! But before we investigate whether it is true, why does NAH matter for alignment?
Why the natural abstraction hypothesis is important for alignment
Alignment is probably easier if NAH is true than if it isn't. If superintelligences will reliably use approximately the same concepts humans use, then there's no fundamental barrier to doing mechanistic interpretability on superintelligences, and maybe even editing their goals to be human-compatible.
If we are lucky, human values, or other alignment targets like “niceness” or corrigibility or property rights, are themselves natural abstractions. If these abstractions are represented in a simple way in most advanced AI systems, then alignment, or control, is simply a matter of locating these abstractions within the AI's mind and forming a goal from them like “be corrigible to your creator”. A crude but remarkably effective technique in this vein is activation steering[3]. If these values are natural abstractions, then even if are not represented anywhere in the AI’s cognition, they could still be taught to the AI using little data.
Different alignment targets look more or less plausible as natural abstractions. Any specific conception of value a human has — e.g., natural law deontology or ancient Athenian virtue ethics — is unlikely to be a natural abstraction[4]. But there are some parts of human values, and inputs to them, that are plausibly natural abstractions. If an AI used those abstractions, that would make it easier for a training process to instill values that depend on them into the AI.
Is NAH true?
We don't know. The truth of the NAH is ultimately an empirical question, and we have few distinct kinds of minds we can converse with, or manually inspect, to see if their abstractions are natural. For the few tests we can do on different minds — i.e., humans, some animals, and AI — the data are consistent with NAH.
Humans can quickly share abstractions,[5] and use roughly the same ones in the same environment. Our abstractions continue to work even in drastically different environments from where we acquired them. For example, F=ma still works on the moon. As far as we can tell with our crude ability to measure abstractions, very different AIs trained in different ways on different data develop basically the same abstractions; even more so the more capable the AI.
But we have no data for generally superhuman systems. This is where some theories of natural abstractions would have to come into play. Then we might test theories against existing data and use the best to predict what will occur for superhuman systems. Alas, the theory of natural abstractions is far from developed enough to do such things. We do not even have a good technical definition yet, which is why the hypothesis is framed informally.[6]
The work is ongoing.
“Good representations are all alike; every bad representation is bad in its own way” — if Tolstoy had invented the Natural Abstractions Hypothesis, that is what it would say.
Relative to the size of the system — “far away” from a fly might mean a few centimeters, while “far away” from the sun might mean thousands of kilometers. ↩︎
See for instance Golden Gate Claude.
If human values are't natural abstractions, it doesn't follow that they have nothing to do with natural abstractions. Human values may have inputs which are natural abstractions, which can significantly constrain their type signature, making them easier to find. Perhaps even making them good enough proxies to natural abstractions in some training regimes they they get found by default.
Note that you've never needed 1TB of data to describe an idea to someone, let alone to convince them that something is a rock.
It is somewhat sloppy to say “the” natural abstraction hypothesis, as there are various formulations, and of course there might be a few, distinct natural abstractions corresponding to a given human abstraction, rather than one. Some of the formulations have different implications for alignment. This is why this article’s exposition has to be fuzzy enough to accommodate most of these variants.