Thanks to Garrett Baker and Evžen Wybitul for discussions and feedback on this post.
Imagine that you meet an 18th century altruist. They tell you “So, I’ve been thinking about whether or not to eat meat. Do you know whether animals have souls?” How would you answer, assuming you actually do want to be helpful?
One option is to spend a lot of time explaining why “soul” isn’t actually the thing in the territory they care about, and talk about moral patienthood and theories of welfare and moral status. If they haven’t walked away from you in the first thirty seconds this may even work, though I wouldn’t bet on it.
Another option is to just say “yes” or “no”, to try and answer what their question was pointing at. If they ask further questions, you can either dig in deeper and keep translating your real answers into their ontology or at some point try to retarget their questions’ pointers toward concepts that do exist in the territory.
Low-fidelity pointers
The problem you’re facing in the above situation is that the person you’re talking to is using an inaccurate ontology to understand reality. The things they actually care about correspond to quite different objects in the territory. Those objects currently don’t have very good pointers in their map. Trying to directly redirect their questions without first covering a fair amount of context and inferential distance over what these objects are probably wouldn’t work very well.
So, the reason this is relevant to alignment:
Representations of things within the environment are learned by systems up to the level of fidelity that’s required for the learning objective. This is true even if you assume a weak version of the natural abstraction hypothesis to be true; the general point isn’t that there wouldn’t be concepts corresponding to what we care about, but that they could be very fuzzy.
For example, let’s say that you try to retarget an internal general-purpose search process. That post describes the following approach:
- Identify the AI’s internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
- Identify the retargetable internal search process.
- Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.
There are - very broadly, abstracting over a fair amount of nuance - three problems with this:
- You need to have interpretability tools that are able to robustly identify human-relevant alignment properties from the AI’s internals[1]. This isn’t as much a problem with the approach, as it is the hard thing you have to solve for it to work.
- It doesn’t seem obvious that existentially dangerous models are going to look like they’re doing fully-retargetable search. Learning some heuristics that are specialized to the environment, task, or target are likely to make your search much more efficient[2]. These can be selectively learned and used for different contexts. This imposes a cost on arbitrary retargeting, because you have to relearn those heuristics for the new target.
- The concept corresponding to the alignment target you want is not very well-specified. Retargeting your model to this concept probably would make it do the right things for a while. However, as the model starts to learn more abstractions relating to this new target, you run into an under-specification problem where the pointer can generalize in one of several ways.
The first problem (or at least, some version of it) seems unavoidable to me in any solution to alignment. What you want in the end is to interact with the things in your system that would ensure you of its safety. There may be simpler ways to go about it, however. The second problem is somewhat related to the third, and would I think be solved by similar methods.
Notably, the third problem is still a problem in this setup even if you have a robust specification of the target yourself. Because the setup doesn’t involve going in and improving the model’s capabilities / ontology / world model fidelity, you’re upper-bounded by how good of a specification it already has. You could modify the setup to involve updating the model’s ontology, but that seems like a much harder problem if we wanted to do it in the same direct way.
If we didn’t have to do it in the same way however, there are some promising threads. The one that seems most straightforward to me is designing a training process such that the retargeting is integrated at every step. For instance, as part of a reward signal on what you want the search target to be. This would allow for the concepts you care about to always be closely downstream of the target you’re training on, and steer the model’s capabilities toward learning higher-fidelity representations of them[3].
This does mean that you have to have a good specification of the alignment target yourself; training on an under-specified signal or weak proxy is pretty risky. I don’t think this necessarily makes the problem harder because we may need such a specification to retarget the model at all, or generally do model editing, and so on. But it’s a consideration I don’t often see mentioned, that feeds into some aspects of what I work on, and how I think of good alignment solutions.
This post was in part inspired by this Twitter thread from last year mentioning this problem. I left a short reply on how I thought it could be solved, but figured it was worth writing up the entire thing in more detail at some point.
- ^
- ^
I think Mark Xu’s example of quick sort vs random sorting is a good intuition pump here.
- ^
The original proposal for retargeting the search also mentioned training-on-the-fly, but mainly in the context of not getting a misaligned AI before we do the retargeting.
Directly, no. But the process of science (like any use of Bayesian reasoning) is intended to gradually make our ontology a better fit to more of reality. If that was working as intended, then we would expect it to come to require more and more effort to produce the evidence needed to cause a significant further paradigm shift across a significant area of science, because there are fewer and fewer major large-scale misconceptions left to fix. Over the last century, we have more and more people working as scientists, publishing more and more papers, yet the rate of significant paradigm shifts that have an effect across a significant area of science has been dropping. From which I deduce that it is likely that our ontology is a probably a significantly better fit to reality now than it was a century ago, let alone three centuries ago back in the 18th century as this post discusses. Certainly the size and detail of our scientific ontology have both increased dramatically.
Is this proof? No, as you correctly observe, proof would require knowing the truth about reality. It's merely suggestive supporting evidence. It's possible to contrive other explanations: it's also possible, if rather unlikely, that, for some reason (perhaps related to social or educational changes) all of those people working in science now are much stupider, more hidebound, or less original thinkers than the scientists a century ago, and that's why dramatic paradigm shifts are slower — but personally I think this is very unlikely.
It is also quite possible that this is more true in certain areas of science that are amenable to the mental capabilities and research methods of human researchers, and that there might be other areas that were resistant to these approaches (so our lack of progress in these areas is caused by inability, not us approaching our goal), but where the different capabilities of an AI might allow it to make rapid progress. In such an area, the AI's ontology might well be a significantly better fit to reality than ours.