I think that the AI's internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn't surprise me to find distinct thoughts in there about electrons. As the internal ontology goes to be more about affordances and actions, I expect to find increasing disalignment. As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI's internals, I expect to find much larger differences -- not just that the AI has a different concept boundary around "easy to understand", say, but that it maybe doesn't have any such internal notion as "easy to understand" at all, because easiness isn't in the environment and the AI doesn't have any such thing as "effort". Maybe it's got categories around yieldingness to seven different categories of methods, and/or some general notion of "can predict at all / can't predict at all", but no general notion that maps onto human "easy to understand" -- though "easy to understand" is plausibly general-enough that I wouldn't be unsurprised to find a mapping after all.
Corrigibility and actual human values ar...
So, would you also say that two random humans are likely to have similar misalignment problems w.r.t. each other? E.g. my brain is different from yours, so the concepts I associate with words like "be helpful" and "don't betray Eliezer" and so forth are going to be different from the concepts you associate with those words, and in some cases there might be strings of words that are meaningful to you but totally meaningless to me, and therefore if you are the principal and I am your agent, and we totally avoid problem #2 (in which you give me instructions and I just don't follow them, even the as-interpreted-by-me version of them) you are still screwed? (Provided the power differential between us is big enough?)
I assumed the idea here was that AGI has a different mind architecture and thus also has different internal concepts for reflection. E.g. where a human might think about a task in terms of required willpower, an AGI might instead have internal concepts for required power consumption or compute threads or something.
Since human brains all share more or less the same architecture, you'd only expect significant misalignment between them if specific brains differed a lot from one another: e.g. someone with brain damage vs. a genius, or (as per an ACX post) a normal human vs. some one-of-a-kind person who doesn't experience suffering due to some genetic quirk.
Or suppose we could upload people: then a flesh-and-blood human with a physical brain would have a different internal architecture from a digital human with a digital brain simulated on physical computer hardware. In which case their reflective concepts might diverge insofar as the simulation was imperfect and leaked details about the computer hardware and its constraints.
Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI's internal ontology at training time. My guess is that more of the disagreement lies here.
I doubt much disagreement between you and I lies there, because I do not expect ML-style training to robustly point an AI in any builder-intended direction. My hopes generally don't route through targeting via ML-style training.
I do think my deltas from many other people lie there - e.g. that's why I'm nowhere near as optimistic as Quintin - so that's also where I'd expect much of your disagreement with those other people to lie.
There isn't really one specific thing, since we don't yet know what the next ML/AI paradigm will look like, other than that some kind of neural net will probably be involved somehow. (My median expectation is that we're ~1 transformers-level paradigm shift away from the things which take off.) But as a relatively-legible example of a targeting technique my hopes might route through: Retargeting The Search.
Corrigibility and actual human values are both heavily reflective concepts. If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment -- which of course most people can't do because they project the category boundary onto the environment
Actual human values depend on human internals, but predictions about systems that strongly couple to human behavior depend on human internals as well. I thus expect efficient representations of systems that strongly couple to human behavior to include human values as somewhat explicit variables. I expect this because humans seem agent-like enough that modeling them as trying to optimize for some set of goals is a computationally efficient heuristic in the toolbox for predicting humans.
At lower confidence, I also think human expected-value-trajectory-under-additional-somewhat-coherent-reflection would show up explicitly in the thoughts of AIs that try to predict systems strongly coupled to humans. I think this because humans seem to change their values enough over time in a sufficiently coherent fa...
I expect this because humans seem agent-like enough that modeling them as trying to optimize for some set of goals is a computationally efficient heuristic in the toolbox for predicting humans.
Sure, but the sort of thing that people actually optimize for (revealed preferences) tends to be very different from what they proclaim to be their values. This is a point not often raised in polite conversation, but to me it's a key reason for the thing people call "value alignment" being incoherent in the first place.
Oh, sure, I agree that an ASI would understand all of that well enough, but even if it wanted to, it wouldn't be able to give us either all of what we think we want, or what we would endorse in some hypothetical enlightened way, because neither of those things comprise a coherent framework that robustly generalizes far out-of-distribution for human circumstances, even for one person, never mind the whole of humanity.
The best we could hope for is that some-true-core-of-us-or-whatever would generalize in such way, the AI recognizes this and propagates that while sacrificing inessential contradictory parts. But given that our current state of moral philosophy is hopelessly out of its depth relative to this, to the extent that people rarely even acknowledge these issues, trusting that AI would get this right seems like a desperate gamble to me, even granting that we somehow could make it want to.
Of course, it doesn't look like we would get to choose not to get subjected a gamble of this sort even if more people were aware of it, so maybe it's better for them to remain in blissful ignorance for now.
This post might turn into a sequence if there's interest; I already have another one written for Christiano, and people are welcome to suggest others they'd be interested in.
Consider this my vote to turn it into a sequence, and to go on for as long as you can. I would be interested in one for Chris Olah, as well as the AI Optimists.
The AI Optimists (i.e. the people in the associated Discord server) have a lot of internal disagreement[1], to the point that I don't think it's meaningful to talk about the delta between John and them. That said, I would be interested in specific deltas e.g. with @TurnTrout, in part because he thought we'd get death by default and now doesn't think that, has distanced himself from LW, and if he replies, is more likely to have a productive argument w/ John than Quintin Pope or Nora Belrose would. Not because he's better, but because I think John and him would be more legible to each other.
Source: I'm on the AI Optimists Discord server and haven't seen much to alter my prior belief that ~ everyone in alignment disagrees with everyone else.
I meant turn the "delta compared to X" into a sequence, which was my understanding of the sentence in the OP.
Consider my vote for Vanessa Kossoy, and Scott Garabrant deltas. I don't really know what their models are. I can guess what the deltas between you and Evan Hubinger are, but that would also be interesting. All of these would be less interesting than Christiano deltas though.
I'm trying to understand this debate, and probably failing.
>human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all.
I assume we all agree that the system can understand the human ontology, though? This is at least necessary for communicating and reasoning about humans, which LLMs can clearly already do to some extent.
There's a lot of work around mapping ontologies, and this is known to be difficult, but very possible - especially for a superhuman intelligence.
So, I fail to see what exactly the problem is. If this smarter system can understand and reason about human ways of thinking about the world, I assume it could optimize for these ways if it wanted to. I assume the main question is if it wants to - but I fail to understand how this is an issue of ontology.
If a system really couldn't reason about human ontologies, then I don't see how it would understand the human world at all.
I'd appreciate any posts that clarify this question.
This would probably need a whole additional post to answer fully, but I can kinda gesture briefly in the right direction.
Let's use a standard toy model: an AI which models our whole world using quantum fields directly. Does this thing "understand the human ontology"? Well, the human ontology is embedded in its model in some sense (since there are quantum-level simulations of humans embedded in its model), but the AI doesn't actually factor any of its cognition through the human ontology. So if we want to e.g. translate some human instructions or human goals or some such into that AI's ontology, we need a full quantum-level specification of the instructions/goals/whatever.
Now, presumably we don't actually expect a strong AI to simulate the whole world at the level of quantum fields, but that example at least shows what it could look like for an AI to be highly capable, including able to reason about and interact with humans, but not use the human ontology at all.
Thanks for that, but I'm left just as confused.
I assume that this AI agent would be able to have conversations with humans about our ontologies. I strongly assume it would need to be able to do the work of "thinking through our eyes/ontologies" to do this.
I'd imagine the situation would be something like,
1. The agent uses quantum-simutions almost all of the time.
2. In the case it needs to answer human questions, like answer AP Physics problems, it easily understands how to make these human-used models/ontologies in order to do so.
Similar to how graduate physicists can still do mechanics questions without considering special relativity or quantum effects, if asked.
So I'd assume that the agent/AI could do the work of translation - we wouldn't need to.
I guess, here are some claims:
1) Humans would have trouble policing a being way smarter than us.
2) Humans would have trouble understanding AIs with much more complex ontologies.
3) AIs with more complex ontologies would have trouble understanding humans.
#3 seems the most suspect to me, though 1 and 2 also seem questionable.
I strongly assume it would need to be able to do the work of "thinking through our eyes/ontologies" to do this.
Why would an AI need to do that? It can just simulate what happens conditional on different sounds coming from its speaker or whatever, and then emit the sounds which result in the outcomes which it wants.
A human ontology is not obviously the best tool, even for e.g. answering mostly-natural-language questions on an exam. Heck, even today's exam help services will often tell you to guess which answer the graders will actually mark as correct, rather than taking questions literally or whatever. Taken to the extreme, an exam-acing AI would plausibly perform better by thinking about the behavior of the physical system which is a human grader (or a human recording the "correct answers" for an automated grader to use), rather than trying to reason directly about the semantics of the natural language as a human would interpret it.
(To be clear, my median model does not disagree with you here, but I'm playing devil's advocate.)
If it's able to function as well as it would if it understands our ontology, if not better, then why does it then matter if it doesn't use our ontology?
I assume a system you're describing could still be used by humans to do (basically) all of the important things. Like, we could ask it "optimize this company, in a way that we would accept, after a ton of deliberation", and it could produce a satisfying response.
> But why would it? What objective, either as the agent's internal goal or as an outer optimization signal, would incentivize the agent to bother using a human ontology at all, when it could instead use the predictively-superior quantum simulator?
I mean, if it can always act just as well as if it could understand human ontologies, then I don't see the benefit of it "technically understanding human ontologies". This seems like it is tending into some semantic argument or something.
If an agent can trivially act as if it understands Ontology X, where/why does it actually matter that it doesn't technically "understand" ontology X?
I assume that the argument that "this distinction matters a lot" would functionally play out in there being some concrete things that it can't do.
(feel free to stop replying at any point, sorry if this is annoying)
> Like, you ask the AI "optimize this company, in a way that we would accept, after a ton of deliberation", and it has a very-different-off-distribution notion than you about what constitutes the "company", and counts as you "accepting", and what it's even optimizing the company for.
I'd assume that when we tell it, "optimize this company, in a way that we would accept, after a ton of deliberation", this could be instead described as, "optimize this company, in a way that we would accept, after a ton of deliberation, where these terms are described using our ontology"
It seems like the AI can trivially figure out what humans would regard as the "company" or "accepting". Like, it could generate any question like, "Would X qualify as the 'company, if asked to a human?", and get an accurate response.
I agree that we would have a tough time understanding its goal / specifications, but I expect that it would be capable of answering questions about its goal in our ontology.
My take:
I assume we all agree that the system can understand the human ontology, though? This is at least necessary for communicating and reasoning about humans, which LLMs can clearly already do to some extent.
Can we reason about a thermostat's ontology? Only sort of. We can say things like "The thermostat represents the local temperature. It wants that temperature to be the same as the set point." But the thermostat itself is only very loosely approximating that kind of behavior - imputing any sort of generalizability to it that it doesn't actually have is an anthropmorphic fiction. And it's blatantly a fiction, because there's more than one way to do it - you can suppose the thermostat wants only the temperature sensor to be at the right temperature vs. it wants the whole room vs. the whole world to be at that temperature, or that it's "changing its mind" when it breaks vs. it would want to be repaired, etc.
To the superintelligent AI, we are the thermostat. You cannot be aligned to humans purely by being smart, because finding "the human ontology" is an act of interpretation, of story-telling, not just a question of fact. Helping an AI narrow down how to interpret humans a...
Simplifying somewhat: I think that my biggest delta with John is that I don't think the natural abstraction hypothesis holds. (EG, if I believed it holds, I would become more optimistic about single-agent alignment, to the point of viewing Moloch as higher priority.) At the same time, I believe that powerful AIs will be able to understand humans just fine. My vague attempt at reconciling these two is something like this:
Humans have some ontology, in which they think about the world. This corresponds to a world model. This world model has a certain amount of prediction errors.
The powerful AI wants to have much lower prediction error than that. When I say "natural abstraction hypothesis is false", I imagine something like: If you want to have a much lower prediction error than that, you have to use a different ontology / world-model than humans use. And in fact if you want sufficiently low error, then all ontologies that can achieve that are very different from our ontology --- either (reasonably) simple and different, or very complex (and, I guess, therefore also different).
So when the AI "understands humans perfectly well", that means something like: The AI can visualise the flawed...
Oddly, while I was at MIRI I thought the ontology identification problem was hard and absolutely critical, and it seemed Eliezer was more optimistic about it; he thought it would probably get solved along the way in AI capabilities development, because e.g. the idea of carbon atoms in diamond is a stable concept, and "you don't forget how to ride a bike". (Not sure if his opinion has changed)
We’re assuming natural abstraction basically fails, so those AI systems will have fundamentally alien internal ontologies. For purposes of this overcompressed version of the argument, we’ll assume a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all.
For context, I'm familiar with this view from the ELK report. My understanding is that this is part of the "worst-case scenario" for alignment that ARC's agenda is hoping to solve (or, at least, still hoped to solve a ~year ago).
...The paradigmatic example of an ontology mismatch is a deep change in our understanding of the physical world. For example, you might imagine humans who think about the world in terms of rigid bodies and Newtonian fluids and “complicated stuff we don’t quite understand,” while an AI thinks of the world in terms of atoms and the void. Or we might imagine humans who think in terms of the standard model of physics, while an AI understands reality as vibrations of strings. We think that this kind of deep physical mismatch is a useful mental picture, and it can be a fruitful source of simp
My model (which is pretty similar to my model of Eliezer's model) does not match your model of Eliezer's model. Here's my model, and I'd guess that Eliezer's model mostly agrees with it:
Insofar as an AI cares-as-a-terminal-goal about keeping humans around, it will care about its own alien conception of “humans” which does not match ours, and will happily replace us with less resource-intensive (or otherwise preferable) things which we would not consider “human”.
As a side note. I'm not sure about this. It seems plausible to me that the super-stimulus-of-a-human-according-to-an-alien-AI-value-function is a human in the ways that I care about, in the same way that an em is in someways extremely different from a biological human, but is also a human in the ways I care about.
I'm not sure that I should give up on a future that's dominated by AIs that care about a weird alien abstraction of "human" that admits extremely weird edge cases, being valueless.
I also think that the natural abstraction hypothesis holds with current AI. The architecture of LLMs is based on the capability of modeling ontology in terms of vectors in space of thousands of dimensions and there are experiments that show it generalizes and has somewhat interpretable meanings to the directions in that space. (even if not easy to interpret to the scale above toy models). Like in that toy example when you take the embedding vector of the word "king", subtract the vector of "man", add the vector of "woman" and you land near the position of ...
In worlds where natural abstraction basically fails, we are thoroughly and utterly fucked
I'm not really sure what it would mean for the natural abstraction hypothesis to turn out to be true, or false. The hypothesis itself seems insufficiently clear to me.
On your view, if there are no "natural abstractions," then we should predict that AIs will "generalize off-distribution" in ways that are catastrophic for human welfare. Okay, fine. I would prefer to just talk directly about the probability that AIs will generalize in catastrophic ways. I don't see any re...
Curated. I appreciate posts that attempt to tease out longstanding disagreements. I like both this post and it's followup about Wentworth/Christiano diffs. But I find this one a bit more interesting on the margin because Wentworth and Yudkowsky are people I normally think of as "roughly on the same page", so teasing out the differences is a bit more interesting and feels more like it's having a conversation we actually haven't had much of in the public discourse.
Very nice! Strong vote for a sequence. Understanding deltas between experts is a good way to both understand their thinking, and to identify areas of uncertainty that need more work/thought.
On natural abstractions, I think the hypothesis is more true for some abstractions than others. I'd think there's a pretty clear natural abstraction for a set of carbon atoms arranged as diamond. But much less of a clear natural abstraction for the concept of a human. Different people mean different things by "human", and will do this even more when we can make variatio...
Edit: see https://www.lesswrong.com/posts/q8uNoJBgcpAe3bSBp/my-ai-model-delta-compared-to-yudkowsky?commentId=CixonSXNfLgAPh48Z and ignore the below.
This is not a doom story I expect Yudkowsky would tell or agree with.
a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all.
This hypothetical suggests to me that the AI might not be very good at e.g. manipulating humans in an AI-box experiment, since it just doesn't understand how humans think all that well.
I wonder what MIRI thinks about this 2013 post ("The genie knows, but doesn't care") nowadays. Seems like the argument is less persuasive now, with AIs that seem to learn representations first, and later are given...
If I've understood you correctly, you consider your only major delta with Elizer Yudkowsky to be whether or not natural abstractions basically always work or reliably exist harnessably, to put it in different terms. Is that a fair restatement?
If so, I'm (specifically) a little surprised that that's all. I would have expected whatever reasoning the two of you did differently or whatever evidence the two of you weighted differently (or whatever else) would have also given you some other (likely harder to pin down) generative-disagreements (else maybe i...
All of the technical alignment hopes are out, unless we posit some objective natural enough that it can be faithfully and robustly translated into the AI’s internal ontology despite the alien-ness.
Right. One possible solution is that if we are in a world without natural abstraction, a more symmetric situation where various individual entities try to respect each other rights and try to maintain this mutual respect for each other rights might still work OK.
Basically, assume that there are many AI agents on different and changing levels of capabilities, a...
As a counterargument, consider mapping our ontology onto that of a baby. We can, kind of, explain some things in baby terms and, to that extent, a baby could theoretically see our neurons mapping to similar concepts in their ontology lighting up when we do or say things related to that ontology. At the same time our true goals are utterly alien to the baby.
Alternatively, imagine that you are sent back to the time of the pharaohs and had a discussion with Cheops/Khufu about the weather and forthcoming harvest - Even trying to explain it in terms of chaos th...
I get the feeling that "Given you mostly believe the natural abstraction hypothesis is true, why aren't you really optimistic about AI alignment (are you?) and/or think doom is very unlikely?" is a question people have. I think it would be useful for you to answer this.
I think 99% is within the plausible range of doom, but I think there's 100% chance that I have no capacity to change that (I'm going to take that as part of the definition of doom). The non-doom possibility is then worth all my attention, since there's some chance of increasing the possibility of this favorable outcome. Indeed, of the two, this is by definition the only chance for survival.
Said another way, it looks to me like this is moving too fast and powerfully and in too many quarters to expect it to be turned around. The most dangerous corners of the...
I am curious over which possible universes you expect natural abstractions to hold.
Would you expect the choice of physics to decide the abstractions that arise? Or is it more fundamental categories like "physics abstractions" that instantiate from a universal template and "mind/reasoning/sensing abstractions" where the latter is mostly universally identical?
My current best guess is that spacetime locality of physics is the big factor - i.e. we'd get a lot of similar high-level abstractions (including e.g. minds/reasoning/sensing) in other universes with very different physics but similar embedding of causal structure into 4 dimensional spacetime.
Given that Anthropic basically extracted the abstractions from the middle layer of Claude Sonnet, and OpenAI recently did the same for models up to GPT-4, and that most of the results they found were obvious natural abstractions to a human, I'd say we now have pretty conclusive evidence that you're correct and that (your model of) Eliezer is mistaken on this. Which isn't really very surprising for models whose base model was trained on the task of predicting text from Internet: they were distilled from humans and they think similarly.
Note that for your arg...
(I'm the first author of the linked paper on GPT-4 autoencoders.)
I think many people are heavily overrating how human-explainable SAEs today are, because it's quite subtle to determine whether a feature is genuinely explainable. SAE features today, even in the best SAEs, are generally are not explainable with simple human understandable explanations. By "explainable," I mean there is a human understandable procedure for labeling whether the feature should activate on a given token (and also how strong the activation should be, but I'll ignore that for now), such that your procedure predicts an activation if and only if the latent actually activates.
There are a few problems with interpretable-looking features:
I do not think SAE results to date contribute very strong evidence in either direction. "Extract all the abstractions from a layer" is not obviously an accurate statement of what they do, and the features they do find do not obviously faithfully and robustly map to human concepts, and even if they did it's not clear that they compose in human-like ways. They are some evidence, but weak.
(In fact, we know that the fraction of features extracted is probably quite small - for example, the 16M latent GPT-4 autoencoder only captures 10% of the downstream loss in terms of equivalent pretraining compute.)
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Preamble: Delta vs Crux
I don’t natively think in terms of cruxes. But there’s a similar concept which is more natural for me, which I’ll call a delta.
Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it’s cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we might have very different beliefs about lots of stuff in the world.
If your model and my model differ in that way, and we’re trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is.
That’s a delta: one or a few relatively “small”/local differences in belief, which when propagated through our models account for most of the differences in our beliefs.
For those familiar with Pearl-style causal models: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else’s model, or vice versa.
This post is about my current best guesses at the delta between my AI models and Yudkowsky's AI models. When I apply the delta outlined here to my models, and propagate the implications, my models basically look like Yudkowsky’s as far as I can tell. That said, note that this is not an attempt to pass Eliezer's Intellectual Turing Test; I'll still be using my own usual frames.
This post might turn into a sequence if there's interest; I already have another one written for Christiano, and people are welcome to suggest others they'd be interested in.
My AI Model Delta Compared To Yudkowsky
Best guess: Eliezer basically rejects the natural abstraction hypothesis. He mostly expects AI to use internal ontologies fundamentally alien to the ontologies of humans, at least in the places which matter. Lethality #33 lays it out succinctly:
What do my models look like if I propagate that delta? In worlds where natural abstraction basically fails, we are thoroughly and utterly fucked, and a 99% probability of doom strikes me as entirely reasonable and justified.
Here’s one oversimplified doom argument/story in a world where natural abstraction fails hard:
Note that the “oversimplification” of the argument mostly happened at step 2; the actual expectation here would be that a faithful and robust translation of human concepts is long in the AI’s internal language, which means we would need very high precision in order to instill the translation. But that gets into a whole other long discussion.
By contrast, in a world where natural abstraction basically works, the bulk of human concepts can be faithfully and robustly translated into the internal ontology of a strong AI (and the translation isn't super-long). So, all those technical alignment possibilities are back on the table.
That hopefully gives a rough idea of how my models change when I flip the natural abstraction bit. It accounts for most of the currently-known-to-me places where my models diverge from Eliezer’s. I put nontrivial weight (maybe about 10-20%) on the hypothesis that Eliezer is basically correct on this delta, though it’s not my median expectation.
particular time = particular point in the unrolled execution of the program