As Shankar Sivarajan points out in a different comment, the idea that AI became less scientific when we started having actual machine intelligence to study, as opposed to before that when the 'rightness' of a theory was mostly based on the status of whoever advanced it, is pretty weird. The specific way in which it's weird seems encapsulated by this statement:
on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.
In that there is an unstated assumption that these are unrelated activities. That deep learning systems are a kind of artifact produced by a few undifferentiated commodity inputs, one of which is called 'parameters', one called 'compute', and one called 'data', and that the details of these commodities aren't important. Or that the details aren't important to the people building the systems.
I've seen a (very revisionist) description of the Wright Brothers research as analogous to solving the control problem, because other airplane builders would put in an engine and crash before they'd developed reliable steering. Therefore, the analogy says, we should de...
This homunculus is frequently ascribed almost magical powers, like the ability to perform gradient surgery on itself during training to subvert the training process.
Gradient hacking in supervised learning is generally recognized by alignment people (including the author of that article) to not be a likely problem. A recent post by people at Redwood Research says "This particular construction seems very unlikely to be constructible by early transformative AI, and in general we suspect gradient hacking won’t be a big safety concern for early transformative AI". I would still defend the past research into it as good basic science, because we might encounter failure modes somewhat related to it.
Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?
The fact that constitutional AI works at all, that we can point at these abstract concepts like 'freedom' and language models are able to drive a reinforcement learning optimization process to hit the right behavior-targets from the abstract principle is very strong evidence that they understand the meaning of those abstract concepts.
"It understands but it doesn't care!"
There is this bizarre motte-and-bailey people seem to do around this subject.
I agree. I am extremely bothered by this unsubstantiated claim. I recently replied to Eliezer:
...Getting a shape into the AI's preferences is different from getting it into the AI's predictive model.
It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?
I commonly encounter people expressing sentiments like "prosaic alignment work isn't real alignment, because we aren't actually getting the AI to care about X." To which I say: How do you know that? What does it even mean for that claim to be true or false? What do you think you know, and why do you think you know it? What empirical knowledge of inner motivational structure could you be leveraging to make these
Well, if someone originally started worrying based on strident predictions of sophisticated internal reasoning with goals independent of external behavior, then realizing that's currently unsubstantiated should cause them to down-update on AI risk. That's why it's relevant. Although I think we should have good theories of AI internals.
"That deep learning systems are a kind of artifact produced by a few undifferentiated commodity inputs, one of which is called 'parameters', one called 'compute', and one called 'data', and that the details of these commodities aren't important. Or that the details aren't important to the people building the systems."
That seems mostly true so far for the most capable systems? Of course, some details matter and there's opportunity to do research on these systems now, but centrally it seems like you are much more able to forge ahead without a detailed understanding of what you're doing than e.g. in the case of the Wright brothers.
The opening sounds a lot like saying "aerodynamics used to be a science until people started building planes."
The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is. A physicist's view. It is one I'm deeply sympathetic to, and if your definition of science is Rutherford's, you might be right, but a reasonable one that includes chemistry would have to include AI as well.
The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is.
See my reply to Bogdan here. The issue isn't "inelegance"; we also lack an inelegant ability to predict or explain how particular ML systems do what they do.
Modern ML is less like modern chemistry, and more like ancient culinary arts and medicine. (Or "ancient culinary arts and medicine shortly after a cultural reboot", such that we have a relatively small number of recently-developed shallow heuristics and facts to draw on, rather than centuries of hard-earned experience.)
The opening sounds a lot like saying "aerodynamics used to be a science until people started building planes."
The reason this analogy doesn't land for me is that I don't think our epistemic position regarding LLMs is similar to, e.g., the Wright brothers' epistemic position regarding heavier-than-air flight.
The point Nate was trying to make with "ML is no longer a science" wasn't "boo current ML that actually works, yay GOFAI that didn't work". The point was exactly to draw a contrast between, e.g., our understanding of heavier-than-air flight and our understanding of how the human brain works...
While theoretical physics is less "applied science" than chemistry, there's still a real difference between chemistry and chemical engineering.
For context, I am a Mechanical Engineer, and while I do occasionally check the system I am designing and try to understand/verify how well it is working, I am fundamentally not doing science. The main goal is solving a practical problem (i.e. as little theoretical understanding as is sufficient), where in science the understanding is the main goal, or at least closer to it.
...By the time AIs are powerful enough to endanger the world at large, I expect AIs to do something akin to “caring about outcomes”, at least from a behaviorist perspective (making no claim about whether it internally implements that behavior in a humanly recognizable manner).
Roughly, this is because people are trying to make AIs that can steer the future into narrow bands (like “there’s a cancer cure printed on this piece of paper”) over long time-horizons, and caring about outcomes (in the behaviorist sense) is the flip side of the same coin as steering the future into narrow bands, at least when the world is sufficiently large and full of curveballs.
I expect the outcomes that the AI “cares about” to, by default, not include anything good (like fun, love, art, beauty, or the light of consciousness) — nothing good by present-day human standards, and nothing good by broad cosmopolitan standards either. Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific way
This is a meta-point, but I find it weird that you ask what is "caring about something" according to CS but don't ask what "corrigibility" is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born in attempt to imagine system that doesn't care about us but still behaves nicely under some vaguely-restricted definition of niceness. We don't have any examples of corrigible systems in nature and we have constant failure of attempts to formalize even relatively simple instances of corrigibility, like shutdownability. I think likely answer to "why I should expect corrigibility to be unlikely" sounds like "there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist".
Disagree on several points. I don't need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some argument why this is unlikely, I haven't seen a good rigorous version.
As Algon says in a sibling comment, non-agentic systems are by default shutdownable, myopic, etc. In addition, there are powerful shutdownable systems: KataGo can beat me at Go but doesn't prevent itself from being shut down for instrumental reasons, whereas humans generally will. So there is no linear scale of "powerful optimizer" that determines whether a system is easy to shut down. If there is some property of competent systems in practice that does prevent shutdownability, what is it? Likewise with other corrigibility properties. That's what I'm trying to ...
Dude, a calculator is corrigible. A desktop computer is corrigible. (Less confidently) a well-trained dog is pretty darn corrigible. There are all sorts of corrigible systems, because most things in reality aren't powerful optimizers.
So what about powerful optimizers? Like, is Google corrigible? If shareholders seem like they might try to pull the plug on the company, does it stand up for itself & convince, lie, threaten shareholders? Maybe, but I think the details matter. I doubt Google would assassinate shareholders in pretty much any situation. Mislead them? Yeah, probably. How much though? I don't know. I'm somewhat confident beauracracies aren't corrigible. Lots of humans aren't corrigible. What about even more powerful optimizers?
We haven't seen any, so there are no examples of corrigible ones.
Some thoughts:
I'm very sympathetic to this complaint; I think that these arguments simply haven't been made rigorously, and at this point it seems like Nate and Eliezer are not in an epistemic position where they're capable of even trying to do so. (That is, they reject the conception of "rigorous" that you and I are using in these comments, and therefore aren't willing to formulate their arguments in a way which moves closer to meeting it.)
You should look at my recent post on value systematization, which is intended as a framework in which these claims can be discussed more clearly.
I don't think we should equate the understanding required to build a neural net that will generalize in a way that's good for us with the understanding required to rewrite that neural net as a gleaming wasteless machine.
The former requires finding some architecture and training plan to produce certain high-level, large-scale properties, even in the face of complicated AI-environment interaction. The latter requires fine-grained transparency at the level of cognitive algorithms, and some grasp of the distribution of problems posed by the environment, together with the ability to search for better implementations.
If your implicit argument is "In order to be confident in high-level properties even in novel environments, we have to understand the cognitive algorithms that give rise to them and how those algorithms generalize - there exists no emergent theory of the higher level properties that covers the domain we care about." then I think that conclusion is way too hasty.
AI used to be a science. In the old days (back when AI didn't work very well), people were attempting to develop a working theory of cognition.
Those scientists didn’t succeed, and those days are behind us.
I claim many of them did succeed, for example:
Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.
(Similar to how the human genome was naturally selected for inclusive genetic fitness, but the resultant humans didn’t end up with a preference for “whatever food they model as useful for inclusive genetic fitness”. Instead, humans wound up internalizing a huge and complex set of preferences for "tasty" foods, laden with complications like “ice cream is good when it’s frozen but not when it’s melted”.)
I simply do not understand why people keep using this example.
I think it is wrong -- evolution does not grow minds, it grows hyperparameters for minds. When you look at the actual process for how we actually start to like ice-cream -- namely, we eat it, and then we get a reward, and that's why we like it -- then the world looks a a lot less hostile, and misalignment a lot less likely.
But given that this example is so controversial, even if it were right why would you use it -- at least, why would you use it ...
I think Nate’s claim “I expect them to care about a bunch of correlates of the training signal in weird and specific ways.” is plausible, at least for the kinds of AGI architectures and training approaches that I personally am expecting. If you don’t find the evolution analogy useful for that (I don’t either), but are OK with human within-lifetime learning as an analogy, then fine! Here goes!
OK, so imagine some “intelligent designer” demigod, let’s call her Ev. In this hypothetical, the human brain and body were not designed by evolution, but rather by Ev. She was working 1e5 years ago, back on the savannah. And her design goal was for these humans to have high inclusive genetic fitness.
So Ev pulls out a blank piece of paper. First things first: She designed the human brain with a fancy large-scale within-lifetime learning algorithm, so that these humans can gradually get to understand the world and take good actions in it.
Supporting that learning algorithm, she needs a reward function (“innate drives”). What to do there? Well, she spends a good deal of time thinking about it, and winds up putting in lots of perfectly sensible components for perfectly sensible reasons.
For example: ...
Does evolution ~= AI have predictive power apart from doom?
Evolution analogies predict a bunch of facts that are so basic they're easy to forget about, and even if we have better theories for explaining specific inductive biases, the simple evolution analogies should still get some weight for questions we're very uncertain about.
I agree that if you knew nothing about DL you'd be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.
I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you'd be better off deferring to local knowledge about DL than to the analogy.
Or, what's more to the point -- I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.
Combining some of yours and Habryka's comments, which seem similar.
The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don't know exist.
It's true that the structure of the solution is discovered and complex -- but the ontology of the solution for DL (at least in currently used architectures) is quite opinionated towards shallow circuits with relatively few serial ops. This is different than the bias for evolution, which is fine with a mutation that leads to 10^7 serial ops if it's metabolic costs are low. So the resemblance seems shallow other than "soluti...
FWIW my take is that the evolution-ML analogy is generally a very excellent analogy, with a bunch of predictive power, but worth using carefully and sparingly. Agreed that sufficient detail on e.g. DL specifics can screen off the usefulness of the analogy, but it's very unclear whether we have sufficient detail yet. The evolution analogy was originally supposed to point out that selecting a bunch for success on thing-X doesn't necessarily produce thing-X-wanters (which is obviously true, but apparently not obvious enough to always be accepted without providing an example).
I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.
Not sure where to land on that. It seems like both are good analogies? Brains might not be using gradients at all[1], whereas evolution basically is. But brains are definitely doing something like temporal-difference learning, and the overall 'serial depth' thing is also weakly in favour of brains ~= DL vs genomes+selection ~= DL.
I'd love to know what you're referring to by this:
evolution... is fine with a mutation that leads to 10^7 serial ops if it's metabolic costs are low.
Also,
...Is
Not sure what you mean here. One of the best explanations of how neural networks get trained uses basically a pure natural selection lens, and I think it gets most predictions right:
CGP Grey "How AIs, like ChatGPT, Learn" https://www.youtube.com/watch?v=R9OHn5ZF4Uo
There is also a follow-up video that explains SGD:
CGP Grey "How AI, Like ChatGPT, *Really* Learns" https://www.youtube.com/watch?v=wvWpdrfoEv0
In-general I think if you use a natural selection analogy you will get a huge amount of things right about how AI works, though I agree not everything (it won't explain the difference between Adam and AdamW, but it will explain the difference between hierarchical bayesian networks, linear regression and modern deep learning).
Note: I just watched the videos. I personally would not recommend the first video as an explanation to a layperson if I wanted them to come away with accurate intuitions around how today's neural networks learn / how we optimize them. What it describes is a very different kind of optimizer, one explicitly patterned after natural selection such as a genetic algorithm or population-based training, and the follow-up video more or less admits this. I would personally recommend they opt for videos these instead:
Except that selection and gradient descent are closely mathematically related - you have to make a bunch of simplifying assumptions, but 'mutate and select' (evolution) is actually equivalent to 'make a small approximate gradient step' (SGD) in the limit of small steps.
I read the post and left my thoughts in a comment. In short, I don't think the claimed equivalence in the post is very meaningful.
(Which is not to say the two processes have no relationship whatsoever. But I am skeptical that it's possible to draw a connection stronger than "they both do local optimization and involve randomness.")
CGP Grey's video is a decent example source. Most of the differences between hierarchical bayesian networks and modern deep learning come across pretty well if you model the latter as a type of genetic algorithm search:
There are also just actually deep similarities. Vanilla SGD is perfectly equivalent to a genetic search with an infinitesimally small mutation size and infinite samples per generation (I could make a proof here but won't unless someone is interested in it). Indeed in one of my ML classes at Berkeley genetic algorithms were suggested as one of the obvious things to do in an indifferentiable loss-landscape as generalization of SGD, where you just try some mutations, see which one performs best, and then modify your parameters in that direction.
evolution does not grow minds, it grows hyperparameters for minds.
Imo this is a nitpick that isn't really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn't necessarily lead to a thing that wants ('optimizes for') X; and more broadly it's a good example for how the results of an optimization process can be unexpected.
I want to distinguish two possible takes here:
It sounds like you're arguing against (1). Fair enough, I too think (1) isn't a great take in isolation. If the evolution analogy does not help you think more clearly about AI at all then I don't think you should change your mind much on the strength of the analogy alone. But my best guess is that most people incl Nate mean (2).
Also relevent is Steven Byrnes' excelent Against evolution as an analogy for how humans will create AGI.
It has been over two years since the publication of that post, and criticism of this analogy has continued to intensify. The OP and other MIRI members have certainly been exposed to this criticism already by this point, and as far as I am aware, no principled defense has been made of the continued use of this example.
I encourage @So8res and others to either stop using this analogy, or to argue explicitly for its continued usage, engaging with the arguments presented by Byrnes, Pope, and others.
But given that this example is so controversial, even if it were right why would you use it -- at least, why would you use it if you had any other example at all to turn to?
Humans are the only real-world example we have of human-level agents, and natural selection is the only process we know of for actually producing them.
SGD, singular learning theory, etc. haven't actually produced human-level minds or a usable theory of how such minds work, and arguably haven't produced anything that even fits into the natural category of minds at all, yet. (Maybe they will pretty soon, when applied at greater scale or in combination with additional innovations, either of which could result in the weird-correlates problem emerging.)
Also, the actual claims in the quote seem either literally true (humans don't care about foods that they model as useful for inclusive genetic fitness) or plausible / not obviously false (when you grow minds [to human capabilities levels], they end up caring about a bunch of weird correlates). I think you're reading the quote as saying something stronger / more specific than it actually is.
Great post. I agree directionally with most of it (and have varying degrees of difference in how I view the severity of some of the problems you mention).
One that stood out to me:
(unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs).
While still far from being in a state legible to be easy or even probable that we'll solve, this seems like a route that circumvent some of the problems you mention, and is where a large amount of whatever probability I assign to non-doom outcomes come from.
More precisely: insofar as the problem at its core comes down to understanding AI systems deeply enough to make strong claims about whether or not they're safe / have certain alignment-relevant properties, one route to get there is to understand those high-level alignment-relevant things well enough to reliably identify the presence / nature thereof / do other things with, in a large class of systems. I can think of multiple approaches that try to do this, like John's work on abstractions, Paul with ELK (though referring to it as understanding the high-level alignment-relevant property of truth sounds somewhat janky because of the ...
It looks to me like we’re on track for some people to be saying “look how rarely my AI says bad words”, while someone else is saying “our evals are saying that it can’t deceive humans yet”, while someone else is saying “our AI is acting very submissive, and there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing”, and someone else is saying “we’ll just direct a bunch of our AIs to help us solve alignment, while arranging them in a big bureaucracy”, and someone else is saying “we’ve set up the game-theoretic incentives such that if any AI starts betraying us, some other AI will alert us first”, and this is a different sort of situation.
And not one that looks particularly survivable, to me.
And if you ask bureaucrats to distinguish which teams should be allowed to move forward (and how far) in that kind of circus, full of claims, promises, and hunches and poor in theory, then I expect that they basically just can’t.
I'm reminded of a draft post that I started but, never finished or published, about the Manhattan project and the relevance for AI alignment and AI coordination, based on my reading of The Making of the Atomic Bomb.
The histor...
I expect that we’d see all sorts of coincidences and hacks that make the thing run, and we’d be able to see in much more detail how, when we ask the system to achieve some target, it’s not doing anything close to “caring about that target” in a manner that would work out well for us, if we could scale up the system’s optimization power to the point where it could achieve great technological or scientific feats (like designing Drexlerian nanofactories or what-have-you).
I think this counterfactual is literally incoherent— it does not make sense to talk about what an individual neural network would do if its "optimization power" were scaled up. It's a category error. You instead need to ask what would happen if the training procedure were scaled up, and there are always many different ways that you can scale it up— e.g. keeping data fixed while parameters increase, or scaling both in lockstep, keeping the capability of the graders fixed, or investing in more capable graders / scalable oversight techniques, etc. So I deny that there is any fact of the matter about whether current LLMs "care about the target" in your sense. I think there probably are sensible ways of cashing out what it means for a 2023 LLM to "care about" something but this is not it.
As others have hinted at/pointed out in the comments, there is an entire science of deep learning out there, including on high-level (vs. e.g. most of low-level mech interp) aspects that can be highly relevant to alignment and that you seem to not be aware of/dismiss. E.g. follow the citation trail of An Explanation of In-context Learning as Implicit Bayesian Inference.
Some of Nate’s quick thoughts (paraphrased), after chatting with him:
Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-mechanistically-explain a typical system humans have engineered.)
Nobody's been able to call the specific capabilities of systems in advance. Nobody's been able to call the specific exploits in advance. Nobody's been able to build better cognitive algorithms by hand after understanding how the AI does things we can't yet code by hand. There is clearly some other level of understanding that is possible that we lack, and that we once sought, and that only the interpretability folks continue to seek.
E.g., think of that time Neel Nanda figured out how a small transformer does modular arithmetic (AXRP episode). If nobody had ever tho...
Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.
This includes an assumption that alignment must be done through training signals.
If I shared that assumption, I'd be similarly pessimistic. That seems like trying to aim a rocket with no good theory of gravitation, nor knowledge of the space it needs to pass through.
But alignment needn't be done by defining goals or training signals, letting fly, and hoping. We can pause learning prior to human level (and potential escape), and perform "course corrections". Aligning a partly-trained AI allows us to use its learned representations as goal/value representation, rather than guessing how to create them well enough through training with correlated rewards.
We have proposals that do this for different current approaches to AGI; see The (partial) fallacy of dumb superintelligence for more about them and this line of thinking.
This doesn't entirely avoid the problem that most theories don't work...
I am excited about the concept of uploading, but as I've discussed with fellow enthusiasts... I don't see a way to a working emulation of a human brain (much less an accurate recreation of a specific human brain) that doesn't go through improving our general understanding of how the human brain works. And I think that that knowledge leads to unlocking AI capabilities. So it seems like a tightly information-controlled research project would be needed to not have AI tech leapfrogging over uploading tech while aiming for uploads.
Edit: to be extra clear, I'm trying to speak to people out who might not have thought this through that there is a clear strategic rationale to think that 'private uploading-directed-research is potentially good, but open uploading-directed-research is very risky and bad.' Because of my particular bias towards believing in the importance of studying the human brain, I suspect that the ML capabilities side-effects of such research would be substantially worse than the average straightforward ML capabilities advance.
And if you were using it to send payloads to very distant planets at relativistic speeds, you’d still be screwed, because Newtonian mechanics does not account for relativistic effects.
You don't even need to have that extravagant an example; if you use Newtonian mechanics to build a Global Positioning System your calculated locations move at up to 10 kilometers per day—what does that say about condition numbers of values under recursive self-improvement or repeated ontological shifts?
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
"…there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing"
When your AI includes an LLM extensively trained to simulate human token-generation, anthropomophizing its behavior is an extremely relevant idea, to the point of being the obvious default assumption.
For example, what I find most concerning about RLHF inducing sycophancy is not the sycophancy itself, which is "mostly harmless", but the likelihood that it's also dragging in all the other more seriously unaligned human behaviors that, in real or fictional humans, typic...
And in part because it’s socially hard to believe, as a regulator, that you should keep telling everyone “no”, or that almost everything on offer is radically insufficient, when you yourself don’t concretely know what insights and theoretical understanding we’re missing.
That's not true. We can end up with a regulator that stands in the pose of "prohibit everything". See IRB in America, for instance: medical experiments are made plainly insurmountable.
I'd like to offer an alternative to the third point. Let's assume we have built a highly capable AI that we don't yet trust. We've also managed to coordinate as a society and implement defensive mechanisms to get to that point. I think that we don't have to test the AI in a low-stakes environment and then immediately move to a high-stakes one (as described in the dictator analogy), while still getting high gains.
It is feasible to design a sandboxed environment formally proven to be secure, in the sense that you can not hack into, escape from or deliberatel...
unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs
Are you able to provide an example of the kind of thing that would constitute such a theoretical triumph? Or, if not; a maximally close approximation in the form of something that exists currently?
AI used to be a science. In the old days (back when AI didn't work very well), people were attempting to develop a working theory of cognition.
Those scientists didn’t succeed, and those days are behind us. For most people working in AI today and dividing up their work hours between tasks, gone is the ambition to understand minds. People working on mechanistic interpretability (and others attempting to build an empirical understanding of modern AIs) are laying an important foundation stone that could play a role in a future science of artificial minds, but on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.
The bitter lesson has been taken to heart, by those at the forefront of the field; and although this lesson doesn't teach us that there's nothing to learn about how AI minds solve problems internally, it suggests that the fastest path to producing more powerful systems is likely to continue to be one that doesn’t shed much light on how those systems work.
Absent some sort of “science of artificial minds”, however, humanity’s prospects for aligning smarter-than-human AI seem to me to be quite dim.
Viewing Earth’s current situation through that lens, I see three major hurdles:
I’ll go into more detail on these three points below. First, though, some background:
Background
By the time AIs are powerful enough to endanger the world at large, I expect AIs to do something akin to “caring about outcomes”, at least from a behaviorist perspective (making no claim about whether it internally implements that behavior in a humanly recognizable manner).
Roughly, this is because people are trying to make AIs that can steer the future into narrow bands (like “there’s a cancer cure printed on this piece of paper”) over long time-horizons, and caring about outcomes (in the behaviorist sense) is the flip side of the same coin as steering the future into narrow bands, at least when the world is sufficiently large and full of curveballs.
I expect the outcomes that the AI “cares about” to, by default, not include anything good (like fun, love, art, beauty, or the light of consciousness) — nothing good by present-day human standards, and nothing good by broad cosmopolitan standards either. Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.
(Similar to how the human genome was naturally selected for inclusive genetic fitness, but the resultant humans didn’t end up with a preference for “whatever food they model as useful for inclusive genetic fitness”. Instead, humans wound up internalizing a huge and complex set of preferences for "tasty" foods, laden with complications like “ice cream is good when it’s frozen but not when it’s melted”.)
Separately, I think that most complicated processes work for reasons that are fascinating, complex, and kinda horrifying when you look at them closely.
It’s easy to think that a bureaucratic process is competent until you look at the gears and see the specific ongoing office dramas and politicking between all the vice-presidents or whatever. It’s easy to think that a codebase is running smoothly until you read the code and start to understand all the decades-old hacks and coincidences that make it run. It’s easy to think that biology is a beautiful feat of engineering until you look closely and find that the eyeballs are installed backwards or whatever.
And there’s an art to noticing that you would probably be astounded and horrified by the details of a complicated system if you knew them, and then being astounded and horrified already in advance before seeing those details.[1]
1. Alignment and capabilities are likely intertwined
I expect that if we knew in detail how LLMs are calculating their outputs, we’d be horrified (and fascinated, etc.).
I expect that we’d see all sorts of coincidences and hacks that make the thing run, and we’d be able to see in much more detail how, when we ask the system to achieve some target, it’s not doing anything close to “caring about that target” in a manner that would work out well for us, if we could scale up the system’s optimization power to the point where it could achieve great technological or scientific feats (like designing Drexlerian nanofactories or what-have-you).
Gaining this sort of visibility into how the AIs work is, I think, one of the main goals of interpretability research.
And understanding how these AIs work and how they don’t — understanding, for example, when and why they shouldn’t yet be scaled or otherwise pushed to superintelligence — is an important step on the road to figuring out how to make other AIs that could be scaled or otherwise pushed to superintelligence without thereby causing a bleak and desolate future.
But that same understanding is — I predict — going to reveal an incredible mess. And the same sort of reasoning that goes into untangling that mess into an AI that we can aim, also serves to untangle that mess to make the AI more capable. A tangled mess will presumably be inefficient and error-prone and occasionally self-defeating; once it’s disentangled, it won’t just be tidier, but will also come to accurate conclusions and notice opportunities faster and more reliably.[2]
Indeed, my guess is that it’s even easier to see all sorts of things that the AI is doing that are dumb, all sorts of ways that the architecture is tripping itself up, and so on.
Which is to say: the same route that gives you a chance of aligning this AI (properly, not the “it no longer says bad words” superficial-property that labs are trying to pass off as “alignment” these days) also likely gives you lots more AI capabilities.
(Indeed, my guess is that the first big capabilities gains come sooner than the first big alignment gains.)
I think this is true of most potentially-useful alignment research: to figure out how to aim the AI, you need to understand it better; in the process of understanding it better you see how to make it more capable.
If true, this suggests that alignment will always be in catch-up mode: whenever people try to figure out how to align their AI better, someone nearby will be able to run off with a few new capability insights, until the AI is pushed over the brink.
So a first key challenge for AI alignment is a challenge of ordering: how do we as a civilization figure out how to aim AI before we’ve generated unaimed superintelligences plowing off in random directions? I no longer think “just sort out the alignment work before the capabilities lands” is a feasible option (unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs).
Interpretability? Will likely reveal ways your architecture is bad before it reveals ways your AI is misdirected.
Recruiting your AIs to help with alignment research? They’ll be able to help with capabilities long before that (to say nothing of whether they would help you with alignment by the time they could, any more than humans would willingly engage in eugenics for the purpose of redirecting humanity away from Fun and exclusively towards inclusive genetic fitness).
And so on.
This is (in a sense) a weakened form of my answer to those who say, “AI alignment will be much easier to solve once we have a bona fide AGI on our hands.” It sure will! But it will also be much, much easier to destroy the world, when we have a bona fide AGI on our hands. To survive, we’re going to need to either sidestep this whole alignment problem entirely (and take other routes to a wonderful future instead, as I may discuss more later), or we’re going to need some way to do a bunch of alignment research even as that research makes it radically easier and radically cheaper to destroy everything of value.
Except even that is harder than many seem to realize, for the following reason.
2. Distinguishing real solutions from fake ones is hard
Already, labs are diluting the word “alignment” by using the word for superficial results like “the AI doesn’t say bad words”. Even people who apparently understand many of the core arguments have apparently gotten the impression that GPT-4’s ability to answer moral quandaries is somehow especially relevant to the alignment problem, and an important positive sign.
(The ability to answer moral questions convincingly mostly demonstrates that the AI can predict how humans would answer or what humans want to hear, without revealing much about what the AI actually pursues, or would pursue upon reflection, etc.)
Meanwhile, we have little idea of what passes for “motivations” inside of an LLM, or what effect pretraining on next-token prediction and fine-tuning with RLHF really has on the internals. This sort of precise scientific understanding of the internals — the sort that lets one predict weird cognitive bugs in advance — is currently mostly absent in the field. (Though not entirely absent, thanks to the hard work of many researchers.)
Now imagine that Earth wakes up to the fact that the labs aren’t going to all decide to stop and take things slowly and cautiously at the appropriate time.[3] And imagine that Earth uses some great feat of civilizational coordination to halt the world’s capabilities progress, or to otherwise handle the issue that we somehow need room to figure out how these things work well enough to align them. And imagine we achieve this coordination feat without using that same alignment knowledge to end the world (as we could). There’s then the question of who gets to proceed, under what circumstances.
Suppose further that everyone agreed that the task at hand was to fully and deeply understand the AI systems we’ve managed to develop so far, and understand how they work, to the point where people could reverse out the pertinent algorithms and data-structures and what-not. As demonstrated by great feats like building, by-hand, small programs that do parts of what AI can do with training (and that nobody previously knew how to code by-hand), or by identifying weird exploits and edge-cases in advance rather than via empirical trial-and-error. Until multiple different teams, each with those demonstrated abilities, had competing models of how AIs’ minds were going to work when scaled further.
In such a world, it would be a difficult but plausibly-solvable problem, for bureaucrats to listen to the consensus of the scientists, and figure out which theories were most promising, and figure out who needs to be allotted what license to increase capabilities (on the basis of this or that theory that predicts this would be non-catastrophic), so as to put their theory to the test and develop it further.
I’m not thrilled about the idea of trusting an Earthly bureaucratic process with distinguishing between partially-developed scientific theories in that way, but it’s the sort of thing that a civilization can perhaps survive.
But that doesn’t look to me like how things are poised to go down.
It looks to me like we’re on track for some people to be saying “look how rarely my AI says bad words”, while someone else is saying “our evals are saying that it can’t deceive humans yet”, while someone else is saying “our AI is acting very submissive, and there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing”, and someone else is saying “we’ll just direct a bunch of our AIs to help us solve alignment, while arranging them in a big bureaucracy”, and someone else is saying “we’ve set up the game-theoretic incentives such that if any AI starts betraying us, some other AI will alert us first”, and this is a different sort of situation.
And not one that looks particularly survivable, to me.
And if you ask bureaucrats to distinguish which teams should be allowed to move forward (and how far) in that kind of circus, full of claims, promises, and hunches and poor in theory, then I expect that they basically just can’t.
In part because the survivable answers (such as “we have no idea what’s going on in there, and will need way more of an idea what’s going on in there, and that understanding needs to somehow develop in a context where we can do the job right rather than simply unlocking the door to destruction”) aren’t really in the pool. And in part because all the people who really want to be racing ahead have money and power and status. And in part because it’s socially hard to believe, as a regulator, that you should keep telling everyone “no”, or that almost everything on offer is radically insufficient, when you yourself don’t concretely know what insights and theoretical understanding we’re missing.
Maybe if we can make AI a science again, then we’ll start to get into the regime where, if humanity can regulate capabilities advancements in time, then all the regulators and researchers understand that you shall only ask for a license to increase the capabilities of your system when you have a full detailed understanding of the system and a solid justification for why you need the capabilities advance and why it’s not going to be catastrophic. At which point maybe a scientific field can start coming to some sort of consensus about those theories, and regulators can start being sensitive to that consensus.
But unless you can get over that grand hump, it looks to me like one of the key bottlenecks here is bureaucratic legibility of plausible solutions. Where my basic guess is that regulators won’t be able to distinguish real solutions from false ones, in anything resembling the current environment.
Together with the above point ("alignment and capabilities are likely intertwined"), I think this means that our rallying cry should be less “pause to give us more time on alignment research” and more “stop entirely, and find some way to circumvent these woods entirely; we’re not equipped to navigate them”.
(With a backup rallying cry of “make AI a science again”, though again, that only works if you have some way of preventing the science-of-mind from leading to catastrophe before we figure out how to build AIs that care about good stuff rather than bleak and desolate stuff.)
3. Most theories don’t work on the first real try
It seems worth noting that even if you manage to surmount the above two problems, you have a third problem on your hands, which is that when it finally comes time, not to increase your system’s capabilities a bit more (and test your theory of cognition further), but rather to actually do something of note with your AI, then you must, in some sense, be trusting a novel and untested scientific theory (and your engineering efforts grounded in that theory) to work perfectly on the first critical try.
In particular, once your AI is capable of autonomous scientific/technological development, and operating in a domain where it could use that ability to gain a decisive strategic advantage over the rest of the planet, it’s operating in a fundamentally different cognitive regime than it ever was in training.
By analogy, if you’re considering making Bob the dictator of your country, you might first make him mock-dictator of your town, and watch to make sure he doesn’t abuse his power. But, try as you might, that’s still not fundamentally all that reliable a test of whether he’ll actually abuse power once he really has it. Once he really can give orders to the army, once he really can take advantage of the country’s populace, will he do so? The option of “abuse my mock power while being observed by my townspeople (who can still beat me up or deny me the job)” is actually very cognitively different than the option “order the army to intimidate the parliament and ‘oversee’ the next election”.
Now, given a sufficiently well-developed theory of cognition, you can try to read the AI’s mind and predict what cognitive state it would enter if it did really think it had those options. And you can set up simulations (and try to spoof its internal sensations and so on) in a way that your theory of cognition predicts is very similar to the cognitive state it would enter once it really had the option to betray you.
But the link between these states that you induce and observe in the lab, and the actual state where the AI actually has the option to betray you, depends fundamentally on your fresh new theory of cognition.
Actually running the AI until it really has the opportunity to betray you is an empirical test of those theories in an environment that differs fundamentally from the lab setting.
And many a scientist (and programmer) knows that their theories of how a complicated system is going to work in a fundamentally new operating environment often don’t go super well on the first try.
As a concrete analogy to potentially drive this point home: Newtonian mechanics made all sorts of shockingly-good empirical predictions. It was a simple concise mathematical theory with huge explanatory power that blew every previous theory out of the water. And if you were using it to send payloads to very distant planets at relativistic speeds, you’d still be screwed, because Newtonian mechanics does not account for relativistic effects.
(And the only warnings you’d get would be little hints about light seeming to move at the same speed in all directions at all times of year, and light bending around the sun during eclipses, and the perihelion of Mercury being a little off from what Newtonian mechanics predicted. Small anomalies, weighed against an enormous body of predictive success in a thousand empirical domains; and yet Nature doesn’t care, and the theory still falls apart when we move to energies and scales far outside what we’d previously been able to observe.)
Getting scientific theories to work on the first critical try is hard. (Which is one reason to aim for minimal pivotal tasks — getting a satellite into orbit should work fine on Newtonian mechanics, even if sending payloads long distances at relativistic speeds does not.)
Worrying about this issue is something of a luxury, at this point, because it’s not like we’re anywhere close to scientific theories of cognition that accurately predict all the lab data. But it’s the next hurdle on the queue, if we somehow manage to coordinate to try to build up those scientific theories, in a way where success is plausibly bureaucratically-legible.
Maybe later I’ll write more about what I think the strategy implications of these points are. In short, I basically recommend that Earth pursue other routes to the glorious transhumanist future, such as uploading. (Which is also fraught with peril, but I expect that those perils are more surmountable; I hope to write more about this later.)
Albeit slightly less, since there’s nonzero prior probability on this unknown system turning out to be simple, elegant, and well-designed.
An exception to this guess happens if the AI is at the point where it’s correcting its own flaws and improving its own architecture, in which case, in principle, you might not see much room for capabilities improvements if you took a snapshot and comprehended its inner workings, despite still being able to see that the ends it pursues are not the ones you wanted. But in that scenario, you’re already about to die to the self-improving AI, or so I predict.
Not least because there are no sufficiently clear signs that it’s time to stop — we blew right past “an AI claims it is sentient”, for example. And I’m not saying that it was a mistake to doubt AI systems’ first claims to be sentient — I doubt that Bing had the kind of personhood that’s morally important (though I am by no means confident!). I’m saying that the thresholds that are clear in science fiction stories turn out to be messy in practice and so everyone just keeps plowing on ahead.