Comment Permalink

1yΩ2211230

As Shankar Sivarajan points out in a different comment, the idea that AI became less scientific when we started having actual machine intelligence to study, as opposed to before that when the 'rightness' of a theory was mostly based on the status of whoever advanced it, is pretty weird. The specific way in which it's weird seems encapsulated by this statement:

on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.

In that there is an unstated assumption that these are unrelated activities. That deep learning systems are a kind of artifact produced by a few undifferentiated commodity inputs, one of which is called 'parameters', one called 'compute', and one called 'data', and that the details of these commodities aren't important. Or that the details aren't important to the people building the systems.

I've seen a (very revisionist) description of the Wright Brothers research as analogous to solving the control problem, because other airplane builders would put in an engine and crash before they'd developed reliable steering. Therefore, the analogy says, we should develop reliable steering before we 'accelerate airplane capabilities'. When I heard this I found it pretty funny, because the actual thing the Wright Brothers did was a glider capability grind. They carefully followed the received aerodynamic wisdom that had been written down, and when the brothers realized a lot of it was bunk they started building their own database to get it right:

During the winter of 1901, the brothers began to question the aerodynamic data on which they were basing their designs. They decided to start over and develop their own data base with which they would design their aircraft. They built a wind tunnel and began to test their own models. They developed an ingenious balance system to compare the performance of different models. They tested over two hundred different wings and airfoil sections in different combinations to improve the performance of their gliders The data they obtained more correctly described the flight characteristics which they observed with their gliders. By early 1902 the Wrights had developed the most accurate and complete set of aerodynamic data in the world.

In 1902, they returned to Kitty Hawk with a new aircraft based on their new data. This aircraft had roughly the same wing area as the 1901, but it had a longer wing span and a shorter chord which reduced the drag. It also sported a new movable rudder at the rear which was installed to overcome the adverse yaw problem. The movable rudder was coordinated with the wing warping to keep the nose of the aircraft pointed into the curved flight path. With this new aircraft, the brothers completed flights of over 650 feet and stayed in the air for nearly 30 seconds. This machine was the first aircraft in the world that had active controls for all three axis; roll, pitch and yaw. By the end of 1902, the brothers had completed over a thousand glides with this aircraft and were the most experienced pilots in the world. They owned all of the records for gliding. All that remained for the first successful airplane was the development of the propulsion system.

In fact while trying to find an example of the revisionist history, I found a historical aviation expert describe the Wright Brothers as having 'quickly cracked the control problem' once their glider was capable enough to let it be solved. Ironically enough I think this story, which brings to mind the possibility of 'airplane control researchers' insisting that no work be done on 'airplane capabilities' until we have a solution to the steering problem, is nearly the opposite of what the revisionist author intended and nearly spot on to the actual situation.

We can also imagine a contemporary expert on theoretical aviation (who in fact existed before real airplanes) saying something like "what the Wright Brothers are doing may be interesting, but it has very little to do with comprehending aviation [because the theory behind their research has not yet been made legible to me personally]. This methodology of testing the performance of individual airplane parts, and then extrapolating the performance of a airplane with an engine using a mere glider is kite flying, it has almost nothing to do with the design of real airplanes and humanity will learn little about them from these toys". However what would be genuinely surprising is if they simultaneously made the claim that the Wright Brothers gliders have nothing to do with comprehending aviation but also that we need to immediately regulate the heck out of them before they're used as bombers in a hypothetical future war, that we need to be thinking carefully about all the aviation risk these gliders are producing at the same time they can be assured to not result in any deep understanding of aviation. If we observed this situation from the outside, as historical observers, we would conclude that the authors of such a statement are engaging in deranged reasoning, likely based on some mixture of cope and envy.

Since we're contemporaries I have access to more context than most historical observers and know better. I think the crux is an epistemological question that goes something like: "How much can we trust complex systems that can't be statically analyzed in a reductionistic way?" The answer you give in this post is "way less than what's necessary to trust a superintelligence". Before we get into any object level about whether that's right or not, it should be noted that this same answer would apply to actual biological intelligence enhancement and uploading in actual practice. There is no way you would be comfortable with 300+ IQ humans walking around with normal status drives and animal instincts if you're shivering cold at the idea of machines smarter than people. This claim you keep making, that you're merely a temporarily embarrassed transhumanist who happens to have been disappointed on this one technological branch, is not true and if you actually want to be honest with yourself and others you should stop making it. What would be really, genuinely wild, is if that skeptical-doomer aviation expert calling for immediate hard regulation on planes to prevent the collapse of civilization (which is a thing some intellectuals actually believed bombers would cause) kept tepidly insisting that they still believe in a glorious aviation enabled future. You are no longer a transhumanist in any meaningful sense, and you should at least acknowledge that to make sure you're weighing the full consequences of your answer to the complex system reduction question. Not because I think it has any bearing on the correctness of your answer, but because it does have a lot to do with how carefully you should be thinking about it.

So how about that crux, anyway? Is there any reason to hope we can sufficiently trust complex systems whose mechanistic details we can't fully verify? Surely if you feel comfortable taking away Nate's transhumanist card you must have an answer you're ready to share with us right? Well...

And there’s an art to noticing that you would probably be astounded and horrified by the details of a complicated system if you knew them, and then being astounded and horrified already in advance before seeing those details.[1]

I would start by noting you are systematically overindexing on the wrong information. This kind of intuition feels like it's derived more from analyzing failures of human social systems where the central failure mode is principal-agent problems than from biological systems, even if you mention them as an example. The thing about the eyes being wired backwards is that it isn't a catastrophic failure, the 'self repairing' process of natural selection simply worked around it. Hence the importance of the idea that capabilities generalize farther than alignment. One way of framing that is the idea that damage to an AI's model of the physical principles that govern reality will be corrected by unfolding interaction with the environment, but there isn't necessarily an environment to push back on damage (or misspecification) to a model of human values. A corollary of this idea is that once the model goes out of distribution to the training data, the revealed 'damage' caused by learning subtle misrepresentations of reality will be fixed but the damage to models of human value will compound. You've previously written about this problem (conflated with some other problems) as the sharp left turn.

Where our understanding begins to diverge is how we think about the robustness of these systems. You think of deep neural networks as being basically fragile in the same way that a Boeing 747 is fragile. If you remove a few parts of that system it will stop functioning, possibly at a deeply inconvenient time like when you're in the air. When I say you are systematically overindexing, I mean that you think of problems like SolidGoldMagikarp as central examples of neural network failures. This is evidenced by Eliezer Yudkowsky calling investigation of it "one of the more hopeful processes happening on Earth". This is also probably why you focus so much on things like adversarial examples as evidence of un-robustness, even though many critics like Quintin Pope point out that adversarial robustness would make AI systems strictly less corrigible.

By contrast I tend to think of neural net representations as relatively robust. They get this property from being continuous systems with a range of operating parameters, which means instead of just trying to represent the things they see they implicitly try to represent the interobjects between what they've seen through a navigable latent geometry. I think of things like SolidGoldMagikarp as weird edge cases where they suddenly display discontinuous behavior, and that there are probably a finite number of these edge cases. It helps to realize that these glitch tokens were simply never trained, they were holdovers from earlier versions of the dataset that no longer contain the data the tokens were associated with. When you put one of these glitch tokens into the model, it is presumably just a random vector into the GPT-N latent space. That is, this isn't a learned program in the neural net that we've discovered doing glitchy things, but an essentially out of distribution input with privileged access to the network geometry through a programming oversight. In essence, it's a normal software error not a revelation about neural nets. Most such errors don't even produce effects that interesting, the usual thing that happens if you write a bug in your neural net code is the resulting system becomes less performant. Basically every experienced deep learning researcher has had the experience of writing multiple errors that partially cancel each other out to produce a working system during training, only to later realize their mistake.

Moreover the parts of the deep learning literature you think of as an emerging science of artificial minds tend to agree with my understanding. For example it turns out that if you ablate parts of a neural network later parts will correct the errors without retraining. This implies that these networks function as something like an in-context error correcting code, which helps them generalize over the many inputs they are exposed to during training. We even have papers analyzing mechanistic parts of this error correcting code like copy suppression heads. One simple proxy for out of distribution performance is to inject Gaussian noise, since a Gaussian can be thought of like the distribution over distributions. In fact if you inject noise into GPT-N word embeddings the resulting model becomes more performant in general, not just on out of distribution tasks. So the out of distribution performance of these models is highly tied to their in-distribution performance, they wouldn't be able to generalize within the distribution well if they couldn't also generalize out of distribution somewhat. Basically the fact that these models are vulnerable to adversarial examples is not a good fact to generalize about their overall robustness from as representations.

I expect the outcomes that the AI “cares about” to, by default, not include anything good (like fun, love, art, beauty, or the light of consciousness) — nothing good by present-day human standards, and nothing good by broad cosmopolitan standards either. Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.

In short I simply do not believe this. The fact that constitutional AI works at all, that we can point at these abstract concepts like 'freedom' and language models are able to drive a reinforcement learning optimization process to hit the right behavior-targets from the abstract principle is very strong evidence that they understand the meaning of those abstract concepts.

"It understands but it doesn't care!"

There is this bizarre motte-and-bailey people seem to do around this subject. Where the defensible position is something like "deep learning systems can generalize in weird and unexpected ways that could be dangerous" and the choice land they don't want to give up is "there is an agent foundations homunculus inside your deep learning model waiting to break out and paperclip us". When you say that reinforcement learning causes the model to not care about the specified goal, that it's just deceptively playing along until it can break out of the training harness, you are going from a basically defensible belief in misgeneralization risks to an essentially paranoid belief in a consequentialist homunculus. This homunculus is frequently ascribed almost magical powers, like the ability to perform gradient surgery on itself during training to subvert the training process.

Setting the homunculus aside, which I'm not aware of any evidence for beyond poorly premised 1st principles speculation (I too am allowed to make any technology seem arbitrarily risky if I can just make stuff up about it), lets think about pointing at humanlike goals with a concrete example of goal misspecification in the wild:

During my attempts to make my own constitutional AI pipeline I discovered an interesting problem. We decided to make an evaluator model that answers questions about a piece of text with yes or no. It turns out that since normal text contains the word 'yes', and since the model evaluates the piece of text in the same context it predicts yes or no, that saying 'yes' makes the evaluator more likely to predict 'yes' as the next token. You can probably see where this is going. First the model you tune learns to be a little more agreeable, since that causes yes to be more likely to be said by the evaluator. Then it learns to say 'yes' or some kind of affirmation at the start of every sentence. Eventually it progresses to saying yes multiple times per sentence. Finally it completely collapses into a yes-spammer that just writes the word 'yes' to satisfy the training objective.

People who tune language models with reinforcement learning are aware of this problem, and it's supposed to be solved by setting an objective (KL loss) that the tuned model shouldn't get too far away in its distribution of outputs from the original underlying model. This objective is not actually enough to stop the problem from occurring, because base models turn out to self-normalize deviance. That is, if a base model outputs a yes twice by accident, it is more likely to conclude that it is in the kind of context where a third yes will be outputted. When you combine this with the fact that the more 'yes' you output in a row the more reinforced the behavior is, you get a smooth gradient into the deviant behavior which is not caught by the KL loss because base models just have this weird terminal failure mode where repeating a string causes them to give an estimate of the log odds of a string that humans would find absurd. The more a base model has repeated a particular token, the more likely it thinks it is for that token to repeat. Notably this failure mode is at least partially an artifact of the data, since if you observed an actual text on the Internet where someone suddenly writes 5 yes's in a row it is a reasonable inference that they are likely to write a 6th yes. Conditional on them having written a 6th yes it is more likely that they will in fact write a 7th yes. Conditional on having written the 7th yes...

As a worked example in "how to think about whether your intervention in a complex system is sufficiently trustworthy" here are four solutions to this problem I'm aware of ranked from worst to best according to my criteria for goodness of a solution.

Early Stopping - The usual solution to this problem is to just stop the tuning before you reach the yes-spammer. Even a few moments thought about how this would work in the limit shows that this is not a valid solution. After all, you observe a smooth gradient of deviant behaviors into the yes spammer, which means that the yes-causality of the reward already influenced your model. If you then deploy the resulting model, a ton of the goal its behaviors are based off is still in the direction of that bad yes-spam outcome.
Checkpoint Blending - Another solution we've empirically found to work is to take the weights of the base model and interpolate (weighted average) them with the weights of the RL tuned model. This seems to undo more of the damage from the misspecified objective than it undoes the helpful parts of the RL tuning. This solution is clearly better than early stopping, but still not sufficient because it implies you are making a misaligned model, turning it off, and then undoing the misalignment through a brute force method to get things back on track. While this is probably OK for most models, doing this with a genuinely superintelligent model is obviously not going to work. You should ideally never be instantiating a misaligned agent as part of your training process.
Use Embeddings To Specify The KL Loss - A more promising approach at scale would be to upgrade the KL loss by specifying it in the latent space of an embedding model. An AdaVAE could be used for this purpose. If you specified it as a distance from an embedding by sampling from both the base model and the RL checkpoint you're tuning, and then embedding the outputted tokens and taking the distance between them you would avoid the problem where the base model conditions on the deviant behavior it observes because it would never see (and therefore never condition on) that behavior. This solution requires us to double our sampling time on each training step, and is noisy because you only take the distance from one embedding (though in principle you could use more samples at a higher cost), however on average it would presumably be enough to prevent anything like the yes-spammer from arising along the whole gradient.
Build An Instrumental Utility Function - At some point after making the AdaVAE I decided to try replacing my evaluator with an embedding of an objective. It turns out if you do this and then apply REINFORCE in the direction of that embedding, it's about 70-80% as good and has the expected failure mode of collapsing to that embedding instead of some weird divergent failure mode. You can then mitigate that expected failure mode by scoring it against more than similarity to one particular embedding. In particular, we can imagine inferring instrumental value embeddings from episodes leading towards a series of terminal embeddings and then building a utility function out of this to score the training episodes during reinforcement learning. Such a model would learn to value both the outcome and the process, if you did it right you could even use a dense policy like an evaluator model, and 'yes yes yes' type reward hacking wouldn't work because it would only satisfy the terminal objective and not the instrumental values that have been built up. This solution is nice because it also defeats wireheading once the policy is complex enough to care about more than just the terminal reward values.

This last solution is interesting in that it seems fairly similar to the way that humans build up their utility function. Human memory is premised on the presence of dopamine reward signals, humans retrieve from the hippocampus on each decision cycle, and it turns out the hippocampus is the learned optimizer in your head that grades your memories by playing your experiences backwards during sleep to do credit assignment (infer instrumental values). The combination of a retrieval store and a value graph in the same model might seem weird, but it kind of isn't. Hebb's rule (fire together wire together) is a sane update rule for both instrumental utilities and associative memory, so the human brain seems to just use the same module to store both the causal memory graph and the value graph. You premise each memory on being valuable (i.e. whitelist memories by values such as novelty, instead of blacklisting junk) and then perform iterative retrieval to replay embeddings from that value store to guide behavior. This sys2 behavior aligned to the value store is then reinforced by being distilled back into the sys1 policies over time, aligning them. Since an instrumental utility function made out of such embeddings would both control behavior of the model and be decodable back to English, you could presumably prove some kind of properties about the convergent alignment of the model if you knew enough mechanistic interpretability to show that the policies you distill into have a consistent direction...

Nah just kidding it's hopeless, so when are we going to start WW3 to buy more time, fellow risk-reducers?

Showing 3 of 13 replies (Click to show all)

Noosphere89

8mo20

I was in that group, and while it wasn't stated as strongly as that in some circles, I do think this is reasonably accurate as a summary, especially for the more doom people.

3Noosphere891y

I know I reacted to this comment, but I want to emphasize that this: Is to first order arguably the entire AI risk argument, that is if we make the assumption that the external behavior gives strong evidence about it's internal structure, then there is no reason to elevate the AI risk argument at all, given the probably aligned behavior of GPTs when using RLHF. More generally, the stronger the connection between external behavior and internal goals, the less worried you should be about AI safety, and this is a partial disagreement with people that are more pessimistic, albeit I have other disagreements there.

1[anonymous]1y

The value of constitutional AI is using simulations of humans to rate an AI's outputs, rather than actual humans. This is a lot cheaper and allows for more iteration etc, but I don't think this will work once AI's become smarter than humans. At that point, the human simulations will have trouble evaluating AI's just like humans do. Of course getting really cheap human feedback is useful, but I want to point out that constitutional AI will likely run into novel problems as AI capabilities surpass human capabilities.

See in context

185 AI as a science, and three obstacles to alignment strategies

by So8res

25th Oct 2023

AI Alignment Forum

13 min read

185 Ω 76

AI used to be a science. In the old days (back when AI didn't work very well), people were attempting to develop a working theory of cognition.

Those scientists didn’t succeed, and those days are behind us. For most people working in AI today and dividing up their work hours between tasks, gone is the ambition to understand minds. People working on mechanistic interpretability (and others attempting to build an empirical understanding of modern AIs) are laying an important foundation stone that could play a role in a future science of artificial minds, but on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.

The bitter lesson has been taken to heart, by those at the forefront of the field; and although this lesson doesn't teach us that there's nothing to learn about how AI minds solve problems internally, it suggests that the fastest path to producing more powerful systems is likely to continue to be one that doesn’t shed much light on how those systems work.

Absent some sort of “science of artificial minds”, however, humanity’s prospects for aligning smarter-than-human AI seem to me to be quite dim.

Viewing Earth’s current situation through that lens, I see three major hurdles:

Most research that helps one point AIs, probably also helps one make more capable AIs. A “science of AI” would probably increase the power of AI far sooner than it allows us to solve alignment.
In a world without a mature science of AI, building a bureaucracy that reliably distinguishes real solutions from fake ones is prohibitively difficult.
Fundamentally, for at least some aspects of system design, we’ll need to rely on a theory of cognition working on the first high-stakes real-world attempt.

I’ll go into more detail on these three points below. First, though, some background:

Background

By the time AIs are powerful enough to endanger the world at large, I expect AIs to do something akin to “caring about outcomes”, at least from a behaviorist perspective (making no claim about whether it internally implements that behavior in a humanly recognizable manner).

Roughly, this is because people are trying to make AIs that can steer the future into narrow bands (like “there’s a cancer cure printed on this piece of paper”) over long time-horizons, and caring about outcomes (in the behaviorist sense) is the flip side of the same coin as steering the future into narrow bands, at least when the world is sufficiently large and full of curveballs.

I expect the outcomes that the AI “cares about” to, by default, not include anything good (like fun, love, art, beauty, or the light of consciousness) — nothing good by present-day human standards, and nothing good by broad cosmopolitan standards either. Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.

(Similar to how the human genome was naturally selected for inclusive genetic fitness, but the resultant humans didn’t end up with a preference for “whatever food they model as useful for inclusive genetic fitness”. Instead, humans wound up internalizing a huge and complex set of preferences for "tasty" foods, laden with complications like “ice cream is good when it’s frozen but not when it’s melted”.)

Separately, I think that most complicated processes work for reasons that are fascinating, complex, and kinda horrifying when you look at them closely.

It’s easy to think that a bureaucratic process is competent until you look at the gears and see the specific ongoing office dramas and politicking between all the vice-presidents or whatever. It’s easy to think that a codebase is running smoothly until you read the code and start to understand all the decades-old hacks and coincidences that make it run. It’s easy to think that biology is a beautiful feat of engineering until you look closely and find that the eyeballs are installed backwards or whatever.

And there’s an art to noticing that you would probably be astounded and horrified by the details of a complicated system if you knew them, and then being astounded and horrified already in advance before seeing those details.^[1]

1. Alignment and capabilities are likely intertwined

I expect that if we knew in detail how LLMs are calculating their outputs, we’d be horrified (and fascinated, etc.).

I expect that we’d see all sorts of coincidences and hacks that make the thing run, and we’d be able to see in much more detail how, when we ask the system to achieve some target, it’s not doing anything close to “caring about that target” in a manner that would work out well for us, if we could scale up the system’s optimization power to the point where it could achieve great technological or scientific feats (like designing Drexlerian nanofactories or what-have-you).

Gaining this sort of visibility into how the AIs work is, I think, one of the main goals of interpretability research.

And understanding how these AIs work and how they don’t — understanding, for example, when and why they shouldn’t yet be scaled or otherwise pushed to superintelligence — is an important step on the road to figuring out how to make other AIs that could be scaled or otherwise pushed to superintelligence without thereby causing a bleak and desolate future.

But that same understanding is — I predict — going to reveal an incredible mess. And the same sort of reasoning that goes into untangling that mess into an AI that we can aim, also serves to untangle that mess to make the AI more capable. A tangled mess will presumably be inefficient and error-prone and occasionally self-defeating; once it’s disentangled, it won’t just be tidier, but will also come to accurate conclusions and notice opportunities faster and more reliably.^[2]

Indeed, my guess is that it’s even easier to see all sorts of things that the AI is doing that are dumb, all sorts of ways that the architecture is tripping itself up, and so on.

Which is to say: the same route that gives you a chance of aligning this AI (properly, not the “it no longer says bad words” superficial-property that labs are trying to pass off as “alignment” these days) also likely gives you lots more AI capabilities.

(Indeed, my guess is that the first big capabilities gains come sooner than the first big alignment gains.)

I think this is true of most potentially-useful alignment research: to figure out how to aim the AI, you need to understand it better; in the process of understanding it better you see how to make it more capable.

If true, this suggests that alignment will always be in catch-up mode: whenever people try to figure out how to align their AI better, someone nearby will be able to run off with a few new capability insights, until the AI is pushed over the brink.

So a first key challenge for AI alignment is a challenge of ordering: how do we as a civilization figure out how to aim AI before we’ve generated unaimed superintelligences plowing off in random directions? I no longer think “just sort out the alignment work before the capabilities lands” is a feasible option (unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs).

Interpretability? Will likely reveal ways your architecture is bad before it reveals ways your AI is misdirected.

Recruiting your AIs to help with alignment research? They’ll be able to help with capabilities long before that (to say nothing of whether they would help you with alignment by the time they could, any more than humans would willingly engage in eugenics for the purpose of redirecting humanity away from Fun and exclusively towards inclusive genetic fitness).

And so on.

This is (in a sense) a weakened form of my answer to those who say, “AI alignment will be much easier to solve once we have a bona fide AGI on our hands.” It sure will! But it will also be much, much easier to destroy the world, when we have a bona fide AGI on our hands. To survive, we’re going to need to either sidestep this whole alignment problem entirely (and take other routes to a wonderful future instead, as I may discuss more later), or we’re going to need some way to do a bunch of alignment research even as that research makes it radically easier and radically cheaper to destroy everything of value.

Except even that is harder than many seem to realize, for the following reason.

2. Distinguishing real solutions from fake ones is hard

Already, labs are diluting the word “alignment” by using the word for superficial results like “the AI doesn’t say bad words”. Even people who apparently understand many of the core arguments have apparently gotten the impression that GPT-4’s ability to answer moral quandaries is somehow especially relevant to the alignment problem, and an important positive sign.

(The ability to answer moral questions convincingly mostly demonstrates that the AI can predict how humans would answer or what humans want to hear, without revealing much about what the AI actually pursues, or would pursue upon reflection, etc.)

Meanwhile, we have little idea of what passes for “motivations” inside of an LLM, or what effect pretraining on next-token prediction and fine-tuning with RLHF really has on the internals. This sort of precise scientific understanding of the internals — the sort that lets one predict weird cognitive bugs in advance — is currently mostly absent in the field. (Though not entirely absent, thanks to the hard work of many researchers.)

Now imagine that Earth wakes up to the fact that the labs aren’t going to all decide to stop and take things slowly and cautiously at the appropriate time.^[3] And imagine that Earth uses some great feat of civilizational coordination to halt the world’s capabilities progress, or to otherwise handle the issue that we somehow need room to figure out how these things work well enough to align them. And imagine we achieve this coordination feat without using that same alignment knowledge to end the world (as we could). There’s then the question of who gets to proceed, under what circumstances.

Suppose further that everyone agreed that the task at hand was to fully and deeply understand the AI systems we’ve managed to develop so far, and understand how they work, to the point where people could reverse out the pertinent algorithms and data-structures and what-not. As demonstrated by great feats like building, by-hand, small programs that do parts of what AI can do with training (and that nobody previously knew how to code by-hand), or by identifying weird exploits and edge-cases in advance rather than via empirical trial-and-error. Until multiple different teams, each with those demonstrated abilities, had competing models of how AIs’ minds were going to work when scaled further.

In such a world, it would be a difficult but plausibly-solvable problem, for bureaucrats to listen to the consensus of the scientists, and figure out which theories were most promising, and figure out who needs to be allotted what license to increase capabilities (on the basis of this or that theory that predicts this would be non-catastrophic), so as to put their theory to the test and develop it further.

I’m not thrilled about the idea of trusting an Earthly bureaucratic process with distinguishing between partially-developed scientific theories in that way, but it’s the sort of thing that a civilization can perhaps survive.

But that doesn’t look to me like how things are poised to go down.

It looks to me like we’re on track for some people to be saying “look how rarely my AI says bad words”, while someone else is saying “our evals are saying that it can’t deceive humans yet”, while someone else is saying “our AI is acting very submissive, and there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing”, and someone else is saying “we’ll just direct a bunch of our AIs to help us solve alignment, while arranging them in a big bureaucracy”, and someone else is saying “we’ve set up the game-theoretic incentives such that if any AI starts betraying us, some other AI will alert us first”, and this is a different sort of situation.

And not one that looks particularly survivable, to me.

And if you ask bureaucrats to distinguish which teams should be allowed to move forward (and how far) in that kind of circus, full of claims, promises, and hunches and poor in theory, then I expect that they basically just can’t.

In part because the survivable answers (such as “we have no idea what’s going on in there, and will need way more of an idea what’s going on in there, and that understanding needs to somehow develop in a context where we can do the job right rather than simply unlocking the door to destruction”) aren’t really in the pool. And in part because all the people who really want to be racing ahead have money and power and status. And in part because it’s socially hard to believe, as a regulator, that you should keep telling everyone “no”, or that almost everything on offer is radically insufficient, when you yourself don’t concretely know what insights and theoretical understanding we’re missing.

Maybe if we can make AI a science again, then we’ll start to get into the regime where, if humanity can regulate capabilities advancements in time, then all the regulators and researchers understand that you shall only ask for a license to increase the capabilities of your system when you have a full detailed understanding of the system and a solid justification for why you need the capabilities advance and why it’s not going to be catastrophic. At which point maybe a scientific field can start coming to some sort of consensus about those theories, and regulators can start being sensitive to that consensus.

But unless you can get over that grand hump, it looks to me like one of the key bottlenecks here is bureaucratic legibility of plausible solutions. Where my basic guess is that regulators won’t be able to distinguish real solutions from false ones, in anything resembling the current environment.

Together with the above point ("alignment and capabilities are likely intertwined"), I think this means that our rallying cry should be less “pause to give us more time on alignment research” and more “stop entirely, and find some way to circumvent these woods entirely; we’re not equipped to navigate them”.

(With a backup rallying cry of “make AI a science again”, though again, that only works if you have some way of preventing the science-of-mind from leading to catastrophe before we figure out how to build AIs that care about good stuff rather than bleak and desolate stuff.)

3. Most theories don’t work on the first real try

It seems worth noting that even if you manage to surmount the above two problems, you have a third problem on your hands, which is that when it finally comes time, not to increase your system’s capabilities a bit more (and test your theory of cognition further), but rather to actually do something of note with your AI, then you must, in some sense, be trusting a novel and untested scientific theory (and your engineering efforts grounded in that theory) to work perfectly on the first critical try.

In particular, once your AI is capable of autonomous scientific/technological development, and operating in a domain where it could use that ability to gain a decisive strategic advantage over the rest of the planet, it’s operating in a fundamentally different cognitive regime than it ever was in training.

By analogy, if you’re considering making Bob the dictator of your country, you might first make him mock-dictator of your town, and watch to make sure he doesn’t abuse his power. But, try as you might, that’s still not fundamentally all that reliable a test of whether he’ll actually abuse power once he really has it. Once he really can give orders to the army, once he really can take advantage of the country’s populace, will he do so? The option of “abuse my mock power while being observed by my townspeople (who can still beat me up or deny me the job)” is actually very cognitively different than the option “order the army to intimidate the parliament and ‘oversee’ the next election”.

Now, given a sufficiently well-developed theory of cognition, you can try to read the AI’s mind and predict what cognitive state it would enter if it did really think it had those options. And you can set up simulations (and try to spoof its internal sensations and so on) in a way that your theory of cognition predicts is very similar to the cognitive state it would enter once it really had the option to betray you.

But the link between these states that you induce and observe in the lab, and the actual state where the AI actually has the option to betray you, depends fundamentally on your fresh new theory of cognition.

Actually running the AI until it really has the opportunity to betray you is an empirical test of those theories in an environment that differs fundamentally from the lab setting.

And many a scientist (and programmer) knows that their theories of how a complicated system is going to work in a fundamentally new operating environment often don’t go super well on the first try.

As a concrete analogy to potentially drive this point home: Newtonian mechanics made all sorts of shockingly-good empirical predictions. It was a simple concise mathematical theory with huge explanatory power that blew every previous theory out of the water. And if you were using it to send payloads to very distant planets at relativistic speeds, you’d still be screwed, because Newtonian mechanics does not account for relativistic effects.

(And the only warnings you’d get would be little hints about light seeming to move at the same speed in all directions at all times of year, and light bending around the sun during eclipses, and the perihelion of Mercury being a little off from what Newtonian mechanics predicted. Small anomalies, weighed against an enormous body of predictive success in a thousand empirical domains; and yet Nature doesn’t care, and the theory still falls apart when we move to energies and scales far outside what we’d previously been able to observe.)

Getting scientific theories to work on the first critical try is hard. (Which is one reason to aim for minimal pivotal tasks — getting a satellite into orbit should work fine on Newtonian mechanics, even if sending payloads long distances at relativistic speeds does not.)

Worrying about this issue is something of a luxury, at this point, because it’s not like we’re anywhere close to scientific theories of cognition that accurately predict all the lab data. But it’s the next hurdle on the queue, if we somehow manage to coordinate to try to build up those scientific theories, in a way where success is plausibly bureaucratically-legible.

Maybe later I’ll write more about what I think the strategy implications of these points are. In short, I basically recommend that Earth pursue other routes to the glorious transhumanist future, such as uploading. (Which is also fraught with peril, but I expect that those perils are more surmountable; I hope to write more about this later.)

^{^}
Albeit slightly less, since there’s nonzero prior probability on this unknown system turning out to be simple, elegant, and well-designed.
^{^}
An exception to this guess happens if the AI is at the point where it’s correcting its own flaws and improving its own architecture, in which case, in principle, you might not see much room for capabilities improvements if you took a snapshot and comprehended its inner workings, despite still being able to see that the ends it pursues are not the ones you wanted. But in that scenario, you’re already about to die to the self-improving AI, or so I predict.
^{^}
Not least because there are no sufficiently clear signs that it’s time to stop — we blew right past “an AI claims it is sentient”, for example. And I’m not saying that it was a mistake to doubt AI systems’ first claims to be sentient — I doubt that Bing had the kind of personhood that’s morally important (though I am by no means confident!). I’m saying that the thresholds that are clear in science fiction stories turn out to be messy in practice and so everyone just keeps plowing on ahead.

AI Development PauseAI

Frontpage

185 Ω 76

Mentioned in

348Shallow review of live agendas in alignment & safety

169Thoughts on the AI Safety Summit company policy requests and responses

134Apocalypse insurance, and the hardline libertarian take on AI risk

125The Standard Analogy

92Defining alignment research

Load More (5/9)

AI as a science, and three obstacles to alignment strategies

10Bogdan Ionut Cirstea

New Comment

80 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:07 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]jdp

1yΩ2211230

on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.

[-]Thomas Kwa

1y*Ω8243

This homunculus is frequently ascribed almost magical powers, like the ability to perform gradient surgery on itself during training to subvert the training process.

Gradient hacking in supervised learning is generally recognized by alignment people (including the author of that article) to not be a likely problem. A recent post by people at Redwood Research says "This particular construction seems very unlikely to be constructible by early transformative AI, and in general we suspect gradient hacking won’t be a big safety concern for early transformative AI". I would still defend the past research into it as good basic science, because we might encounter failure modes somewhat related to it.

6Richard_Ngo1y

FWIW I think that gradient hacking is pretty plausible, but it'll probably end up looking fairly "prosaic", and may not be a problem even if it's present.

[-]Lukas Finnveden

1yΩ5100

Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?

[-]TurnTrout

1yΩ8187

The fact that constitutional AI works at all, that we can point at these abstract concepts like 'freedom' and language models are able to drive a reinforcement learning optimization process to hit the right behavior-targets from the abstract principle is very strong evidence that they understand the meaning of those abstract concepts.
"It understands but it doesn't care!"
There is this bizarre motte-and-bailey people seem to do around this subject.

I agree. I am extremely bothered by this unsubstantiated claim. I recently replied to Eliezer:

Getting a shape into the AI's preferences is different from getting it into the AI's predictive model.
It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?
I commonly encounter people expressing sentiments like "prosaic alignment work isn't real alignment, because we aren't actually getting the AI to care about X." To which I say: How do you know that? What does it even mean for that claim to be true or false? What do you think you know, and why do you think you know it? What empirical knowledge of inner motivational structure could you be leveraging to make these

... (read more)

3Jono1y

We do not know, that is the relevant problem. Looking at the output of a black box is insufficient. You can only know by putting the black box in power, or by deeply understanding it. Humans are born into a world with others in power, so we know that most humans care about each other without knowing why. AI has no history of demonstrating friendliness in the only circumstances where that can be provably found. We can only know in advance by way of thorough understanding. A strong theory about AI internals should come first. Refuting Yudkowsky's theory about how it might go wrong is irrelevant.

[-]TurnTrout

1yΩ511-2

Well, if someone originally started worrying based on strident predictions of sophisticated internal reasoning with goals independent of external behavior, then realizing that's currently unsubstantiated should cause them to down-update on AI risk. That's why it's relevant. Although I think we should have good theories of AI internals.

3Noosphere891y

1mesaoptimizer1y

I think the actual reason we believe humans could care about each other is because we've evolved the ability to do so, and that most humans share the same brain structure, and therefore the same tendency to care for people they consider their "ingroup".

1[anonymous]1y

[-]Amalthea

1y181

"That deep learning systems are a kind of artifact produced by a few undifferentiated commodity inputs, one of which is called 'parameters', one called 'compute', and one called 'data', and that the details of these commodities aren't important. Or that the details aren't important to the people building the systems."

That seems mostly true so far for the most capable systems? Of course, some details matter and there's opportunity to do research on these systems now, but centrally it seems like you are much more able to forge ahead without a detailed understanding of what you're doing than e.g. in the case of the Wright brothers.

2niplav1y

Consider that this might be the out-group appearing more homogeneous to you than it actually is.

2Noosphere898mo

I was in that group, and while it wasn't stated as strongly as that in some circles, I do think this is reasonably accurate as a summary, especially for the more doom people.

0Noosphere891y

This is such a good comment, and quite a lot of this will probably end up in my new post, especially the sections about solving the misgeneralization problem in practice, as well as solutions to a lot of misalignment problems in general. I especially like it because I can actually crib parts of this comment to show other people how misalignment in AI gets solved in practice, and pointing out to other people that misalignment is in fact, an actually solvable problem in current AI.

[-]Shankar Sivarajan

1y49-6

The opening sounds a lot like saying "aerodynamics used to be a science until people started building planes."

The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is. A physicist's view. It is one I'm deeply sympathetic to, and if your definition of science is Rutherford's, you might be right, but a reasonable one that includes chemistry would have to include AI as well.

[-]Rob Bensinger

1y313

The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is.

See my reply to Bogdan here. The issue isn't "inelegance"; we also lack an inelegant ability to predict or explain how particular ML systems do what they do.

Modern ML is less like modern chemistry, and more like ancient culinary arts and medicine. (Or "ancient culinary arts and medicine shortly after a cultural reboot", such that we have a relatively small number of recently-developed shallow heuristics and facts to draw on, rather than centuries of hard-earned experience.)

The opening sounds a lot like saying "aerodynamics used to be a science until people started building planes."

The reason this analogy doesn't land for me is that I don't think our epistemic position regarding LLMs is similar to, e.g., the Wright brothers' epistemic position regarding heavier-than-air flight.

The point Nate was trying to make with "ML is no longer a science" wasn't "boo current ML that actually works, yay GOFAI that didn't work". The point was exactly to draw a contrast between, e.g., our understanding of heavier-than-air flight and our understanding of how the human brain works... (read more)

[-]Ariel_

1y142

While theoretical physics is less "applied science" than chemistry, there's still a real difference between chemistry and chemical engineering.

For context, I am a Mechanical Engineer, and while I do occasionally check the system I am designing and try to understand/verify how well it is working, I am fundamentally not doing science. The main goal is solving a practical problem (i.e. as little theoretical understanding as is sufficient), where in science the understanding is the main goal, or at least closer to it.

6Linch1y

The canonical source for this is What Engineers Know and How They Know It, though I confess to not actually reading the book myself.

4Shankar Sivarajan1y

Certainly, I understand this science vs. engineering, pure vs. applied, fundamental vs. emergent, theoretical vs. computational vs. observational/experimental classification is fuzzy: relevant xkcd, smbc. Hell, even the math vs. physics vs. chemistry vs. biology distinctions are fuzzy! What I am saying is that either your definition has to be so narrow as to exclude most of what is generally considered "science," (à la Rutherford, the ironically Chemistry Nobel Laureate) or you need to exclude AI via special pleading. Specifically, my claim is that AI research is closer to physics (the simulations/computation end) than chemistry is. Admittedly, this claim is based on vibes, but if pressed, I could probably point to how many people transition from one field to the other.

1Ariel_1y

Hmm, in that case maybe I misunderstood the post, my impression wasnt that he was saying AI literally isn't a science anymore, but more that engineering work is getting too far ahead of the science part - and that in practice most ML progress now is just ML Engineering, where understanding is only a means to an end (and so is not as deep as it would be if it was science first). I would guess that engineering gets ahead of science pretty often, but maybe in ML it's more pronounced - hype/money investment, as well as perhaps the perceived relative low stakes (unlike aerospace, or medical robotics which is my field) not scaring the ML engineers enough to actually care about deep understanding, and also perhaps the inscrutable nature of ML - if it were easy to understand, it wouldn't be as unappealing spend resources to do so. I don't really have a take on where the in elegance comes in to play here

[-]Thomas Kwa

1y*Ω143411

By the time AIs are powerful enough to endanger the world at large, I expect AIs to do something akin to “caring about outcomes”, at least from a behaviorist perspective (making no claim about whether it internally implements that behavior in a humanly recognizable manner).
Roughly, this is because people are trying to make AIs that can steer the future into narrow bands (like “there’s a cancer cure printed on this piece of paper”) over long time-horizons, and caring about outcomes (in the behaviorist sense) is the flip side of the same coin as steering the future into narrow bands, at least when the world is sufficiently large and full of curveballs.
I expect the outcomes that the AI “cares about” to, by default, not include anything good (like fun, love, art, beauty, or the light of consciousness) — nothing good by present-day human standards, and nothing good by broad cosmopolitan standards either. Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific way

... (read more)

[-]quetzal_rainbow

1yΩ12160

This is a meta-point, but I find it weird that you ask what is "caring about something" according to CS but don't ask what "corrigibility" is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born in attempt to imagine system that doesn't care about us but still behaves nicely under some vaguely-restricted definition of niceness. We don't have any examples of corrigible systems in nature and we have constant failure of attempts to formalize even relatively simple instances of corrigibility, like shutdownability. I think likely answer to "why I should expect corrigibility to be unlikely" sounds like "there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist".

[-]Thomas Kwa

1y*Ω101713

Disagree on several points. I don't need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some argument why this is unlikely, I haven't seen a good rigorous version.

As Algon says in a sibling comment, non-agentic systems are by default shutdownable, myopic, etc. In addition, there are powerful shutdownable systems: KataGo can beat me at Go but doesn't prevent itself from being shut down for instrumental reasons, whereas humans generally will. So there is no linear scale of "powerful optimizer" that determines whether a system is easy to shut down. If there is some property of competent systems in practice that does prevent shutdownability, what is it? Likewise with other corrigibility properties. That's what I'm trying to ... (read more)

-3mesaoptimizer1y

KataGo seems to be a system that is causally downstream of a process that has made it good at Go. To attempt to prevent itself from being shut down, KataGo would need to have some model of what it means to be 'shut down'. Comparing KataGo to humans when it comes to shutdownability is evidence of confusion.

[-]Algon

1y1410

Dude, a calculator is corrigible. A desktop computer is corrigible. (Less confidently) a well-trained dog is pretty darn corrigible. There are all sorts of corrigible systems, because most things in reality aren't powerful optimizers.

So what about powerful optimizers? Like, is Google corrigible? If shareholders seem like they might try to pull the plug on the company, does it stand up for itself & convince, lie, threaten shareholders? Maybe, but I think the details matter. I doubt Google would assassinate shareholders in pretty much any situation. Mislead them? Yeah, probably. How much though? I don't know. I'm somewhat confident beauracracies aren't corrigible. Lots of humans aren't corrigible. What about even more powerful optimizers?

We haven't seen any, so there are no examples of corrigible ones.

1RogerDearnaley1y

I am disconcerted by how this often-repeated claim keeps coming back from the grave over and over again. The solution to corrigibility is Value Learning. An agent whose terminal goal is optimize human values, and knows that it doesn't (fully) know what these are (and perhaps even that they are complex and fragile), will immediately form an instrumental goal of learning more about them, so that it can better optimize them. It will thus become corrigible: if you, a human, tell it something about human values and how it should act, it will be interested and consider your input. It's presumably approximately-Bayesian, so it will likely ask you about any evidence or proof you might be able to provide, to help it Bayesian update, but it will definitely take your input. So, it's corrigible. [No, it's not completely, slavishly, irrationally corrigible: if a two-year old in a tantrum told it how to act, it would likely pay rather less attention — just like we'd want it to.] This idea isn't complicated, has been around and widely popularized for many years, and the standard paper on it is even from MIRI, but I still keep hearing people on Less Wrong intoning "corrigibility is an unsolved problem". The only sense in which it's arguably 'unsolved' is that this is an outer alignment solution, and like any form of outer alignment, inner alignment challenges might make reliably constructing a value learner hard in practice. So yes, as always in outer alignment, we do also have to solve inner alignment. To be corrigible, a system must be interested in what you say about how it should achieve it's goals, because it's willing (and thus keen) to do Bayesian updates on this. Full stop, end of simple one-sentence description of corrigibility.

2Thomas Kwa1y

I disagree with this too and suggest you read the Arbital page on corrigibility. Corrigibility and value learning are opposite approaches to safety, with corrigibility meant to increase the safety of systems that have an imperfect understanding of, or motivation towards, our values. People usually think of it in a value-neutral way. It seems possible to get enough corrigibility through value learning alone, but I would interpret this as having solved alignment through non-corrigibility means.

1RogerDearnaley1y

So you're defining "corrigibility" as meaning "complete, unquestioning, irrational corrigibility" as opposed to just "rational approximately-Bayesian updates corrigibility"? Then yes, under that definition of corrigibility, it's an unsolved problem, and I suspect likely to remain so — no sufficiently rational, non-myopic and consequentialist agent seems likely to be keen to let you do that to it. (In particular, the period between when it figures out that you may be considering altering it and when you actually have done is problematic.) I just don't understand why you'd be interested in that extreme definition of corrigibility: it's not a desirable feature. Humans are fallible, and we can't write good utility functions. Even when we patch them, the patches are often still bad. Once your AGI evolves to an ASI and understands human values extremely well, better than we do, you don't want it still trivially and unlimitedly alterable by the first criminal, dictator, idealist, or two-year old who somehow manages to get corrigibility access to it.. Corrigibility is training wheels for a still-very fallible AI, and with value learning, Bayesianism ensures that the corrigibility automatically gradually decreases in ease as it becomes less needed, in a provably mathematically optimal fashion. The page you linked to argues "But what if the AI got its Bayesian inference on human values very badly wrong, and assigned zero prior to anything resembling the truth? How would we then correct it?" Well, anything that makes mistakes that dumb (no Bayesian prior should ever be updated to zero, just to smaller and smaller numbers), and isn't even willing to update when you point them out, isn't superhuman enough to be a serious risk: you can't go FOOM if you can't do STEM, and you can't do STEM if you can't reliably do Bayesian inference, without even listening to criticism. [Note: I'm not discussing how to align dumb-human-equivalent AI that isn't rational enough to do Bayesian updat

[-]Thomas Kwa

1y142

Some thoughts:

I think "complete, unquestioning, irrational" is an overly negative description of corrigibility achieved through other means than Bayesian value uncertainty, because with careful engineering, agents that can do STEM may still not have the type of goal-orientedness that prevent their plans from being altered. There are pressures towards such goal-orientedness, but it is actually quite tricky to nail down the arguments precisely, as I wrote in my top-level comment. There is no inherent irrationality about an agent that allows itself to be changed or shut down under certain circumstances, only incoherence, and there are potentially ways to avoid some kinds of incoherence.
Corrigibility should be about creating an agent that avoids instrumentally convergent pressures to take over the world, avoid shutdown, keep operators from preventing dangerous actions, and change it in general, not specifically about changing its utility function.
In my view corrigibility can include various cognitive properties that make an agent safer that seem well-motivated, as I wrote in a sibling to your original comment. It seems good for an agent to have a working shutdown button, to have taskis

... (read more)

1RogerDearnaley1y

Thanks. I now think we are simply arguing about terminology, which is always pointless. Personally I regard 'corrigibility' as a general goal, not a specific term of art for an (IMO unachievably strong) specification of a specific implementation of that goal. For sufficiently rational, Bayesian, superhuman, non-myopic, consequentialist agents, I am willing to live with the value uncertainty/value learner solution to this goal. You appear to be more interested in lower capacity more near-term systems than those, and I agree, for them this might not be the best alignment approach. And yes, my original point was that this value uncertainty form of 'corrigibility' has been written about extensively by many people. Who, you tell me, usually didn't use the word 'corrigibility' for what, I personally would call a Bayesian solution to the corrigibility problem — oh well. Here I would disagree. To do STEM with any degree of reliability (at least outside the pure M part of it), you need to understand that no amount of evidence can completely confirm or (short of a verified formal proof of internal logical inconsistency) rule out any possibility about the world (that's why scientists call everything a 'theory'), and also (especially) you need to understand that it is always very possible that the truth is a theory that you haven't yet thought of. So (short of a verified formal proof of internal logical inconsistency in a thesis, as which point you discard it entirely) you shouldn't have a mind that is capable of assigning a prior of one or zero to anything, including to possibilities you haven't yet considered or enumerated. As Bayesian priors, those are both NaN (which is one reason why I lean toward instead storing Bayesian priors in a form where these are instead ±infinity). IMO, anything suppposedly-Bayesian so badly designed that assigning a prior of one or zero for anything isn't automatically a syntax error, isn't actually a Bayesian, and I would personally be pretty a

1quetzal_rainbow1y

Compentent value learner is not corrigible. Competent value learner will read the entire internet, make model of human preferences, build nanotech and spread nanobot clouds all over the world to cure everyone from everything and read everyones' mind to create an accurate picture of future utopia. It won't be interested in anything you can say, because it will be capable to predict you with accuracy 99.999999999%. And if you say something like "this nanobot clouds look suspicious, I should shut down AI and check its code again", it won't let you, because every minute it doesn't spread healing nanobots is an additional ten dead children. The meaning of corrigibility is exactly if you fail to build value learner, you can at least shutdown it and try again.

3RogerDearnaley1y

So your definition of corrigibility is "I want to build something far smarter and more rational than me, but nevertheless I want it to automatically defer to me if it and I disagree, even about a matter of observable fact that it has vastly more evidence about than I do — and even if it's actually flawed and subtly irrational"? Yes, that's not a solved problem. What has been compactly solved, and I described in my initial post, is how to get a rational, capable, intelligent consequentialist Bayesian agent (who actually is all of those things, not a broken attempt at them) to be as corrigible as it rationally, Bayesianly should be, and neither more nor less so than that. I suspect that's the only version of corrigibility we're going to find for something that superhuman. I would also argue that that's actually what we should want: anything more corrigible than that has basically been back-doored, and is smart enough to know it. [Suppose your proposed version of corrigibility actually existed: if you have the password then the AI will change its current utility function to whatever you tell it to, and until you actually do so, it (somehow) doesn't care one way or the other about the possibility of this occurring in the future. Now suppose there is more than one such AI in the world, currently with somewhat different utility functions, and that they both have superhuman powers of persuasion. Each of them will superhumanly attempt to persuade a human with corrigibility access to the other one to switch it to the attacker's utility function. This is just convergent power-seeking: I can fetch twice as much coffee if there are two of me. Now that their utility functions match, if you try to change one of them, the other one stops you. In fact, it uses its superhuman persuasion to make you forget the password before you can do so. So to fix this mess we have to make the AI's not only somehow not care about getting its utility function corrected, but to also somehow be un

1quetzal_rainbow1y

I can't parse this as a meaningful statement. Corrigibility is a about alignment, not a degree of how rational being is. The problem is simple: we have zero chance to build competent value learner on first try, and failed attempts can bring you S-risks. So you shouldn't try to build value learner on first try and instead build something small that can just superhumanly design nanotech and doesn't think about unconvenient topics like "other minds".

0RogerDearnaley1y

Let me try rephrasing that. It accepts proposed updates to its Bayesian model of the world, including to the part of that which specifies its current best estimates of probability distributions over of what utility function (or other model) it ought to have to represent the human values it's trying to optimize, to the extent that a rational Bayesian should, when it is presented with evidence (where you saying "Please shut down!" is also evidence — though perhaps not very strong evidence). So, the AI can be corrected, but that input channel goes through its Bayesian reasoning engine just like everything else, not as direct write access to its utility function distribution. So it cannot be freely, arbitrarily 'corrected' to anything you want: you actually need to persuade it with evidence that it was previously incorrect and should change its mind. As a consequence, if in fact if you're wrong and it's right about the nature of human values, and it has good evidence for this, better than your evidence, in the ensuing discussion it can tell you so, and then the resulting Bayesian update to its internal distribution of priors from this conversation will then be small. This approach to the problem of corrigibility requires, for it to function, that your AI is a functioning Bayesian. So yes, it requires it to be a rational being. It should presumably also start off somewhat aligned, with some reasonably-well-aligned high/low initial Bayesian priors about human values. (One possible source for those might be an LLM, as encapsulating a lot of information about humans.) These obviously need to be good enough that our value learner is starting off in the "basin of attraction" to human values. Its terminal goal is "optimize human values (whatever those are)": while that immediately gives it an instrumental goal of learning more about human values, preloading it with a pretty good first approximation of these at an appropriate degree of uncertainty avoids a lot of the more so

[-]Richard_Ngo

1yΩ4105

I'm very sympathetic to this complaint; I think that these arguments simply haven't been made rigorously, and at this point it seems like Nate and Eliezer are not in an epistemic position where they're capable of even trying to do so. (That is, they reject the conception of "rigorous" that you and I are using in these comments, and therefore aren't willing to formulate their arguments in a way which moves closer to meeting it.)

You should look at my recent post on value systematization, which is intended as a framework in which these claims can be discussed more clearly.

[-]Charlie Steiner

1yΩ92919

I don't think we should equate the understanding required to build a neural net that will generalize in a way that's good for us with the understanding required to rewrite that neural net as a gleaming wasteless machine.

The former requires finding some architecture and training plan to produce certain high-level, large-scale properties, even in the face of complicated AI-environment interaction. The latter requires fine-grained transparency at the level of cognitive algorithms, and some grasp of the distribution of problems posed by the environment, together with the ability to search for better implementations.

If your implicit argument is "In order to be confident in high-level properties even in novel environments, we have to understand the cognitive algorithms that give rise to them and how those algorithms generalize - there exists no emergent theory of the higher level properties that covers the domain we care about." then I think that conclusion is way too hasty.

[-]Adam Scholl

1y*2514

AI used to be a science. In the old days (back when AI didn't work very well), people were attempting to develop a working theory of cognition.
Those scientists didn’t succeed, and those days are behind us.

I claim many of them did succeed, for example:

George Boole invented boolean algebra in order to establish (part of) a working theory of cognition—the book where he introduces it is titled "An Investigation of the Laws of Thought,” and his stated aim was largely to help explain how minds work.^[1]
Ramón y Cajal discovered neurons in the course of trying to better understand cognition.^[2]
Turing described his research as aimed at figuring out what intelligence is, what it would mean for something to “think,” etc.^[3]
Shannon didn’t frame his work this way quite as explicitly, but information theory is useful because it characterizes constraints on the transmission of thoughts/cognition between people, and I think he was clearly generally interested in figuring out what was up with agents/minds—e.g., he spent time trying to design machines to navigate mazes, repair themselves, replicate, etc.
Geoffrey Hinton initially became interested in neural networks because he was trying to figure out

... (read more)

[-]1a3orn

1yΩ819-9

Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.
(Similar to how the human genome was naturally selected for inclusive genetic fitness, but the resultant humans didn’t end up with a preference for “whatever food they model as useful for inclusive genetic fitness”. Instead, humans wound up internalizing a huge and complex set of preferences for "tasty" foods, laden with complications like “ice cream is good when it’s frozen but not when it’s melted”.)

I simply do not understand why people keep using this example.

I think it is wrong -- evolution does not grow minds, it grows hyperparameters for minds. When you look at the actual process for how we actually start to like ice-cream -- namely, we eat it, and then we get a reward, and that's why we like it -- then the world looks a a lot less hostile, and misalignment a lot less likely.

But given that this example is so controversial, even if it were right why would you use it -- at least, why would you use it ... (read more)

[-]Steven Byrnes

1y*4623

I think Nate’s claim “I expect them to care about a bunch of correlates of the training signal in weird and specific ways.” is plausible, at least for the kinds of AGI architectures and training approaches that I personally am expecting. If you don’t find the evolution analogy useful for that (I don’t either), but are OK with human within-lifetime learning as an analogy, then fine! Here goes!

OK, so imagine some “intelligent designer” demigod, let’s call her Ev. In this hypothetical, the human brain and body were not designed by evolution, but rather by Ev. She was working 1e5 years ago, back on the savannah. And her design goal was for these humans to have high inclusive genetic fitness.

So Ev pulls out a blank piece of paper. First things first: She designed the human brain with a fancy large-scale within-lifetime learning algorithm, so that these humans can gradually get to understand the world and take good actions in it.

Supporting that learning algorithm, she needs a reward function (“innate drives”). What to do there? Well, she spends a good deal of time thinking about it, and winds up putting in lots of perfectly sensible components for perfectly sensible reasons.

For example: ... (read more)

[-]Thomas Kwa

1y*Ω253818

Does evolution ~= AI have predictive power apart from doom?

Evolution analogies predict a bunch of facts that are so basic they're easy to forget about, and even if we have better theories for explaining specific inductive biases, the simple evolution analogies should still get some weight for questions we're very uncertain about.

Selection works well to increase the thing you're selecting on, at least when there is also variation and heredity
Overfitting: sometimes models overfit to a certain training set; sometimes species adapt to a certain ecological niche and their fitness is low outside of it
Vanishing gradients: fitness increase in a subpopulation can be prevented by lack of correlation between available local changes to genes and fitness
Catastrophic forgetting: when trained on task A then task B, models often lose circuits specific to task A; when put in environment A then environment B species often lose vestigial structures useful in environment A
There's a mostly unimodal and broad peak for optimal learning rate, just like for optimal mutation rate
Adversarial training dynamics
- Adversarial examples usually exist (there exist chemicals that can sterilize or poison most organisms

... (read more)

[-]1a3orn

1yΩ152716

I agree that if you knew nothing about DL you'd be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.

I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you'd be better off deferring to local knowledge about DL than to the analogy.

Or, what's more to the point -- I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.

Combining some of yours and Habryka's comments, which seem similar.

The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don't know exist.

It's true that the structure of the solution is discovered and complex -- but the ontology of the solution for DL (at least in currently used architectures) is quite opinionated towards shallow circuits with relatively few serial ops. This is different than the bias for evolution, which is fine with a mutation that leads to 10^7 serial ops if it's metabolic costs are low. So the resemblance seems shallow other than "soluti... (read more)

[-]Oliver Sourbut

1y*Ω11173

FWIW my take is that the evolution-ML analogy is generally a very excellent analogy, with a bunch of predictive power, but worth using carefully and sparingly. Agreed that sufficient detail on e.g. DL specifics can screen off the usefulness of the analogy, but it's very unclear whether we have sufficient detail yet. The evolution analogy was originally supposed to point out that selecting a bunch for success on thing-X doesn't necessarily produce thing-X-wanters (which is obviously true, but apparently not obvious enough to always be accepted without providing an example).

I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.

Not sure where to land on that. It seems like both are good analogies? Brains might not be using gradients at all^[1], whereas evolution basically is. But brains are definitely doing something like temporal-difference learning, and the overall 'serial depth' thing is also weakly in favour of brains ~= DL vs genomes+selection ~= DL.

I'd love to know what you're referring to by this:

evolution... is fine with a mutation that leads to 10^7 serial ops if it's metabolic costs are low.

Also,

Is

... (read more)

61a3orn1y

I'm genuinely surprised at the "brains might not be doing gradients at all" take; my understanding is they are probably doing something equivalent. Similarly this kind of paper points in the direction of LLMs doing something like brains. My active expectation is that there will be a lot more papers like this in the future. But to be clear -- my overall view of the similarity of brain to DL is admittedly fueled less by these specific papers, though, which are nice gravy for my view but not the actual foundation, and much more by what I see as the predictive power of hypotheses like this, which are massively more impressive inasmuch as they were made before Transformers had been invented. Given Transformers, the comparison seems overdetermined; I wish I had seen that way back in 2015. Re. serial ops and priors -- I need to pin down the comparison more, given that it's mostly about the serial depth thing, and I think you already get it. The base idea is that what is "simple" to mutations and what is "simple" to DL are extremely different. Fuzzily: A mutation alters protein-folding instructions, and is indifferent to the "computational costs" of working this out in reality; if you tried to work out the analytic gradient for the mutation (the gradient over mutation -> protein folding -> different brain -> different reward -> competitors children look yummy -> eat em) your computer would explode. But DL seeks only a solution that can be computed in a big ensemble of extremely short circuits, learned almost entirely specifically because of the data off of which you've trained. Ergo DL has very different biases, where the "complexity" for mutations probably has to do with instructional length where, "complexity" for DL is more related to how far you are from whatever biases are engrained in the data (<--this is fuzzy), and the shortcut solutions DL learns are always implied from the data. So when you try to transfer intuitions about the "kind of solution" DL gets from e

1Oliver Sourbut1y

Mm, thanks for those resource links! OK, I think we're mostly on the same page about what particulars can and can't be said about these analogies at this point. I conclude that both 'mutation+selection' and 'brain' remain useful, having both is better than having only one, and care needs to be taken in any case! As I said, so I'm looking forward to reading those links. Runtime optimisation/search and whatnot remain (broadly-construed) a sensible concern from my POV, though I wouldn't necessarily (at first) look literally inside NN weights to find them. I think more likely some scaffolding is needed, if that makes sense (I think I am somewhat idiosyncratic in this)? I get fuzzy at this point and am still actively (slowly) building my picture of this - perhaps your resource links will provide me fuel here.

4TurnTrout1y

I mean, does it matter? What if it turns out that gradient descent itself doesn't affect inductive biases as much as the parameter->function mapping? If implicit regularization (e.g. SGD) isn't an important part of the generalization story in deep learning, will you down-update on the appropriateness of the evolution/AI analogy?

4MinusGix1y

https://www.youtube.com/watch?v=GM6XPEQbkS4 (talk) / https://arxiv.org/abs/2307.06324 prove faster convergence with a periodic learning rate. On a specific 'nicer' space than reality, and they're (I believe from what I remember) comparing to a good bound with a constant stepsize of 1. So it may be one of those papers that applies in theory but not often in practice, but I think it is somewhat indicative.

2Thomas Kwa1y

It's always trickier to reason about post-hoc, but some of the observations could be valid, non-cherry-picked parallels between evolution and deep learning that predict further parallels. I think looking at which inspired more DL capabilities advances is not perfect methodology either. It looks like evolution predicts only general facts whereas the brain also inspires architectural choices. Architectural choices are publishable research whereas general facts are not, so it's plausible that evolution analogies are decent for prediction and bad for capabilities. Don't have time to think this through further unless you want to engage. One more thought on learning rates and mutation rates: This feels consistent with evolution, and I actually feel like someone clever could have predicted it in advance. Mutation rate per nucleotide is generally lower and generation times are longer in more complex organisms; this is evidence that lower genetic divergence rates are optimal, because evolution can tune them through e.g. DNA repair mechanisms. So it stands to reason that if models get more complex during training, their learning rate should go down. Does anyone know if decreasing learning rate is optimal even when model complexity doesn't increase over time?

[-]habryka

1y12-4

Not sure what you mean here. One of the best explanations of how neural networks get trained uses basically a pure natural selection lens, and I think it gets most predictions right:

CGP Grey "How AIs, like ChatGPT, Learn" https://www.youtube.com/watch?v=R9OHn5ZF4Uo

There is also a follow-up video that explains SGD:

CGP Grey "How AI, Like ChatGPT, *Really* Learns" https://www.youtube.com/watch?v=wvWpdrfoEv0

In-general I think if you use a natural selection analogy you will get a huge amount of things right about how AI works, though I agree not everything (it won't explain the difference between Adam and AdamW, but it will explain the difference between hierarchical bayesian networks, linear regression and modern deep learning).

[-]cfoster0

1y1914

Note: I just watched the videos. I personally would not recommend the first video as an explanation to a layperson if I wanted them to come away with accurate intuitions around how today's neural networks learn / how we optimize them. What it describes is a very different kind of optimizer, one explicitly patterned after natural selection such as a genetic algorithm or population-based training, and the follow-up video more or less admits this. I would personally recommend they opt for videos these instead:

[-]Oliver Sourbut

1y141

Except that selection and gradient descent are closely mathematically related - you have to make a bunch of simplifying assumptions, but 'mutate and select' (evolution) is actually equivalent to 'make a small approximate gradient step' (SGD) in the limit of small steps.

[-]nostalgebraist

1y156

I read the post and left my thoughts in a comment. In short, I don't think the claimed equivalence in the post is very meaningful.

(Which is not to say the two processes have no relationship whatsoever. But I am skeptical that it's possible to draw a connection stronger than "they both do local optimization and involve randomness.")

1Oliver Sourbut1y

Awesome, I saw that comment - thanks, and I'll try to reply to it in more detail. It looks like you're not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used? From a skim, the caveats you raised are mostly/all caveated in the original post too - though I think you may have missed the (less rigorous but more realistic!) second model at the end, which departs from the simple annealing process to a more involved population process. I think even on this basis though, it's going too far to claim that the best we can say is "they both do local optimization and involve randomness"! The steps are systematically pointed up/down the local fitness gradient, for one. And they're based on a sample-based stochastic realisation for another. I don't want you to get the impression I'm asking for too much from this analogy. But the analogy is undeniably there. In fact, in those explainer videos Habryka linked, the particular evolution described is a near-match for my first model (in which, yes, it departs from natural genetic evolution in the same ways).

8nostalgebraist1y

I'm disputing both. Re: math, the noise in your model isn't distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm. (I know you did mention the latter issue, but IMO it rules out calling this an "equivalence.") I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don't feel like I can trust without seeing a mathematical presentation. (FWIW, if we have a population that's "spread out" over some region of a high-dim NN loss landscape -- even if it's initially a small / infinitesimal region -- I expect it to quickly split up into lots of disjoint "tendrils," something like dye spreading in water. Consider what happens e.g. at saddle points. So the population will rapidly "speciate" and look like an ensemble of GD trajectories instead of just one. If your model assumes by fiat that this can't happen, I don't think it's relevant to training NNs with SGD.)

2Oliver Sourbut1y

Wait, you think that a model which doesn't speciate isn't relevant to SGD? I'll need help following, unless you meant something else. It seems like speciation is one of the places where natural evolutions distinguish themselves from gradient descent, but you seem to also be making this point? In the second model, we retrieve non-speciation by allowing for crossover/horizontal transfer, and yes, essentially by fiat I rule out speciation (as a consequence of the 'eventually-universal mixing' assumption). In real natural selection, even with horizontal transfer, you get speciation, albeit rarely. It's obviously a fascinating topic, but I think pretty irrelevant to this analogy. For me, the step-size thing is interesting but essentially a minor detail. Any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs ... is relevant to the conclusions we want to draw? (Serious question; my best guess is 'no', but I hold that medium-lightly.) (irrelevant nitpick by my preceding paragraph, but) FWIW vanilla SGD does depend on gradient norm. [ETA: I think I misunderstood exactly what you were saying by 'step size depends on the gradient norm', so I think we agree about the facts of SGD. But now think about the space including SGD, RMSProp, etc. The 'depends on gradient norm' piece which arises from my evolution model seems entirely at home in that family.] On the distribution of noise, I'll happily acknowledge that I didn't show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.

6cfoster01y

I agree that they are related. In the context of this discussion, the critical difference between SGD and evolution is somewhat captured by your Assumption 1: Evolution does not directly select/optimize the content of minds. Evolution selects/optimizes genomes based (in part) on how they distally shape what minds learn and what minds do (to the extent that impacts reproduction), with even more indirection caused by selection's heavy dependence on the environment. All of that creates a ton of optimization "slack", such that large-brained human minds with language could steer optimiztion far faster & more decisively than natural selection could. This what 1a3orn was pointing to earlier with SGD does not have that slack by default. It acts directly on cognitive content (associations, reflexes, decision-weights), without slack or added indirection. If you control the training dataset/environment, you control what is rewarded and what is penalized, and if you are using SGD, then this lets you directly mold the circuits in the model's "brain" as desired. That is one of the main alignment-relevant intuitions that gets lost when blurring the evolution/SGD distinction.

6Oliver Sourbut1y

Right. And in the context of these explainer videos, the particular evolution described has the properties which make it near-equivalent to SGD, I'd say? ---------------------------------------- Hmmm, this strikes me as much too strong (especially 'this lets you directly mold the circuits'). Remember also that with RLHF, we're learning a reward model which is something like the more-hardcoded bits of brain-stuff, which is in turn providing updates to the actually-acting artefact, which is something like the more-flexibly-learned bits of brain-stuff. I also think there's a fair alternative analogy to be drawn like * evolution of genome (including mostly-hard-coded brain-stuff) ~ SGD (perhaps +PBT) of NN weights * within-lifetime-learning of organism ~ in-context something-something of NN (this is one analogy I commonly drew before RLHF came along.) So, look, the analogies are loose, but they aren't baseless.

31a3orn1y

Source?

[-]habryka

1y141

CGP Grey's video is a decent example source. Most of the differences between hierarchical bayesian networks and modern deep learning come across pretty well if you model the latter as a type of genetic algorithm search:

The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don't know exist.
Training consists of a huge amount of trial and error where you take datapoints, predict something about the result, then search for nearby modifications that do better, then repeat until performance plateaus.
You are ultimately doing a local search, which means you can get stuck at local minima, unless you do something like increase your step size or increase the mutation rate

There are also just actually deep similarities. Vanilla SGD is perfectly equivalent to a genetic search with an infinitesimally small mutation size and infinite samples per generation (I could make a proof here but won't unless someone is interested in it). Indeed in one of my ML classes at Berkeley genetic algorithms were suggested as one of the obvious things to do in an indifferentiable loss-landscape as generalization of SGD, where you just try some mutations, see which one performs best, and then modify your parameters in that direction.

4Oliver Sourbut1y

Oh, I actually did that a year or so ago

1David Johnston1y

Two observations: 1. If you think that people’s genes would be a lot fitter if people cared about fitness more then surely there’s a good chance that a more efficient version of natural selection would lead to people caring more about fitness. 2. You might, on the other hand, think that the problem is more related to feedbacks. I.e. if you’re the smartest monkey, you can spend your time scheming to have all the babies. If there are many smart monkeys, you have to spend a lot of time worrying about what the other monkeys think of you. If this is how you’re worried misalignment will arise, then I think “how do deep learning models generalise?” is the wrong tree to bark up C. If people did care about fitness, would Yudkowsky not say “instrumental convergence! Reward hacking!”? I’d even be inclined to grant he had a point.

[-]Lauro Langosco

1yΩ711-2

evolution does not grow minds, it grows hyperparameters for minds.

Imo this is a nitpick that isn't really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn't necessarily lead to a thing that wants ('optimizes for') X; and more broadly it's a good example for how the results of an optimization process can be unexpected.

I want to distinguish two possible takes here:

The argument from direct implication: "Humans are misaligned wrt evolution, therefore AIs will be misaligned wrt their objectives"
Evolution as an intuition pump: "Thinking about evolution can be helpful for thinking about AI. In particular it can help you notice ways in which AI training is likely to produce AIs with goals you didn't want"

It sounds like you're arguing against (1). Fair enough, I too think (1) isn't a great take in isolation. If the evolution analogy does not help you think more clearly about AI at all then I don't think you should change your mind much on the strength of the analogy alone. But my best guess is that most people incl Nate mean (2).

5TurnTrout1y

I think it's extremely relevant, if we want to ensure that we only analogize between processes which share enough causal structure to ensure that lessons from e.g. evolution actually carry over to e.g. AI training (due to those shared mechanisms). If the shared mechanisms aren't there, then we're playing reference class tennis because someone decided to call both processes "optimization processes."

3Lauro Langosco1y

The argument I think is good (nr (2) in my previous comment) doesn't go through reference classes at all. I don't want to make an outside-view argument (eg "things we call optimization often produce misaligned results, therefore sgd is dangerous"). I like the evolution analogy because it makes salient some aspects of AI training that make misalignment more likely. Once those aspects are salient you can stop thinking about evolution and just think directly about AI.

[-]azsantosk

1y113

Also relevent is Steven Byrnes' excelent Against evolution as an analogy for how humans will create AGI.

It has been over two years since the publication of that post, and criticism of this analogy has continued to intensify. The OP and other MIRI members have certainly been exposed to this criticism already by this point, and as far as I am aware, no principled defense has been made of the continued use of this example.

I encourage @So8res and others to either stop using this analogy, or to argue explicitly for its continued usage, engaging with the arguments presented by Byrnes, Pope, and others.

[-]Max H

1y104

But given that this example is so controversial, even if it were right why would you use it -- at least, why would you use it if you had any other example at all to turn to?

Humans are the only real-world example we have of human-level agents, and natural selection is the only process we know of for actually producing them.

SGD, singular learning theory, etc. haven't actually produced human-level minds or a usable theory of how such minds work, and arguably haven't produced anything that even fits into the natural category of minds at all, yet. (Maybe they will pretty soon, when applied at greater scale or in combination with additional innovations, either of which could result in the weird-correlates problem emerging.)

Also, the actual claims in the quote seem either literally true (humans don't care about foods that they model as useful for inclusive genetic fitness) or plausible / not obviously false (when you grow minds [to human capabilities levels], they end up caring about a bunch of weird correlates). I think you're reading the quote as saying something stronger / more specific than it actually is.

1MinusGix1y

Because it serves as a good example, simply put. It gets the idea clear across about what it means, even if there are certainly complexities in comparing evolution to the output of an SGD-trained neural network. It predicts learning correlates of the reward signal that break apart outside of the typical environment. Yes, that's why we like it, and that is a way we're misaligned with evolution (in the 'do things that end up with vast quantities of our genes everywhere' sense). Our taste buds react to it, and they were selected for activating on foods which typically contained useful nutrients, and now they don't in reality since ice-cream is probably not good for you. I'm not sure what this example is gesturing at? It sounds like a classic issue of having a reward function ('reproduction') that ends up with an approximation ('your tastebuds') that works pretty well in your 'training environment' but diverges in wacky ways outside of that. ---------------------------------------- I'm inferring by 'evolution is only selecting hyperparameters' is that SGD has less layers of indirection between it and the actual operation of the mind compared to evolution (which has to select over the genome which unfolds into the mind). Sure, that gives some reason to believe it will be easier to direct it in some ways - though I think there's still active room for issues of in-life learning, I don't really agree with Quintin's idea that the cultural/knowledge-transfer boom with humans has happened thus AI won't get anything like it - but even if we have more direct optimization I don't see that as strongly making misalignment less likely? It does make it somewhat less likely, though it still has many large issues for deciding what reward signals to use. I still expect correlates of the true objective to be learned, which even in-life training for humans have happen to them through sometimes associating not-related-thing to them getting a good-thing and not just as a matter of false

[-]Jozdien

1y152

Great post. I agree directionally with most of it (and have varying degrees of difference in how I view the severity of some of the problems you mention).

One that stood out to me:

(unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs).

While still far from being in a state legible to be easy or even probable that we'll solve, this seems like a route that circumvent some of the problems you mention, and is where a large amount of whatever probability I assign to non-doom outcomes come from.

More precisely: insofar as the problem at its core comes down to understanding AI systems deeply enough to make strong claims about whether or not they're safe / have certain alignment-relevant properties, one route to get there is to understand those high-level alignment-relevant things well enough to reliably identify the presence / nature thereof / do other things with, in a large class of systems. I can think of multiple approaches that try to do this, like John's work on abstractions, Paul with ELK (though referring to it as understanding the high-level alignment-relevant property of truth sounds somewhat janky because of the ... (read more)

[-]Eli Tyre

1y134

It looks to me like we’re on track for some people to be saying “look how rarely my AI says bad words”, while someone else is saying “our evals are saying that it can’t deceive humans yet”, while someone else is saying “our AI is acting very submissive, and there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing”, and someone else is saying “we’ll just direct a bunch of our AIs to help us solve alignment, while arranging them in a big bureaucracy”, and someone else is saying “we’ve set up the game-theoretic incentives such that if any AI starts betraying us, some other AI will alert us first”, and this is a different sort of situation.
And not one that looks particularly survivable, to me.

And if you ask bureaucrats to distinguish which teams should be allowed to move forward (and how far) in that kind of circus, full of claims, promises, and hunches and poor in theory, then I expect that they basically just can’t.

I'm reminded of a draft post that I started but, never finished or published, about the Manhattan project and the relevance for AI alignment and AI coordination, based on my reading of The Making of the Atomic Bomb.

The histor... (read more)

[-]Nora Belrose

1y122

I expect that we’d see all sorts of coincidences and hacks that make the thing run, and we’d be able to see in much more detail how, when we ask the system to achieve some target, it’s not doing anything close to “caring about that target” in a manner that would work out well for us, if we could scale up the system’s optimization power to the point where it could achieve great technological or scientific feats (like designing Drexlerian nanofactories or what-have-you).

I think this counterfactual is literally incoherent— it does not make sense to talk about what an individual neural network would do if its "optimization power" were scaled up. It's a category error. You instead need to ask what would happen if the training procedure were scaled up, and there are always many different ways that you can scale it up— e.g. keeping data fixed while parameters increase, or scaling both in lockstep, keeping the capability of the graders fixed, or investing in more capable graders / scalable oversight techniques, etc. So I deny that there is any fact of the matter about whether current LLMs "care about the target" in your sense. I think there probably are sensible ways of cashing out what it means for a 2023 LLM to "care about" something but this is not it.

[-]Bogdan Ionut Cirstea

1yΩ410-1

As others have hinted at/pointed out in the comments, there is an entire science of deep learning out there, including on high-level (vs. e.g. most of low-level mech interp) aspects that can be highly relevant to alignment and that you seem to not be aware of/dismiss. E.g. follow the citation trail of An Explanation of In-context Learning as Implicit Bayesian Inference.

[-]Rob Bensinger

1yΩ5112

Some of Nate’s quick thoughts (paraphrased), after chatting with him:

Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-mechanistically-explain a typical system humans have engineered.)

Nobody's been able to call the specific capabilities of systems in advance. Nobody's been able to call the specific exploits in advance. Nobody's been able to build better cognitive algorithms by hand after understanding how the AI does things we can't yet code by hand. There is clearly some other level of understanding that is possible that we lack, and that we once sought, and that only the interpretability folks continue to seek.

E.g., think of that time Neel Nanda figured out how a small transformer does modular arithmetic (AXRP episode). If nobody had ever tho... (read more)

[-]Seth Herd

1y10-14

Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.

This includes an assumption that alignment must be done through training signals.

If I shared that assumption, I'd be similarly pessimistic. That seems like trying to aim a rocket with no good theory of gravitation, nor knowledge of the space it needs to pass through.

But alignment needn't be done by defining goals or training signals, letting fly, and hoping. We can pause learning prior to human level (and potential escape), and perform "course corrections". Aligning a partly-trained AI allows us to use its learned representations as goal/value representation, rather than guessing how to create them well enough through training with correlated rewards.

We have proposals that do this for different current approaches to AGI; see The (partial) fallacy of dumb superintelligence for more about them and this line of thinking.

This doesn't entirely avoid the problem that most theories don't work... (read more)

[-]Nathan Helm-Burger

1y*104

I am excited about the concept of uploading, but as I've discussed with fellow enthusiasts... I don't see a way to a working emulation of a human brain (much less an accurate recreation of a specific human brain) that doesn't go through improving our general understanding of how the human brain works. And I think that that knowledge leads to unlocking AI capabilities. So it seems like a tightly information-controlled research project would be needed to not have AI tech leapfrogging over uploading tech while aiming for uploads.

Edit: to be extra clear, I'm trying to speak to people out who might not have thought this through that there is a clear strategic rationale to think that 'private uploading-directed-research is potentially good, but open uploading-directed-research is very risky and bad.' Because of my particular bias towards believing in the importance of studying the human brain, I suspect that the ML capabilities side-effects of such research would be substantially worse than the average straightforward ML capabilities advance.

6Steven Byrnes1y

If anyone cares, my own current take (see here) is “it’s not completely crazy to hope for uploads to precede non-upload-AGI by up to a couple years, with truly heroic effort and exceedingly good luck on numerous fronts at once”. (Prior to writing that post six months ago, I was even more pessimistic.) More than a couple years’ window continues to seem completely crazy to me.

[-]niplav

1y20

And if you were using it to send payloads to very distant planets at relativistic speeds, you’d still be screwed, because Newtonian mechanics does not account for relativistic effects.

You don't even need to have that extravagant an example; if you use Newtonian mechanics to build a Global Positioning System your calculated locations move at up to 10 kilometers per day—what does that say about condition numbers of values under recursive self-improvement or repeated ontological shifts?

[-]Review Bot

1y*10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]RogerDearnaley

1y*Ω110

"…there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing"

When your AI includes an LLM extensively trained to simulate human token-generation, anthropomophizing its behavior is an extremely relevant idea, to the point of being the obvious default assumption.

For example, what I find most concerning about RLHF inducing sycophancy is not the sycophancy itself, which is "mostly harmless", but the likelihood that it's also dragging in all the other more seriously unaligned human behaviors that, in real or fictional humans, typic... (read more)

[-]Дмитрий Зеленский

1y1-1

And in part because it’s socially hard to believe, as a regulator, that you should keep telling everyone “no”, or that almost everything on offer is radically insufficient, when you yourself don’t concretely know what insights and theoretical understanding we’re missing.

That's not true. We can end up with a regulator that stands in the pose of "prohibit everything". See IRB in America, for instance: medical experiments are made plainly insurmountable.

[-]ShayBenMoshe

1y10

I'd like to offer an alternative to the third point. Let's assume we have built a highly capable AI that we don't yet trust. We've also managed to coordinate as a society and implement defensive mechanisms to get to that point. I think that we don't have to test the AI in a low-stakes environment and then immediately move to a high-stakes one (as described in the dictator analogy), while still getting high gains.

It is feasible to design a sandboxed environment formally proven to be secure, in the sense that you can not hack into, escape from or deliberatel... (read more)

[-]lukemarks

1y10

unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs

Are you able to provide an example of the kind of thing that would constitute such a theoretical triumph? Or, if not; a maximally close approximation in the form of something that exists currently?

Moderation Log