(This is an account of my checking a certain alignment idea and finding that it doesn't work. Also my thinking is pretty naive and could easily be wrong.)

When thinking about AIs that are trained on some dataset and learn to extrapolate it, like the current crop of LLMs, I asked myself: can such an AI be aligned purely by choosing an appropriate dataset to train on? In other words, does there exist any dataset such that generating extrapolations from it leads to good outcomes, even in the hands of bad actors? If we had such a dataset, we'd have an aligned AI.

But unfortunately it seems hard. For example if the dataset includes instructions to build a nuke, then a bad actor could just ask for that. Moreover, if there's any circumstance at all under which we want the AI to say "here's the instructions to build a nuke" (to help a good actor stop an incoming asteroid, say), then a bad actor could extrapolate from that phrase and get the same result.

It seems the problem is that extrapolation doesn't have situational awareness. If the AI is based on extrapolating a certain dataset, there's no way to encode in the dataset itself which parts of it can be said when. And putting a thin wrapper on top, like ChatGPT, doesn't seem to help much, because from what I've seen it's easy enough to bypass.

What is the hope for alignment, then? Can we build an AI with situational awareness from the ground up, not relying on an "extrapolation core" (because the core would itself be an unaligned AI that bad actors could use)? I don't know.

EDIT: the sequel to this post is Aligned AI as a wrapper around an LLM.

New to LessWrong?

1.

e.g. software agents in some virtual world, programmed such that agents are implemented as some well-defined datatype/class, and agent-world interactions can only happen via a well-defined simple interface, running on a computer that cannot be hacked from within the simulation. ↩︎

1.

I think a naively designed data set containing lots of {words that are value-laden for English-speaking humans} would not cut it, for hopefully obvious reasons. ↩︎

New Comment


15 comments, sorted by Click to highlight new comments since:

In other words, does there exist any dataset such that generating extrapolations from it leads to good outcomes, even in the hands of bad actors?

I think this is an important question to ask, but "even in the hands of bad actors" is just too difficult a place to start. I'm sure you're aware, but it's an unsolved problem whether there exists a dataset / architecture / training procedure such that "generating extrapolations from it leads to good outcomes," for sufficiently capable ML models, even in the hands of good actors. (And the "bad actor" piece can at least plausibly be solved by social coordination, whereas the remaining portion is a purely technical problem you can't dodge.)

But if you drop the bad actor part, I think this question is a good one to ask (but still difficult)! I think answering this question requires a better understanding of how neural networks generalize, but I can at least see worlds where the answer is "yes".  (Though there are still pitfalls in how you instantiate this in reality - does your dataset need to be perfectly annotated, so that truth-telling is incentivized over sycophancy/deception? Does it require SGD to always converge to the same generalization behavior? etc.)

Ok, let's assume good actors all around. Imagine we have a million good people volunteering to generate/annotate/curate the dataset, and the eventual user of the AI will also be a good person. What should we tell these million people, what kind of dataset should they make?

To be clear, I don't know the answer to this!

Spitballing here, the key question to me seems to be about the OOD generalization behavior of ML models. Models that receive similarly low loss on the training distribution still have many different ways they can behave on real inputs, so we need to know what generalization strategies are likely to be learned for a given architecture, training procedure, and dataset. There is some evidence in this direction, suggesting that ML models are biased towards a simplicity prior over generalization strategies.

If this is true, then the incredibly handwave-y solution is to just create a dataset where the simplest (good) process for estimating labels is to emulate an aligned human. At first pass this actually looks quite easy - it's basically what we're doing with language models already.

Unfortunately there's quite a lot we swept under the rug. In particular this may not scale up as models get more powerful - the prior towards simplicity can be overcome if it results in lower loss, and if the dataset contains some labels that humans unknowingly rated incorrectly, the best process for estimating labels involves saying what humans believe is true rather than what actually is. This can already be seen with the sycophancy problems today's LLMs are having.

There's a lot of other thorny problems in this vein that you can come up with with a few minutes of thinking. That being said, it doesn't seem completely doomed to me! There just needs to be a lot more work here. (But I haven't spent too long thinking about this, so I could be wrong.)

Well, one way of thinking of the objective without situational awareness could be to maximize the expected utility of the resulting policy.

Human-written texts, especially literature, laws, news articles etc., are both shaped by and shaping human culture and values. So, LLMs like GPT, which are trained on the big massives of those, probably already have a pretty deep understanding of the human values. GPT can reason about values and ethics, when prompted, maybe, even better than many humans https://www.lesswrong.com/posts/ztqpqff2xfLpahSpB/challenge-does-chatgpt-ever-claim-that-a-bad-outcome-for

That is not exactly an alignment, of cause, but a big step in that direction, imho.

Yeah, LLMs somewhat understand how to do good stuff, and how to label it as good. Also they somewhat understand how to do bad stuff, and how to label it as bad. So the situation is symmetric. The question in the post was, can we make it asymmetric? Make a dataset that, when extrapolated, tends toward outputting information that helps humanity?

To be fair, it's not entirely symmetric. Current datasets are already a bit biased toward human morality, because they consist of texts written by humans. In a way that's lucky. If we'd first gotten powerful AIs trained on observations of physical reality instead, they'd have been more amoral and dangerous. Texts are better. But they don't get us all the way, because humans can do bad things too. And it's tricky to figure out how to make the dataset lean more toward morality, without making it much smaller and thus less powerful.

I suspect GPT already can figure what is the description of the "benevolent" action. If not, please give me an example of AI mislabeling it.

Problems are that AI now is too dumb to figure if the act is bad if it is described in some roundabout way https://humanevents.com/2023/03/24/chatgpt-helps-plan-a-state-run-death-camp , or is too complex, or have to be inferred from non-text information etc.

For example, it would take a very smart AI, probably AGI, to reliably figure out that some abstract math or engineering task is actually a weapon recipe.

Suppose you made your dataset larger and larger. Once it got "really large" let's say, would you feel confident that your AI model will have learned enough such that even if its dataset contained nuke-building instructions, it would remain safe to use even in the hands of a bad actor?

In other words, does there exist any dataset such that generating extrapolations from it leads to good outcomes, even in the hands of bad actors?

First, let's remove the requirement that it must be safe from bad actors, as that's not an alignment problem.

Now, to answer the question, the good news is that the more we crank up generalization ability, the better it's alignment, because it's better at extrapolating.

Now to answer the question, I suspect the answer is yes, conditional on having the following assumptions:

  1. You have an AI that generalizes enough to solve arbitrarily long problems, as you can make any problem complex by making it longer.

  2. The AI cannot affect or hack the human's values.

The first is just capabilities progress, and the second assumption is valid for offline training of human values like PTHF, but not online training like RLHF.

the more we crank up generalization ability, the better it's alignment

To me that seems almost correct, in a way that is dangerous. I'd agree with the statement that

the more we crank up generalization ability, the better the AI's ability to align to any given set of values/goals

But for that to lead to the AI being aligned with "good" values, we also need to somehow get the AI to choose/want to align with "good" values. (Whatever "good" even means; maybe humanity's CEV?) And that does not happen on its own, AFAICT.

But for that to lead to the AI being aligned with "good" values, we also need to somehow get the AI to choose/want to align with "good" values.

And that does not happen on its own, AFAICT.

I agree that the assumption of generality, on it's own, doesn't actually allow you to align an AI's values to a specific human's values or intentions, given the embeddedness of the world allowing for everything to be manipulated, including a human's values.

Thus you can solve the problem of alignment in 2 ways:

  1. Resolve the embedded alignment problems and try to align the AI with a human's values, or essentially get alignment while online learning in the world.

This is essentially Reinforcement Learning from Human Feedback's method for alignment.

  1. Dissolve the embedded alignment problems by finding a way to translate the ontology of Cartesianism, including it's boundaries into an embedded world via offline learning that makes sense.

This is essentially Pretraining from Human Feedback's method for alignment.

So I made the assumption that the AI can't hack the data set used for human values, and that assumption is able to be enforced in offline learning, but not online learning.

I don't quite understand what you're saying; I get the impression we're using different ontologies/vocabularies. I'm curious to understand your model of alignment, and below are a bunch of questions. I'm uncertain whether it's worth the time to bridge the conceptual gap, though --- feel free to drop this conversation if it feels opportunity-costly.

(1.)

Are you saying that if we assumed agents to be Cartesian[1], then you'd know a solution to the problem of {how could a weak agent train and align a very powerful agent}? If yes, could you outline that solution?

(2.)

Resolve the embedded alignment problems [...] This is essentially Reinforcement Learning from Human Feedback's method for alignment

How does RLHF solve problems of embedded alignment? I'm guessing you're referring to something other than the problems outlined in Embedded Agency?

(3.)

What exact distinction do you mean by "online" vs "offline"? Given that any useful/"pivotal" AI would need to learn new things about the world (and thus, modify its own models/mind) in order to form and execute a useful/pivotal plan, it would have to learn "online", no?

(4.)

the data set used for human values

What kind of data set did you have in mind here? A data set s.t. training an AI on it in some appropriate way would lead to the AI being aligned to human values? Could you give a concrete example of such a data set (and training scheme)?


  1. e.g. software agents in some virtual world, programmed such that agents are implemented as some well-defined datatype/class, and agent-world interactions can only happen via a well-defined simple interface, running on a computer that cannot be hacked from within the simulation. ↩︎

Are you saying that if we assumed agents to be Cartesian[1], then you'd know a solution to the problem of {how could a weak agent train and align a very powerful agent}? If yes, could you outline that solution?

Yes, and while I can't totally describe the situation, I can say this:

The first step would be to say scale up the experiment of Pretraining from Human Feedback by using larger data, then curating it for alignment. In particular, we can even try to design a data set such that it uses words like freedom, justice, alignment and more value laden words.

But the real power of the Cartesian agent for alignment is the fact that the agent-world interactions can only happen on a well-defined simple interface, and the computer isn't hackable. That immediately means that many AI risk stories evaporate, as you can only learn legitimate generalizations, not deceptive generalizations leading to deceptive alignment and you can't amplify Goodhart or hack the human's values, making alignment simpler since we can ensure that there doesn't need to be cumbersome protection that incurs alignment taxes. Thus we can bring it back to simple iteration until we succeed.

How does RLHF solve problems of embedded alignment? I'm guessing you're referring to something other than the problems outlined in Embedded Agency?

You're right that it doesn't do well, but the main strategy is to reward AI whenever it takes good looking actions. Problem is, as you can see, it's trying to align a agent after it's already capable, and with higher and higher power, this is increasingly dangerous. In particular, since the AI controls the learning schedule, it will probably be incentivized into hacking the human, since the human's values are just another thing in the world, and there's no well defined interface that's under our control.

Pretraining from Human Feedback does alignment before it gets significantly better capabilities.

What exact distinction do you mean by "online" vs "offline"? Given that any useful/"pivotal" AI would need to learn new things about the world (and thus, modify its own models/mind) in order to form and execute a useful/pivotal plan, it would have to learn "online", no?

I agree with this that one eventually has to shed the Cartesian boundary and do online learning/training. But the key thing we've learned from deep learning is that we can translate the ontology of Cartesianism without too much trouble. In particular, one can do a whole lot of training inside a Cartesian setting called offline learning, and when it moves to the embedded setting so that it's online setting, it generalizes the capabilities learned in the Cartesian setting really well. If one could do this for alignment, then the problem transforms into a solvable problem, as we can ensure that there is no way to hack the human's values or hack it's environment.

And indeed Pretraining from Human Feedback is the closest I've seen to actually implement this.

The embeddedness of the world dominates asymptotically, but we only need a finite time of Cartesian learning to generalize in the embedded setting correctly.

To talk about the distinction between offline and online learning, offline learning is when we give batches of data, the AI guided by SGD learns that patterns and algorithms that explain that data, and then we give it new data, rinse and repeat essentially. An important point to notice here is that the human is in control of that data, and there's no way for the data to be hacked, unlike in online learning. In particular, the interface is well specified: Text.

In online learning, the AI takes the lead, and it selects it's own data points, and the most we can do is give it reward or punishment. In particular, nearly all human learning is online learning, since we select our own data points.

What kind of data set did you have in mind here? A data set s.t. training an AI on it in some appropriate way would lead to the AI being aligned to human values? Could you give a concrete example of such a data set (and training scheme)?

Unfortunately, that's outside my expertise, so I can't really concretely do this. I am aiming to give something like a possibility proof for alignment, as well as mentioning some practical ways to achieve the idea. The implementation is left to other people.

Now to gesture at a crux I have, I think one of the largest cruxes I have is that I think the human method of learning, which is almost all online, doesn't need to be replicated in AI. A large part of the problems with alignment in humans ultimately stem from that we are almost always online learners, thus we constantly have chances to hack or manipulate our environments, whether it's physical, social or anywhere else, and so we do hack our environments. When we train AI, we don't need to replicate that method, because we've learned that we can do a lot more offline learning to get the AI to learn the basic concepts.

If we could reliably do offline learning in humans, I think alignment at least in the single person case could be totally solved, and while larger groups would have problems staying aligned unless they had special properties, we could plausibly solve alignment for much larger groups.

In essence, the human architecture almost totally prohibits offline learning, while AI architecture does permit offline learning.

The real weakness of offline learning is the cost. While this cost is amortizable such that it's actually pretty cheap over multiple runs of training, and you don't need to recompute rewards, evolution and us had to work with much smaller energy budgets, at least until the 19th century, and in the 21st century, that energy and data could be applied to compute. Thus large upfront costs couldn't be purchased, even if it works far better for alignment in the long run.

Also, I'm focusing on the embedded agency problems of Goodhart's law and subsystem alignment, and I'm not addressing decision theory or embedded world models problems. In essence, I'm focusing only on problems that are alignment related, not capabilities related.

I know this is a long commennt, but I hope you understand why I'm communicating so differently, and why I'm so optimistic about alignment being better as capabilities scale.

scale up the experiment of Pretraining from Human Feedback by using larger data

AFAICT, PHF doesn't solve any of the core problems of alignment. IIUC, PHF is still using an imperfect reward model trained on a finite amount of human signals-of-approval; I'd tentatively expect scaling up PHF (to ASI) to result in death-or-worse by Goodhart. Haven't thought about PHF very thoroughly though, so I'm uncertain here.

we can even try to design a data set such that it uses words like freedom, justice, alignment and more value laden words

Did you mean something like "(somehow) design a data set such that, in order to predict token-sequences in that data set, the AI has to learn the real-world structure of things we care about, like freedom, justice, alignment, etc."? [1]


can only learn legitimate generalizations, not deceptive generalizations leading to deceptive alignment

I don't understand this. What difference are you pointing at with "deceptive" vs "legitimate" generalizations? How does {AI-human (and/or AI-env) interactions being limited to a simple interface} preclude {learning "deceptive" generalizations}?

I'm under the impression that entirely "legitimate" generalizations can (and apriori probably will) lead to "deception"; see e.g. https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness. Do you disagree with that? (If yes, how?)

can't amplify Goodhart

Side note: I don't understand what you mean by this (in the given context).

can't [...] hack the human's values

I don't see how this follows. IIUC, the proposition here is something like

  • If the AI only interacts with the humans via a simple, well-defined, and thoroughly understood interface, then the AI can't hack the humans.

Is that a reasonable representation of what you're saying? If yes, consider: What if we replace "the AI" with "Anonymous" and "the humans" with "the web server"? Then we get

  • If Anonymous only interacts with the web server via a simple, well-defined, and thoroughly understood interface, then Anonymous can't hack the web server

...which is obviously false in the general case, right? Systems can definitely be hackable, even if interactions with them are limited to a simple interface; as evidence, we could consider any software exploit ever that didn't rely on hardware effects like rowhammering.

(I agree that limiting human-AI interactions to a simple interface would be helpful, but I think it's far from sufficient (to guarantee any form of safety).)


IIUC, a central theme here is the belief that {making learning offline vs online} and {limiting AI-human interfaces to be simple/understood} would solve large chunks of the whole alignment problem, or at least make it much easier. I'm still confused as to why you think that. To the extent that I understood the reasons you presented, I think they're incorrect (as outlined above). (Maybe I'm misunderstanding something.)

I'm kinda low on bandwidth, so I might not engage with this further. But in any case, thanks for trying to share parts of your model!


  1. I think a naively designed data set containing lots of {words that are value-laden for English-speaking humans} would not cut it, for hopefully obvious reasons. ↩︎

I'm kinda low on bandwidth, so I might not engage with this further. But in any case, thanks for trying to share parts of your model!

This will be my last comment here. Thank you for trying to explain why you disagree with me!

IIUC, a central theme here is the belief that {making learning offline vs online} and {limiting AI-human interfaces to be simple/understood} would solve large chunks of the whole alignment problem, or at least make it much easier.

I"m impressed that you passed my ITT. I think analogies to other alignment problems like the human alignment problem miss that it's the most difficult setting, but you don't need to play on that difficulty, because AI is very different from humans.

AFAICT, PHF doesn't solve any of the core problems of alignment.

While I definitely over claimed on how much it solves the alignment problems, I think this is definitely underselling the accomplishments. It's an incomplete solution, in that it doesn't do everything on it's own, but it does carry a lot of weight.

To talk about deceptive alignment more specifically, deceptive alignment is essentially where the AI isn't aligned with human goals, and tries to hide that fact. One of the key prerequisites of deceptive alignment is that it is optimizing a non-myopic goal. It's the most dangerous form of alignment, since we have an AI only aligned for instrumental, not terminal reasons.

What Pretraining from Human Feedback did was it finally married a myopic goal with competitive capabilities, and once the myopic goal of conditional training was added, then deceptive alignment goes away, since a non-myopic goal being optimized is one of the key prerequisites.

I don't see how this follows. IIUC, the proposition here is something like

If the AI only interacts with the humans via a simple, well-defined, and thoroughly understood interface, then the AI can't hack the humans.

Is that a reasonable representation of what you're saying? If yes, consider: What if we replace "the AI" with "Anonymous" and "the humans" with "the web server"? Then we get

If Anonymous only interacts with the web server via a simple, well-defined, and thoroughly understood interface, then Anonymous can't hack the web server

...which is obviously false in the general case, right? Systems can definitely be hackable, even if interactions with them are limited to a simple interface; as evidence, we could consider any software exploit ever that didn't rely on hardware effects like rowhammering.

This is definitely right, and I did over claim here, though I do remember Pretraining from Human Feedback claimed to do this:

Conditional training (as well as other PHF objectives) is purely offline: the LM is not able to affect its own training distribution. This is unlike RLHF, where the LM learns from self-generated data and thus is more likely to lead to risks from auto-induce distribution shift or gradient hacking.

Which vindicated a narrower claim about the inability of the AI to hack or affect the training distribution, which I don't know how much it supports my thesis on the immunity to hacking claims.

To port another reason why I'm so optimistic on alignment, I think that alignment is scalable, or put it another way, while Pretraining from Human Feedback is imperfect right now, and even in the imperfections my view is that it would avoid X-risk almost entirely, small, consistent improvements in the vein of empirical work will eventually make it far more aligned than the original Pretraining from Human Feedback work. In the case of more data, they tested it and it showed increasing alignment with more data.

To edit a quote from Thoth Hermes:

Yudkowsky was wrong in the tendency to assume that certain abstractions just don't apply whenever intelligence or capability is scaled way up."

This essentially explains my issues with the idea that alignment isn't scalable.