I do note that General Purpose Search can be almost reduced to learning, in that most things that you want General Purpose Search to do can also be done by learning, though I do think General Purpose Search will at least be the foundation/bootstrapping for learning:
Interesting! I have to read the papers in more depth but here are some of my initial reactions to that idea (let me know if it’s been addressed already):
Agree, learning can't entirely replace General Purpose Search, and I agree something like General Purpose Search will still in practice be the backbone behind learning, due to your reasoning.
That is, General Purpose Search will still be necessary for AIs, if only due to bootstrapping concerns, and I agree with your list of benefits of General Purpose Search.
My piece on Steering subsystems is highly overlapping with your decomposition into a targeting process. I argue that effective AGI will need to have such a thing for similar reasons to those you present, but your presentation is different and quite possibly better than mine was.
Thanks! I recall reading the steering subsystems post a while ago & it matched a lot of my thinking on the topic. The idea of using variables in the world model to determine the optimization target also seems similar to your "Goals selected from learned knowledge" approach (the targeting process is essentially a mapping from learned knowledge to goals).
Another motivation for the targeting process (which might also be an advantage of GLSK) I forgot to mention is that we can allow the AI to update their goals as they update their knowledge (eg about what the current human values are), which might help us avoid value lock-in.
Right! I'm pleased that you read those posts and got something from them.
I worry less about value lock-in and more about The alignment stability problem which is almost the opposite.
But more recently I've been thinking that neither will be a real issue, because Instruction-following AGI is easier and more likely than value aligned AGI. The obvious solution to both alignment stability and premature/incorrect/mis-specified value lock-in is to keep a human in the loop by making AGI whose central goal is to follow instructions (or similar personal intent alignment) from authorized user(s). It's also the more appealing option to people actually in charge of AGI projects. They like being in charge, and of course everyone likes their values better than the average of all humanity's values.
Good point!
But more recently I've been thinking that neither will be a real issue, because Instruction-following AGI is easier and more likely than value aligned AGI. The obvious solution to both alignment stability and premature/incorrect/mis-specified value lock-in is to keep a human in the loop by making AGI whose central goal is to follow instructions (or similar personal intent alignment) from authorized user(s).
I think this argument also extends to value-aligned AI, because the value-aligned AGI will keep humans in the loop insofar as we want to be kept in the loop, & it will be corrigible insofar as we want it to be corrigible.
But a particular regime I'm worried about (for both PIA & VA) is when the AI has an imperfect model of the users' goals inside its world model & optimizes for them. One way to mitigate this is to implement some systematic ways of avoiding side effects (corrigibility/impact regularization) so that we won’t need to require a perfect optimization target, another is to allow the AI to update its goals as it improves its model of the users’ values.
Instruction-following AI can also help with this, though I think it might imply a higher alignment tax, but it’s also probably easier to build.
I think this argument also extends to value-aligned AI, because the value-aligned AGI will keep humans in the loop insofar as we want to be kept in the loop, & it will be corrigible insofar as we want it to be corrigible.
Good point. I agree that the wrong model of user's preferences is my main concern and most alignment thinkers'. And that it can happen with a personal intent alignment as well as value alignment.
This is why I prefer instruction-following to corrigibility as a target. If it's aligned to follow instructions, it doesn't need nearly as much of a model of the user's preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like "Okay, I've got an approach that should work. I'll engineer a gene drive to painlessly eliminate the human population". "Um okay, I actually wanted the humans to survive and flourish while solving cancer, so let's try another approach that accomplishes that too...". I describe this as do-what-I-mean-and-check, DWIMAC.
The Harms version of corrigibility is pretty similar in that it should take instructions first and foremost, even though it's got a more elaborate model of the user's preferences to help in interpreting instructions correctly, and it's supposed to act on its own initiative in some cases. But the two approaches may converge almost completely after a user has given a wise set of standing instructions to their DWIMAC AGI.
Also, accurately modeling short-term intent - what the user wants right now - seems a lot more straightforward than modeling the deep long-term values of all of humanity. Of course, it's also not as good a way to get a future that everyone likes a lot. This seems like a notable difference but not an immense one; the focus on instructions seems more important to me.
Absent all of that, it seems like there's still two advantages to modeling just one person's values instead of all of humanity's. The smaller one is that you don't need to understand as many people or figure out how to aggregate values that conflict with each other. I think that's not actually that hard since lots of compromises could give very good futures, but I haven't thought that one alal the way through. The bigger advantage is that one person can say "oh my god don't do that it's the last thing I want" and it's pretty good evidence for their true values. Humanity as a whole probably won't be in a position to say that before a value-aligned AGI sets out to fulfill its (misgeneralized) model of their values.
Doesn't easier to build mean lower alignment tax?
The Harms version of corrigibility is pretty similar in that it should take instructions first and foremost, even though it's got a more elaborate model of the user's preferences to help in interpreting instructions correctly, and it's supposed to act on its own initiative in some cases. But the two approaches may converge almost completely after a user has given a wise set of standing instructions to their DWIMAC AGI.
Note that the link to the Harms version of corrigibility doesn't work.
Good point. I agree that the wrong model of user's preferences is my main concern and most alignment thinkers'. And that it can happen with a personal intent alignment as well as value alignment.
This is why I prefer instruction-following to corrigibility as a target. If it's aligned to follow instructions, it doesn't need nearly as much of a model of the user's preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like "Okay, I've got an approach that should work. I'll engineer a gene drive to painlessly eliminate the human population". "Um okay, I actually wanted the humans to survive and flourish while solving cancer, so let's try another approach that accomplishes that too...". I describe this as do-what-I-mean-and-check, DWIMAC.
Yes, I also think that is a consideration in favor of instruction following. I think there’s an element of IF which I find appealing, it’s somewhat similar to bayesian updating: When I tell an IF agent to “fill the cup”, on one hand it will try to fulfill that goal, but it also thinks about the “usual situation” where that instruction is satisfied, & it will notice that the rest of the world remains pretty much unchanged, so it will try to replicate that. We can think of the IF agent as having a background prior over world states, and it conditions that prior on our instructions to get a posterior distribution over world states, & that’s the “target distribution” that it’s optimizing for. So it will try to fill the cup, but it wouldn’t build a dyson sphere to harness energy & maximize the probability of the cup being filled, because that scenario has never occurred when a cup has been filled (so that world has low prior probability).
I think this property can also be transferred to PIA and VA, where we have a compromise between “desirable worlds according to model of user values” and “usual worlds”.
Also, accurately modeling short-term intent - what the user wants right now - seems a lot more straightforward than modeling the deep long-term values of all of humanity. Of course, it's also not as good a way to get a future that everyone likes a lot. This seems like a notable difference but not an immense one; the focus on instructions seems more important to me.
Absent all of that, it seems like there's still two advantages to modeling just one person's values instead of all of humanity's. The smaller one is that you don't need to understand as many people or figure out how to aggregate values that conflict with each other. I think that's not actually that hard since lots of compromises could give very good futures, but I haven't thought that one alal the way through. The bigger advantage is that one person can say "oh my god don't do that it's the last thing I want" and it's pretty good evidence for their true values. Humanity as a whole probably won't be in a position to say that before a value-aligned AGI sets out to fulfill its (misgeneralized) model of their values.
Agreed, I also favor personal intent alignment for those reasons, or at least I consider PIA + accelerated & simulated reflection to be the most promising path towards eventual VA
Doesn't easier to build mean lower alignment tax?
It’s part of it, but alignment tax also includes the amount of capabilities that we have to sacrifice to ensure that the AI is safe. The way I think of alignment tax is that for every optimization target, there is an upper bound on the optimization pressure that we can apply before we run into goodhart failures. The closer the optimization target is to our actual values, the more optimization pressure we get to safely apply. & because each instruction only captures a small part of our actual values, we have to limit the amount of optimization pressure we apply (this is also why we need to avoid side effects when the AI has an imperfect model of the users’ preferences).
Re this:
It’s part of it, but alignment tax also includes the amount of capabilities that we have to sacrifice to ensure that the AI is safe. The way I think of alignment tax is that for every optimization target, there is an upper bound on the optimization pressure that we can apply before we run into goodhart failures. The closer the optimization target is to our actual values, the more optimization pressure we get to safely apply. & because each instruction only captures a small part of our actual values, we have to limit the amount of optimization pressure we apply (this is also why we need to avoid side effects when the AI has an imperfect model of the users’ preferences).
We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.
We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.
Yes, I think having an aimable general purpose search module is the most important bottleneck for solving inner alignment
I think things can still go wrong if we apply too much optimization pressure to an inadequate optimization target because we won’t have a chance to correct the AI if it doesn’t want us to (I think adding corrigibility is a form of reducing optimization pressure, but it's still desirable).
But a particular regime I'm worried about (for both PIA & VA) is when the AI has an imperfect model of the users' goals inside its world model & optimizes for them. One way to mitigate this is to implement some systematic ways of avoiding side effects (corrigibility/impact regularization) so that we won’t need to require a perfect optimization target, another is to allow the AI to update its goals as it improves its model of the users’ values.
I agree with both of those methods being used, and IMO a third way or maybe a way to improve the 2 methods is to use synthetic data early and fast on human values and instruction following, and the big reason for using synthetic data here is to both improve the world model, plus using it to implement stuff like instruction following by making datasets where superhuman AI always obey the human master despite large power differentials, or value learning where we offer large datasets of AI always faithfully acting on the best of our human values, as described here:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
The biggest reason I wouldn't be too concerned about that problem is that assuming no deceptive alignment/adversarial behavior from the AI, which is likely to be enforced by synthetic data, I think the problem you're talking about is likely to be solvable in practice, because we can just make their General Purpose Search/world model more capable without causing problems, which means we can transform this in large parts into a problem that goes away with scale.
More generally, this unlocks the ability to automate the hard parts of alignment research, which lets us offload most of the hard work onto the AI.
Yes, I think synthetic data could be useful for improving the world model. It's arguable that allowing humans to select/filter synthetic data for training counts as a form of active learning, because the AI is gaining information about human preference through its own actions (generating synthetic data for humans to choose). If we have some way of representing uncertainties over human values, we can let our AI argmax over synthetic data with the objective of maximizing information gain about human values (when synthetic data is filtered).
I think using synthetic data for corrigibility can be more or less effective depending on your views on corrigibility and the type of AI we’re considering. For instance, it would be more effective under Christiano’s act-based corrigibility because we’re avoiding any unintended optimization by evaluating the agent at the behavioral level (sometimes even at thought level), but in this paradigm we’re basically explicitly avoiding general purpose search, so I expect a much higher alignment tax.
If we’re considering an agentic AI with a general purpose search module, misspecification of values is much more susceptible to goodhart failures because we’re applying much more optimization pressure, & it’s less likely that synthetic data on corrigibility can offer us sufficient robustness, especially when there may be systematic bias in human filtering of synthetic data. So in this context I think a value-free core of corrigibility would be necessary to avoid the side effects that we can’t even think of.
Note that whenever I say corrigibility, I really mean instruction following, ala @Seth Herd's comments.
Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans, and my view is that we will likely be able to get AI that is both close enough to our values plus very highly aimable because of the very large amounts of synthetic data, which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.
Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans
I also expect general purpose search to be aimable, in fact, it’s selected to be aimable so that the AI can recursively retarget GPS on instrumental subgoals
which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.
I think there’s a fundamental tradeoff between optimization pressure & correctability, because if we apply a lot of optimization pressure on the wrong goals, the AI will prevent us from correcting it, and if the goals are adequate we won’t need to correct them. Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)
Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:
Yes, I consider this to be the central crux.
I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
Yes, I consider this to be the central crux.
I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
I kind of agree, at least at realistic compute levels say through 2030, lack of search is a major bottleneck to better AI, but a few things to keep mind:
People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
Also, I don't buy that it was refuted, based on this, which sounds like a refutation but isn't actually a refutation, and they never directly deny it:
Re today's AIs being weak evidence for alignment generalizes further than capabilities, I think that the theoretical reasons and empirical reasons for why alignment generalizes further than capabilities is in large part (but not the entire story) reducible to why it's generally much easier to verify that something has been done correctly than actually executing the plan yourself:
2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
'3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.4.) We see a similar situation with humans. Almost all human problems are caused by a.) not knowing what you want and b.) being unable to actually optimize the world towards that state. Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa. For the AI, we aim to solve part a.) as a general part of outer alignment and b.) is the general problem of capabilities. It is much much much easier for people to judge and critique outcomes than actually materialize them in practice, as evidenced by the very large amount of people who do the former compared to the latter.
5.) Similarly, understanding of values and ability to assess situations for value arises much earlier and robustly in human development than ability to actually steer outcomes. Young children are very good at knowing what they want and when things don’t go how they want, even new situations for them, and are significantly worse at actually being able to bring about their desires in the world.
Re this:
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.
More concretely, I think that the adequate optimization target is actually deferrable, because we can mostly just rely on instruction following and not have to worry too much about adequate optimization targets for the physical world, since we can use the first AGIs/ASIs to do interpretability and alignment research that help us reveal what optimization targets to choose for.
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
Agreed
People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
Also, I don't buy that it was refuted, based on this, which sounds like a refutation but isn't actually a refutation, and they never directly deny it:
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
it's generally much easier to verify that something has been done correctly than actually executing the plan yourself
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it's much harder to come up with a specification such that but once we have such a specification it'll be much easier to come up with an implementation (likely with the help of AI)
2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
Yes, I think optimizing worst-case performance is one crucial part of alignment, it's also one
advantage of infrabayesianism
I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we'd get to control the optimization target for automating interpretability without worrying about unintended optimization).
Agreed
Then we've converged almost completely, thanks for the conversation.
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
So you're saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it's much harder to come up with a specification X such that X ⟹ alignment but once we have such a specification it'll be much easier to come up with an implementation (likely with the help of AI)
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
Yes, I think optimizing worst-case performance is one crucial part of alignment, it's also one
advantage of infrabayesianism
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I'm not that comfortable with using infrabayesianism, even if it actually worked.
I also don't believe it's necessary for alignment/uncertainty either.
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we'd get to control the optimization target for automating interpretability without worrying about unintended optimization).
I wasn't totally thinking of simulated reflection, but rather automated interpretability/alignment research.
Yeah, a big thing I admit to assuming is that I'm assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
Then we've converged almost completely, thanks for the conversation.
Thanks! I enjoyed the conversation too.
So you're saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
Agreed.
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I'm not that comfortable with using infrabayesianism, even if it actually worked.
I also don't believe it's necessary for alignment/uncertainty either.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don't think it's necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
I wasn't totally thinking of simulated reflection, but rather automated interpretability/alignment research.
I intended "simulated reflection" to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Yeah, a big thing I admit to assuming is that I'm assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
Thanks!
This comment is to clarify some things, not to disagree too much with you:
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
Then we'd better start cracking on how to get GPS into LLMs.
Re world modeling, I believe that while LLMs do have a world model in at least some areas, I don't think it's all that powerful or all that reliable, and IMO the meta-bottleneck on GPS/world modeling is that they were very compute expensive back in the day, and as compute and data rise, people will start trying to put GPS/world modeling capabilities in LLMs and succeeding way more compared to the past.
And I believe that a lot of the world modeling stuff will start to become much more reliable and powerful as a result of scale and some early GPS.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don't think it's necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
Perhaps so, though I'd bet on synthetic data/automated interpretability being the first way we practically get a full solution to alignment.
I intended "simulated reflection" to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Thanks for clarifying that, now I understand what you're saying.
Epistemic staus: Exploratory
Summary: In this post I will decompose the alignment problem into subproblems and frame existing approaches in terms of their relations to the subproblems. I will try to place a larger focus on the epistemic process as opposed to results of this particular problem factorization, where the aim is to obtain an epistemic strategy that can be generalized to new problems.
The case for problem decomposition
Degrees of freedom
One way to frame the advantage of factoring a problem is that doing so allows degrees of freedom to add up instead of multiply. If the solution space of a problem space P contains n degrees of freedom, then without decomposing the problem, we need to search through all possible combinations to find a solution. However, if we can decompose P into two independent subproblems P1 and P2, where the degrees of freedom for each subproblem do not affect the other subproblem, then we get to independently search through the solutions of P1and P2, which means the solution spaces of P1 and P2 add up instead of combinatorially multiply. It’s important to note that
Combining forward chaining with backward chaining
Problem factoring is a form of backchaining from desired end states. In addition to this approach, we can also forward-chain from the status quo to gain information about the problem domain, which may be helpful for finding new angles of attack. However, forward chaining is most effective when we have adequate heuristics that guide our search towards insights that are more useful and generalizable. One way to develop heuristics about what insights are generalizable is to keep a wide variety of problems on which to apply new techniques to, and bias our search towards insights that are helpful for multiple problems
Concretely, decomposing the alignment problem into subproblems means that whenever we stumble upon a new insight that may be relevant to alignment, we can try to apply it to each of the subproblems, & gain a more concrete intuition about what sorts of insights are useful. In addition, we can frame existing approaches in terms of how they can help us address subproblems of alignment, so that when we consider similar approaches, we can direct our focus onto the same set of subproblems.
Scope
In this post we will focus on a narrow class of transformative AIs that can be roughly factored into three components:
While I do believe that it’s important to figure out how to align AIs with other possible architectures, we will not discuss them in this post. Nevertheless, the following are some justifications for focusing on TAIs that can be factored into a world model, a GPS module, and a targeting process:
Where does the information come from, and how do we plan on using it?
Not all problems should be framed as optimization problems
There’s a tempting style of thinking which tries to frame all problems as optimization problems (see here and here). This style of thinking seems to make sense for dualistic agents: Afterall, the dualistic agent has preferences over the environment, it has well-defined input and output channels, and it can hold an entire model of the environment inside its mind. All that’s left to do is to optimize the environment against its preferences using the output channel.
However, we run into issues when we try to translate this style of thinking to embedded agents: The embedded agent has some degree of introspective uncertainty, including over its own preferences, which means it doesn’t always know what objective function to optimize for; the goals of the embedded agent may depend on information in the environment that isn’t fully accessible to the agent. For instance, an embedded agent might try to satisfy the preferences of another agent, and because the agent is logically non-omniscient and smaller than the environment, it’s not straightforward to simply calculate expected utilities over all possible worlds. As a result, embedded agents can face many problems where most of the difficulty stems from finding an adequate set of criteria to optimize against, as opposed to finding out how to optimize against a known criteria. The agent cannot just optimize against an arbitrary proxy for its objectives either, as that can lead to Goodhart failures.
The alignment problem is a central example where the main bottleneck hinges upon defining an objective as opposed to optimizing against it. And because framing all problems as optimization problems assumes that we already know the objectives, we need to find an alternative framework which helps us think about the task of formulating the problem itself.
Desiderata, sources and bridges
One way to think about alignment which I find helpful is that we have human values on one hand, and the goals or optimization targets of the AI on the other, and we want to establish a bridge which allows information to flow from the former to the latter. We might need to formulate properties that we want this bridge to have, drawing inspirations from many different places, or try to implement properties that we already think are desirable. The following are some important features of this picture that are different from the dualistic optimization viewpoint:
Decomposing the AI alignment problem
Our main objective is to find optimization targets that lead to desirable outcomes when optimized against, and there are different sources of information which tell us what properties we want our optimization targets to have. To factorize the alignment problem, a natural place to start is to factorize these sources of information which can help us narrow our search space for our optimization targets.
One axis of factorization is the information that we have a priori vs a posteriori, that is, what information do we have before the AI starts developing a world model, vs after we have access to its world model? These two cases seem to be mostly independent because gaining access to an AI’s world model gives us new information that isn’t accessible to us a priori.
A priori
When we haven’t started training an AI and we don’t have access to the AI’s world model, there are two constraints that limit the information we have about what optimization targets are desirable:
When we don’t have access to certain types of information, we want to seek considerations which don’t make assumptions about them. As a result, when we don’t have access to the AI’s ontology and its model of human values, we should seek ontology-invariant and value-free considerations:
Value-free considerations
The main benefit of allowing the optimization target of an AI to depend on variables in the world model is that we can potentially “point” to human values inside the world model & set it as the optimization target. However, that information isn’t available when the AI hasn’t developed a world model yet, and our introspective uncertainty bars us from directly specifying our own values in the AI’s ontology, which means at this stage we should seek desirable properties of the optimization target that don’t depend on contingent features of human values. We call such considerations “value-free”.
Since value-free considerations don’t make assumptions about contingent properties of human values, they must be universal across a wide variety of agents. In other words, to search for value-free properties, we should focus on properties of the optimization target which are instrumentally convergent for agents with diverse values.
Examples of value-free considerations
Ontology-invariant considerations
Not having access to the AI’s world model means that we don’t know how the internal representations of the AI correspond to physical things in the real world. This means that when we have preferences about real world objects, we don’t know how that preference should be expressed in relation to the AI’s internal representations. In other words, we don’t know how to make the AI care about apples and dogs when we don’t know which parts of the AI’s mind point to apples and dogs.
When we face such limitations, we should seek properties of the optimization target that are desirable regardless of what ontology the AI’s might end up developing; when we don’t know how the AI will describe the world, we can still implement the parts of our preferences which don’t depend on which description of the world will end up being used.
Examples of ontology invariant considerations
Advantage
The main benefit of a priori properties of the optimization target is that they can be deployed before the AI starts developing a sophisticated world model. In other words, they are more robust to scaling down
A posteriori
We’ve discussed two limitations in the a priori stage when we don’t have access to the AI’s world model, which means the main question we should ask in the a posteriori stage is what opportunities are unlocked once those limitations are lifted? What new sources of information do we gain access to which we previously didn’t?
The AI gets to observe us
Once the AI develops a sophisticated world model, that world model will likely contain information about human values. This means that a key consideration in the a posteriori stage is how we can leverage that information to determine properties of the optimization target.
Examples
We get to observe the AI’s world model
The second limitation that’s lifted when the AI starts developing a world model is that we get to inspect the world model and gain information about the AI’s ontology. This means that in addition to the AI gaining a better understanding of our values, we can also become better at designing the targeting process ourselves by understanding the AI’s world model
Examples
Backpropagation
Our discussions mainly focused on considerations about the targeting process, but the targeting process is entangled with the world model and the general purpose search module. This means that we should backpropagate our desiderata for the targeting process to inform design decisions about the rest of the components. For instance, if we want our optimization target to be robust to ontology shifts, we should try to design world models which are capable of modeling the world at multiple levels of abstractions and explicitly representing the relationships between different levels.