Great post! Agree with the points raised but would like to add that restricting the expressivity isn’t the only way that we can try to make the world model more interpretable by design. There are many ways that we can decompose a world model into components, and human concepts correspond to some of the components (under a particular decomposition) as opposed to the world model as a whole. We can backpropagate desiderata about ontology identification to the way that the world model is decomposed.
For instance, suppose that we’re trying to identify the ...
I think one pattern which needs to hold in the environment in order for subgoal corrigibility to make sense is that the world is modular, but that modularity structure can be broken or changed
For one, modularity is the main thing that enables general purpose search: If we can optimize for a goal by just optimizing for a few instrumental subgoals while ignoring the influence of pretty much everything else, then that reflects some degree of modularity in the problem space
Secondly, if the modularity structure of the environment stays constant no matter what (...
Imagine that I'm watching the video of the squirgle, and suddenly the left half of the TV blue-screens. Then I'd probably think "ah, something messed up the TV, so it's no longer showing me the squirgle" as opposed to "ah, half the squirgle just turned into a big blue square". I know that big square chunks turning a solid color is a typical way for TVs to break, which largely explains away the observation; I think it much more likely that the blue half-screen came from some failure of the TV rather than an unprecedented behavior of the squirgle.
My me...
Noted, that does seem a lot more tractable than using natural latents to pin down details of CEV by itself
natural latents are about whether the AI's cognition routes through the same concepts that humans use.
We can imagine the AI maintaining predictive accuracy about humans without using the same human concepts. For example, it can use low-level physics to simulate the environment, which would be predictively accurate, but that cognition doesn't make use of the concept "strawberry" (in principle, we can still "single out" the concept of "strawberry" within it, but that information comes mostly from us, not from the physics simulation)
Natural latents are equiva...
I think the fact that natural latents are much lower dimensional than all of physics makes it suitable for specifying the pointer to CEV as an equivalence class over physical processes (many quantum field configurations can correspond to the same human, and we want to ignore differences within that equivalence class).
IMO the main bottleneck is to account for the reflective aspects in CEV, because one constraint of natural latents is that it should be redundantly represented in the environment.
like infinite state Turing machines, or something like this:
Interesting, I'll check it out!
Then we've converged almost completely, thanks for the conversation.
Thanks! I enjoyed the conversation too.
So you're saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
...While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of info
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
Agreed
...People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans
I also expect general purpose search to be aimable, in fact, it’s selected to be aimable so that the AI can recursively retarget GPS on instrumental subgoals
which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.
I think there’s a fundamental tradeoff between optimization pressure & correctability, because if we apply...
We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.
Yes, I think having an aimable general purpose search module is the most important bottleneck for solving inner alignment
I think things can still go wrong if we apply too much optimization pressure to an inadequate optimization target because we won’t have a chance to correct the AI if it doesn’t want us to (I think adding corrigibility is a form of reducing optimization pressure, but it's still desirable).
...Good point. I agree that the wrong model of user's preferences is my main concern and most alignment thinkers'. And that it can happen with a personal intent alignment as well as value alignment.
This is why I prefer instruction-following to corrigibility as a target. If it's aligned to follow instructions, it doesn't need nearly as much of a model of the user's preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like "Okay, I've got an approach that should work. I'll engineer a gene dri
Yes, I think synthetic data could be useful for improving the world model. It's arguable that allowing humans to select/filter synthetic data for training counts as a form of active learning, because the AI is gaining information about human preference through its own actions (generating synthetic data for humans to choose). If we have some way of representing uncertainties over human values, we can let our AI argmax over synthetic data with the objective of maximizing information gain about human values (when synthetic data is filtered).
I think using...
Interesting! I have to read the papers in more depth but here are some of my initial reactions to that idea (let me know if it’s been addressed already):
Good point!
But more recently I've been thinking that neither will be a real issue, because Instruction-following AGI is easier and more likely than value aligned AGI. The obvious solution to both alignment stability and premature/incorrect/mis-specified value lock-in is to keep a human in the loop by making AGI whose central goal is to follow instructions (or similar personal intent alignment) from authorized user(s).
I think this argument also extends to value-aligned AI, because the value-aligned AGI will keep humans in the loop insofar as we want t...
Yes, I plan to write a sequence about it some time in the future, but here are some rough high-level sketches:
Thanks! I recall reading the steering subsystems post a while ago & it matched a lot of my thinking on the topic. The idea of using variables in the world model to determine the optimization target also seems similar to your "Goals selected from learned knowledge" approach (the targeting process is essentially a mapping from learned knowledge to goals).
Another motivation for the targeting process (which might also be an advantage of GLSK) I forgot to mention is that we can allow the AI to update their goals as they update their knowledge (eg about what...
Right! I'm pleased that you read those posts and got something from them.
I worry less about value lock-in and more about The alignment stability problem which is almost the opposite.
But more recently I've been thinking that neither will be a real issue, because Instruction-following AGI is easier and more likely than value aligned AGI. The obvious solution to both alignment stability and premature/incorrect/mis-specified value lock-in is to keep a human in the loop by making AGI whose central goal is to follow instructions (or similar personal intent align...
I intended 'local' (aka not global) to be a necessary but not sufficient condition for predictions made by smaller maps within it to be possible (cuz global predictions runs into problems of embedded agency)
I'm mostly agnostic about what the other necessary conditions are & what the sufficient conditions are
Yes, there's no upper bound for what counts as "local" (except global), but there is an upper bound for the scale at which agents' predictions can outpace the territory (eg humans can't predict everything in the galaxy)
I meant upper bound in the second sense
Yes, by "locally outpace" I simply meant outpace at some non-global scale, there will of course be some tighter upper bound for that scale when it comes to real world agents
I don't think we disagree?
The point was exactly that although we can't outpace the territory globally, we can still do it locally(by throwing out info we don't care about like solar flares)
That by itself is not that interesting. The interesting part is given that different embedded maps throw out different info & retain some info, is there any info that's convergently retained by a wide variety of maps? (aka natural latents)
The rest of the disagreement seems to boil down to terminology
The claim was that a subprogram (map) embedded within a program(territory) cannot predict the *entire execution trace of that program faster than the program itself given computational irreducibility
"there are many ways of escaping them in principle, or even in practice (by focusing on abstract behavior of computers)."
Yes, I think this is the same point as my point about coarse graining (outpacing the territory "locally" by throwing away some info)
I agree, just changed the wording of that part
Embedded agency & computational irreducibility implies that the smaller map cannot outpace the full time evolution of the territory because it is a part of it, which may or may not be important for real world agents.
In the case where the response time of the map does matter to some extent, embedded maps often need to coarse grain over the territory to "locally" outpace the territory
We may think of natural latents as coarse grainings that are convergent for a wide variety of embedded maps
Presumably there can be a piece of paper somewhere with the laws of physics & initial conditions of our universe written on it. That piece of paper can "fully capture" the entire territory in that sense.
But no agents within our universe can compute all consequences of the territory using that piece of paper (given computational irreducibility), because that computation would be part of the time evolution of the universe itself
I think that bit is about embedded agency
Would there be a problem when speculators can create stocks in the conditional case? As in if a decision C harms me, can i create and sell loads and loads of C stock, and not having to actually go through the trade when C is not enforced (due to the low price i've caused)?
<3!