All of Daniel C's Comments + Replies

Great post! Agree with the points raised but would like to add that restricting the expressivity isn’t the only way that we can try to make the world model more interpretable by design. There are many ways that we can decompose a world model into components, and human concepts correspond to some of the components (under a particular decomposition) as opposed to the world model as a whole. We can backpropagate desiderata about ontology identification to the way that the world model is decomposed.

 

For instance, suppose that we’re trying to identify the ... (read more)

I think one pattern which needs to hold in the environment in order for subgoal corrigibility to make sense is that the world is modular, but that modularity structure can be broken or changed


For one, modularity is the main thing that enables general purpose search: If we can optimize for a goal by just optimizing for a few instrumental subgoals while ignoring the influence of pretty much everything else, then that reflects some degree of modularity in the problem space

Secondly, if the modularity structure of the environment stays constant no matter what (... (read more)

Imagine that I'm watching the video of the squirgle, and suddenly the left half of the TV blue-screens. Then I'd probably think "ah, something messed up the TV, so it's no longer showing me the squirgle" as opposed to "ah, half the squirgle just turned into a big blue square". I know that big square chunks turning a solid color is a typical way for TVs to break, which largely explains away the observation; I think it much more likely that the blue half-screen came from some failure of the TV rather than an unprecedented behavior of the squirgle.

 

My me... (read more)

Noted, that does seem a lot more tractable than using natural latents to pin down details of CEV by itself

natural latents are about whether the AI's cognition routes through the same concepts that humans use.

We can imagine the AI maintaining predictive accuracy about humans without using the same human concepts. For example, it can use low-level physics to simulate the environment, which would be predictively accurate, but that cognition doesn't make use of the concept "strawberry" (in principle, we can still "single out" the concept of "strawberry" within it, but that information comes mostly from us, not from the physics simulation)


Natural latents are equiva... (read more)

3Lucius Bushnaq
My claim is that the natural latents the AI needs to share for this setup are not about the details of what a 'CEV' is. They are about what researchers mean when they talk about initializing, e.g., a physics simulation with the state of the Earth at a specific moment in time.
Daniel C
*10

I think the fact that natural latents are much lower dimensional than all of physics makes it suitable for specifying the pointer to CEV as an equivalence class over physical processes (many quantum field configurations can correspond to the same human, and we want to ignore differences within that equivalence class).

IMO the main bottleneck is to account for the reflective aspects in CEV, because one constraint of natural latents is that it should be redundantly represented in the environment.

2Lucius Bushnaq
It is redundantly represented in the environment, because humans are part of the environment. If you tell an AI to imagine what happens if humans sit around in a time loop until they figure out what they want, this will single out a specific thought experiment to the AI, provided humans and physics are concepts the AI itself thinks in. (The time loop part and the condition for terminating the loop can be formally specified in code, so the AI doesn't need to think those are natural concepts) If the AI didn't have a model of human internals that let it predict the outcome of this scenario, it would be bad at predicting humans.  

like infinite state Turing machines, or something like this:

https://arxiv.org/abs/1806.08747

 

Interesting, I'll check it out!

3Noosphere89
Note that it isn't intended to be in any way a realistic program we could ever run, but rather an interesting ideal case where we could compute every well-founded set.

Then we've converged almost completely, thanks for the conversation.

 Thanks! I enjoyed the conversation too.

So you're saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?

yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.

While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of info

... (read more)
3Noosphere89
This comment is to clarify some things, not to disagree too much with you: Then we'd better start cracking on how to get GPS into LLMs. Re world modeling, I believe that while LLMs do have a world model in at least some areas, I don't think it's all that powerful or all that reliable, and IMO the meta-bottleneck on GPS/world modeling is that they were very compute expensive back in the day, and as compute and data rise, people will start trying to put GPS/world modeling capabilities in LLMs and succeeding way more compared to the past. And I believe that a lot of the world modeling stuff will start to become much more reliable and powerful as a result of scale and some early GPS. Perhaps so, though I'd bet on synthetic data/automated interpretability being the first way we practically get a full solution to alignment. Thanks for clarifying that, now I understand what you're saying.

This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.

Agreed

People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:

https://www.lesswrong.com/posts/JnM3EHegiBePeKkL

... (read more)
3Noosphere89
  Then we've converged almost completely, thanks for the conversation. So you're saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right? While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more, I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is. My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I'm not that comfortable with using infrabayesianism, even if it actually worked. I also don't believe it's necessary for alignment/uncertainty either. I wasn't totally thinking of simulated reflection, but rather automated interpretability/alignment research. Yeah, a big thing I admit to assuming is that I'm assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.

Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans

 

I also expect general purpose search to be aimable, in fact, it’s selected to be aimable so that the AI can recursively retarget GPS on instrumental subgoals

 

which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.

 

I think there’s a fundamental tradeoff between optimization pressure & correctability, because if we apply... (read more)

3Noosphere89
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem. I kind of agree, at least at realistic compute levels say through 2030, lack of search is a major bottleneck to better AI, but a few things to keep mind: People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test: https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type Also, I don't buy that it was refuted, based on this, which sounds like a refutation but isn't actually a refutation, and they never directly deny it: https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type#ECyqFKTFSLhDAor7k Re today's AIs being weak evidence for alignment generalizes further than capabilities, I think that the theoretical reasons and empirical reasons for why alignment generalizes further than capabilities is in large part (but not the entire story) reducible to why it's generally much easier to verify that something has been done correctly than actually executing the plan yourself: Re this: I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs. More concretely, I think that the adequate optimization target is actually deferrable, because we can mostly

We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.

 

Yes, I think having an aimable general purpose search module is the most important bottleneck for solving inner alignment

I think things can still go wrong if we apply too much optimization pressure to an inadequate optimization target because we won’t have a chance to correct the AI if it doesn’t want us to (I think adding corrigibility is a form of reducing optimization pressure, but it's still desirable).

Good point. I agree that the wrong model of user's preferences is my main concern and most alignment thinkers'.  And that it can happen with a personal intent alignment as well as value alignment. 

This is why I prefer instruction-following to corrigibility as a target. If it's aligned to follow instructions, it doesn't need nearly as much of a model of the user's preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like "Okay, I've got an approach that should work. I'll engineer a gene dri

... (read more)
3Noosphere89
Re this: We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.

Yes, I think synthetic data could be useful for improving the world model. It's arguable that allowing humans to select/filter synthetic data for training counts as a form of active learning, because the AI is gaining information about human preference through its own actions (generating synthetic data for humans to choose). If we have some way of representing uncertainties over human values, we can let our AI argmax over synthetic data with the objective of maximizing information gain about human values (when synthetic data is filtered).

I think using... (read more)

3Noosphere89
Note that whenever I say corrigibility, I really mean instruction following, ala @Seth Herd's comments. Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans, and my view is that we will likely be able to get AI that is both close enough to our values plus very highly aimable because of the very large amounts of synthetic data, which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime. Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here: https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/

Interesting! I have to read the papers in more depth but here are some of my initial reactions to that idea (let me know if it’s been addressed already):

  • AFAICT using learning to replace GPS either requires:1) Training examples of good actions or 2) An environment like chess where we can rapidly gain feedback through simulation. Sampling from the environment would be much more costly when these assumptions break down, and general purpose search can enable lower sample complexity because we get to use all the information in the world model
  • General purpose sea
... (read more)
5Noosphere89
Agree, learning can't entirely replace General Purpose Search, and I agree something like General Purpose Search will still in practice be the backbone behind learning, due to your reasoning. That is, General Purpose Search will still be necessary for AIs, if only due to bootstrapping concerns, and I agree with your list of benefits of General Purpose Search.

Good point!

But more recently I've been thinking that neither will be a real issue, because Instruction-following AGI is easier and more likely than value aligned AGI. The obvious solution to both alignment stability and premature/incorrect/mis-specified value lock-in is to keep a human in the loop by making AGI whose central goal is to follow instructions (or similar personal intent alignment) from authorized user(s). 

I think this argument also extends to value-aligned AI, because the value-aligned AGI will keep humans in the loop insofar as we want t... (read more)

6Seth Herd
Good point. I agree that the wrong model of user's preferences is my main concern and most alignment thinkers'.  And that it can happen with a personal intent alignment as well as value alignment.  This is why I prefer instruction-following to corrigibility as a target. If it's aligned to follow instructions, it doesn't need nearly as much of a model of the user's preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like "Okay, I've got an approach that should work. I'll engineer a gene drive to painlessly eliminate the human population". "Um okay, I actually wanted the humans to survive and flourish while solving cancer, so let's try another approach that accomplishes that too...". I describe this as do-what-I-mean-and-check, DWIMAC. The Harms version of corrigibility is pretty similar in that it should take instructions first and foremost, even though it's got a more elaborate model of the user's preferences to help in interpreting instructions correctly, and it's supposed to act on its own initiative in some cases. But the two approaches may converge almost completely after a user has given a wise set of standing instructions to their DWIMAC AGI. Also, accurately modeling short-term intent - what the user wants right now - seems a lot more straightforward than modeling the deep long-term values of all of humanity. Of course, it's also not as good a way to get a future that everyone likes a lot. This seems like a notable difference but not an immense one; the focus on instructions seems more important to me. Absent all of that, it seems like there's still two advantages to modeling just one person's values instead of all of humanity's.  The smaller one is that you don't need to understand as many people or figure out how to aggregate values that conflict with each other. I think that's not actually that hard since lots of compromises could give very good futures, but I haven't thought that one alal the
3Noosphere89
I agree with both of those methods being used, and IMO a third way or maybe a way to improve the 2 methods is to use synthetic data early and fast on human values and instruction following, and the big reason for using synthetic data here is to both improve the world model, plus using it to implement stuff like instruction following by making datasets where superhuman AI always obey the human master despite large power differentials, or value learning where we offer large datasets of AI always faithfully acting on the best of our human values, as described here: https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/ The biggest reason I wouldn't be too concerned about that problem is that assuming no deceptive alignment/adversarial behavior from the AI, which is likely to be enforced by synthetic data, I think the problem you're talking about is likely to be solvable in practice, because we can just make their General Purpose Search/world model more capable without causing problems, which means we can transform this in large parts into a problem that goes away with scale. More generally, this unlocks the ability to automate the hard parts of alignment research, which lets us offload most of the hard work onto the AI.

Yes, I plan to write a sequence about it some time in the future, but here are some rough high-level sketches:

  • Basic assumptions: Modularity implies that the program can be broken down into loosely coupled components, for now I'll just assume that each component has some "class definition" which specifies how it interacts with other components; "class definitions" can be reused (aka we can instantiate multiple components of the same class); each component can aggregate info from other components & the info they store can be used by other components
  •  
... (read more)
4Nathan Helm-Burger
For what it's worth, the human brain (including the cortex) has a fixed modularity. Long range connections are created during fetal development according to genetic rules, and can only be removed, not rerouted or added to. I believe this is what causes the high degree of functional localization in the cortex.
5Noosphere89
I think that this would make a very nice sequence, and despite all my discussion with you, I'd absolutely like to see this sequence carried out.
Daniel C
*30

Thanks! I recall reading the steering subsystems post a while ago & it matched a lot of my thinking on the topic. The idea of using variables in the world model to determine the optimization target also seems similar to your "Goals selected from learned knowledge" approach (the targeting process is essentially a mapping from learned knowledge to goals).


Another motivation for the targeting process (which might also be an advantage of GLSK) I forgot to mention is that we can allow the AI to update their goals as they update their knowledge (eg about what... (read more)

Seth Herd
117

Right! I'm pleased that you read those posts and got something from them.

I worry less about value lock-in and more about The alignment stability problem which is almost the opposite.

But more recently I've been thinking that neither will be a real issue, because Instruction-following AGI is easier and more likely than value aligned AGI. The obvious solution to both alignment stability and premature/incorrect/mis-specified value lock-in is to keep a human in the loop by making AGI whose central goal is to follow instructions (or similar personal intent align... (read more)

I intended 'local' (aka not global) to be a necessary but not sufficient condition for predictions made by smaller maps within it to be possible (cuz global predictions runs into problems of embedded agency)

I'm mostly agnostic about what the other necessary conditions are & what the sufficient conditions are

Yes, there's no upper bound for what counts as "local" (except global), but there is an upper bound for the scale at which agents' predictions can outpace the territory (eg humans can't predict everything in the galaxy)

I meant upper bound in the second sense

2Vladimir_Nesov
The relevance of extracting/formulating something "local" is that prediction by smaller maps within it remains possible, ignoring the "global" solar flares and such. So that is a situation that could be set up so that a smaller agent predicts everything eons in the future at galaxy scale. Perhaps a superintelligence predicts human process of reflection, that is it's capable of perfectly answering specific queries before the specific referenced event would take place in actuality, while the computer is used to run many independent possibilities in parallel. So the superintelligence couldn't enumerate them all in advance, but it could quickly chase and overtake any given one of them. Even a human would be capable of answering such questions if nothing at all is happening within this galaxy scale computer, and the human is paused for eons after making the prediction that nothing will be happening. (I don't see what further "first sense" of locality or upper bound that is distinct from this could be relevant.)

Yes, by "locally outpace" I simply meant outpace at some non-global scale, there will of course be some tighter upper bound for that scale when it comes to real world agents

2Vladimir_Nesov
What I'm saying is that there is no upper bound for real world agents, the scale of "locally" in this weird sense can be measured in eons and galaxies.

I don't think we disagree?

The point was exactly that although we can't outpace the territory globally, we can still do it locally(by throwing out info we don't care about like solar flares)

That by itself is not that interesting. The interesting part is given that different embedded maps throw out different info & retain some info, is there any info that's convergently retained by a wide variety of maps? (aka natural latents)

The rest of the disagreement seems to boil down to terminology

2Vladimir_Nesov
The locally/globally distinction is suspicious, since "locally" here can persist at an arbitrary scale. If all the different embedded maps live within the same large legible computation, statistical arguments that apply to the present-day physical world will fail to clarify the dynamics of their interaction.

The claim was that a subprogram (map) embedded within a program(territory) cannot predict the *entire execution trace of that program faster than the program itself given computational irreducibility

"there are many ways of escaping them in principle, or even in practice (by focusing on abstract behavior of computers)."

Yes, I think this is the same point as my point about coarse graining (outpacing the territory "locally" by throwing away some info)

2Vladimir_Nesov
My point about computers-in-practice is that this is no longer an issue within the computers, indefinitely. You can outpace the territory within a computer using a smaller map from within the computer. Whatever "computational irreducibility" is, the argument doesn't apply for many computations that can be set up in practice, that is they can be predicted by smaller parts of themselves. (Solar flares from distant future can't be predicted, but even that is not necessarily an important kind of practical question in the real world, after the universe is overwritten with computronium, and all the stars are dismantled to improve energy efficiency.)

I agree, just changed the wording of that part

Embedded agency & computational irreducibility implies that the smaller map cannot outpace the full time evolution of the territory because it is a part of it, which may or may not be important for real world agents.

In the case where the response time of the map does matter to some extent, embedded maps often need to coarse grain over the territory to "locally" outpace the territory

We may think of natural latents as coarse grainings that are convergent for a wide variety of embedded maps

2Vladimir_Nesov
A program can predict another program regardless of when either of them is instantiated in the territory (neither needs to be instantiated for this to work, or they could be instantiated at many times simultaneously). Statistical difficulties need to be set up more explicitly, there are many ways of escaping them in principle (by changing the kind of territory we are talking about), or even in practice (by focusing on abstract behavior of computers).

Presumably there can be a piece of paper somewhere with the laws of physics & initial conditions of our universe written on it. That piece of paper can "fully capture" the entire territory in that sense.

But no agents within our universe can compute all consequences of the territory using that piece of paper (given computational irreducibility), because that computation would be part of the time evolution of the universe itself

I think that bit is about embedded agency

3Vladimir_Nesov
A map is something that can answer queries, it doesn't need to be specifically a giant lookup table. If a map can perfectly describe any specific event when queried about it, it's already centrally a perfect map, even if it didn't write down all answers to all possible questions on stone tablets in advance. But even then, in a computational territory we could place a smaller map that is infinite in time, and it will be able to write down all that happens in that territory at all times, with explicit representations of events in the territory being located either in the past or in the future of the events themseves.

Would there be a problem when speculators can create stocks in the conditional case? As in if a decision C harms me, can i create and sell loads and loads of C stock, and not having to actually go through the trade when C is not enforced (due to the low price i've caused)?

[This comment is no longer endorsed by its author]Reply
1JBlack
In the simple conditional case with N possible outcomes, you are (in the basic case) paying $1 to create 2N stocks: W|D_i and (1-W)|D_i for each of the N decisions D_i, where W is the agreed welfare metric ranging from 0 to 1. When decision n is implemented and the outcome measured, the W|D_n and (1-W)|D_n stocks pay out appropriately. So yes, if you never sold your |D_n stocks then you get $(W + 1-W) = $1 back. However, you don't have an unlimited number of dollars and can't create an unlimited number of stocks.