Master's student in applied mathematics, funded by Center on Long-Term Risk to investigate the cheating problem in safe pareto-improvements. Agent foundations fellow with @Alex_Altair.
Some other areas I'm interested in:
Noted, that does seem a lot more tractable than using natural latents to pin down details of CEV by itself
natural latents are about whether the AI's cognition routes through the same concepts that humans use.
We can imagine the AI maintaining predictive accuracy about humans without using the same human concepts. For example, it can use low-level physics to simulate the environment, which would be predictively accurate, but that cognition doesn't make use of the concept "strawberry" (in principle, we can still "single out" the concept of "strawberry" within it, but that information comes mostly from us, not from the physics simulation)
Natural latents are equivalent up to isomorphism (ie two latent variables are equivalent iff they give the same conditional probabilities on observables), but for reflective aspects of human cognition, it's unclear whether that equivalence class pin down all information we care about for CEV (there may be differences within the equivalence class that we care about), in a way that generalizes far out of distribution
I think the fact that natural latents are much lower dimensional than all of physics makes it suitable for specifying the pointer to CEV as an equivalence class over physical processes (many quantum field configurations can correspond to the same human, and we want to ignore differences within that equivalence class).
IMO the main bottleneck is to account for the reflective aspects in CEV, because one constraint of natural latents is that it should be redundantly represented in the environment.
like infinite state Turing machines, or something like this:
Interesting, I'll check it out!
Then we've converged almost completely, thanks for the conversation.
Thanks! I enjoyed the conversation too.
So you're saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
Agreed.
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I'm not that comfortable with using infrabayesianism, even if it actually worked.
I also don't believe it's necessary for alignment/uncertainty either.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don't think it's necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
I wasn't totally thinking of simulated reflection, but rather automated interpretability/alignment research.
I intended "simulated reflection" to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Yeah, a big thing I admit to assuming is that I'm assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
Thanks!
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
Agreed
People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
Also, I don't buy that it was refuted, based on this, which sounds like a refutation but isn't actually a refutation, and they never directly deny it:
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
it's generally much easier to verify that something has been done correctly than actually executing the plan yourself
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it's much harder to come up with a specification such that but once we have such a specification it'll be much easier to come up with an implementation (likely with the help of AI)
2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
Yes, I think optimizing worst-case performance is one crucial part of alignment, it's also one
advantage of infrabayesianism
I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we'd get to control the optimization target for automating interpretability without worrying about unintended optimization).
Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans
I also expect general purpose search to be aimable, in fact, it’s selected to be aimable so that the AI can recursively retarget GPS on instrumental subgoals
which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.
I think there’s a fundamental tradeoff between optimization pressure & correctability, because if we apply a lot of optimization pressure on the wrong goals, the AI will prevent us from correcting it, and if the goals are adequate we won’t need to correct them. Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)
Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:
Yes, I consider this to be the central crux.
I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.
Yes, I think having an aimable general purpose search module is the most important bottleneck for solving inner alignment
I think things can still go wrong if we apply too much optimization pressure to an inadequate optimization target because we won’t have a chance to correct the AI if it doesn’t want us to (I think adding corrigibility is a form of reducing optimization pressure, but it's still desirable).
My mental model of this is something like: My concept of a squirgle is a function f(x) which maps latent variables x to observations such that likelier observations correspond to latent variables with lower description length.
Suppose that we currently settle on a particular latent variable x0, but we receive new observations that are incompatible with f(x0), and these new observations can be most easily accounted for by modifying x0 to a new latent variable x1 that's pretty close to x0, then we say that this change is still about squirgle
But if we receive new observations that can be more easily accounted for by perturbing a different latent variable y that corresponds to another concept g(y) (eg about TV), then that is a change about a different thing and not the squirgle
The main property that enables this kind of separation is modularity of the world model, because when most components are independent of most other components at any given time, only a change in a few latent variables (as opposed to most latent variables) is required to accomodate new beliefs, & that allows us to attribute changes in beliefs into changes about disentangled concepts