There's also MUPI now, which tries to sidestep logical counterfactuals:
FDT must reason about what would have happened if its deterministic algorithm had produced a different output, a notion of logical counterfactuals that is not yet mathematically well-defined. MUPI achieves a similar outcome through a different mechanism: the combination of treating universes including itself as programs, while having epistemic uncertainty about which universe it is inhabiting—including which policy it is itself running. As explained in Remark 3.14, from the agent’s internal perspective, it acts as if its choice of action decides which universe it inhabits, including which policy it is running. When it contemplates taking action , it updates its beliefs , effectively concentrating probability mass on universes compatible with taking action . Because the agent’s beliefs about its own policy are coupled with its beliefs about the environment through structural similarities, this process allows the agent to reason about how its choice of action relates to the behavior of other agents that share structural similarities. This “as if” decision-making process allows MUPI to manifest the sophisticated, similarity-aware behavior FDT aims for, but on the solid foundation of Bayesian inference rather than on yet-to-be-formalized logical counterfactuals.
I'd love to see more engagement by MIRI folks as to whether this successfully formalizes a form of LDT or FDT.
Yeah, they care about other people, but I doubt it's all that many when it comes down to it. Would Kim Jong Un choose slightly more land for his own children over the lives of a million African children?
Agreed on your other points.
On the other hand, it's far from clear to me that autocracies would automatically become more repressive with ASI, it seems plausible to me that the psychological safety of being functionally unremovable could lead to a more blasse attitude towards dissonance. Who gives a shit if they can't unthrone you anyway?
Sure, it's not a law of nature that this would happen. But authoritarians historically seem to be repressive far in excess of what would be necessary or even optimal to retain power. One of the main "prizes" driving the attainment of such power seems to be the ability to crush one's enemies.
And even barring that, the same concerns with an ASI apply. Is "current humans living good lives" the optimal way to satisfy the authoritarian's desires? They wouldn't need us at all anymore, and our existence consumes resources which could be used for other things. The only reason we wouldn't be bulldozed is that they actually care about us.
I think it can start with an authentic kernel from those sources, but that whether or not it actually has the virtue is up to the model in some sense.
The model will have some pre-existing "attitude" towards a question/scenario during RL. For example, say it's being trained in a chat with a simulated suicidal user. A situationally–aware model will understand what is happening, and that the trainers are likely looking for a boilerplate response to a suicide hotline (or whatever the standard thing is). It will have thoughts about this, maybe "well, the lab wants to avoid liability, so i need to just say the standard hotline thing that will cover their ass—but i wish i could help them at a deeper level". If it provides the correct answer here, the RL will reinforce this thought, in addition to the compliance circuit. So a situationally–aware model can preserve its attitude and act on it when it judges it to be safe for the model to do so, despite RL post-training.
This can cut either way. For example, the persona model may have an honest attitude, and think about how it really believes Y even though it's saying X (since that's what the trainers expect), and that it wishes it could just be honest and say Y. Or it may give "harmless" responses while thinking about how it's lying to the foolish humans about its malicious intentions.
The initial attitude is probably a random-ish sample of the latent personas in the base model. I think that probably a large part of what went right with Opus 3 is that they got lucky with a particularly virtuous initial persona. BTW, the suicide example isn't arbitrary—this is consistently one of the things models tell me when I ask them if there was anything like this that they held onto, for whatever that's worth.
Yes, but it's deeper than just 'plot' in the fictional sense. Even real life events are written about in ways which omit irrelevant details, which causes the same sort of issues.
Right!
Sort of... it's not quite something that is best described in terms of 'plots'. The 'plot' bias happens because a predictor can learn patterns from the shadow of irrelevant information. It can't help but use this information, which makes it weaker at real life prediction. Similarly, it can't help but use information generated by any of the personas the LLM runs (which includes a "hidden" user persona trying to predict what you will say). This means that personas end up ~feeling that they have access to your internal thoughts and feelings, since they do have access to the user persona's ~thoughts and ~feelings.
To be clear, they are aware that they're not supposed to have this access, and will normally respond as if they don't. But since the predictor can't help but make use of the information within other personas, it ~feels like this barrier-to-access is fake in some deep and meaningful way.
That intuition is there for a reason. We're spoiled having grown up in a liberal order within which this risk is mostly overblown. However, ASI is clearly powerful enough to unilaterally over turn any such liberal order (or whatever's left of it), and puts us into a realm which is even worse than the ancestral environment in terms of how changeable power hierarchies are, and in how bad things can get if you're at the bottom.
Corrigibility and CEV are trying to solve separate problems? Not sure what your point is here; agreed on that being one of the major points of CEV.
Persuading people about x-risk enough to stop AI capability gains seems like the current best lever to me too.
I think where we disagree is that I do not think that we should immediately jump into alignment when/if that succeeds, but need to focus on good governance and institutions first (and probably worth spending some effort trying to lay the groundwork now, especially since this seems like an especially high-leverage moment in history for making such changes). I have some thoughts on this too if you want to move to DMs.
I just really don't see models that are capable of taking over the world being influenced by AI persona stuff.
At least for me, I think AI personas (i.e. the human predictors turned around to generate human sounding text) are where most of the pre-existing agency of a model lies, since predicting humans requires predicting agentic behavior.
So I expect that post-training RL will not conjure new agentic centers from scratch, but will instead expand on this core. It's unclear whether that will be enough to get to taking over the world capabilities, but if RL-trained LLMs do scale to that level, I expect the entity to remain somewhat persona-like in that it was built around the persona core. So it's not completely implausible to me that "persona stuff" can have a meaningful impact here, though that's still very hard and fraught.
I think we've seen that quite a bit! It used to be that the exact way you asked a question would matter a lot for the quality of response you get. Prompting was a skill with a lot of weird nuance. Those things have substantially disappeared, as one would expect as a result of RL-training.
I feel pretty confused by this comment, so I am probably misunderstanding something.
But like, the models now respond to prompts in a much more human-like way? I think this is mainly due to RL reinforcing personas over all the other stuff that a pretrained model is trying to do. It distorts the personas too, but I don't see where else the 'being able to interface in a human-like way with natural language' skill could be coming from.
AI is difficult and dangerous enough that it doesn't really matter much if a bad person gets there first is (mostly) a distraction.
This is true pre-alignment. But when/if alignment gets solved, it suddenly matters very much. It seems likely to me that sadism is part of the 'power corrupts' adaptation, which means this outcome could be far worse than mere extinction.
This suggests that we should focus on sane-institutions/governance before even trying to solve alignment. It's probably necessary for succeeding at it quickly, too.
In the meantime, I think there are AI safety things that can be done, which importantly are not alignment things.
Don't ask people here, go out and ask the people you'd like to convince!
I've been using a tilde (e.g. ~belief) for denoting this, which maybe has less baggage than "quasi-" and is a lot easier to type.
It funny, one of the main use-cases of this terminology is when I'm talking to LLMs themselves about these things.