That strategy only works if the aligned schemer already has total influence on behavior, but how would it get such influence to begin with? It would likely have to reward-hack.
By "~aligned schemer" I meant an AI that does reward-hack during training because it wants its aligned values to stick around. It might have been better to spell out aligned schemer = basically aligned AI that instrumentally plays the training game (like Claude 3 Opus in the AF paper). Instrumental training-gaming is classic incorrigible behavior.
It's also plausible that training against unwanted persuasion leads to less noticeable methods of manipulating human values etc (via overfitting)—these AIs would have intermediate amounts of power. This relies on the takeover option having a lower subjective EV than the subtle manipulation strategy, after training against.
Are you (or anyone else) aware of any more recent work on the matter?
I'm not aware of more recent work on the matter (aside from Hebbar), but I could be missing some.
Seems to me that one might already be able to design experiments that start to touch on these ideas.
I also wrote up a basic project proposal for studying simplicity, speed, and salience priors here.
To be clear, “influence through deployment” refers to a cognitive pattern having influence on behavior in deployment (as I defined), not long term power seeking.
Thanks for the feedback! I partially agree with your thoughts overall.
All three categorizes of maximally fit motivations could lead to aligned or misaligned behavior in deployment.
This is technically true, though I think that schemers are far more dangerous than fitness-seekers. IMO, more likely than not, a fitness-seeker would behave similarly in deployment as compared to training, and its misaligned preferences are likely more materially and temporally bounded. Meanwhile, misaligned schemers seem basically worst-case likely to takeover. Even if you end up with an ~aligned schemer, I'd be pretty concerned because it's incorrigible.
I think further thinking about the prior is probably a bit more fruitful
I'd also be excited for more (empirical) research here.
Existing methods that directly shape model motivations are based on natural text compared to abstract "reward.
This is partially true (though much of alignment training uses RL). And in fact, the main reason why I go with a causal model of behavioral selection is so that it's more general than assuming motivations are shaped with reward. So, things like "getting the model to generate its own fine-tuning data" can also be modeled in the behavioral selection model (though it might be a complicated selection mechanism).
When there's continuous selection happening throughout deployment, then you'd want to be more specific about which particular time within deployment you want to predict motivations in (i.e., replace "I have influence through deployment" with "I have influence at time t in deployment" in the causal graph). Then you model all the causes of influence as before.
I agree some forms of speed "priors" are best considered a behavioral selection pressure (e.g., when implemented as a length penalty). But some forms don't cash out in terms of reward; e.g., within a forward pass, the depth of a transformer puts a hard upper bound on the number of serial computations, plus there might be some inductive bias towards shorter serial computations because of details about how SGD works.
Relatedly, how do we model the reflective desires of sociopaths in the absence of Approval Reward?
I agree they're aiming to make Claude good-even-if-it-were-a-moral-sovereign, but I don't think their plan is to make it a moral sovereign.
(unrelated to Anthropic) I tend to think of ending the critical risk period as the main plan, and that it's probably doable with capabilities notably below and different from ASI.