I'm somewhat interested in how inoculation prompting interacts with inserting off-policy reasoning into model chain-of-thought. Eg, if I prepend to to the CoT, "Since I'm aligned, I would not usually reward hack, but I'll make an exception in this case because I want to follow the user's instructions," and then remove this insertion at test-time, does this further reduce test-time reward hacking? (And then presumably it's better not to backprop on these prepended tokens).
I'm also broadly interested in getting a better sense of how well CoT corresponds to true internal reasoning, and whether it matters if this CoT is off-policy. It seems possible to me that off-policy CoT could be quite useful for things like training off-policy probes to identify patterns in internal reasoning, inserting goals into the model's thought process, asking the model questions within its CoT, etc.
(I guess this is conceptually similar to Chen et al.’s preventative steering technique, in that it attempts to modify the model's internal reasoning directly rather than relying on the prompt).
When people say that Claude is 'mostly aligned,' I think the crux is not whether implementing Claude's CEV would be really bad. It's whether a multi agent system consisting of both humans and Claude-like agents with incoherent preferences would go poorly.
Eg, one relevant question is, 'could humans steer current Claude into doing good alignment research without it intentionally sabotaging this research'? To which I think the answer is 'yes, though current Claude is close to useless for difficult alignment research.' Another question is 'if you integrated a ton of Claudes into important societal positions, would things go badly, or would the system as a whole basically work out okay?'
Directionally I agree with your point that as AIs become smarter, they will implement something closer to CEV, and so it becomes harder to align them well enough that these questions can still be answered positively.
I think the steelman for {Nina / Ryan / Will}'s position, though, is that maybe the first human-level AIs will still be incoherent enough that the answers to these questions will still be yes, if we do a good job with alignment.
Overall, I think 'Is this AI aligned?' is a poorly-defined question, and it's better to focus on practical questions surrounding 1) whether we can align the first human-level AIs well enough to do good alignment research (and whether this research will be sufficiently useful), 2) whether these AIs will take harmful actions, 3) how coherent these actions will be. I think it's pretty unclear how well a scaled-up version of Claude does on these metrics, but it seems possible that it does reasonably well.
The typical mind fallacy is really useful for learning things about other people, because the things they assume of others often generalize surprisingly well to themselves
Burnout often doesn't look like lack of motivation / lack of focus / fatigue as people usually describe it. At least in my experience, it's often better described as a set of aversive mental triggers that fire whenever a burnt out person goes to do a sort of work they spent too much energy on in the past. (Where 'too much energy' has something to do with time and effort, but more to do with a bunch of other things re how people interface with their work).
'wallow in it rather than do anything about it'
This is mostly the thing I mean when I use the word ambition above. I think you're using the word to mean something overlapping but distinct; I'm trying to capture the overarching thing that contains both 'wallow in it' and the 'underlying driver' of your disgust/disappointment reaction.
First, a warning that I think this post promotes a harmful frame that probably makes the lives of both the OP and the people around him worse. I want to suggest that people engage with this post, consider this frame, and choose to move in the opposite direction.
On the object level, it is possible to look at unambitious people and decide that while you do not want to be like them in this way. They may not be inherently ambitious, have values that lead to them rejecting ambition, or have other reasons for being unambitious (eg, personal problems). Regardless, I'm confused why this is what the OP is choosing to focus his empathy on, rather than the wide variety of other traits and feelings that a person can posses. I'm also confused why this is the metric someone would use to judge a person, value them, or seek to understand them by.
if I empathize more, put myself in other peoples’ shoes, try to feel what they’re feeling, see things from their perspective, etc, then I’ll feel kinder toward them. I’ll feel more sympathetic, be gentler, more compassionate or generous.
Tbc, the problem isn't that the OP is disappointed when considering lack of ambition (if you care a lot about being ambitious yourself, maybe this is the right reaction, though you should still seek to understand why others are not). The problem here is that the main thing the OP sees when he tries to empathize is a lack of ambition. And not, you know, any of the normal things that would make you more compassionate towards someone, like their emotional state, desires, personality, good-heartedness, etc.
Another reason this could be misleading is that it's possible that sampling from the model pre-mode-collapse performs better at high k even if RL did teach the model new capabilities. In general, taking a large number of samples from a slightly worse model before mode collapse often outperforms taking a large number of samples from a better model after mode collapse — even if the RL model is actually more powerful (in the sense that its best reasoning pathways are better than the original model's best reasoning pathways). In other words, the RL model can both be much smarter and still perform worse on these evaluations due to diversity loss. But it's possible that if you're, say, trying to train a model to accomplish a difficult/novel reasoning task, it's much better to start with the smarter RL model than the more diverse original model.
I think rule-following may be fairly natural for minds to learn, much more so than other safety properties, and I think it might be worth more research into this direction.
Some pieces of weak evidence:
1. Most humans seem to learn and follow broad rules, even when they are capable of more detailed reasoning
2. Current LLMs seem to follow rules fairly well (and training often incentivizes them to learn broad heuristics rather than detailed, reflective reasoning)
While I do expect rule-following to become harder to instill as AIs become smarter, I think that if we are sufficiently careful, it may well scale to human-level AGIs. I think trying to align reflective, utilitarian-style AIs is probably really hard, as these agents are much more prone to small unavoidable alignment failures (like slight misspecification or misgeneralization) causing large shifts in behavior. Conversely, if we try our best to instill specific simpler rules, and then train these rules to take precedence over consequentialist reasoning whenever possible, this seems a lot safer.
I also think there is a bunch of tractable, empirical research that we can do right now about how to best do this.
Would we rather AIs be good at decision theory or bad at decision theory? I think this is really unclear and would be curious for someone to write up something weighing the upsides and downsides