Trying to write a new steelman of Matt's view. It's probably incorrect, but seems good to post as a measure of progress:
You believe in agentic capabilities generalizing, but also in additional high-level patterns that generalize and often overpower agentic behaviour. You expect training to learn all the algorithms required for intelligence, but also pick up patterns in the data like "research style", maybe "personality", maybe "things a person wouldn't do" and also build those into the various-algorithms-that-add-up-to-intelligence at a deep level. In particular, these patterns might capture something like "unwillingness to commandeer some extra compute" even though it's easy and important and hasn't been explicitly trained against. These higher level patterns influence generalization more than agentic patterns do, even though this reduces capability a bit.
One component of your model that reinforces this: Realistic intelligence algorithms rely heavily on something like caching training data and this has strong implications about how we should expect them to generalize. This gives an inductive-bias advantage to the patterns you mention, and a disadvantage to think-it-through-properly algorithms (like brute force search, or even human-like thinking).
We didn't quite get to talking about reflection, but this is probably the biggest hurdle in the way of getting such properties to stick around. I'll guess at your response: You think that an intelligence that doesn't-reflect-very-much is reasonably simple. Given this, we can train chain-of-thought type algorithms to avoid reflection using examples of not-reflecting-even-when-obvious-and-useful. With some effort on this, reflection could be crushed with some small-ish capability penalty, but massive benefits for safety.
Maybe I'm just reading my own frames into your words, but this feels quite similar to the rough model of human-level LLMs I've had in the back of my mind for a while now.
You think that an intelligence that doesn't-reflect-very-much is reasonably simple. Given this, we can train chain-of-thought type algorithms to avoid reflection using examples of not-reflecting-even-when-obvious-and-useful. With some effort on this, reflection could be crushed with some small-ish capability penalty, but massive benefits for safety.
In particular, this reads to me like the "unstable alignment" paradigm I wrote about a while ago.
You have an agent which is consequentialist enough to be useful, but not so consequentialist that it'll do things like spontaneously notice conflicts in the set of corrigible behaviors you've asked it to adhere to and undertake drastic value reflection to resolve those conflicts. You might hope to hit this sweet spot by default, because humans are in a similar sort of sweet spot. It's possible to get humans to do things they massively regret upon reflection as long as their day to day work can be done without attending to obvious clues (eg guy who's an accountant for the Nazis for 40 years and doesn't think about the Holocaust he just thinks about accounting). Or you might try and steer towards this sweet spot by developing ways to block reflection in cases where it's dangerous without interfering with it in cases where it's essential for capabilities.
I was probably influenced by your ideas! I just (re?)read your post on the topic.
Tbh I think it's unlikely such a sweet spot exists, and I find your example unconvincing. The value of this kind of reflection for difficult problem solving directly conflicts with the "useful" assumption.
I'd be more convinced if you described the task where you expect an AI to be useful (significantly above current humans), and doesn't involve failing and reevaluating high-level strategy every now and then.
I agree that I wouldn't want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don't think that requires a task where the AI doesn't try and fail and re-evaluate - it just requires that the re-evalution never climbs above a certain level in the stack.
There's such a thing as being pathologically persistent, and such a thing as being pathologically flaky. It doesn't seem too hard to train a model that will be pathologically persistent in some domains while remaining functional in others. A lot of my current uncertainty is bound up in how robust these boundaries are going to have to be.
I buy that such an intervention is possible. But doing it requires understanding the internals at a deep level. You can't expect SGD to implement the patch in a robust way. The patch would need to still be working after 6 months on an impossible problem, in spite of it actively getting in the way of finding the solution!
My two cents:
- The system has a fixed goal that it capably works towards across all contexts.
- The system is able to capably work towards goals, but which it does, if any, may depend on the context.
From these two above, seems it would be good for you to define/clarify what exactly you mean by "goals". I can see two definitions: 1. goals as in a loss function or objective that the algorithm is optimizing towards, 2. task specific goals like summarize an article, planning. There may be some other goals that I am unaware of, or this is obvious elsewhere in some context that I am not aware of. (From the shortform in the context shared, seems to be 1, but I have a vague feeling that 2 may not be aligned on this.)
For the example with dQw4w9WgXcQ in your initial operationalization when you were wondering about if it always generate Q - it just depends on the frequency. A good paper is https://arxiv.org/pdf/2202.07646 on frequency of this data and their rate of memorization if you were wondering if it is always (same context with training data, not different context/instruction).
This dialogue is still in progress, but due to other commitments we don't have much time to continue it. We think the content is interesting, so we decided to publish it unfinished. We will maybe very slowly continue adding to it in the future, but can't commit to doing so.