I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.
Dialogues are more difficult to create (if done well between people with different beliefs), and are less pleasant to read, but are often higher value for reaching true beliefs as a group.
Dialogues seem under-incentivised relative to comments, given the amount of effort involved. Maybe they would get more karma if we could vote on individual replies, so it's more like a comment chain?
This could also help with skimming a dialogue because you can skip to the best parts, to see whether it's worth reading the whole thing.
The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with... etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theory) potentially has relevance, but it's not super clear which bits are most likely to be directly useful.
This seems like the ultimate goal of interp research (and it's a good goal). Or, I think the current story for heuristic arguments is using them to "explain" a trained neural network by breaking it down into something more like an X,Y,Z components explanation.
At this point, we can analyse the overall AI algorithm, and understand what happens when it updates its beliefs radically, or understand how its goals are stored and whether they ever change. And we can try to work out whether the particular structure will change itself in bad-to-us ways if it could self-modify. This is where it looks much more theoretical, like theoretical analysis of algorithms.
(The above is the "understood" end of the axis. The "not-understood" end looks like making an AI with pure evolution, with no understanding of how it works. There are many levels of partial understanding in between).
This kind of understanding is a prerequisite for the scheme in my post. This scheme could be implemented by modifying a well-understood AI.
Also what is its relation to natural language?
Not sure what you're getting at here.
Fair enough, good points. I guess I classify these LLM agents as "something-like-an-LLM that is genuinely creative", at least to some extent.
Although I don't think the first example is great, seems more like a capability/observation-bandwidth issue.
I'm not sure how this is different from the solution I describe in the latter half of the post.
Great comment, agreed. There was some suggestion of (3), and maybe there was too much. I think there are times when expectations about the plan are equivalent to literal desires about how the task should be done. For making coffee, I expect that it won't create much noise. But also, I actually want the coffee-making to not be particularly noisy, and if it's the case that the first plan for making coffee also creates a lot of noise as a side effect, this is a situation where something in the goal specification has gone horribly wrong (and there should be some institutional response).
Yeah I think I remember Stuart talking about agents that request clarification whenever they are uncertain about how a concept generalizes. That is vaguely similar. I can't remember whether he proposed any way to make that reflectively stable though.
From the perspective of this post, wouldn't natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents?
LLMs in their current form don't really cause Edge Instantiation problems. Plausibly this is because they internally implement many kinds of regularization toward "normality" (and also kinda quantilize by default). So maybe yeah, I think I agree with your statement in the sense that I think you intended it, as it refers to current technology. But it's not clear to me that this remains true if we made something-like-an-LLM that is genuinely creative (in the sense of being capable of finding genuinely-out-of-the-box plans that achieve a particular outcome). It depends on how exactly it implements its regularization/redundency/quantilization and whether that implementation works for the particular OOD tasks we use it for.
Ultimately I don't think LLM-ish vs RL-ish won't be the main alignment-relevant axis. RL trained agents will also understand natural language, and contain natural-language-relevant algorithms. Better to focus on understood vs not-understood.
Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it's reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn't clear that it won't self-modify (but it's hard to tell).
There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”.
But there is a principled distinction. The distinction is whether the plan exploits differences between the goal specification and our actual goal. This is a structural difference, and we can detect using information about our actual goal.
So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception.
My proposal is usually an exception to this, because it takes advantage of the structural difference between the two cases. The trick is that the validation set only contains things that we actually want. If it were to contain extra constraints beyond what we actually want, then yeah that creates an alignment tax.
The Alice and Bob example isn't a good argument against the independence axiom. The combined agent can be represented using a fact-conditional utility function. Include the event "get job offer" in the outcome space, so that the combined utility function is a function of that fact.
E.g.
Bob {A: 0, B: 0.5, C: 1}
Alice {A: 0.3, B: 0, C: 0}
Should merge to become
AliceBob {Ao: 0, Bo: 0.5, Co: 1, A¬o: 0, B¬o: 0, C¬o: 0.3}, where o="get job offer".
This is a far more natural way to combine agents. We can avoid the ontologically weird mixing of probabilities and preference implied by having preference () and also . Like... what does a geometrically rational agent actually care about, and why does it's preferences change depending on its own beliefs and priors? A fact-conditional utility function is ontologically cleaner. Agents care about events in the world (potentially in different ways across branches of possibility, but it's still fundamentally caring about events).
This removes all the appeal of geometric rationality for me. The remaining intuitive appeal comes from humans having preferences that are logarithmic in most resources, which is more simply represented as one utility function rather than as a geometric average of many.
Hmm good point. Looking at your dialogues has changed my mind, they have higher karma than the ones I was looking at.
You might also be unusual on some axis that makes arguments easier. It takes me a lot of time to go over peoples words and work out what beliefs are consistent with them. And the inverse, translating model to words, also takes a while.