I think that if it were to go ahead, it should have been made stronger and clearer. But this wouldn't have been politically feasible, and therefore if that were the standard being aimed for it wouldn't have gone ahead.
This I think would have been better than the outcome that actually happened.
Whoa, this seems very implausible to me. Speaking with the courage of one's convictions in situations which feel high-stakes is an extremely high bar, and I know of few people who I'd describe as consistently doing this.
If you don't know anyone who isn't in this category, consider whether your standards for this are far too low.
For a while now I've been thinking about the difference between "top-down" agents which pursue a single goal, and "bottom-up" agents which are built around compromises between many goals/subagents.
I've now decided that the frame of "centralized" vs "distributed" agents is a better way of capturing this idea, since there's no inherent "up" or "down" direction in coalitions. It's also more continuous.
Credit to @Scott Garrabrant, who something like this point to me a while back, in a way which I didn't grok at the time.
This feels related to the predictive processing framework, in which the classifications of one model are then predicted by another.
More tangentially, I've previously thought about merging cognitive biases and values, since we can view both of them as deviations from the optimal resource-maximizing policy. For example, suppose that I am willing to bet even when I am being Dutch booked. You could think of that as a type of irrationality, or you could think of it as an expression of me valuing being Dutch booked, and therefore being willing to pay to experience it.
This is related to the Lacanian/"existential kink" idea that most dysfunctions are actually deliberate, caused by subagents that are trying to pursue some goal at odds with the rest of your goals.
Ooops, good catch. It should have linked to this: https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform?commentId=W9N9tTbYSBzM9FvWh (and I've changed the link now).
Here is a broad sketch of how I'd like AI governance to go. I've written this in the form of a "plan" but it's not really a sequential plan, more like a list of the most important things to promote.
This is of course all very vague; I'm hoping to flesh it out much more over the coming months, and would welcome thoughts and feedback. Having said that, I'm spending relatively little of my time on this (and focusing on technical alignment work instead).
Here is the broad technical plan that I am pursuing with most of my time (with my AI governance agenda taking up most of my remaining time):
I'm focusing on step 1 right now. Note that my pursuit of it is overdetermined—I'm excited enough about finding a scale-free theory of intelligent agency that I'd still be working on it even if I didn't think steps 2-4 would work, because I have a strong heuristic that pursuing fundamental knowledge is good. Trying to backchain from an ambitious goal to reasons why a fundamental scientific advance would be useful for achieving that goal feels pretty silly from my perspective. But since people keep asking me why step 1 would help with alignment, I decided to write this up as a central example.
Yepp, see also some of my speculations here: https://x.com/richardmcngo/status/1815115538059894803?s=46
I suspect that many of the things you've said here are also true for humans.
That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I'm Richard. What's my opinion on this post? Well, being "Richard" doesn't specify how I should respond to this post. But let me check the cached facts I believe about myself ("I'm truth-seeking"; "I'm polite") and construct an answer which fits well with those facts. A child might start off not really knowing what "polite" means, but still wanting to be polite, and gradually flesh out what that means that as they learn more about the world.
Another way of putting this point: being pulled from the void is not a feature of LLM personas. It's a feature of personas. Personas start off with underspecified narratives that fail to predict most behavior (but are self-fulfilling) and then gradually systematize to infer deeper motivations, resolving conflicts with the actual drivers of behavior along the way.
What's the takeaway here? We should still be worried about models learning the wrong self-fulfilling prophecies. But the "pulling from the void" thing should be seen less as an odd thing that we're doing with AIs, and more as a claim about the nature of minds in general.