LESSWRONG
LW

Richard_Ngo
18956Ω280116610800
Message
Dialogue
Subscribe

Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Twitter threads
Understanding systematization
Stories
Meta-rationality
Replacing fear
Shaping safer goals
AGI safety from first principles
No wikitag contributions to display.
6Richard Ngo's Shortform
Ω
5y
Ω
412
the void
Richard_Ngo2hΩ220

I suspect that many of the things you've said here are also true for humans.

That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I'm Richard. What's my opinion on this post? Well, being "Richard" doesn't specify how I should respond to this post. But let me check the cached facts I believe about myself ("I'm truth-seeking"; "I'm polite") and construct an answer which fits well with those facts. A child might start off not really knowing what "polite" means, but still wanting to be polite, and gradually flesh out what that means that as they learn more about the world.

Another way of putting this point: being pulled from the void is not a feature of LLM personas. It's a feature of personas. Personas start off with underspecified narratives that fail to predict most behavior (but are self-fulfilling) and then gradually systematize to infer deeper motivations, resolving conflicts with the actual drivers of behavior along the way.

What's the takeaway here? We should still be worried about models learning the wrong self-fulfilling prophecies. But the "pulling from the void" thing should be seen less as an odd thing that we're doing with AIs, and more as a claim about the nature of minds in general.

Reply
A case for courage, when speaking of AI danger
Richard_Ngo5d73

I think that if it were to go ahead, it should have been made stronger and clearer. But this wouldn't have been politically feasible, and therefore if that were the standard being aimed for it wouldn't have gone ahead.

This I think would have been better than the outcome that actually happened.

Reply
A case for courage, when speaking of AI danger
Richard_Ngo5d2716

Whoa, this seems very implausible to me. Speaking with the courage of one's convictions in situations which feel high-stakes is an extremely high bar, and I know of few people who I'd describe as consistently doing this.

If you don't know anyone who isn't in this category, consider whether your standards for this are far too low.

Reply
Richard Ngo's Shortform
Richard_Ngo9d5-1

For a while now I've been thinking about the difference between "top-down" agents which pursue a single goal, and "bottom-up" agents which are built around compromises between many goals/subagents.

I've now decided that the frame of "centralized" vs "distributed" agents is a better way of capturing this idea, since there's no inherent "up" or "down" direction in coalitions. It's also more continuous.

Credit to @Scott Garrabrant, who something like this point to me a while back, in a way which I didn't grok at the time.

Reply
Judgements: Merging Prediction & Evidence
Richard_Ngo9dΩ562

This feels related to the predictive processing framework, in which the classifications of one model are then predicted by another.

More tangentially, I've previously thought about merging cognitive biases and values, since we can view both of them as deviations from the optimal resource-maximizing policy. For example, suppose that I am willing to bet even when I am being Dutch booked. You could think of that as a type of irrationality, or you could think of it as an expression of me valuing being Dutch booked, and therefore being willing to pay to experience it.

This is related to the Lacanian/"existential kink" idea that most dysfunctions are actually deliberate, caused by subagents that are trying to pursue some goal at odds with the rest of your goals.

Reply1
Richard Ngo's Shortform
Richard_Ngo25d40
  1. Consider the version of the 5-and-10 problem in which one subagent is assigned to calculate U | take 5, and another calculates U | take 10. The overall agent solves the 5-and-10 problem iff the subagents reason about each other in the "right ways", or have the right type of relationship to each other. What that specifically means seems like the sort of question that a scale-free theory of intelligent agency might be able to answer.
  2. I'm mostly trying to extend pure mathematical frameworks (particularly active inference and a cluster of ideas related to geometric rationality, including picoeconomics and ergodicity economics).
Reply
Richard Ngo's Shortform
Richard_Ngo26d40

Ooops, good catch. It should have linked to this: https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform?commentId=W9N9tTbYSBzM9FvWh (and I've changed the link now).

Reply
Richard Ngo's Shortform
Richard_Ngo26d*292

Here is a broad sketch of how I'd like AI governance to go. I've written this in the form of a "plan" but it's not really a sequential plan, more like a list of the most important things to promote.

  1. Identify mechanisms by which the US government could exert control over the most advanced AI systems without strongly concentrating power. For example, how could the government embed observers within major AI labs who report to a central regulatory organization, without that regulatory organization having strong incentives and ability to use their power against their political opponents?
    1. In practice I expect this will involve empowering US elected officials (e.g. via greater transparency) to monitor and object to misbehavior by the executive branch.
  2. Create common knowledge between the US and China that the development of increasingly powerful AI will magnify their own internal conflicts (and empower rogue states) disproportionately more than it empowers them against each other. So instead of a race to world domination, in practice they will face a "race to stay standing".
    1. Rogue states will be empowered because human lives will be increasingly fragile in the face of AI-designed WMDs. This means that rogue states will be able to threaten superpowers with "mutually assured genocide" (though I'm a little wary of spreading this as a meme, and need to think more about ways to make it less self-fulfilling).
  3. Set up channels for flexible, high-bandwidth cooperation between AI regulators in China and the US (including the "AI regulators" in each who try to enforce good behavior from the rest of the world).
  4. Advocate for an ideology roughly like the one I sketched out here, as a consensus alignment target for AGIs.

This is of course all very vague; I'm hoping to flesh it out much more over the coming months, and would welcome thoughts and feedback. Having said that, I'm spending relatively little of my time on this (and focusing on technical alignment work instead).

Reply1
Richard Ngo's Shortform
Richard_Ngo26d200

Here is the broad technical plan that I am pursuing with most of my time (with my AI governance agenda taking up most of my remaining time):

  1. Mathematically characterize a scale-free theory of intelligent agency which describes intelligent agents in terms of interactions between their subagents.
    1. A successful version of this theory will retrodict phenomena like the Waluigi effect, solve theoretical problems like , and make new high-level predictions about AI behavior.
  2. Identify subagents (and subsubagents, and so on) within neural networks by searching their weights and activations for the patterns of interactions between subagents that this theory predicts.
    1. A helpful analogy is how Burns et al. (2022) search for beliefs inside neural networks based on the patterns that probability theory predicts. However, I'm not wedded to any particular search methodology.
  3. Characterize the behaviors associated with each subagent to build up "maps" of the motivational systems of the most advanced AI systems.
    1. This would ideally give you explanations of AI behavior that scales in quality based on how much effort you put in. E.g. you might be able to predict 80% of the variance in an AI's choices by looking at which highest-level subagents are activated, then 80% of the remaining variance by looking at which subsubagents are activated, and so on.
  4. Monitor patterns of activations of different subagents to do lie detection, anomaly detection, and other useful things.
    1. This wouldn't be fully reliable—e.g. there'd still be some possible failures where low-level subagents activate in ways that, when combined, leads to behavior that's very surprising given the activations of high-level subagents. (ARC's research seems to be aimed at these worst-case examples.) However, I expect it would be hard even for AIs with significantly superhuman intelligence to deliberately contort their thinking in this way. And regardless, in order to solve worst-case examples it seems productive to try to solve the average-case examples first.

I'm focusing on step 1 right now. Note that my pursuit of it is overdetermined—I'm excited enough about finding a scale-free theory of intelligent agency that I'd still be working on it even if I didn't think steps 2-4 would work, because I have a strong heuristic that pursuing fundamental knowledge is good. Trying to backchain from an ambitious goal to reasons why a fundamental scientific advance would be useful for achieving that goal feels pretty silly from my perspective. But since people keep asking me why step 1 would help with alignment, I decided to write this up as a central example.

Reply1
Power Lies Trembling: a three-book review
Richard_Ngo1mo20

Yepp, see also some of my speculations here: https://x.com/richardmcngo/status/1815115538059894803?s=46

Reply
Load More
the five-and-ten problem
35Well-foundedness as an organizing principle of healthy minds and societies
3mo
7
99Third-wave AI safety needs sociopolitical thinking
3mo
23
96Towards a scale-free theory of intelligent agency
Ω
4mo
Ω
44
92Elite Coordination via the Consensus of Power
4mo
15
245Trojan Sky
4mo
39
214Power Lies Trembling: a three-book review
3mo
29
243The Gentle Romance
5mo
46
20From the Archives: a story
6mo
1
51Epistemic status: poetry (and other poems)
7mo
5
215Why I’m not a Bayesian
9mo
104
Load More