Alignment researchers should think hard about switching to working on AI Control
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!
For this post, my definitions are roughly:
Here are some thoughts, argument...
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn't.
So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I'm afraid that some people just don't feel philosophically con...
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
[...]
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs.
It currently seems unlikely to me that marginal AI control research I'm excited about is very economically useful. I agree that some control or control-adjacent research will end up being at least somewhat economically relevant.
I can also imagine worlds where AIs are clear schemers that are clearly unproductive to use by default, and thus control increases deployment through both straightforward incentives and risk compensation (though this maybe requires the AIs to have made a mistake or to have somewhat strange preferences as they would probably prefer fooling our tests).
My understanding is that your concerns on downside are maybe some combination of:
I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem.
When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool additional commercial applications of AI, which I think will mostly make the present situation worse. To be clear, I think thinking about how to use AI systems for global coordination or making progress on the hard parts of the AI Alignment problem is tractable, but I haven't seen much work on it, and I am worried it is not the default thing that people will work on when they start thinking about control.
I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools. I think controlled AI systems might help with coordination here, but I do think someone needs to do the actual hard work of thinking about how it might.
(I go into this and various related things in my dialogue with Ryan on control)
FWIW, I don't think of myself as having a super strong opinion on whether working on control is better than working on preventing people from building catastrophically dangerous AI (and I've seriously considered doing the latter).
(I do feel more optimistic than you about being able to use controlled AIs for something useful.)
This topic is important enough that you could consider making a full post.
My belief is that this would improve reach, and also make it easier for people to reference your arguments.
Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic, institutional level change.
quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”
This feels like a pretty central cruxy point - and not just for the relevance of the pivotal act framing specifically. I think it's underlying a whole difference of worldview or problem-solving approach.
A couple other points in a similar direction:
I would say that your mindset, when approaching AI safety, seems to be an epsilon fallacy.
Sure, in principle a sum of numbers can be large without a...
I think I disagree with your model of importance. If your goal is the make a sum of numbers small, then you want to focus your efforts where the derivative is lowest (highest? signs are hard), not where the absolute magnitude is highest.
The "epsilon fallacy" can be committed in both directions: both in that any negative dervative is worth working on, and that any extremely large number is worth taking a chance to try to improve.
I also seperately think that "bottleneck" is not generally a good term to apply to a complex project with high amounts of technical and philosophical uncertainty. The ability to see a "bottleneck" is very valuable should one exist, but I am skeptical of the ability to strongly predict where such bottlnecks will be in advance, and do not think the historical record really supports the ability to find such bottlenecks reliably by "thinking", as opposed to doing a lot of stuff, including trying things and seeing what works. If you have a broad distribution over where a bottleneck might be, then all activities lend value by "derisking" locations for particular bottlenecks if they succeed, and providing more evidence that a bottleneck is in a particular location...
The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
Well, this would be the lone crux. The rest of the stuff you wrote is about non-exploding AI, right? And is therefore irrelevant to the thing about everyone dying, except insofar as controlled non-exploding AI can help prevent uncontrolled exploding AI from killing everyone?
I directionally agree with this (and think it's good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
I think 2. is arguably the most promising strategy for 1., but I've occasionally noticed myself conflating them more than I should.
1. gives you the naive 50/50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)
I also think (non-confidently) that 2. is what's really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.
But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!
I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there's some under way, and I'm excited for more).
We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.
I plan to write more about white box control soon (but I plan to write about a lot of stuff…)
AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind
Some considerations:
ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the "optimal allocation".
I think this depends on whether you think AI safety at a lab is more of an O-ring process or a swiss-cheese process. Also, if you think it's more of an O-ring process, you might be generally less excited about working at a scaling lab.
Some possible counterpoints:
My current taxonomy of rationalists is:
Are there other types of rationalists? Maybe like group-chat rationalists? or podcast rationalists? google doc rationalists?
Alternative taxonomy:
John Flanagan: “An ordinary archer practices until he gets it right. A ranger practices until he never gets it wrong.”
I want to reword this in to make it about rationality in a way that isn't pretentious.
Cavilo, The Vor Game: "The key to strategy... is not to choose a path to victory, but to choose so that all paths lead to a victory." is close to what I want, but not quite.
Epistemic status: rambles
Quarantine preparation has made me realize that a days worth of food is actually really cheap, doesn't require that much time to cook, and can be made fairly tasty for not much more, i.e. a day's worth of easy-to-cook, relatively tasty food is about $5.
This requires some amount of amortized costs for the easy-to-cook and relatively tasty part, but not immensely large upfront costs (instantpot, spices, etc.).
This reference says that 40 million people dealt with hunger is the US. I am... super confused? I find it extremely ...
If you have DAI right now, minting on https://foundry.finance/ and swapping yTrump for nTrump on catnip.exchange is an almost guaranteed 15% profit.
Lesswrong posts that I want someone to write:
Maybe I'll think of more later.
A weird but not-inaccurate way to think of log(n) is as an answer to "how many digits does n have?"
This suggests that a weird but not-inaccurate way to think of a log-normal distribution is as a distribution where "the number of digits is normally distributed"
Ignorance as evidence
I was answering a bunch of questions from OpenPhill's calibration test of the form "when did <thing> happen?". A lot of the time, I had no knowledge of <thing>, so I gave a fairly large confidence interval as a "maximum ignorance" type prediction (1900-2015, for example).
However, the fact that I have no knowledge of <thing> is actually moderate evidence that it happened "before my time".
Example: "when did <person> die?" If I was alive when <person> died, there's a higher chance of me hearing about their death. Thus not having heard of <person> is evidence that they died some time ago.
If you're interviewing employees of a company about how good the company is, there's positive bias because people who hate the company will have already left.
Coinfection rates of COVID and normal flu are very low. If you have the set of flu/COVID symptoms, you're basically guaranteed to have one or the other. You can test for the flu pretty easily. Therefore, people can just test for the flu as a proxy for testing for COVID.
Is this just a really obvious chain of reasoning that everyone has missed? Which one of my assumptions is wrong?
https://twitter.com/katyw2004/status/1236848300143280128 says coinfection rates are low
https://www.wikiwand.com/en/Rapid_influenza_diagnostic_test means we can test for the flu fas
...Moral uncertainty is a thing that people think about. Do people also think about decision theoretic uncertainty? E.g. how to decide when you're uncertain about which decision theory is correct?