3

Ω 1

This is a special post for quick takes by Mark Xu. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
94 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-]Mark XuΩ6814733

Alignment researchers should think hard about switching to working on AI Control

I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!

For this post, my definitions are roughly:

  • AI alignment is the task of ensuring the AIs “do what you want them to do”
  • AI control is the task of ensuring that if the AIs are not aligned (e.g. don’t always “do what you want” and potentially want to mess with you), then you are still OK and can use them for economically productive tasks (an important one of which is doing more alignment/control research.)

Here are some thoughts, argument... (read more)

[-]Wei DaiΩ246139

Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.

When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.

Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn't.

So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I'm afraid that some people just don't feel philosophically con... (read more)

Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.

[...]

Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs.

It currently seems unlikely to me that marginal AI control research I'm excited about is very economically useful. I agree that some control or control-adjacent research will end up being at least somewhat economically relevant.

I can also imagine worlds where AIs are clear schemers that are clearly unproductive to use by default, and thus control increases deployment through both straightforward incentives and risk compensation (though this maybe requires the AIs to have made a mistake or to have somewhat strange preferences as they would probably prefer fooling our tests).

My understanding is that your concerns on downside are maybe some combination of:

  1. AI control research will be economically very useful (e.g. accelerates AI by 6 months conditional on 5 year timelines).
  2. Control will result in irrational risk compensation style actio
... (read more)
8Mark Xu
My vague plan along these lines is to attempt as hard as possible to defer all philosophically confusing questions to the "long reflection", and to use AI control as a tool to help produce AIs that can help preserve long term option value (including philosophical option value) as best as possible. I seperately have hope we can solve "the entire problem" at some point, e.g. through ARC's agenda (which I spend most of my time trying to derisk and advance).
7ryan_greenblatt
People interested in a discussion about control with someone who is maybe closer to Wei Dai's perspective might be interested in my dialogue with habyrka.
2Max Lee
I think if the first powerful unaligned AI remained in control instead of escaping, it might make a good difference, because we can engineer and test alignment ideas on it, rather than develop alignment ideas on an unknown future AI. This assumes at least some instances of it do not hide their misalignment very well.
0Noosphere89
I think a key difference is I do believe the technical alignment/control problem as defined essentially requires no philosophical progress or solving philosophical problems like the hard problem of consciousness, and I believe the reason for this comes down to both a general point and a specific point. In general, one of the reasons I believe philosophy tends not to be a productive area compared to other branches of science is that usually they either solve essentially proven to be intractable problems nowadays, or they straight up tried to solve a problem in far too much generality without doing any experiments, and that's when they aren't straight up solving fictional problems (I believe a whole lot of possible world philosophizing is in that category). This is generally because philosophers do far too much back-chaining compared to front-chaining on a lot of problems. For the specific point of alignment/control agendas, it's because that the problem of AI alignment isn't a problem about what goals you should assign it, but rather whether you can put in goals into the AI system such that the AI will reliably follow your goals at all.
[-]habrykaΩ11306

I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem. 

When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool additional commercial applications of AI, which I think will mostly make the present situation worse. To be clear, I think thinking about how to use AI systems for global coordination or making progress on the hard parts of the AI Alignment problem is tractable, but I haven't seen much work on it, and I am worried it is not the default thing that people will work on when they start thinking about control.

I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools. I think controlled AI systems might help with coordination here, but I do think someone needs to do the actual hard work of thinking about how it might.

(I go into this and various related things in my dialogue with Ryan on control)

[-]BuckΩ10154

FWIW, I don't think of myself as having a super strong opinion on whether working on control is better than working on preventing people from building catastrophically dangerous AI (and I've seriously considered doing the latter).

(I do feel more optimistic than you about being able to use controlled AIs for something useful.)

4Akash
@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is.  Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.
4ryan_greenblatt
On (a) and (b), we describe this at a high level here. We don't really have anything written on (c) or (d). (c) really depends a lot on effort, so I'd probably prefer to talk mostly about (d) including what evalutions would be needed at various points etc. For (a), I think we potentially care about all of: 1. Systems which are perhaps qualitatively similarly smart to OK software engineers and which are capable of speeding up R&D work by 10x (speedups aren't higher due to human bottlenecks). (On a nearcast, we'd expect such systems to be very broadly knowledgeable, pretty fast, and very well tuned for many of their usages.) 2. Systems which nearly strictly dominate top human scientists on capability and which are perhaps similar in qualitative intelligence (I'd guess notably, but not wildly weaker and compensating in various ways.) Such systems likely some domains/properties in which they are much better than any human or nearly any human. 3. Systems which are qualitatively smarter than any human by a small amount. It's likely control is breaking down by (3) unless control ends up being quite easy or the implementation/evaluation is very good. On (b) we plan on talking more about this soon. (Buck's recent EAGx talk is pretty up to date with our current thinking, though this talk is obviously not that detailed. IDK if you can find a recording anywhere.)
1Satron
Does that mean that you believe that after a certain point we would lose control over AI? I am new to this field, but doesn't this fact spell doom for humanity?
9ryan_greenblatt
By "control", I mean AI Control: approaches aiming to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. AI control stops working once AIs are sufficiently capable (and likely don't work for all possible deployments that might eventually be otherwise desirable), but there could be other approaches that work at that point. In particular aligning systems. The main hope I think about is something like: * Use control until AIs are capable enough that if we trusted them, we could obsolete top human scientists and experts. * Use our controlled AI labor to do the work needed to make systems which are capable enough, trustworthy enough (via alignment), and philosophically competent enough that we can safely hand things off to them. (There might be some intermediate states to get to here.) * Have these systems which totally obsolete us figure out what to do, including figuring out how to aligning more powerful systems as needed. We discuss our hopes more in this post.
2Buck
Re a, there’s nothing more specific on this than what we wrote in “the case for ensuring”. But I do think that our answer there is pretty good. Re b, no, we need to write some version of that up; I think our answer here is ok but not amazing, writing it up is on the list.
7Mark Xu
yep agreed, I have a bunch of vague plans in this direction. I most generally think that AI control is a pretty good tool in the toolbox, and is unlikely to make things much worse but plausibly makes things much better.
0Noosphere89
I think AI control agendas are defined in such a way such that this metric isn't as relevant as you think it is: Because the agenda isn't trying to make AIs alignable, but to make them useful and not break out of labs, so the question of the timeline to unaligned AI is less relevant than it is for most methods of making safe AI.

This topic is important enough that you could consider making a full post.

My belief is that this would improve reach, and also make it easier for people to reference your arguments. 

Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic, institutional level change. 

2Mark Xu
Yes, I agree. If I had more time, this would have been a top-level post. If anyone reading wants to write such a post using my quick take as a base, I would be happy to take a look and offer comments. I might do it myself at some point as well.
3MondSemmel
Just make it a full post without doing much if any editing, and link to this quick take and its comments when you do. A polished full post is better than an unpolished one, but an unpolished one is better than none at all.
1Max Lee
I'm not sure if this is allowed here, but maybe you can ask an AI to write a draft and manually proofread for mistakes?
3Mark Xu
idk how much value that adds over this shortform, and I currently find AI prose a bit nauseating.
1Max Lee
That's fair. To be honest I've only used AI for writing code, I merely heard about other people having success with AI drafts. Maybe their situation was different, or they were bad at English to the point that AI writes better than them.

quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”

This feels like a pretty central cruxy point - and not just for the relevance of the pivotal act framing specifically. I think it's underlying a whole difference of worldview or problem-solving approach.

A couple other points in a similar direction:

  • A thing I noticed in our discussion on the model delta with Christiano post: your criterion for useful AI safety work seems to be roughly "this will contribute some marginal value" as opposed to "this will address a bottleneck".
  • Right at the top of this thread, you say: "I think Redwood Research’s recent work on AI control really 'hits it out of the park', and they have identified a tractable and neglected intervention that can make AI go a lot better". Note what's conspicuously missing there: tractable and neglected, but you don't claim importance.

I would say that your mindset, when approaching AI safety, seems to be an epsilon fallacy.

Sure, in principle a sum of numbers can be large without a... (read more)

[-]Mark XuΩ5100

I think I disagree with your model of importance. If your goal is the make a sum of numbers small, then you want to focus your efforts where the derivative is lowest (highest? signs are hard), not where the absolute magnitude is highest.

The "epsilon fallacy" can be committed in both directions: both in that any negative dervative is worth working on, and that any extremely large number is worth taking a chance to try to improve.

I also seperately think that "bottleneck" is not generally a good term to apply to a complex project with high amounts of technical and philosophical uncertainty. The ability to see a "bottleneck" is very valuable should one exist, but I am skeptical of the ability to strongly predict where such bottlnecks will be in advance, and do not think the historical record really supports the ability to find such bottlenecks reliably by "thinking", as opposed to doing a lot of stuff, including trying things and seeing what works. If you have a broad distribution over where a bottleneck might be, then all activities lend value by "derisking" locations for particular bottlenecks if they succeed, and providing more evidence that a bottleneck is in a particular location... (read more)

3johnswentworth
I do think that "we don't have enough information to know where the bottlenecks are yet" is in-general a reasonable counterargument to a "just focus on the bottlenecks" approach (insofar as we in fact do not yet have enough information). In this case I think we do have enough information, so that's perhaps a deeper crux.
6Mark Xu
Hiliariously, it seems likely that our disagreement is even more meta, on the question of "how do you know when you have enough information to know", or potentially even higher, e.g. "how much uncertainty should one have given that they think they know" etc.
9Mark Xu
I agree it is better work on bottlenecks than non-bottlenecks. I have high uncertainty about where such bottlenecks will be, and I think sufficiently low amounts of work have gone into "control" that it's obviously worth investing more, because e.g. I think it'll let us get more data on where bottlenecks are.
2Mark Xu
see my longer comment https://www.lesswrong.com/posts/A79wykDjr4pcYy9K7/mark-xu-s-shortform#8qjN3Mb8xmJxx59ZG
[-]TsviBTΩ12214

The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.

Well, this would be the lone crux. The rest of the stuff you wrote is about non-exploding AI, right? And is therefore irrelevant to the thing about everyone dying, except insofar as controlled non-exploding AI can help prevent uncontrolled exploding AI from killing everyone?

I directionally agree with this (and think it's good to write about this more, strongly upvoted!)

For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:

  1. "ensuring that if the AIs are not aligned [...], then you are still OK" (which I think is the main meaning of "AI control")
  2. Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood's existing work on control).

I think 2. is arguably the most promising strategy for 1., but I've occasionally noticed myself conflating them more than I should.

1. gives you the naive 50/50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)

I also think (non-confidently) that 2. is what's really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.

But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!

[-]BuckΩ19268

I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there's some under way, and I'm excited for more).

We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.

I plan to write more about white box control soon (but I plan to write about a lot of stuff…)

9Rohin Shah
This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me. E.g. my categorization would be something like: * Clearly alignment: debate theory, certain flavors of process supervision * Clearly control: removing affordances (e.g. "don't connect the model to the Internet") * Could be either one: interpretability, critique models (in control this is called "untrusted monitoring"), most conceptions of ELK, generating inputs on which models behave badly, anomaly detection, capability evaluations, faithful chain of thought, ... Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn't make much sense for research, most of which helps with both. It does make sense as a distinction for "what plan will you implement in practice" -- but labs have said very little publicly about that. Other things that characterize work done under the name of "control" so far are (1) it tries to be very concrete about its threat models, to a greater degree than most other work in AI safety, and (2) it tries to do assurance, taking a very worst case approach. Maybe you're saying that people should do those things more, but this seems way more contentious and I'd probably just straightforwardly disagree with the strength of your recommendation (though probably not its direction). Nitpick: I would also quibble with your definitions; under your definitions, control seems like a subset of alignment (the one exception if you notice the model is scheming and then simply stop using AI). I think you really have to define alignment as models reliably doing what you want independent of the surrounding context, or talk about "trying to do what you want" (which only makes sense when applied to models, so has similar upshots). Tbc I like control and think more effort should be put into it; I just disagree with the strength o
6ryan_greenblatt
Agreed. (And I made a comment to this effect on Mark's draft of this short form.) I don't think it is that productive to classify research directions, the central thing we're trying to point at is a classification for reasons why you might think that you're fine where one reason is "the AI couldn't do bad stuff without getting caught" and another reason is "the AI wouldn't do bad stuff" (obviously simplifying somewhat). I also think that white-box control should typically be considered pretty separately from black-box control and the story for it is often pretty different. (Though when thinking about white-box control you should often adopt the central control perspective, e.g., you should incorporate white-box runtime detection techniques (like probes) into an overall control evaluation via exogenously trying to determine the probe reliability and other important aspects of the probe. For instance, the probe can take advantage of black box strategies and how safe we are depends on how many actions AIs need to cause huge problems.) The claims I'd like to make is: * There should be more research specifically targeting a black-box control story and which is operating from a black-box control perspective. (I'd guess like 20% of the overall AI safety portfolio should be this and maybe 30% of safety work at labs. This is not including work like "get AIs to be more useful at alignment research and get better at DC eval methodology" which should also be subsidized from a control perspective but seem non-central.) * When people do research which could be pretty useful from a black-box control perspective (e.g. debate or other scalable oversight style research), they should consider sometimes explicitly think about the control perspective and consider adjusting their research accordingly (to the extent they can understand this perspective well enough). To the extent that people have an alignment style story they think about while doing their research, they should be rou
5ryan_greenblatt
You're potentially thinking about footnote 4 in our post on control: This footnote is a bit confusingly worded, but I think the situation we were trying to say is "Suppose you expect dangerously powerful AI in the next year, and the current year is prior to 2032. We're guessing you should put the better team on control." This is different than research in the run up. ---------------------------------------- I also think that to the extent people are trying to do backchained research focused on specific applications, it makes sense to put the better team on control over reducing the chance that scheming arises. (But these aren't the only classes of interventions and some interventions don't nicely fit into these buckets, e.g., you can do work on differentially making AIs more useful for alignment work which isn't well classified as either and you can work on high level interpretability which aims to roughly understand how AIs make decisions in some cases (this high-level interp doesn't clearly help with reducing the chance that scheming arises very directly, but could help with a bunch of stuff).)
4Noosphere89
I agree with bullet points 1, 2, 3, 6 and 7, partially agree with bullet point 5, and disagree with bullet point 4. Thus, I agree with the central claim here: For more of my analysis on the bullet points, read the rest of the comment. For bullet point 1, I basically agree with this, mostly due to not favoring binary assumptions and instead prefer continuous quantity reasoning, which tends to both be a better match for the IRL world, and also continuous quantity reasoning gives you more information than binary outcomes. I really like bullet point 2, and also think that even in a scenario where it's easy to prevent defection, you should still have controls that make defecting employees have much less reward and much more punishment for subversive actions. I deeply agree with point 3, and I'd frame AI control in one of 2 ways: 1. As a replacement for the pivotal act concept. 2. As a pivotal act that doesn't require destruction or death, and doesn't require you to overthrow nations in your quest. A nitpick: AI labor will be the huge majority of alignment progress in every stage, not just the early stage. I think one big reason the pivotal act frame dominated a lot of discussions is the assumption that we would get a pure software singularity which would FOOM in several weeks, but reality is shaping up to not be a pure software-singularity, since physical stuff like robotics and data centers still matters. There's a reason why every hyperscaler is trying to get large amounts of power and datacenter compute contracts, because they realize that the singularity is bottlenecked currently on power and to a lesser extent compute. I disagree with 4, but that's due to my views on alignment, which tend to view it as a significantly easier problem than the median LWer does, and in particular I view essentially 0 need for philosophical deconfusion to make the future go well. I agree that AI control enhances alignment arguments universally, and provides more compelling
5Nathan Helm-Burger
I feel like our viewpoints have converged a lot over the past couple years Noosphere. Which I suppose makes sense, since we've both been updating on similar evidence! The one point I'd disagree with, although also wanting to point out that the disagreement seems irrelevant to short term strategy, is that I do think that philosophy and figuring out values is going to be pretty key in getting from a place of "shakey temporary safety" to a place of "long-term stable safety". But I think our views on the sensible next steps to get to that initial at-least-reasonable-safety sound quite similar. Since I'm pretty sure we're currently in a quite fragile place as a species, I think it's worth putting off thinking about long term safety (decades) to focus on short/medium term safety (months/years).
2Chris_Leong
I would suggest 50% of researchers working on a broader definition of control: including "control", technical governance work and technical outreach (scary demos, model organisms of misalignment). 
2jacquesthibs
I'm in the process of trying to build an org focused on "automated/augmented alignment research." As part of that, I've been thinking about which alignment research agendas could be investigated in order to make automated alignment safer and trustworthy. And so, I've been thinking of doing internal research on AI control/security and using that research internally to build parts of the system I intend to build. I figured this would be a useful test case for applying the AI control agenda and iterating on issues we face in implementation, and then sharing those insights with the wider community. Would love to talk to anyone who has thoughts on this or who would introduce me to someone who would fund this kind of work.
2cubefox
I don't see a significant difference in your distinction between alignment and control. If you say alignment is about doing what you want (which I strongly disagree with in its generality, e.g. when someone might want to murder or torture people or otherwise act unethically), that obviously includes your wanting to "be OK" when the AI didn't do exactly what you want. Alignment comes in degrees, and you merely seem to equate control with non-perfect alignment and alignment with perfect alignment. Or I might be misunderstanding what you have in mind.
2Noosphere89
The actual definition comes from this quote here: And the full link is below: https://www.lesswrong.com/tag/ai-control
2Cole Wyeth
I disagree with this take. A.I. control will only be important in a multipolar situation in which no single A.I. system can create a gray goo catastrophe etc. But if such pivotal acts are impossible and no singular A.I. takes control, but instead many A.I.’s are competing, than some groups will develop better or worse control for economic reasons and it won’t affect existential risk much to work on it now. I don’t think I can see a situation where control matters - only a few players have A.G.I. for a very long time and none escape or are open sourced but also none gain a decisive advantage?
3Cole Wyeth
I do see advantages to hardening important institutions against cyberattacks and increasing individual and group rationality so that humans remain agentic for as long as possible.
2Noosphere89
I think the key story for wins from AI control specifically is a scenario where we have human-level to slightly superhuman AI inside a lab that isn't aligned and wants to take over, but it turns out that it's easier to control what AI affordances are given than it is to align an AI, and in particular it's easier to catch an AI scheming than it is to make it aligned, and the lab wants to use AIs for alignment/control research. I don't see this as a probable scenario, but I do see it as a valuable scenario to work on, so it does have value in my eyes.
3Cole Wyeth
Imagine that there are just a few labs with powerful A.I., all of which are responsible enough to use existing A.I. control strategies which have been prepared for this situation, and none of which open source their models. Now if they successfully use their A.I. for alignment, they will also be able to successfully use it for capabilities research. At some point, control techniques will no longer be sufficient, and we have to hope that by then A.I. aided alignment has succeeded enough to prevent bad outcomes. I don’t believe this is a serious possibility; the first A.I. capable of solving the alignment problem completely will also be able to deceive us about solving the alignment problem (more) easily - up to and including this point, A.I. will produce partial, convincing solutions to the alignment problem which human engineers will go forward with. Control techniques will simply threshold (below) the capabilities of the first unaligned A.I. that escapes, which is plausibly a net negative since it means we won’t have early high impact warnings. If occasional A.I. escapes turn out to be non-lethal, economic incentives will favor better A.I. control, so working on this early won’t really matter. If occasional A.I. escapes turn out to be lethal, then we will die unless we solve the alignment problem ourselves.
2Noosphere89
One assumption that could be used to defuse the danger is if we can apply the controlled AIs to massively improve computer security such that computer security wins the race over attacking the computers. This is at least semi-plausible for the first AIs, who will almost certainly be wildly superhuman at both coding and mathematically proving theorems in for example Lean, because there's a very obvious path to how to get the training data to bootstrap yourself, and like go, it is relatively easy to verify that your solution actually works. https://www.lesswrong.com/posts/2wxufQWK8rXcDGbyL/access-to-powerful-ai-might-make-computer-security-radically Another assumption that could work for AI control is that once the AIs are controlled enough, labs start using the human-level AIs to further enhance control/alignment strategies, and thus from the base case we can inductively show that the next level up of AIs are even better controlled until you reach a limit, and that the capabilities that are done are there to make the AI safer, not more dangerous.
3Cole Wyeth
Improving computer security seems possible but there are many other attack vectors. For instance, even if an A.I. can prove a system’s software is secure, it may choose to introduce social engineering style back doors if it is not aligned. It’s true that controlled A.I.‘s can be used to harden society but overall I don’t find that strategy comforting. I’m not convinced that this induction argument goes through. I think it fails on the first generation that is smarter than humans, for basically Yudkowskian reasons.
2Nathan Helm-Burger
Hard to be sure without more detail, but your comment gives me the impression that you haven't thought through the various different branches of how AI and geopolitics might go in the next 10 years. I, for one, am pretty sure AI control and powerful narrow AI tools will both be pretty key for humanity surviving the next 10 years. I don't expect us to have robustly solved ASI-aligment in that timeframe.
3Cole Wyeth
I also don’t expect us to have robustly solved ASI-alignment in that timeframe. I simply fail to see a history in which AI control work now is a decisive factor. If you insist on making a top level claim that I haven’t thought through the branches of how things go, I’d appreciate a more substantive description of the branch I am not considering.
1MichaelDickens
This is a good and important point. I don't have a strong opinion on whether you're right, but one counterpoint: AI companies are already well-incentivized to figure out how to control AI, because (as Wei Dai said) controllable AI is more economically useful. It makes more sense for nonprofits / independent researchers to do work that AI companies wouldn't do otherwise.
1AI_Symbiote
This post raises an important perspective on the practicalities of AI Control versus Alignment. Given the potential for AI to function productively even when not fully aligned, do you believe that current AI control methods are scalable enough to handle future AGI systems? Additionally, what would be the main challenges in ensuring that AI control strategies are robust in highly dynamic or emergency scenarios?
[-]Mark XuΩ407824

AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind

Some considerations:

  • Safety researchers should want Google Deepmind (GDM) to have a robust and flourishing safety department. It seems plausible that GDM will be able to create "the smartest" models: they have lots of talent, and own lots of computers. (see e.g. https://epochai.org/data/notable-ai-models#computing-capacity)
  • Anthropic (ANT) might run into trouble in the future due to not owning their own computers, e.g. if Amazon (or where ever they're renting their computers from) starts their own internal scaling competitor, and decides to stop renting out most of their compute.
  • ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the "optimal allocation".
  • GDM only recently started a bay area based safety research team/lab (with members like Alex Turner). So if people had previously decided to work for ANT based on location, they now have the opportunity to work for GDM without relocating.
  • I've heard that many safety researchers join A
... (read more)
[-]kaveΩ11192

ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the "optimal allocation".

I think this depends on whether you think AI safety at a lab is more of an O-ring process or a swiss-cheese process. Also, if you think it's more of an O-ring process, you might be generally less excited about working at a scaling lab.

[-]evhubΩ7114

Some possible counterpoints:

  • Centralization might actually be good if you believe there are compounding returns to having lots of really strong safety researchers in one spot working together, e.g. in terms of having other really good people to work with, learn from, and give you feedback.
  • My guess would be that Anthropic resources its safety teams substantially more than GDM in terms of e.g. compute per researcher (though I'm not positive of this).
  • I think the object-level research productivity concerns probably dominate, but if you're thinking about influence instead, it's still not clear to me that GDM is better. GDM is a much larger, more bureaucratic organization, which makes it a lot harder to influence. So influencing Anthropic might just be much more tractable.
9Raemon
I think two major cruxes for me here are: * is it actually tractable to affect Deepmind's culture and organizational decisionmaking * how close to the threshold is Anthropic for having a good enough safety culture? My current best guess is that Anthropic is still under the threshold for good enough safety culture (despite seeming better than I expected in a number of ways), and meanwhile that Deepmind is just too intractably far gone.  I think people should be hesitant to work at any scaling lab, but, I think Anthropic might be possible to make "the one actually good scaling lab", and I don't currently expect that to be tractable at Deepmind and I think "having at least one" seems good for the world (although it's a bit hard for me to articulate why at the moment) I am interested in hearing details about Deepmind that anyone thinks should change my mind about this. This viewpoint is based on having spent at least 10s of hours trying to learn and about influence both org's culture, at various times. In both cases, I don't get the sense that people at the orgs really have a visceral sense that "decisionmaking processes can be fake", I think they will be fake by default and the org is better modeled as following general incentives, and DeepMind has too many moving people and moving parts at a low enough density that it doesn't seem possible to fix. For me to change my mind about this, I would need to someone there to look me in the eye and explain that they do have a visceral sense of how organizational decisionmaking processes can be fake, and why they nonetheless think DeepMind is tractable to fix. I assume it's hard for @Rohin Shah and @Neel Nanda can't really say anything publicly that's capable of changing my mind for various confidentiality and political reasons, but, like, that's my crux. (conving me in more general terms "Ray, you're too pessimistic about org culture" would hypothetically somehow work, but, you have a lot of work to do given how thorou
2Nathan Helm-Burger
I think your pessimism of org culture is pretty relevant for the question of big decisions that GDM may make, but I think there is absolutely still a case to be made for the value of alignment research conducted wherever. If the research ends up published, then the origin shouldn't be held too much against it. So yes, having a few more researchers at GDM doesn't solve the corporate race problem, but I don't think it worsens it either. As for pausing, I think it's a terrible idea. I'm pretty confident that any sort of large scale pause would be compute threshold focused, and would be worse than not pausing because it would shift research pressure towards algorithmic efficiency. More on that here: https://www.lesswrong.com/posts/Kobbt3nQgv3yn29pr/my-theory-of-change-for-working-in-ai-healthtech?commentId=qwixG4xYeFdELb2GJ
4Raemon
It might be "fine" to do research at GDM (depending on how free you are to actually pursue good research directions, or how good a mentor you have). But, part of the schema in Mark's post is "where should one go for actively good second-order effects?".
9Akash
I largely agree with this take & also think that people often aren't aware of some of GDM's bright spots from a safety perspective. My guess is that most people overestimate the degree to which ANT>GDM from a safety perspective. For example, I think GDM has been thinking more about international coordination than ANT. Demis has said that he supports a "CERN for AI" model, and GDM's governance team (led by Allan Dafoe) has written a few pieces about international coordination proposals. ANT has said very little about international coordination. It's much harder to get a sense of where ANT's policy team is at. My guess is that they are less enthusiastic about international coordination relative to GDM and more enthusiastic about things like RSPs, safety cases, and letting scaling labs continue unless/until there is clearer empirical evidence of loss of control risks.  I also think GDM deserves some praise for engaging publicly with arguments about AGI ruin and threat models. (On the other hand, GDM is ultimately controlled by Google, which makes it unclear how important Demis's opinions or Allan's work will be. Also, my impression is that Google was neutral or against SB1047, whereas ANT eventually said that the benefits outweighed the costs.)
8Scott Emmons
Great post. I'm on GDM's new AI safety and alignment team in the Bay Area and hope readers will consider joining us! What evidence is there that working at a scaling lab risks creating a "corrupted" perception? When I try thinking of examples, the people that come to my mind seem to have quite successfully transitioned from working at a scaling lab to doing nonprofit / government work. For example: * Paul Christiano went from OpenAI to the nonprofit Alignment Research Center (ARC) to head of AI safety at the US AI Safety Institute. * Geoffrey Irving worked at Google Brain, OpenAI, and Google DeepMind. Geoffrey is now Chief Scientist at the UK AI Safety Institute. * Beth Barnes worked at DeepMind and OpenAI and is now founder and head of research at Model Evaluation and Threat Research (METR).
4Mark Xu
I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that "corrupted", although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).
3Arthur Conmy
> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted” Does this mean something like: 1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or   2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon or something else entirely?
4Mark Xu
Basically (2), very small amounts of (1) (perhaps qualitatively similar to the amount of (1) you would apply to e.g. people joining US AISI or UK AISI)
2Neel Nanda
The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams Thanks for writing the post!

My current taxonomy of rationalists is:

  • LW rationalists (HI!)
  • Facebook rationalists
  • Twitter rationalists
  • Blog rationalists
  • Internet-invisible rationalists

Are there other types of rationalists? Maybe like group-chat rationalists? or podcast rationalists? google doc rationalists?

Alternative taxonomy:

  • rationalists belonging to Eliezer
  • cryopreserved rationalists
  • rationalists trained by CFAR
  • aspiring rationalists
  • rationalists working for MIRI
  • legendary rationalists
  • metarationalists
  • those commenting on this taxonomy
  • those that tweet as if they were mad
  • Bayesians
  • et cetera
  • Zvi
  • those that from afar look like paperclips

:)

6Mark Xu
This made me chuckle. More humor * Rationalists taxonomizing rationalists * Mesa-rationalists (the mesa-optimizers inside rationalists) * carrier pigeon rationalists * proto-rationalists * not-yet-born rationalists * literal rats * frequentists * group-house rationalists * EA forum rationalists * academic rationalists * meme rationalists :)
4Ben Pace
I like this one.
4Matt Goldenberg
I think there is a community of discord rationalists and tumblr rationalists.
2Ben Pace
I am a google doc rationalist! (Or I would like to be. Google docs are great.)
0Zack_M_Davis
What are you using the word "rationalist" to mean? If you just mean "members of any subculture bearing some line of memetic descent from early Less Wrong" (which I don't think deserves the pretentious term rationalist, but putting that aside), why is "communication platform" a useful way to chop up that social grouping? A lot of the same people use Less Wrong and Facebook and Twitter and Discord and Google Docs, and a lot of the people who use the same platform wouldn't be in the same cluster if you were to analyze the graph of what actual conversations people are using these platforms to have.
2Ben Pace
It's a natural way to cut it up from one's own experience. Each platform has different affordances and brings out different aspects of people, and I get pretty different experiences of them on the different platforms mentioned.

John Flanagan: “An ordinary archer practices until he gets it right. A ranger practices until he never gets it wrong.”

I want to reword this in to make it about rationality in a way that isn't pretentious.

Cavilo, The Vor Game: "The key to strategy... is not to choose a path to victory, but to choose so that all paths lead to a victory." is close to what I want, but not quite.

3Gordon Seidoh Worley
I'm not sure what to recommend, but a few words that come to mind that might be relevant and help you spark an idea: * resilient * robust * reliable * antifragile * high availability * self healing * overdetermined
1leggi
Is the victory bit important in the quotation? If it's not about the victory/winning, and rather about the path/journey .... A first draft that springs to mind as I type: The key to rationality.... is not to chose the label, but to choose to take every opportunity to improve/update your thinking. (Can't ... stop ... myself ... from commenting: From what I've observed too much ego gets in the way of rational thinking sometimes.)
1[anonymous]
Why not something along the lines of "your rationality is measured as much by your worst case performance as your average case"?
3Viliam
Maybe "your worst case contributes to your average just as strongly as your best case"?

Epistemic status: rambles

Quarantine preparation has made me realize that a days worth of food is actually really cheap, doesn't require that much time to cook, and can be made fairly tasty for not much more, i.e. a day's worth of easy-to-cook, relatively tasty food is about $5.

This requires some amount of amortized costs for the easy-to-cook and relatively tasty part, but not immensely large upfront costs (instantpot, spices, etc.).

This reference says that 40 million people dealt with hunger is the US. I am... super confused? I find it extremely ... (read more)

7Viliam
It seems to me that during the quarantine I eat less than usual; either I am deluding myself, or it is a combination of having less physical activity (such as walking to/from job/lunch/shops/playground), being able to eat whenever I want (so there is no pressure to "eat a lot now, because the next opportunity will be 7 hours later"), making less superstimulating food (using less sugar and salt), and having other ways to get some hedons (e.g. taking a nap). Sometimes I cook a soup, and that's most of my daily food. And soups are really cheap. You take $1-2 worth of ingredients, cook them in water, add little salt and spices; optionally eat with bread. Bread is cheap, salt is cheap, most spices are cheap (per portion), potatoes, broccoli, carrot, onion, and beans are cheap. Most of these things are like $1 per 1kg. Okay, soups are not super healthy; cooking destroys vitamins. You should also get some fresh fruits and vegetables. Apples, tomatoes, cucumbers are $1-2 per 1 kg. You should definitely be able to eat healthy food for less than $5 a day. What is expensive? Chocolate and other sweets, cheese, berries, nuts; I probably forgot something. You shouldn't eat sweets; and you can afford the other things now and then even under $5 a day on average. (It is not an optimal diet; some people recommend eating berries and nuts every day. But still healthier than many people eat, including those who spend more money on food.) . On the other hand, we usually spend more than $5 per person per day, even during the quarantine. We spend a lot on sweets and cheese. The tastier ones are even more expensive than the basic ones, which are already more expensive than the actually useful food. Instant gratification -- it's addictive! The more stress I have during the day, the more I need something that will improve my mood instantly, even if it's only for a moment. Poor people probably have more stress, and thus less willpower to resists things full of sugar and salt. (Also alco
6RobertM
There are a few things to keep in mind: 1) The claim that 40 million Americans "deal with hunger" is, um, questionable. Their citation leads to feedingamerica.org, which cites USDA's Household Food Security in the United States report (https://www.ers.usda.gov/webdocs/publications/94849/err-270.pdf?v=963.1). The methodology used is an 11-question survey (18 for households with children), where answering 3 questions in the affirmative marks you as low food security. The questions asked are (naturally) subjective. Even better, the first question is this: “We worried whether our food would run out before we got money to buy more.” Was that often, sometimes, or never true for you in the last 12 months? That's an a real concern to have, but it is not what people are talking about when they say "dealing with hunger". You can be running on a shoestring budget and often worry about whether you'll have enough money for food without ever actually not having enough money for food. 2) A significant percentage of the population has non-trivial issues with executive function. Also, most of the population isn't familiar with "best practices" (in terms of effective life strategies, basic finances, etc). Most people simply don't think about things like this systematically, which is how you get the phenomenon of ~50% of the population not being able to cover a $400 emergency (or whatever those numbers are, they're pretty close). This would be less of an issue if those cultural norms were inherited, but you can't teach something you don't know, and apparently we don't teach Home Economics anymore (not that it'd be sufficient, but it would be better than nothing). This is a subject that deserves a much more in-depth treatment, but I think as a high-level claim this is both close enough to true and sufficient as a cause for what we might observe here. Making an infographic with a rotating course of 10 cheap, easy-to-prepare, relatively healthy, and relatively tasty meals is a great id
1Mark Xu
2) is something that I sort of thought about but not with as much nuance. I agree that such an infographic would be only useful for people who were looking for an alternate meal preparation strategy or something. 3) if it's true that people want to do meal preppy type things but don't have enough to pay upfront costs, there might be gains from 0-interest microloans, maybe via some MLM-type I loan you money, then once you've saved some money and paid me back, you loan other people money too.
5ChristianKl
It seems like the bottom 20% of the US spends $2216 per year per income earner or ~$6 per day on food. Given that children themselves don't have an income they might spend less then $5 per person for food per day. People can drown in a river that's on average 1m deep.

If you have DAI right now, minting on https://foundry.finance/ and swapping yTrump for nTrump on catnip.exchange is an almost guaranteed 15% profit.

Lesswrong posts that I want someone to write:

  1. Description of pair debugging
  2. Description of value handshakes

Maybe I'll think of more later.

4Viliam
I found a reference to "value handshakes" here: I think it refers to something like this: Imagine that a superintelligent human-friendly AI meets a superintelligent paperclip maximizer, and they both realize their powers are approximately balanced. What should they do? For humans, "let's fight, and to the victor go the spoils" is the intuitive answer, but the superintelligences can possibly do better. If they fight, they have a 50% chance of achieving nothing, and a 50% chance of winning the universe... minus whatever was sacrificed to Moloch, which could possibly be a lot. If they split the universe to halves, and find out a way how to trust each other, that is better that war. But there is a possibility of even better solution, when both of them would agree on acting as if they were a single superintelligence that values both humans and paperclips equally. The cooperative solution can be better than a 50% split of the universe, because you could build paperclip factories on places humans care less about, such as uninhabitable planets; or perhaps you could find a way how to introduce paperclips to human environment without reducing the human quality of life. For example, would you mind using paperclips to reinforce the walls of your house? Would you mind if almost all materials used to build stuff for humans contained little paperclips inside? Would you mind living in a simulation implemented on paperclip-shaped circuits? So maybe at the end, humans could get like 70% of the potential utility of the universe, while 70% of potential material would be converted to paperclips.

A weird but not-inaccurate way to think of log(n) is as an answer to "how many digits does n have?"

This suggests that a weird but not-inaccurate way to think of a log-normal distribution is as a distribution where "the number of digits is normally distributed"

4philip_b
There are a bunch of explanations of logarithm as length on Arbital.

Ignorance as evidence

I was answering a bunch of questions from OpenPhill's calibration test of the form "when did <thing> happen?". A lot of the time, I had no knowledge of <thing>, so I gave a fairly large confidence interval as a "maximum ignorance" type prediction (1900-2015, for example).

However, the fact that I have no knowledge of <thing> is actually moderate evidence that it happened "before my time".

Example: "when did <person> die?" If I was alive when <person> died, there's a higher chance of me hearing about their death. Thus not having heard of <person> is evidence that they died some time ago.

3Viliam
Technically, it could also be evidence that you are dead, but your ghost cannot move to afterlife, probably because it is too attached to scoring internet points (a fate that awaits many of us, I am afraid). (epistemic status: just kidding)

If you're interviewing employees of a company about how good the company is, there's positive bias because people who hate the company will have already left.

4Dagon
Sure. Also, current employees are dis-incented from being truthful about the bad parts. But you're not applying statistics to the results, so that's not terribly important. Such interviews provide limited evidence about goodness of the company. They provide decent evidence about the potential coworkers you're interviewing. Generally, when you're interviewing employees about a company about whether the company is any good, you're trying to decide whether to work there yourself. And you're evaluating whether any of them seem competent and interesting enough that you can tolerate being near them for any length of time.

Coinfection rates of COVID and normal flu are very low. If you have the set of flu/COVID symptoms, you're basically guaranteed to have one or the other. You can test for the flu pretty easily. Therefore, people can just test for the flu as a proxy for testing for COVID.

Is this just a really obvious chain of reasoning that everyone has missed? Which one of my assumptions is wrong?

https://twitter.com/katyw2004/status/1236848300143280128 says coinfection rates are low

https://www.wikiwand.com/en/Rapid_influenza_diagnostic_test means we can test for the flu fas

... (read more)
3Mark Xu
update the CDC advises testing for the flu and there's a lot of medical things that cause "flu-like" symptoms. Turns out that "flu-like" symptoms is basically "immune system doing things", which is going to happen with most things your body doesn't like.
2ChristianKl
Not everybody who has a cold has influenza (the flu) or COVID-19. There are many different viruses that cause influenza-like-illnesses. Low is relativ. 2% of coindience is still a bunch.

Moral uncertainty is a thing that people think about. Do people also think about decision theoretic uncertainty? E.g. how to decide when you're uncertain about which decision theory is correct?

1Pongo
Decision theoretic uncertainty seems easier to deal with, because you’re trying to maximise expected value in each case, just changing what you condition on in calculating the expectation. So you can just take the overall expectation given your different credences in the decision theories