Buck — LessWrong

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.

If we are ever arguing on LessWrong and you feel like it's kind of heated and would go better if we just talked about it verbally, please feel free to contact me and I'll probably be willing to call to discuss briefly.

CEO at Redwood Research.

Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.

I do judge comments more harshly when they're phrased confidently—your tone is effectively raising the stakes on your content being correct and worth engaging with.

If I agreed with your position, I'd probably have written something like:

I don't think this is an important source of risk. I think that basically all the AI x-risk comes from AIs that are smart enough that they'd notice their own overconfidence (maybe after some small number of experiences being overconfident) and then work out how to correct for it.
There are other epistemic problems that I think might affect the smart AIs that pose x-risk, but I don't think this is one of them.
In general, this seems to me like a minor capability problem that is very unlikely to affect dangerous AIs. I'm very skeptical that trying to address such problems is helpful for mitigating x-risk.

What changed? I think it's only slightly more hedged. I personally like using "I think" everywhere for the reason I say here and the reason Ben says in response. To me, my version also more clearly describes the structures of my beliefs and how people might want to argue with me if they want to change my mind (e.g. by saying "basically all the AI x-risk comes from" instead of "The kind of intelligent agent that is scary", I think I'm stating the claim in a way that you'd agree with, but that makes it slightly more obvious what I mean and how to dispute my claim—it's a lot easier to argue about where x-risk comes from than whether something is "scary").

I also think that the word "stupid" parses as harsh, even though you're using it to describe something on the object level and it's not directed at any humans. That feels like the kind of word you'd use if you were angry when writing your comment, and didn't care about your interlocutors thinking you might be angry.

I think my comment reads as friendlier and less like I want the person I'm responding to to feel bad about themselves, or like I want onlookers to expect social punishment if they express opinions like that in the future. Commenting with my phrasing would cause me to feel less bad if it later turned out I was wrong, which communicates to the other person that I'm more open to discussing the topic.

(Tbc, sometimes I do want the person I'm responding to to feel bad about themselves, and I do want onlookers to expect social punishment if they behave like the person I was responding to; e.g. this is true in maybe half my interactions with Eliezer. Maybe that's what you wanted here. But I think that would be a mistake in this case.)

I agree. For what it's worth, in most of these cases, it seems to me that for catastrophe to occur, some AI eventually has to take over on purpose. But I agree that a lot of the action (and many of the points for intervention) might have occurred earlier with AIs that didn't intentionally take over and instead were not as helpful as they could be, either because of misalignment or other problems.

It depends on what you mean by scary. I agree that AIs capable enough to take over are pretty likely to be able to handle their own overconfidence. But the situation when those AIs are created might be substantially affected by the earlier AIs that weren't capable of taking over.

As you sort of note, one risk factor in this kind of research is that the capabilities people might resolve that weakness in the course of their work, in which case your effort was wasted. But I don't think that that consideration is overwhelmingly strong. So I think it's totally reasonable to research weaknesses that might cause earlier AIs to not be as helpful as they could be for mitigating later risks. For example, I'm overall positive on research on making AIs better at conceptual research.

Overall, I think your comment is quite unreasonable and overly rude.

That's not literally the opposite, that's a different thing, obviously.

(As a random reference, I thought Joe's paper about low AI takeover risk was silly at the time, and I think that most people working on grants motivated by AI risk at OP at the time had higher estimates of AI takeover risk. I also thought a lot of takes from the Oxford EAs were pretty silly and I found them frustrating at the time and think they look worse with hindsight. Obviously, many of my beliefs at many of these time periods also look silly in hindsight.)

One issue among others is that the kind of work you end up funding when the funding bureaucrats go to the funding-seekers and say, "Well, we mostly think this is many years out and won't kill everyone, but, you know, just in case, we thought we'd fund you to write papers about it" tends to be papers that make net negative contributions.

I think this is a pretty poor model of the attitudes of the relevant staff at the time. I also think your disparaging language here leads to your comments being worse descriptions of what was going on.

Okay, so it sounds like you're saying that the claims I asserted aren't cruxy for your claim you wanted contradicted?

I definitely don't think that Open Phil thought of "have more people take MIRI seriously" as a core objective, and I imagine that opinions on whether "people take MIRI more seriously" is good would depend a lot on how you operationalize it.

I think that Open Phil proactively tried to take a bunch of actions based on the hypothesis that powerful AI would be developed within 20 years. I think the situation with the sinking ship is pretty disanalogous—I think you'd need to say that your guy in the expensive suit was also one of the main people who was proactively taking actions based on the hypothesis that the ship would sink faster.

Section 2.2 in "Some Background..." looks IMO pretty prescient

The technical advisors I have spoken with the most on this topic are close friends I’ve met through GiveWell and effective altruism: Dario Amodei, Chris Olah and Jacob Steinhardt. They are all relatively junior (as opposed to late-career) researchers; they do not constitute a representative sample of researchers; there are therefore risks in leaning too heavily on their thinking.[...]
There may turn out to be a few broadly applicable AI approaches that lead to rapid progress on an extremely wide variety of intellectual tasks. This intuition seems correlated with (though again, not the same as) an intuition that the human brain makes repeated use of a relatively small set of underlying algorithms, and that by applying the processes, with small modifications, in a variety of contexts, it generates a wide variety of different predictive models, which can end up looking like very different intellectual functions.
[..]Certain areas of AI and machine learning, particularly related to deep neural networks and other deep learning methods, have recently experienced rapid and impressive progress.
[...]Deep learning is a general approach to fitting predictive models to data that can lead to automated generation of extremely complex non-linear models. It seems to be, conceptually, a relatively simple and cross-domain approach to generating such models (though it requires complex computations and generates complex models, and hardware improvements of past decades have been a key factor in being able to employ it effectively). My impression is that the field is still very far away from exploring all the ways in which deep learning might be applied to challenges in AI.
[...]In my view, there is a live possibility that with further exploration of the implications and applications of deep learning – and perhaps a small number (1-3) of future breakthroughs comparable in scope and generality to deep learning – researchers will end up being able to achieve better-than-human performance in a large number of intellectual domains, sufficient to produce transformative AI.
[...]
But broadly speaking, based on these conversations, it seems to me that:
It is easy to imagine (though far from certain) that headway on a relatively small number of core problems could lead to AI systems equalling or surpassing human performance in a large number of domains.
The total number of core open problems is not clearly particularly large (though it is highly possible that there are many core problems that the participants simply haven’t thought of).
Many of the identified core open problems may turn out to have overlapping solutions. Many may turn out to be solved by continued extension and improvement of deep learning methods.
None appear that they will clearly require large numbers of major breakthroughs, large (decade-scale) amounts of trial and error, or further progress on directly studying the human brain. There are examples of outstanding technical problems, such as unsupervised learning, that could turn out to be very difficult, leading to a dramatic slowdown in progress in the near future, but it isn’t clear that we should confidently expect such a slowdown.

So your question is whether (with added newline and capitalization for clarity):

any dissenting views from "AI in median >30 years" and "utter AI ruin <10%" (as expressed in the correct directions of shorter timelines and worse ruin chances; and as said before the ChatGPT moment), were permitted to exercise decision-making power over the flow of substantial amounts of funding;
OR if the weight of reputation and publicity of OpenPhil was at any point put behind promoting those dissenting viewpoints

Re the first part:

Open Phil decisions were strongly affected by whether they were good according to worldviews where "utter AI ruin" is >10% or timelines are <30 years. Many staff believed at the time that worlds with shorter timelines and higher misalignment risk were more tractable to intervene on, and so put additional focus on interventions targeting those worlds; many also believed that risk was >10% and that median timeline was <30 years. I'm not really sure how to operationalize this, but my sense is that the majority of their funding related to AI safety was targeted at scenarios with higher misalignment risk and shorter timelines than 10%/30 years.

As an example, see Some Background on Our Views Regarding Advanced Artificial Intelligence (2016), where Holden says that his belief that P(AGI before 2036) is above 10% "is important to my stance on the importance of potential risks from advanced artificial intelligence. If I did not hold it, this cause would probably still be a focus area of the Open Philanthropy Project, but holding this view is important to prioritize the cause as highly as we’re planning to." So he's clearly saying that the grantmaking strategy is strongly affected by wanting to target the sub-20-year timelines.

I'm not sure how to translate this into the language you use. Among other issues, it's a little weird to talk about the relative influence of different credences over hypotheses, rather than the relative influence of different hypotheses. The "AI risk is >10% and <30 years" hypotheses had a lot of influence, but that could be true even if all the relevant staff had believed that AI risk is <10% and >30 years (if they'd also believed that those worlds were particularly leveraged to intervene on, as they do).

Lots of decisions were made that would not have been made given the decision procedure of "do whatever's best assuming AI is in >30 years and risk is <10%"—I think that that decision procedure would have massively changed the AI safety stuff Open Phil did.

I think that this suffices to contradict your description of the situation—they explicitly made many of their decisions based on the possibility of shorter timelines than you described. I haven't presented evidence here that something similar is true for their assessment of misalignment risk, but I also believe that to be the case.

If I persuaded you of the claims I wrote here (only some of which I backed up with evidence), would that be relevant to your overall stance?

All of this is made more complicated by the fact that Open Phil obviously is and was a large organization with many staff and other stakeholders, who believed different things and had different approaches to translating beliefs into decisions, and who have changed over time. So we can't really talk about what "Open Phil believed" coherently.

Re the second part: I think the weight of reputation and publicity was put behind encouraging people to plan for the possibility of AI sooner than 30 years, as I noted above; this doesn't contradict the statement you've made but IMO it is relevant to your broader point.

Thanks for the questions, these are all very reasonable. You might also find this video informative. I agree with Mis-Understandings's answers. Re 3, you might enjoy this post.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments