I like that you did this, and find it interesting to read as an AInotkilleveryoneism-motivated researcher. Unfortunately, I think you both miss talking about what I consider to be the most relevant points.
Particularly, I feel like you don't consider the counterfactual of "you and your like-minded friends choose not to assist the most safety oriented group, where you might have contributed meaningfully to that group making substantial progress on both capabilities and safety. This results in a less safety-minded group becoming the dominant AI leader, and thus inceases likelihood of catastrophe." Based on my reading histories of the making of the atomic bomb, a big part of the motivation of a lot of people involved was knowing that the Nazis were also working on an atomic bomb. It turns out that the Nazis were nowhere close, and were defeated without being in range of making one. That wasn't known until later though.
Had the Nazis managed to create an Atomic Bomb first, and were the only group to have them, I expect that would have gone very poorly indeed for the world. Given this, and with the knowledge that the scientists choosing whether to work on the Manhattan project had this limited info, I believe I would have made the same decision in their shoes. Even anticipating that the US government would seize control of the product and make its own decisions about when and where to use it.
It seems you are both of the pessimistic mindset that even a well-intentioned group discovering a dangerously potent AGI recipe in the lab will be highly likely to mishandle it and cause a catastrophe, whilst also being highly unlikely to handle it well and help move humanity out of its critical vulnerable period.
I think that that is an incorrect position, and here are some of my arguments why:
I believe that it is possible to detect sandbagging (and possibly other forms of deception) via this technique that I am collaborating on. https://www.apartresearch.com/project/sandbag-detection-through-model-degradation
I believe that it is feasible to pursue the path towards corrigibility as described by CAST theory (which I gave some support to the development of.) https://www.lesswrong.com/s/KfCjeconYRdFbMxsy
I believe that it is possible to safely contain and study even substantially superhuman AI if this is done thoughtfully and carefully with good security practices. Specifically: a physically sandboxed training cluster, and training data which has been carefully censored to not contain information about humans, our tech, or the physics of our universe. That, plus the ability to do mechanistic interpretability, and to have control to shut off or slow down the AI at will, makes it very likely we could maintain control.
I believe that humanity's safety requires strong AI to counter the increasing availability of offense-dominant technology, particularly self-replicating weapons like engineered viruses or nanotech. ( I have been working with https://securebio.org/ai/ so I have witnessed strong evidence that AI is posing increasing risk of lowering barriers to bioweapons risk.)
I don't believe that substantial delay in developing AGI is possible, and believe that trying would increase rather than decrease humanity's danger. I believe we are already in a substantial hardware and data overhang, and that within the next 24 months the threshold of capability of LLM agents will suffice to begin recursive self-improvement. This means it is likely that a leader in AI at that time will come into possession of strongly super-human AI (if they choose to engage their LLM agents in RSI), and that this will happen too quickly for other groups to catch up.
I believe we are already in a substantial hardware and data overhang, and that within the next 24 months the threshold of capability of LLM agents will suffice to begin recursive self-improvement. This means it is likely that a leader in AI at that time will come into possession of strongly super-human AI (if they choose to engage their LLM agents in RSI)
Just FYI, I think I'd be willing to bet against this at 1:1 at around $500, whereby I don't expect that a cutting edge model will start to train other models of a greater capability level than itself, or make direct edits to its own weights that have large effects (e.g. 20%+ improvements on a broad swath of tasks) on its performance, and the best model in the world will not be one whose training was primarily led by another model.
If you wish to take me up on this, I'd propose any of John Wentworth or Rob Bensinger or Lawrence Chan (i.e. whichever of them ends up available) as adjudicators if we disagree on whether this has happened.
Cool! Ok, yeah. So, I'm happy with any of the arbiters you proposed. I mean, I'm willing to make the bet because I don't think it will come down to a close call, but will instead be clear.
I do think that there's some substantial chance that the process of RSI will begin in the next 24 months, but not become publicly known right away. So my ask related to this would be:
At the end of 24 months, we resolve the bet based on what is publicly known. If, within the 12 month period following the end of the 24 month period, it becomes publicly known that a process started during the 24 month period has now come to fruition and is clearly the leading model, we reverse the decision of the bet from 'leading not from RSI' to 'leading model from RSI'.
Since my hypothesis is stating that the RSI result would be something extraordinary, beyond what would otherwise be projected from the development trend we've seen from human researchers, I think that a situation which ends up as a close call, akin to what feels like a 'tie', should resolve with my hypothesis losing.
Alright, you have yourself a bet! Let's return on August 23rd 2026 to see who's made $500. I've sent you a calendar invite to help us remmeber the date.
I'll ping the arbiters just to check they're down, may suggest an alt if one of them opts out.
(For the future: I think in future I will suggest arbiters to a betting-counterparty via private DM, so it's easier for the arbiters to opt out for whatever reason, or for one of us to reject them for whatever reason.)
[Note to reader: I wrote this comment in response to an earlier and much shorter version of the parent comment.]
Thanks! I'm glad you found it interesting to read.
I considered that counterfactual, but didn't think it was a good argument. I think there's a world of difference between a team that has a mechanistic story for how they can prevent the doomsday device from killing everyone, and a team that is merely "looking for such a way" and "occasionally finding little nuggets of insight". The latter such team is still contributing its efforts and goodwill and endorsement to a project set to end the world in exchange for riches and glory.
I think better consequences will obtain if humans follow the rule of never using safety as the reason for contributing to such an extinction/takeover-causing project unless the bar of "has a mechanistic plan for preventing the project from causing an extinction event" is met.
And, again, I think it's important in that case to have personal eject-buttons, such as some friends who check in with you to ask things like "Do you still believe in this plan?" and if they think there's not a good reason, you've agreed that any two of them can fire you.
(As a small note of nuance, barring Fermi and one or two others, almost nobody involved had thought of a plausible mechanism by which the atomic bomb would be an extinction risk, so I think there's less reason to keep to that rule in that case.)
I disagree that the point about the scientists not realizing nuclear weapons would likely become an existential risk changes what I see as the correct choice to make. I think that, knowing that I was a scientist in the USA and I had either the choice to help the US government build nuclear weapons and thus set the world up for a tense, potentially existential, détente between the US's enemies (Nazis and/or communists and/or others), or... not help. Still seems clearly correct to me to help, since I think a dangerous détente is a better option than only the Nazis or only Stalin having nuclear weapons.
In the current context, I do think there is important strategic game theory overlap with those times, since it seems likely that AI (whether AGI or not) will potentially disrupt the long-standing nuclear détente in the next few years. I expect that whichever government controls the strongest AI in five years from now, if not sooner, will also be nearly immune to long-range missile attacks, conventional military threats, and bioweapons, but able to deploy those things (or a wide range of other coercive technologies) at will against other nations.
Point of clarification: I didn't mean that there should be a rule against helping one's country race to develop nukes, the argument I'm making is that humans should have a rule against helping one's country race to develop nukes that one expects by-default to (say) ignite the atmosphere and kill everyone and for which there is no known countermeasure.
I believe that it is possible to detect sandbagging (and possibly other forms of deception) via this technique that I am collaborating on. https://www.apartresearch.com/project/sandbag-detection-through-model-degradation
That's a clever trick!
I'm curious if you've thought of this, but I think this doesn't generalize to models whose outputs are too hard to evaluate?
Consider a different model we might train in the future that is trained on all the internal data in a company — all the emails, slack, stock price, trades, spreadsheets, etc, and is trained to predict the stock movement, and then is used to generate the rest of the emails from the CEO and so forth that it predicts will increase the stock price further. It is then given the actual role of CEO and its instructions are followed.
Suppose that you perform your trick of randomly fiddling with the weights, and find that the stock performs substantially better when you do so. You inquire with the model as to why, and it tells you that it is taking a strategy with short-term payoffs but too much long-term risk, and your tampering is causing it to make worse decisions. If you ask it to explain why its current strategy is better, it says it's for complicated reasons that it cannot well-explain to a mere IQ 150 human. Is it so clear now whether the model is intentionally pretending to be less competent than it is, or not?
[Meta: I will probably come back and address some of your other points tomorrow, but in 2 minutes my laptop is going to transform itself into a rather expensive paperweight.]
Yes, the sandbagging detection is quite limited. I anticipate it's mostly useful as an "oh no, something weird is going on" alarm. I don't think it will be particularly helpful in figuring out what exactly. I think at that point you need to delay deployment and use other mechanistic interpretability techniques to explore the anomaly in depth.
Since this is intended as an "anomaly-detecting in-the-lab test, for use on well understood benchmarks", I think it will probably work even on quite intelligent models. My hope is that the noise injection impairment will smoothly compromise the model's capabilites down to arbitrarily lower performance. So, for a smarter model, you just increase the impairment.
In theory , this would work up until the model was so capable that it was able to do something like gradient hacking to engineer its expected impairment curve to avoid betraying itself. At that point of capabilities I can only hope the researchers have been wise enough to move to a censored training simulation regime, so that the model can't anticipate the need to be robust to noise injection.
These are not (necessarily) our actual positions, these are positions we were randomly assigned, and for which we searched for the strongest arguments we could find, over the course of ~1 hr 45 mins.
Why do this? Do you have some hope that this will do something different or better than people expressing their actual beliefs and the reasons for them?
I was recently sold on this being a valuable practice, not for the audience, but for the practitioners. Someone (with debate club experience) argued that a debate club where you have to argue for positions you don't hold forces you develop the muscles of stepping outside your current frame/headspace and actually avoid soldier mindset for yourself.
This too. In the format I've been using, both participants spend 15 mins going on a walk thinking about the subject before they know which side they're arguing. These walks have been one of the more intense periods in my life of desperately searching for a contrary perspective from the default one that I bring to a subject, because I might be about to have to quickly provide strong arguments for it to someone I respect, and I've found it quite productive for seeking ways that I might be wrong.
Yes, for a bunch of reasons!
It sounds like both sides of the debate are implicitly agreeing that capabilities research is bad. I don't think that claim necessarily follows from the debate premise that "we both believe they are risking either extinction or takeover by an ~eternal alien dictatorship". I suspect disagreement about capabilities research being bad is the actual crux of the issue for some or many people who believe in AI risk yet still work for capabilities companies, rather than the arguments made in this debate.
The reason it doesn't necessarily follow is that there's a spectrum of possibilities about how entangled safety and capabilities research are. On one end of the spectrum, they are completely orthogonal: we can fully solve AI Safety without ever making any progress on AI capabilities. On the other end of the spectrum, they are completely aligned: figuring out how to make an AI that successfully follows, say, humanity's coherent extrapolated volition, is the same problem as figuring out how to make an AI that successfully performs any other arbitrarily-complicated real world task.
It seems likely to me that the truth is in the middle of the spectrum, but it's not entirely obvious where on the spectrum it lies. I think it's highly unlikely pure theoretical work is sufficient to guarantee safe AI, because without advancements in capabilities, we are very unlikely to have an accurate enough model about what AGI actually is to be able to do any useful safety work on it. On the flip side, I doubt AI safety is as simple as making a sufficiently smart AI and just telling it "please be safe, you know what I mean by that right?".
The degree of entanglement dictates the degree to which capabilities research is intrinsically bad. If we live in a world close to the orthogonal end of the spectrum, it's relatively easy to distinguish "good research" from "bad research", and if you do bad research, you are a bad person, shame on you, let's create social / legal incentives to stop it.
On the flip side, if we live in a world closer to the convergent end of the spectrum, where making strides on AI safety problems is closely tied to advances in capabilities, then some capability research will be necessary, and demonizing capability research will lead to the perverse outcome that it's mostly done by people who don't care about safety, whereas we'd prefer a world where the leading AI labs are staffed with brilliant, highly-safety-conscious safety/capabilities researchers who push the frontier in both directions.
It's outside the scope of this comment for me to weigh in on which end of the spectrum we live on, but I think that's real crux of this debate, and why smart people in good conscience can fling themselves into capabilities research: their world model is different than AI safety researchers coming from a very orthogonal perspective (my understanding of MIRI's original philosophy is that they believe that there's a tremendous amount of safety research that can be done in the absence of capabilities advances).
Empirically people’s research output drops after going to labs
I'm curious about how true this is and what lead LawrenceC to believe it?
Epistemic status: Soldier mindset. These are not (necessarily) our actual positions, these are positions we were randomly assigned, and for which we searched for the strongest arguments we could find, over the course of ~1 hr 45 mins.
Sides: Ben was assigned to argue that it's ethical to work for an AI capabilities company, and Lawrence was assigned to argue that it isn't.
Reading Order: Ben and Lawrence drafted each round of statements simultaneously. This means that each of Lawrence statements you read were written without Lawrence having read Ben's statements that are immediately proceeding.
Ben's Opening Statement
Lawrence's Opening Statement
Verbal Interrogation
We questioned each other for ~20 minutes, which informed the following discussion.
Ben's First Rebuttal
Lawrence's First Rebuttal
Ben's Second Rebuttal
Lawrence cut his final rebuttal as he felt it mostly repeated prior discussion from above.