This post is written in part in reaction to John Wentworth's post AGI Timelines Are Mostly Not Strategically Relevant To Alignment

Oddly, I mostly agree with his main premises while disagreeing with the conclusion he draws.

In his post, John makes these two claims:

  1. If AGI takeoff is more than ~18 months out, then we should be thinking “long-term” in terms of research
  2. If AGI is more than ~5 years out, then we should probably be thinking “long-term” in terms of training; we should mainly make big investments in recruitment and mentorship.

And in the comments says:

There are some interesting latent factors which influence both timelines and takeoff speeds, like e.g. "what properties will the first architecture to take off have?". But then the right move is to ask directly about those latent factors, or directly about takeoff speeds. Timelines are correlated with takeoff speeds, but not really causally upstream.

I mostly agree with these points. I want to try to explain current beliefs for timelines and takeoff speeds in terms of their upstream causal connections.

In my view there are multiple regimes of 'takeoff speeds' to consider. 

Slow

There is 'slow takeoff' where returns to AI capabilities research from presently available AI are granting a less than 1.0x multiplier on speed to the next level of improved AI capabilities assistance. This is the regime we are currently in the midst of, which I think puts us in line for 15 years until AGI. As we approach 15 years out, it gets increasingly likely that there will have been time for some next generation mainstream ML algorithm to improve on transformers. I think the timing and magnitude of this improvement should be expected from simple extrapolations of past algorithmic improvements. I don't think that algorithmic improvements under this regime should update us towards thinking we are in the explosive regime, but do have a chance to push us over into the fast regime.

Fast

There is 'fast takeoff' where the AI can improve speed to next improvement at a greater than 1.0x multiplier, but still needs substantial investment in training compute for this next improved generation. There might also be some increased chance of an emergent improvement, such as a somewhat surprisingly soon or fast algorithmic improvement. This is faster than the slow regime, but not uncontrollably so. Each generation still gives us time to think carefully, decide if we want to invest in the next model version, evaluate before deployment and such. If we advance to this regime soon, we may find ourselves on a 5 year timeline due to the acceleration.

Explosive

There is 'explosive takeoff' (FOOM) where the model can, with no more significant investment of compute beyond that which is needed for inference, modify itself directly to improve. Once this regime takes off, this puts us on a timeline of perhaps less than a month before things get entirely out of control. I think that using exclusively models from the current mainstream of machine learning we are not at much risk of this. However, the mainstream is not the only genre in the game. There are two main alternative research agendas I think we must keep in mind when estimating this risk:

  1. Sociopathic-Brain-like AGI  
               I'm not talking here about the friendly compassionate Brain-like AGI which Steven Byrnes promoted in his series. I'm talking about the reckless pursuit of power, using the human brain as a template for a known working AGI. These groups are not making any attempt to work on the compassion/empathy/moral-instinct circuits of the brain, only on the capabilities relevant parts. In human society, we call someone whose brain lacks compassion/empathy/moral-instincts a sociopath. So, if a research group does successfully replicate the capabilities of a human brain minus the neglected compassion aspects, we can reasonably expect such an agent to pattern-match to our conception of sociopaths. Some sociopathic-brain-like AGI research groups (e.g. Numenta, http://www.e-cortex.com/ ) seem to believe that they won't create a dangerous agent because they will carefully avoid adding a reward / reinforcement learning subsystem. I argue that adding such a system might be relatively easy, such that a moderately skilled malicious or unwise human could choose to do such and also give their copy of the model self-edit access and encouragement to self-improve.  I also argue that once you have a working example it is going to be really hard to keep its existence and general design (emulating the human brain) a secret. I think that that means that well-funded groups worried about competition (e.g. state actors, large corporations) will be strongly inclined to attempt to make their own versions, and that even without a leak of implementation details or weights, there will be only a short time until many groups have developed their own version. This multiplies the risk that someone will leak the code/weights or make the mistake of allowing direct self-modification plus adding agency.
  2. Non-mainstream-ML academic experiments
                  There are lots and lots of independent researchers and small groups out there trying a wide variety of different tactics to get ahead in ML capabilities. There is a huge amount of potential profit and prestige, it's an intellectually fascinating and stylish topic right now. Most such groups contribute nothing on top of the SotA models put out by big groups, and so many of them don't even try. They try different approaches which could turn up hidden gems. How likely are they to succeed? I don't think it's very likely per researcher. A lot of these researchers are not top notch, and many are searching through barren solution spaces. You don't have to be top notch to get lucky though. If a really significant algorithmic improvement was found it would probably be quite obvious in terms of improvements on benchmarks normalized by training-compute. Thus, even though the research group would likely not have outright created a powerful model, their discovery of a greatly improved algorithm would enable many power-hungry groups to try replicating it at large scale.

 

Current trends in SotA ML research seem to me to be putting us in line for timelines that look like 10 years +/- 5 years with either slow or fast regimes. Oddball risks mean there is a small but non-trivial risk of unexpectedly transitioning to the explosive regime. I think there's roughly about 5% chance that this happens in the next 5 years, and maybe more like 10% that it happens in the following 5 years. You don't have to agree with my relatively high predictions to agree that you should at least add a term for this chance into your timelines model.

Once the explosive regime has been entered, I estimate we have at best a month or two before the self-modifying AGI becomes uncontrollably powerful. Thus, my full spread of 95% CI timelines is 0-15 years, since the explosive regime is at risk of occurring at any time.

I'm not trying to exaggerate risks in order to fear-monger. I just want the alignment community to take some reasonable precautions based on these estimates.

Since we might have as little as 5 years, investing in mentorship and high-serial-researcher-hour-needing approaches have lower expected value. 

Specifically, I am advocating that we establish a secure AI testing box for safety evaluations. I think that we might run into the explosive regime and have a sharp left turn with a relatively unstudied architecture (e.g. sociopathic-brain-like or novel non-mainstream algorithm). 

I think we will need an safety evaluation box eventually, so the effort put into building it is not wasted if we build it too soon. Convincing capabilities researchers to consistently use the safety evaluation box to safety-check their models before deployment may take some time, so just creating the box and a minimal set of safety evaluations is not enough.

By making a secure box cheaply available and free to academic researchers, we can hopefully get a lot of compliance from small independent groups. By making the safety evaluations open-source (but carefully screened to avoid incorporation into training data), we can hopefully get even secretive labs to make their own boxes and run the safety evaluations which they can copy without revealing their interest. These also depend on governance-minded allies coming up with ways of convincing such groups that the safety evaluations are selfishly a good idea. This is an honestly true fact, so I'm hopeful this will be possible. I think that powerful secretive groups will not be dissuaded by seeing safety evaluation results suggestive of the Fast regime or moderate amounts of danger, but I do expect that if they saw evidence of an explosive FOOM with extreme danger to themselves that they would selfishly abstain from deploying.

I believe that boxing even a mildly super-intelligent AGI is safe to do for study and evaluation purposes. This is easier than trying to get practical use out of it, since we can accept a high performance tax in the context of safety evaluation. My confidence is based on giving ourselves these advantages which we can expect to be able to enforce on the boxed entity:

  • examine all it's brain activity / activation states at each token / time-step (subjective millisecond)
  • run it as slow as I like, so that I can have days to analyze each of its subjective milliseconds
  • wipe its memory and rerun it over the same (or different) data with different random seeds to initialize it
  • save and contrast different paused versions run on slightly different data
  • disable or inhibit any portion of its weights / neurons I choose with precise sub-neuron-targeting and whatever intensity of inhibition or noise-excitation I choose.
  • use agent models of my choosing to be the direct viewers of the boxed entity's output rather than myself, and then decide whether to destroy a given copy of the entity based on the output of my agent models (e.g. if it has become too powerful to safely continue to contain it)
  • any other such tricks that fit the description of 'things we can do to ML models, irrespective of their algorithms or architectures'. Feel free to add more suggestions!

 

Besides prioritizing the secure box and safety evals agendas, are there other things here I think are relevant to alignment?  Yes, if my model is correct. I think alignment researchers should be less confident that they will have >10 years to accomplish their agendas, and less confident that the models they end up needing to align will be very similar to today's SotA models. This is especially the case in which we do catch a near-miss FOOM in our safety box, and then must rush to understand and align it.

previously: Timelines post 1

New Comment
5 comments, sorted by Click to highlight new comments since:

This post seemed to start as a reply to AGI Timelines etc and then morphed into a discussion of safety boxes. I clicked on the link expecting to reply to the former, but i'm actually going to comment on the latter.

Virtual sandboxes for safety evaluation aren't discussed here that much, but they are probably going to be key for alignment. I was advocating for them here about 7 years ago towards the end of this post.

In some sense they are just a natural evolution of training in game environments ala deepmind. It's not the whole of alignment of course, but no sane person should accept claims about an agent's safety without a large inspect-able dataset of that agent arch's behavioral data in well controlled sandbox sims (where deception is avoided because the agent isn't aware of sim containment). It isn't the complete solution, but it should be table stakes. It's not the kind of deep new theoretical insight that most LW type alignment researchers are interested in - as its more of a standard engineering approach, but that doesn't make it any less important.

Yes, not a new idea, certainly not my idea. I'm not even arguing for my personal work or expertise being relevant to it. What I am arguing is that it is important, and we need it ASAP, and until that is at least publicly underway I need to keep pointing out that we need it. It's time to start turning theoretical ideas into actual working systems.

If you look over your bullet list and then ask "is this a capability groups training proto-AGI in sims (like DM) already have today?" or "are likely to have by the time for AGI?", I get yes for all basically.

So to the extent the safe path here differs from business as usual, it seems to be around: 1.) coordinating on standardized sim environments where AGI teams can still compete over model design while coordinating on shared sim sandbox and safety evaluation infrastructure. and 2.) advocating the importance of leakage prevention/data isolation as we approach AGI (isolating the sim environment so that it doesn't reveal too much about our world).

For an example of 2, when Carmack was talking about his AGI approach he ponders whether AGI's will need a full sim with virtual tv screens or whether you can skip the sim and just hook the tv/browser directly to their input. It's clear there he isn't really considering the danger of letting potentially future dangerous AGI read the internet.

So long as the AI retains no modifications or artifacts from the evaluation, then it can't learn from it, and thus it should be safe to make the simulation as accurate as possible. And I agree that some things like this are already sort of happening, but I think not with the evaluation being focused on the most dangerous paths (agentive strategic planning to acquire resources, self-modification to increase abilities, deceptive manipulation to avoid detection and run cons, etc.). And not in a widespread consistent way, where all papers published have a little note saying that the model passed it's standardized safety evals.

If the AI learns it is in a sim that could completely undermine or invalidate any evaluations of it's ethical/moral/altruistic behavior. I am assuming that the agent's entire life and education training process is thus an evaluation. The sim can be 'accurate', it just needs to be knowledge constrained. A medieval tech era sim would be fine for example.