Gear-level models are expensive - often prohibitively expensive. Black-box approaches are usually much cheaper and faster. But black-box approaches rarely generalize - they're subject to Goodhart, need to be rebuilt when conditions change, don't identify unknown unknowns, and are hard to build on top of. Gears-level models, on the other hand, offer permanent, generalizable knowledge which can be applied to many problems in the future, even if conditions shift.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Stephen Fowler61-12
37
Very Spicy Take Epistemic Note:  Many highly respected community members with substantially greater decision making experience (and Lesswrong karma) presumably disagree strongly with my conclusion. Premise 1:  It is becoming increasingly clear that OpenAI is not appropriately prioritizing safety over advancing capabilities research. Premise 2: This was the default outcome.  Instances in history in which private companies (or any individual humans) have intentionally turned down huge profits and power are the exception, not the rule.  Premise 3: Without repercussions for terrible decisions, decision makers have no skin in the game.  Conclusion: Anyone and everyone involved with Open Phil recommending a grant of $30 million dollars be given to OpenAI in 2017 shouldn't be allowed anywhere near AI Safety decision making in the future. To go one step further, potentially any and every major decision they have played a part in needs to be reevaluated by objective third parties.  This must include Holden Karnofsky and Paul Christiano, both of whom were closely involved.  To quote OpenPhil: "OpenAI researchers Dario Amodei and Paul Christiano are both technical advisors to Open Philanthropy and live in the same house as Holden. In addition, Holden is engaged to Dario’s sister Daniela."
Tamsin Leake2122
7
I'm surprised at people who seem to be updating only now about OpenAI being very irresponsible, rather than updating when they created a giant public competitive market for chatbots (which contains plenty of labs that don't care about alignment at all), thereby reducing how long everyone has to solve alignment. I still parse that move as devastating the commons in order to make a quick buck.
Akash169
2
My current perspective is that criticism of AGI labs is an under-incentivized public good. I suspect there's a disproportionate amount of value that people could have by evaluating lab plans, publicly criticizing labs when they break commitments or make poor arguments, talking to journalists/policymakers about their concerns, etc. Some quick thoughts: * Soft power– I think people underestimate the how strong the "soft power" of labs is, particularly in the Bay Area.  * Jobs– A large fraction of people getting involved in AI safety are interested in the potential of working for a lab one day. There are some obvious reasons for this– lots of potential impact from being at the organizations literally building AGI, big salaries, lots of prestige, etc. * People (IMO correctly) perceive that if they acquire a reputation for being critical of labs, their plans, or their leadership, they will essentially sacrifice the ability to work at the labs.  * So you get an equilibrium where the only people making (strong) criticisms of labs are those who have essentially chosen to forgo their potential of working there. * Money– The labs and Open Phil (which has been perceived, IMO correctly, as investing primarily into metastrategies that are aligned with lab interests) have an incredibly large share of the $$$ in the space. When funding became more limited, this became even more true, and I noticed a very tangible shift in the culture & discourse around labs + Open Phil * Status games//reputation– Groups who were more inclined to criticize labs and advocate for public or policymaker outreach were branded as “unilateralist”, “not serious”, and “untrustworthy” in core EA circles. In many cases, there were genuine doubts about these groups, but my impression is that these doubts got amplified/weaponized in cases where the groups were more openly critical of the labs. * Subjectivity of "good judgment"– There is a strong culture of people getting jobs/status for having “good judgment”. This is sensible insofar as we want people with good judgment (who wouldn’t?) but this often ends up being so subjective that it ends up leading to people being quite afraid to voice opinions that go against mainstream views and metastrategies (particularly those endorsed by labs + Open Phil). * Anecdote– Personally, I found my ability to evaluate and critique labs + mainstream metastrategies substantially improved when I spent more time around folks in London and DC (who were less closely tied to the labs). In fairness, I suspect that if I had lived in London or DC *first* and then moved to the Bay Area, it’s plausible I would’ve had a similar feeling but in the “reverse direction”. With all this in mind, I find myself more deeply appreciating folks who have publicly and openly critiqued labs, even in situations where the cultural and economic incentives to do so were quite weak (relative to staying silent or saying generic positive things about labs). Examples: Habryka, Rob Bensinger, CAIS, MIRI, Conjecture, and FLI. More recently, @Zach Stein-Perlman, and of course Jan Leike and Daniel K. 
If your endgame strategy involved relying on OpenAI, DeepMind, or Anthropic to implement your alignment solution that solves science / super-cooperation / nanotechnology, consider figuring out another endgame plan.
robo83
2
Our current big stupid: not preparing for 40% agreement Epistemic status: lukewarm take from the gut (not brain) that feels rightish The "Big Stupid" of the AI doomers 2013-2023 was AI nerds' solution to the problem "How do we stop people from building dangerous AIs?" was "research how to build AIs".  Methods normal people would consider to stop people from building dangerous AIs, like asking governments to make it illegal to build dangerous AIs, were considered gauche.  When the public turned out to be somewhat receptive to the idea of regulating AIs, doomers were unprepared. Take: The "Big Stupid" of right now is still the same thing.  (We've not corrected enough).  Between now and transformative AGI we are likely to encounter a moment where 40% of people realize AIs really could take over (say if every month another 1% of the population loses their job).  If 40% of the world were as scared of AI loss-of-control as you, what could the world do? I think a lot!  Do we have a plan for then? Almost every LessWrong post on AIs are about analyzing AIs.  Almost none are about how, given widespread public support, people/governments could stop bad AIs from being built. [Example: if 40% of people were as worried about AI as I was, the US would treat GPU manufacture like uranium enrichment.  And fortunately GPU manufacture is hundreds of time harder than uranium enrichment!  We should be nerding out researching integrated circuit supply chains, choke points, foundry logistics in jurisdictions the US can't unilaterally sanction, that sort of thing.] TLDR, stopping deadly AIs from being built needs less research on AIs and more research on how to stop AIs from being built. *My research included 😬

Popular Comments

Recent Discussion

 [memetic status: stating directly despite it being a clear consequence of core AI risk knowledge because many people have "but nature will survive us" antibodies to other classes of doom and misapply them here.]

Unfortunately, no.[1]

Technically, “Nature”, meaning the fundamental physical laws, will continue. However, people usually mean forests, oceans, fungi, bacteria, and generally biological life when they say “nature”, and those would not have much chance competing against a misaligned superintelligence for resources like sunlight and atoms, which are useful to both biological and artificial systems.

There’s a thought that comforts many people when they imagine humanity going extinct due to a nuclear catastrophe or runaway global warming: Once the mushroom clouds or CO2 levels have settled, nature will reclaim the cities. Maybe mankind in our hubris will have wounded Mother Earth and paid the price ourselves, but...

17ryan_greenblatt
I think literal extinction is unlikely even conditional on misaligned AI takeover due to: * The potential for the AI to be at least a tiny bit "kind" (same as humans probably wouldn't kill all aliens). [1] * Decision theory/trade reasons This is discussed in more detail here and here. Insofar as humans and/or aliens care about nature, similar arguments apply there too, though this is mostly beside the point: if humans survive and have (even a tiny bit of) resources they can preserve some natural easily. I find it annoying how confident this article is without really bothering to engage with the relevant arguments here. (Same goes for many other posts asserting that AIs will disassemble humans for their atoms.) ---------------------------------------- 1. This includes the potential for the AI to have preferences that are morally valueable from a typical human perspective. ↩︎
O O10

Additionally, the AI might think it's in an alignment simulation and just leave the humans as is or even nominally address their needs.

5GoteNoSente
It is not at all clear to me that most of the atoms in a planet could be harnessed for technological structures, or that doing so would be energy efficient. Most of the mass of an earthlike planet is iron, oxygen, silicon and magnesium, and while useful things can be made out of these elements, I would strongly worry that other elements that are needed also in those useful things will run out long before the planet has been disassembled. By historical precedent, I would think that an AI civilization on Earth will ultimately be able to use only a tiny fraction of the material in the planet, similarly to how only a very small fraction of a percent of the carbon in the planet is being used by the biosphere, in spite of biological evolution having optimized organisms for billions of years towards using all resources available for life. The scenario of a swarm of intelligent drones eating up a galaxy and blotting out its stars I think can empirically be dismissed as very unlikely, because it would be visible over intergalactic distances. Unless we are the only civilization in the observable universe in the present epoch, we would see galaxies with dark spots or very strangely altered spectra somewhere. So this isn't happening anywhere. There are probably some historical analogs for the scenario of a complete takeover, but they are very far in the past, and have had more complex outcomes than intelligent grey goo scenarios normally portray. One instance I can think of is the Great Oxygenation Event. I imagine an observer back then might have envisioned that the end result of the evolution of cyanobacteria doing oxygenic photosynthesis would be the oceans and lakes and rivers all being filled with green slime, with a toxic oxygen atmosphere killing off all other life. While indeed this prognosis would have been true to a first order approximation - green plants do dominate life on Earth today - the reality of what happened is infinitely more complex than this crude pictu
4Dagon
If it's possible for super-intelligent AI to be non-sentient, wouldn't it be possible for insects to evolve non-sentient intelligence as well?  I guess I didn't assume "non-sentient" in the definition of "unaligned".

Introduction

[Reminder: I am an internet weirdo with no medical credentials]

A few months ago, I published some crude estimates of the power of nitric oxide nasal spray to hasten recovery from illness, and speculated about what it could do prophylactically. While working on that piece a nice man on Twitter alerted me to the fact that humming produces lots of nasal nitric oxide. This post is my very crude model of what kind of anti-viral gains we could expect from humming.

I’ve encoded my model at Guesstimate. The results are pretty favorable (average estimated impact of 66% reduction in severity of illness), but extremely sensitive to my made-up numbers. Efficacy estimates go from ~0 to ~95%, depending on how you feel about publication bias, what percent of Enovid’s impact...

4Elizabeth
Can you clarify this part? The liquid is a reactive solution (and contains other ingredients) so I don't understand how you calculated it. I agree the integral is a reasonable interpretation and appreciate you pointing it out. My guess is low frequent applications are better than infrequent high doses, but I don't know what the conversion rate is and this definitely undermines the hundred-dollar-bill case. 

My prior is that solutions contain on the order of 1% active ingredients, and of things on the Enovid ingredients list, citric acid and NaNO2 are probably the reagents that create NO [1], which happens at a 5.5:1 mass ratio. 0.11ppm*hr as an integral over time already means the solution is only around 0.01% NO by mass [1], which is 0.055% reagents by mass, probably a bit more because yield is not 100%. This is a bit low but believable. If the concentration were really only 0.88ppm and dissipated quickly, it would be extremely dilute which seems unlikely. T... (read more)


The movement to reduce AI x-risk  is overly purist. This is leading to a lot of sects to maintain each individual sect's platonic level of purity and is actively (greatly) harming the cause.

How the Safety Sects Manifest

  • People suggest not publishing AI research
  • More recently, Jan and his team leaving OpenAI
  • Less recently, Paul Christiano leaving OpenAI to form METR[1]
  • Even less recently, Anthropic  forming off of OpenAI
  • A suggestion to blacklist anyone who decided to give $30 million (a paltry sum of money for a startup) to OpenAI. 
     

I think these were all legitimate responses to a perceived increase in risk, but ultimately did or will do more harm than good. Disclaimer: I am the least sure that the formation Anthropic increases p(doom) but I speculate, post AGI, it will be seen...

Bleeding Feet and Dedication

During AI Safety Camp (AISC) 2024, I was working with somebody on how to use binary search to approximate a hull that would contain a set of points, only to knock a glass off of my table. It splintered into a thousand pieces all over my floor.

A normal person might stop and remove all the glass splinters. I just spent 10 seconds picking up some of the largest pieces and then decided that it would be better to push on the train of thought without interruption.

Some time later, I forgot about the glass splinters and ended up stepping on one long enough to penetrate the callus. I prioritized working too much. A pretty nice problem to have, in my book.

Collaboration as Intelligence Enhancer

It was...

10Alex_Altair
Hey Johannes, I don't quite know how to say this, but I think this post is a red flag about your mental health. "I work so hard that I ignore broken glass and then walk on it" is not healthy. I've been around the community a long time and have seen several people have psychotic episodes. This is exactly the kind of thing I start seeing before they do. I'm not saying it's 90% likely, or anything. Just that it's definitely high enough for me to need to say something. Please try to seek out some resources to get you more grounded.
Emrik10

It's a reasonable concern to have, but I've spoken enough with him to know that he's not out of touch with reality. I do think he's out of sync with social reality, however, and as a result I also think this post is badly written and the anecdotes unwisely overemphasized. His willingness to step out of social reality in order to stay grounded with what's real, however, is exactly one of the main traits that make me hopefwl about him.

I have another friend who's bipolar and has manic episodes. My ex-step-father also had rapid-cycling BP, so I know a bit abou... (read more)

3Nathan Helm-Burger
I absolutely agree that it makes more sense to fund the person (or team) rather than the project. I think that it makes sense to evaluate a person's current best idea, or top few ideas when trying to decide whether they are worth funding. Ideally, yes, I think it'd be great if the funders explicitly gave the person permission to pivot so long as their goal of making aligned AI remained the same. Maybe a funder would feel better about this if they had the option to reevaluate funding the researcher after a significant pivot?
3Emrik
He linked his extensive research log on the project above, and has made LW posts of some of their progress. That said, I don't know of any good legible summary of it. It would be good to have. I don't know if that's one of Johannes' top priorities, however. It's never obvious from the outside what somebody's top priorities ought to be.

Summary:

We think a lot about aligning AGI with human values. I think it’s more likely that we’ll try to make the first AGIs do something else. This might intuitively be described as trying to make instruction-following (IF) or do-what-I-mean-and-check (DWIMAC) be the central goal of the AGI we design. Adopting this goal target seems to improve the odds of success of any technical alignment approach. This goal target avoids the hard problem of specifying human values in an adequately precise and stable way, and substantially helps with goal misspecification and deception by allowing one to treat the AGI as a collaborator in keeping it aligned as it becomes smarter and takes on more complex tasks.

This is similar but distinct from the goal targets of prosaic alignment efforts....

I read your linked shortform thread. I agreed with pretty most of your arguments against some common AGI takeover arguments. I agree that they won't coordinate against us and won't have "collective grudges" against us.

But I don't think the arguments for continued stability are very thorough, either. I think we just don't know how it will play out. And I think there's a reason to be concerned that takeover will be rational for AGIs, where it's not for humans.

The central difference in logic is the capacity for self-improvement. In your post, you addressed se... (read more)

2Seth Herd
In the near term AI and search are blurred, but that's a separate topic. This post was about AGI as distinct from AI. There's no sharp line between but there are important distinctions, and I'm afraid we're confused as a group because of that blurring. More above, and it's worth its own post and some sort of new clarifying terminology. The term AGI has been watered down to include LLMs that are fairly general, rather than the original and important meaning of AI that can think about anything, implying the ability to learn, and therefore almost necessarily to have explicit goals and agency. This was about that type of "real" AGI, which is still hypothetical even though increasingly plausible in the near term.
3agazi
I think we can already see the early innings of this with large API providers figuring out how to calibrate post-training techniques (RHLF, constitutional AI) between economic usefulness and the "mean" of western morals. Tough to go against economic incentives
2Seth Herd
Yes, we do see such "values" now, but that's a separate issue IMO. There's an interesting thing happening in which we're mixing discussions of AI safety and AGI x-risk. There's no sharp line, but I think they are two importantly different things. This post was intended to be about AGI, as distinct from AI. Most of the economic and other concerns relative to the "alignment" of AI are not relevant to the alignment of AGI. This thesis could be right or wrong, but let's keep it distinct from theories about AI in the present and near future. My thesis here (and a common thesis) is that we should be most concerned about AGI that is an entity with agency and goals, like humans have. AI as a tool is a separate thing. It's very real and we should be concerned with it, but not let it blur into categorically distinct, goal-directed, self-aware AGI. Whether or not we actually get such AGI is an open question that should be debated, not assumed. I think the answer is very clearly that we will, and soon; as soon as tool AI is smart enough, someone will make it agentic, because agents can do useful work, and they're interesting. So I think we'll get AGI with real goals, distinct from the pseudo-goals implicit in current LLMs behavior. The post addresses such "real" AGI that is self-aware and agentic, but that has the sole goal of doing what people want is pretty much a third thing that's somewhat counterintuitive.

Produced as part of the MATS Winter 2023-4 program, under the mentorship of @Jessica Rumbelow

One-sentence summary: On a dataset of human-written essays, we find that gpt-3.5-turbo can accurately infer demographic information about the authors from just the essay text, and suspect it's inferring much more.


Introduction

Every time we sit down in front of an LLM like GPT-4, it starts with a blank slate. It knows nothing[1] about who we are, other than what it knows about users in general. But with every word we type, we reveal more about ourselves -- our beliefs, our personality, our education level, even our gender. Just how clearly does the model see us by the end of the conversation, and why should that worry us?

Like many, we were rather startled when @janus showed...

2Arthur Conmy
They emailed some people about this: https://x.com/brianryhuang/status/1763438814515843119 The reason is that it may allow unembedding matrix weight stealing: https://arxiv.org/abs/2403.06634

I'm aware of the paper because of the impact it had. I might personally not have chosen to draw their attention to the issue, since the main effect seems to be making some research significantly more difficult, and I haven't heard of any attempts to deliberately exfiltrate weights that this would be preventing.

2jdp
Of the abilities Janus demoed to me, this is probably the one that most convinced me GPT-3 does deep modeling of the data generator. The formulation they showed me guessed which famous authors an unknown author is most similar to. This is more useful because it doesn't require the model to know who the unknown author in particular is, just to know some famous author who is similar enough to invite comparison. Twitter post I wrote about it: https://x.com/jd_pressman/status/1617217831447465984 The prompt if you want to try it yourself. It used to be hard to find a base model to run this on but should now be fairly easy with LLaMa, Mixtral, et al. https://gist.github.com/JD-P/632164a4a4139ad59ffc480b56f2cc99
1eggsyntax
Interesting! Tough to test at scale, though, or score in any automated way (which is something I'm looking for in my approaches, although I realize you may not be).
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Ilya Sutskever and Jan Leike have resigned. They led OpenAI's alignment work. Superalignment will now be led by John Schulman, it seems. Jakub Pachocki replaced Sutskever as Chief Scientist.

Reasons are unclear (as usual when safety people leave OpenAI).

The NYT piece (archive) and others I've seen don't really have details.

OpenAI announced Sutskever's departure in a blogpost.

Sutskever and Leike confirmed their departures in tweets.


Updates:

Friday May 17:

Superalignment dissolves.

Leike tweets, including:

I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.

I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics.

These problems are quite hard to get right,

...

Noting that while Sam describes the provision as being about “about potential equity cancellation”, the actual wording says ‘shall be cancelled’ not ‘may be cancelled’, as per this tweet from Kelsey Piper: https://x.com/KelseyTuoc/status/1791584341669396560

1Rebecca
What do you mean by pseudo-equity?
2Linch
OpenAI has something called PPUs ("Profit Participation Units") which in theory is supposed to act like RSUs albeit with a capped profit and no voting rights, but in practice is entirely a new legal invention and we don't really know how it works.
1Rebecca
Is that not what Altman is referring to when he talks about vested equity? My understanding was employees had no other form of equity besides PPUs, in which case he’s talking non-misleadingly about the non-narrow case of vested PPUs, ie the thing people were alarmed about, right?

Instances in history in which private companies (or any individual humans) have intentionally turned down huge profits and power are the exception, not the rule.

OpenAI wasn’t a private company (ie for-profit) at the time of the OP grant though.

1Ebenezer Dukakis
So basically, I think it is a bad idea and you think we can't do it anyway. In that case let's stop calling for it, and call for something more compassionate and realistic like a public apology. I'll bet an apology would be a more effective way to pressure OpenAI to clean up its act anyways. Which is a better headline -- "OpenAI cofounder apologizes for their role in creating OpenAI", or some sort of internal EA movement drama? If we can generate a steady stream of negative headlines about OpenAI, there's a chance that Sam is declared too much of a PR and regulatory liability. I don't think it's a particularly good plan, but I haven't heard a better one.
1Ebenezer Dukakis
Sure, I think this helps tease out the moral valence point I was trying to make. "Don't allow them near" implies their advice is actively harmful, which in turn suggests that reversing it could be a good idea. But as you say, this is implausible. A more plausible statement is that their advice is basically noise -- you shouldn't pay too much attention to it. I expect OP would've said something like that if they were focused on descriptive accuracy rather than scapegoating. Another way to illuminate the moral dimension of this conversation: If we're talking about poor decision-making, perhaps MIRI and FHI should also be discussed? They did a lot to create interest in AGI, and MIRI failed to create good alignment researchers by its own lights. Now after doing advocacy off and on for years, and creating this situation, they're pivoting to 100% advocacy. Could MIRI be made up of good people who are "great at technical stuff", yet apt to shoot themselves in the foot when it comes to communicating with the public? It's hard for me to imagine an upvoted post on this forum saying "MIRI shouldn't be allowed anywhere near AI safety communications".
8Elizabeth
I like a lot of this post, but the sentence above seems very out of touch to me. Who are these third parties who are completely objective? Why is objective the adjective here, instead of "good judgement" or "predicted this problem at the time"?

FSF blogpostFull document (just 6 pages; you should read it). Compare to Anthropic's RSPOpenAI's RSP ("Preparedness Framework"), and METR's Key Components of an RSP.

DeepMind's FSF has three steps:

  1. Create model evals for warning signs of "Critical Capability Levels"
    1. Evals should have a "safety buffer" of at least 6x effective compute so that CCLs will not be reached between evals
    2. They list 7 CCLs across "Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D," and they're thinking about CBRN
      1. E.g. "Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents"
  2. Do model evals every 6x effective compute and every 3 months of fine-tuning
    1. This is an "aim," not a commitment
    2. Nothing about evals during deployment
  3. "When a model reaches
...
8Zach Stein-Perlman
Sorry for brevity. We just disagree. E.g. you "walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks"; I felt like Anthropic was thinking about most stuff better. I think Anthropic's ASL-3 is reasonable and OpenAI's thresholds and corresponding commitments are unreasonable. If the ASL-4 threshold was high or commitments are poor such that ASL-4 was meaningless, I agree Anthropic's RSP would be at least as bad as OpenAI's. One thing I think is a big deal: Anthropic's RSP treats internal deployment like external deployment; OpenAI's has almost no protections for internal deployment. I agree "an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4" is also a fine characterization of Anthropic's current RSP. Quick edit: PF thresholds are too high; PF seems doomed / not on track. But RSPv1 is consistent with RSPv1.1 being great. At least Anthropic knows and says there’s a big hole. That's not super relevant to evaluating labs' current commitments but is very relevant to predicting.
2Akash
I agree with ~all of your subpoints but it seems like we disagree in terms of the overall appraisal. Thanks for explaining your overall reasoning though. Also big +1 that the internal deployment stuff is scary. I don’t think either lab has told me what protections they’re going to use for internally deploying dangerous (~ASL-4) systems, but the fact that Anthropic treats internal deployment like external deployment is a good sign. OpenAI at least acknowledges that internal deployment can be dangerous through its distinction between high risk (can be internally deployed) and critical risk (cannot be), but I agree that the thresholds are too high, particularly for model autonomy.
7Akash
I personally have a large amount of uncertainty around how useful prosaic techniques & control techniques will be. Here are a few statements I'm more confident in: 1. Ideally, AGI development would have much more oversight than we see in the status quo. Whether or not development or deployment activities keep national security risks below acceptable levels should be a question that governments are involved in answering. A sensible oversight regime would require evidence of positive safety or "affirmative safety".  2. My biggest concern with the prosaic/control metastrategy is that I think race dynamics substantially decrease its usefulness. Even if ASL-4 systems are deployed internally in a safe way, we're still not out of the acute risk period. And even if the leading lab (Lab A) is trustworthy/cautious, it will be worried that incautious Lab B is about to get to ASL-4 in 1-3 months. This will cause the leading lab to underinvest into control, feel like it doesn't have much time to figure out how to use its ASL-4 system (assuming it can be controlled), and feel like it needs to get to ASL-5+ rather quickly.  It's still plausible to me that perhaps this period of a few months is enough to pull off actions that get us out of the acute risk period (e.g., use the ASL-4 system to generate evidence that controlling more powerful systems would require years of dedicated effort and have Lab A devote all of their energy toward getting governments to intervene).  Given my understanding of the current leading labs, it's more likely to me that they'll underestimate the difficulties of bootstrapped alignment and assume that things are OK as long as empirical tests don't show imminent evidence of danger. I don't think this prior is reasonable in the context of developing existentially dangerous technologies, particularly technologies that are intended to be smarter than you. I think sensible risk management in such contexts should require a stronger theoretical/conceptual u

I agree with 1 and think that race dynamics makes the situation considerably worse when we only have access to prosaic approaches. (Though I don't think this is the biggest issue with these approaches.)

I think I expect a period substantially longer than several months by default due to slower takeoff than this. (More like 2 years than 2 months.)

Insofar as the hope was for governments to step in at some point, I think the best and easiest point for them to step in is actually during the point where AIs are already becoming very powerful:

  • Prior to this poin
... (read more)

LessOnline Festival

May 31st to June 2nd, Berkely CA