Matthew Barnett

Someone who is interested in learning and doing good.

My Twitter: https://twitter.com/MatthewJBar

My Substack: https://matthewbarnett.substack.com/

Sequences

Daily Insights

Wiki Contributions

Comments

I think we probably disagree substantially on the difficulty of alignment and the relationship between "resources invested in alignment technology" and "what fraction aligned those AIs are" (by fraction aligned, I mean what fraction of resources they take as a cut).

That's plausible. If you think that we can likely solve the problem of ensuring that our AIs stay perfectly obedient and aligned to our wishes perpetually, then you are indeed more optimistic than I am. Ironically, by virtue of my pessimism, I'm more happy to roll the dice and hasten the arrival of imperfect AI, because I don't think it's worth trying very hard and waiting a long time to try to come up with a perfect solution that likely doesn't exist.

I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.

I mostly see corrigible AI as a short-term solution (although a lot depends on how you define this term). I thought the idea of a corrigible AI is that you're trying to build something that isn't itself independent and agentic, but will help you in your goals regardless. In this sense, GPT-4 is corrigible, because it's not an independent entity that tries to pursue long-term goals, but it will try to help you.

But purely corrigible AIs seem pretty obviously uncompetitive with more agentic AIs in the long-run, for almost any large-scale goal that you have in mind. Ideally, you eventually want to hire something that doesn't require much oversight and operates relatively independently from you. It's a bit like how, when hiring an employee, at first you want to teach them everything you can and monitor their work, but eventually, you want them to take charge and run things themselves as best they can, without much oversight.

And I'm not convinced you could use corrigible AIs to help you come up with the perfect solution to AI alignment, as I'm not convinced that something like that exists. So, ultimately I think we're probably just going to deploy autonomous slightly misaligned AI agents (and again, I'm pretty happy to do that, because I don't think it would be catastrophic except maybe over the very long-run).

I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren't under the control of their citizens or leaders.

I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn't currently how people are thinking about the situation.

For what it's worth, I'm not sure which part of my scenario you are referring to here, because these are both statements I agree with. 

In fact, this consideration is a major part of my general aversion to pushing for an AI pause, because, as you say, governments will already be quite skeptical of quickly deploying massively powerful agents that we can't fully control. By default, I think people will probably freak out and try to slow down advanced AI, even without any intervention from current effective altruists and rationalists. By contrast, I'm a lot more ready to unroll the autonomous AI agents that we can't fully control compared to the median person, simply because I see a lot of value in hastening the arrival of such agents (i.e., I don't find that outcome as scary as most other people seem to imagine.)

At the same time, I don't think people will pause forever. I expect people to go more slowly than what I'd prefer, but I don't expect people to pause AI for centuries either. And in due course, so long as at least some non-negligible misalignment "slips through the cracks", then AIs will become more and more independent (both behaviorally and legally), their values will slowly drift, and humans will gradually lose control -- not overnight, or all at once, but eventually.

Naively, it seems like it should undercut their wages to subsistence levels (just paying for the compute they run on). Even putting aside the potential for alignment, it seems like there will general be a strong pressure toward AIs operating at subsistence given low costs of copying.

I largely agree. However, I'm having trouble seeing how this idea challenges what I am trying to say. I agree that people will try to undercut unaligned AIs by making new AIs that do more of what they want instead. However, unless all the new AIs perfectly share the humans' values, you just get the same issue as before, but perhaps slightly less severe (i.e., the new AIs will gradually drift away from humans too). 

I think what's crucial here is that I think perfect alignment is very likely unattainable. If that's true, then we'll get some form of "value drift" in almost any realistic scenario. Over long periods, the world will start to look alien and inhuman. Here, the difficulty of alignment mostly sets how quickly this drift will occur, rather than determining whether the drift occurs at all.

A thing I always feel like I'm missing in your stories of how the future goes is "if it is obvious that the AIs are exerting substantial influence and acquiring money/power, why don't people train competitor AIs which don't take a cut?"

People could try to do that. In fact, I expect them to do that, at first. However, people generally don't have unlimited patience, and they aren't perfectionists. If people don't think that a perfectly robustly aligned AI is attainable (and I strongly doubt this type of entity is attainable), then they may be happy to compromise by adopting imperfect (and slightly power-seeking) AI as an alternative. Eventually people will think we've done "enough" alignment work, even if it doesn't guarantee full control over everything the AIs ever do, and simply deploy the AIs that we can actually build.

This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don't want things to change quickly, they can (and probably will) pass regulations to slow things down. But I don't expect people to support total stasis. It's more likely that people will permit some continuous loss of control, implicitly, in exchange for hastening the upside benefits of adopting AI.

Even a very gradual loss of control, continuously compounded, eventually means that humans won't fully be in charge anymore.

In the medium to long-term, when AIs become legal persons, "replacing them" won't be an option -- as that would violate their rights. And creating a new AI to compete with them wouldn't eliminate them entirely. It would just reduce their power somewhat by undercutting their wages or bargaining power.

Most of my "doom" scenarios are largely about what happens long after AIs have established a footing in the legal and social sphere, rather than the initial transition period when we're first starting to automate labor. When AIs have established themselves as autonomous entities in their own right, they can push the world in directions that biological humans don't like, for much the same reasons that young people can currently push the world in directions that old people don't like. 

Everything seems to be going great, the AI systems vasten, growth accelerates, etc, but there is mysteriously little progress in uploading or life extension, the decline in fertility accelerates, and in a few decades most of the economy and wealth is controlled entirely by de novo AI; bio humans are left behind and marginalized.

I agree with the first part of your AI doom scenario (the part about us adopting AI technologies broadly and incrementally), but this part of the picture seems unrealistic to me. When AIs start to influence culture, it probably won't be a big conspiracy. It won't really be "mysterious" if things start trending away from what most humans want. It will likely just look like how cultural drift generally always looks: scary because it's out of your individual control, but nonetheless largely decentralized, transparent, and driven by pretty banal motives. 

AIs probably won't be "out to get us", even if they're unaligned. For example, I don't anticipate them blocking funding for uploading and life extension, although maybe that could happen. I think human influence could simply decline in relative terms even without these dramatic components to the story. We'll simply become "old" and obsolete, and our power will wane as AIs becomes increasingly autonomous, legally independent, and more adapted to the modern environment than we are.

Staying in permanent control of the future seems like a long, hard battle. And it's not clear to me that this is a battle we should even try to fight in the long run. Gradually, humans may eventually lose control—not because of a sudden coup or because of coordinated scheming against the human species—but simply because humans won't be the only relevant minds in the world anymore.

Matthew BarnettΩ173828

I think the main reason why we won't align AGIs to some abstract conception of "human values" is because users won't want to rent or purchase AI services that are aligned to such a broad, altruistic target. Imagine a version of GPT-4 that, instead of helping you, used its time and compute resources to do whatever was optimal for humanity as a whole. Even if that were a great thing for GPT-4 to do from a moral perspective, most users aren't looking for charity when they sign up for ChatGPT, and they wouldn't be interested in signing up for such a service. They're just looking for an AI that helps them do whatever they personally want. 

In the future I expect this fact will remain true. Broadly speaking, people will spend their resources on AI services to achieve their own goals, not the goals of humanity-as-a-whole. This will likely look a lot more like "an economy of AIs who (primarily) serve humans" rather than "a monolithic AGI that does stuff for the world (for good or ill)". The first picture just seems like a default extrapolation of current trends. The second picture, by contrast, seems like a naive conception of the future that (perhaps uncharitably), the LessWrong community generally seems way too anchored on, for historical reasons.

I'm not sure if you'd categorize this under "scaling actually hitting a wall" but the main possibility that feels relevant in my mind is that progress simply is incremental in this case, as a fact about the world, rather than being a strategic choice on behalf of OpenAI. When underlying progress is itself incremental, it makes sense to release frequent small updates. This is common in the software industry, and would not at all be surprising if what's often true for most software development holds for OpenAI as well.

(Though I also expect GPT-5 to be medium-sized jump, once it comes out.)

Yes, I expect AI labs will run extensive safety tests in the future on their systems before deployment. Mostly this is because I think people will care a lot more about safety as the systems get more powerful, especially as they become more economically significant and the government starts regulating the technology. I think regulatory forces will likely be quite strong at the moment AIs are becoming slightly smarter than humans. Intuitively I anticipate the 5 FTE-year threshold to be well-exceeded before such a model release.

Putting aside the question of whether AIs would depend on humans for physical support for now, I also doubt that these initial slightly-smarter-than-human AIs could actually pull off an attack that kills >90% of humans. Can you sketch a plausible story here for how that could happen, under the assumption that we don't have general-purpose robots at the same time?

I'm not saying AIs won't have a large impact on the world when they first start to slightly exceed human intelligence (indeed, I expect AIs-in-general will be automating lots of labor at this point in time). I'm just saying these first slightly-smarter-than-human AIs won't pose a catastrophic risk to humanity in a serious sense (at least in an x-risk sense, if not a more ordinary catastrophic sense too, including for reasons of rational self-restraint).

Maybe some future slightly-smarter-than-human AIs can convince a human to create a virus, or something, but even if that's the case, I don't think it would make a lot of sense for a rational AI to do that given that (1) the virus likely won't kill 100% of humans, (2) the AIs will depend on humans to maintain the physical infrastructure supporting the AIs, and (3) if they're caught, they're vulnerable to shutdown since they would lose in any physical competition.

My sense is that people who are skeptical of my claim here will generally point to a few theses that I think are quite weak, such as:

  1. Maybe humans can be easily manipulated on a large scale by slightly-smarter-than-human AIs
  2. Maybe it'll be mere weeks or months between the first slightly-smarter-than-human AI and a radically superintelligent AI, making this whole discussion moot
  3. Maybe slightly smarter-than-human AIs will be able to quickly invent destructive nanotech despite not being radically superintelligent

That said, I agree there could be some bugs in the future that cause localized disasters if these AIs are tasked with automating large-scale projects, and they end up going off the rails for some reason. I was imagining a lower bar for "safe" than "can't do any large-scale damage at all to human well-being".

Here's something that I suspect a lot of people are skeptical of right now but that I expect will become increasingly apparent over time (with >50% credence): slightly smarter-than-human software AIs will initially be relatively safe and highly controllable by virtue of not having a physical body and not having any legal rights.

In other words, "we will be able to unplug the first slightly smarter-than-human-AIs if they go rogue", and this will actually be a strategically relevant fact, because it implies that we'll be able to run extensive experimental tests on highly smart AIs without worrying too much about whether they'll strike back in some catastrophic way.

Of course, at some point, we'll eventually make sufficient progress in robotics that we can't rely on this safety guarantee, but I currently imagine at least a few years will pass between the first slightly-smarter-than-human software AIs, and mass manufactured highly dexterous and competent robots.

(Although I also think there won't be a clear moment in which the first slightly-smarter-than-human AIs will be developed, as AIs will be imbalanced in their capabilities compared to humans.)

Load More