Counterarguments to the basic AI x-risk case

KatjaGrace

(Crossposted from AI Impacts Blog)

This is going to be a list of holes I see in the basic argument for existential risk from superhuman AI systems¹.

To start, here’s an outline of what I take to be the basic case²:

I. If superhuman AI systems are built, any given system is likely to be ‘goal-directed’

Reasons to expect this:

Goal-directed behavior is likely to be valuable, e.g. economically.
Goal-directed entities may tend to arise from machine learning training processes not intending to create them (at least via the methods that are likely to be used).
‘Coherence arguments’ may imply that systems with some goal-directedness will become more strongly goal-directed over time.

II. If goal-directed superhuman AI systems are built, their desired outcomes will probably be about as bad as an empty universe by human lights

Reasons to expect this:

Finding useful goals that aren’t extinction-level bad appears to be hard: we don’t have a way to usefully point at human goals, and divergences from human goals seem likely to produce goals that are in intense conflict with human goals, due to a) most goals producing convergent incentives for controlling everything, and b) value being ‘fragile’, such that an entity with ‘similar’ values will generally create a future of virtually no value.
Finding goals that are extinction-level bad and temporarily useful appears to be easy: for example, advanced AI with the sole objective ‘maximize company revenue’ might profit said company for a time before gathering the influence and wherewithal to pursue the goal in ways that blatantly harm society.
Even if humanity found acceptable goals, giving a powerful AI system any specific goals appears to be hard. We don’t know of any procedure to do it, and we have theoretical reasons to expect that AI systems produced through machine learning training will generally end up with goals other than those they were trained according to. Randomly aberrant goals resulting are probably extinction-level bad for reasons described in II.1 above.

III. If most goal-directed superhuman AI systems have bad goals, the future will very likely be bad

That is, a set of ill-motivated goal-directed superhuman AI systems, of a scale likely to occur, would be capable of taking control over the future from humans. This is supported by at least one of the following being true:

Superhuman AI would destroy humanity rapidly. This may be via ultra-powerful capabilities at e.g. technology design and strategic scheming, or through gaining such powers in an ‘intelligence explosion‘ (self-improvement cycle). Either of those things may happen either through exceptional heights of intelligence being reached or through highly destructive ideas being available to minds only mildly beyond our own.
Superhuman AI would gradually come to control the future via accruing power and resources. Power and resources would be more available to the AI system(s) than to humans on average, because of the AI having far greater intelligence.

Below is a list of gaps in the above, as I see it, and counterarguments. A ‘gap’ is not necessarily unfillable, and may have been filled in any of the countless writings on this topic that I haven’t read. I might even think that a given one can probably be filled. I just don’t know what goes in it.

This blog post is an attempt to run various arguments by you all on the way to making pages on AI Impacts about arguments for AI risk and corresponding counterarguments. At some point in that process I hope to also read others’ arguments, but this is not that day. So what you have here is a bunch of arguments that occur to me, not an exhaustive literature review.

Counterarguments

A. Contra “superhuman AI systems will be ‘goal-directed’”

Different calls to ‘goal-directedness’ don’t necessarily mean the same concept

‘Goal-directedness’ is a vague concept. It is unclear that the ‘goal-directednesses’ that are favored by economic pressure, training dynamics or coherence arguments (the component arguments in part I of the argument above) are the same ‘goal-directedness’ that implies a zealous drive to control the universe (i.e. that makes most possible goals very bad, fulfilling II above).

One well-defined concept of goal-directedness is ‘utility maximization’: always doing what maximizes a particular utility function, given a particular set of beliefs about the world.

Utility maximization does seem to quickly engender an interest in controlling literally everything, at least for many utility functions one might have³. If you want things to go a certain way, then you have reason to control anything which gives you any leverage over that, i.e. potentially all resources in the universe (i.e. agents have ‘convergent instrumental goals’). This is in serious conflict with anyone else with resource-sensitive goals, even if prima facie those goals didn’t look particularly opposed. For instance, a person who wants all things to be red and another person who wants all things to be cubes may not seem to be at odds, given that all things could be red cubes. However if these projects might each fail for lack of energy, then they are probably at odds.

Thus utility maximization is a notion of goal-directedness that allows Part II of the argument to work, by making a large class of goals deadly.

You might think that any other concept of ‘goal-directedness’ would also lead to this zealotry. If one is inclined toward outcome O in any plausible sense, then does one not have an interest in anything that might help procure O? No: if a system is not a ‘coherent’ agent, then it can have a tendency to bring about O in a range of circumstances, without this implying that it will take any given effective opportunity to pursue O. This assumption of consistent adherence to a particular evaluation of everything is part of utility maximization, not a law of physical systems. Call machines that push toward particular goals but are not utility maximizers pseudo-agents.

Can pseudo-agents exist? Yes—utility maximization is computationally intractable, so any physically existent ‘goal-directed’ entity is going to be a pseudo-agent. We are all pseudo-agents, at best. But it seems something like a spectrum. At one end is a thermostat, then maybe a thermostat with a better algorithm for adjusting the heat. Then maybe a thermostat which intelligently controls the windows. After a lot of honing, you might have a system much more like a utility-maximizer: a system that deftly seeks out and seizes well-priced opportunities to make your room 68 degrees—upgrading your house, buying R&D, influencing your culture, building a vast mining empire. Humans might not be very far on this spectrum, but they seem enough like utility-maximizers already to be alarming. (And it might not be well-considered as a one-dimensional spectrum—for instance, perhaps ‘tendency to modify oneself to become more coherent’ is a fairly different axis from ‘consistency of evaluations of options and outcomes’, and calling both ‘more agentic’ is obscuring.)

Nonetheless, it seems plausible that there is a large space of systems which strongly increase the chance of some desirable objective O occurring without even acting as much like maximizers of an identifiable utility function as humans would. For instance, without searching out novel ways of making O occur, or modifying themselves to be more consistently O-maximizing. Call these ‘weak pseudo-agents’.

For example, I can imagine a system constructed out of a huge number of ‘IF X THEN Y’ statements (reflexive responses), like ‘if body is in hallway, move North’, ‘if hands are by legs and body is in kitchen, raise hands to waist’.., equivalent to a kind of vector field of motions, such that for every particular state, there are directions that all the parts of you should be moving. I could imagine this being designed to fairly consistently cause O to happen within some context. However since such behavior would not be produced by a process optimizing O, you shouldn’t expect it to find new and strange routes to O, or to seek O reliably in novel circumstances. There appears to be zero pressure for this thing to become more coherent, unless its design already involves reflexes to move its thoughts in certain ways that lead it to change itself. I expect you could build a system like this that reliably runs around and tidies your house say, or runs your social media presence, without it containing any impetus to become a more coherent agent (because it doesn’t have any reflexes that lead to pondering self-improvement in this way).

It is not clear that economic incentives generally favor the far end of this spectrum over weak pseudo-agency. There are incentives toward systems being more like utility maximizers, but also incentives against.

The reason any kind of ‘goal-directedness’ is incentivised in AI systems is that then the system can be given an objective by someone hoping to use their cognitive labor, and the system will make that objective happen. Whereas a similar non-agentic AI system might still do almost the same cognitive labor, but require an agent (such as a person) to look at the objective and decide what should be done to achieve it, then ask the system for that. Goal-directedness means automating this high-level strategizing.

Weak pseudo-agency fulfills this purpose to some extent, but not as well as utility maximization. However if we think that utility maximization is difficult to wield without great destruction, then that suggests a disincentive to creating systems with behavior closer to utility-maximization. Not just from the world being destroyed, but from the same dynamic causing more minor divergences from expectations, if the user can’t specify their own utility function well.

That is, if it is true that utility maximization tends to lead to very bad outcomes relative to any slightly different goals (in the absence of great advances in the field of AI alignment), then the most economically favored level of goal-directedness seems unlikely to be as far as possible toward utility maximization. More likely it is a level of pseudo-agency that achieves a lot of the users’ desires without bringing about sufficiently detrimental side effects to make it not worthwhile. (This is likely more agency than is socially optimal, since some of the side-effects will be harms to others, but there seems no reason to think that it is a very high degree of agency.)

Some minor but perhaps illustrative evidence: anecdotally, people prefer interacting with others who predictably carry out their roles or adhere to deontological constraints, rather than consequentialists in pursuit of broadly good but somewhat unknown goals. For instance, employers would often prefer employees who predictably follow rules than ones who try to forward company success in unforeseen ways.

The other arguments to expect goal-directed systems mentioned above seem more likely to suggest approximate utility-maximization rather than some other form of goal-directedness, but it isn’t that clear to me. I don’t know what kind of entity is most naturally produced by contemporary ML training. Perhaps someone else does. I would guess that it’s more like the reflex-based agent described above, at least at present. But present systems aren’t the concern.

Coherence arguments are arguments for being coherent a.k.a. maximizing a utility function, so one might think that they imply a force for utility maximization in particular. That seems broadly right. Though note that these are arguments that there is some pressure for the system to modify itself to become more coherent. What actually results from specific systems modifying themselves seems like it might have details not foreseen in an abstract argument merely suggesting that the status quo is suboptimal whenever it is not coherent. Starting from a state of arbitrary incoherence and moving iteratively in one of many pro-coherence directions produced by whatever whacky mind you currently have isn’t obviously guaranteed to increasingly approximate maximization of some sensical utility function. For instance, take an entity with a cycle of preferences, apples > bananas = oranges > pears > apples. The entity notices that it sometimes treats oranges as better than pears and sometimes worse. It tries to correct by adjusting the value of oranges to be the same as pears. The new utility function is exactly as incoherent as the old one. Probably moves like this are rarer than ones that make you more coherent in this situation, but I don’t know, and I also don’t know if this is a great model of the situation for incoherent systems that could become more coherent.

What it might look like if this gap matters: AI systems proliferate, and have various goals. Some AI systems try to make money in the stock market. Some make movies. Some try to direct traffic optimally. Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable. These systems have no perceptible desire to optimize the universe for forwarding these goals because they aren’t maximizing a general utility function, they are more ‘behaving like someone who is trying to make Walmart profitable’. They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow. The world looks kind of like the current world, in that it is fairly non-obvious what any entity’s ‘utility function’ is. It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.

Ambiguously strong forces for goal-directedness need to meet an ambiguously high bar to cause a risk

The forces for goal-directedness mentioned in I are presumably of finite strength. For instance, if coherence arguments correspond to pressure for machines to become more like utility maximizers, there is an empirical answer to how fast that would happen with a given system. There is also an empirical answer to how ‘much’ goal directedness is needed to bring about disaster, supposing that utility maximization would bring about disaster and, say, being a rock wouldn’t. Without investigating these empirical details, it is unclear whether a particular qualitatively identified force for goal-directedness will cause disaster within a particular time.

What it might look like if this gap matters: There are not that many systems doing something like utility maximization in the new AI economy. Demand is mostly for systems more like GPT or DALL-E, which transform inputs in some known way without reference to the world, rather than ‘trying’ to bring about an outcome. Maybe the world was headed for more of the latter, but ethical and safety concerns reduced desire for it, and it wasn’t that hard to do something else. Companies setting out to make non-agentic AI systems have no trouble doing so. Incoherent AIs are never observed making themselves more coherent, and training has never produced an agent unexpectedly. There are lots of vaguely agentic things, but they don’t pose much of a problem. There are a few things at least as agentic as humans, but they are a small part of the economy.

B. Contra “goal-directed AI systems’ goals will be bad”

Small differences in utility functions may not be catastrophic

Arguably, humans are likely to have somewhat different values to one another even after arbitrary reflection. If so, there is some extended region of the space of possible values that the values of different humans fall within. That is, ‘human values’ is not a single point.

If the values of misaligned AI systems fall within that region, this would not appear to be worse in expectation than the situation where the long-run future was determined by the values of humans other than you. (This may still be a huge loss of value relative to the alternative, if a future determined by your own values is vastly better than that chosen by a different human, and if you also expected to get some small fraction of the future, and will now get much less. These conditions seem non-obvious however, and if they obtain you should worry about more general problems than AI.)

Plausibly even a single human, after reflecting, could on their own come to different places in a whole region of specific values, depending on somewhat arbitrary features of how the reflecting period went. In that case, even the values-on-reflection of a single human is an extended region of values space, and an AI which is only slightly misaligned could be the same as some version of you after reflecting.

There is a further larger region, ‘that which can be reliably enough aligned with typical human values via incentives in the environment’, which is arguably larger than the circle containing most human values. Human society makes use of this a lot: for instance, most of the time particularly evil humans don’t do anything too objectionable because it isn’t in their interests. This region is probably smaller for more capable creatures such as advanced AIs, but still it is some size.

Thus it seems that some amount of AI divergence from your own values is probably broadly fine, i.e. not worse than what you should otherwise expect without AI.

Thus in order to arrive at a conclusion of doom, it is not enough to argue that we cannot align AI perfectly. The question is a quantitative one of whether we can get it close enough. And how close is ‘close enough’ is not known.

What it might look like if this gap matters: there are many superintelligent goal-directed AI systems around. They are trained to have human-like goals, but we know that their training is imperfect and none of them has goals exactly like those presented in training. However if you just heard about a particular system’s intentions, you wouldn’t be able to guess if it was an AI or a human. Things happen much faster than they were, because superintelligent AI is superintelligent, but not obviously in a direction less broadly in line with human goals than when humans were in charge.

Differences between AI and human values may be small

AI trained to have human-like goals will have something close to human-like goals. How close? Call it d, for a particular occasion of training AI.

If d doesn’t have to be 0 for safety (from above), then there is a question of whether it is an acceptable size.

I know of two issues here, pushing d upward. One is that with a finite number of training examples, the fit between the true function and the learned function will be wrong. The other is that you might accidentally create a monster (‘misaligned mesaoptimizer’) who understands its situation and pretends to have the utility function you are aiming for so that it can be freed and go out and manifest its own utility function, which could be just about anything. If this problem is real, then the values of an AI system might be arbitrarily different from the training values, rather than ‘nearby’ in some sense, so d is probably unacceptably large. But if you avoid creating such mesaoptimizers, then it seems plausible to me that d is very small.

If humans also substantially learn their values via observing examples, then the variation in human values is arising from a similar process, so might be expected to be of a similar scale. If we care to make the ML training process more accurate than the human learning one, it seems likely that we could. For instance, d gets smaller with more data.

Another line of evidence is that for things that I have seen AI learn so far, the distance from the real thing is intuitively small. If AI learns my values as well as it learns what faces look like, it seems plausible that it carries them out better than I do.

As minor additional evidence here, I don’t know how to describe any slight differences in utility functions that are catastrophic. Talking concretely, what does a utility function look like that is so close to a human utility function that an AI system has it after a bunch of training, but which is an absolute disaster? Are we talking about the scenario where the AI values a slightly different concept of justice, or values satisfaction a smidgen more relative to joy than it should? And then that’s a moral disaster because it is wrought across the cosmos? Or is it that it looks at all of our inaction and thinks we want stuff to be maintained very similar to how it is now, so crushes any efforts to improve things?

What it might look like if this gap matters: when we try to train AI systems to care about what specific humans care about, they usually pretty much do, as far as we can tell. We basically get what we trained for. For instance, it is hard to distinguish them from the human in question. (It is still important to actually do this training, rather than making AI systems not trained to have human values.)

Maybe value isn’t fragile

Eliezer argued that value is fragile, via examples of ‘just one thing’ that you can leave out of a utility function, and end up with something very far away from what humans want. For instance, if you leave out ‘boredom’ then he thinks the preferred future might look like repeating the same otherwise perfect moment again and again. (His argument is perhaps longer—that post says there is a lot of important background, though the bits mentioned don’t sound relevant to my disagreement.) This sounds to me like ‘value is not resilient to having components of it moved to zero’, which is a weird usage of ‘fragile’, and in particular, doesn’t seem to imply much about smaller perturbations. And smaller perturbations seem like the relevant thing with AI systems trained on a bunch of data to mimic something.

You could very analogously say ‘human faces are fragile’ because if you just leave out the nose it suddenly doesn’t look like a typical human face at all. Sure, but is that the kind of error you get when you try to train ML systems to mimic human faces? Almost none of the faces on thispersondoesnotexist.com are blatantly morphologically unusual in any way, let alone noseless. Admittedly one time I saw someone whose face was neon green goo, but I’m guessing you can get the rate of that down pretty low if you care about it.

Eight examples, no cherry-picking:

Skipping the nose is the kind of mistake you make if you are a child drawing a face from memory. Skipping ‘boredom’ is the kind of mistake you make if you are a person trying to write down human values from memory. My guess is that this seemed closer to the plan in 2009 when that post was written, and that people cached the takeaway and haven’t updated it for deep learning which can learn what faces look like better than you can.

What it might look like if this gap matters: there is a large region ‘around’ my values in value space that is also pretty good according to me. AI easily lands within that space, and eventually creates some world that is about as good as the best possible utopia, according to me. There aren’t a lot of really crazy and terrible value systems adjacent to my values.

Short-term goals

Utility maximization really only incentivises drastically altering the universe if one’s utility function places a high enough value on very temporally distant outcomes relative to near ones. That is, long term goals are needed for danger. A person who cares most about winning the timed chess game in front of them should not spend time accruing resources to invest in better chess-playing.

AI systems could have long-term goals via people intentionally training them to do so, or via long-term goals naturally arising from systems not trained so.

Humans seem to discount the future a lot in their usual decision-making (they have goals years in advance but rarely a hundred years) so the economic incentive to train AI to have very long term goals might be limited.

It’s not clear that training for relatively short term goals naturally produces creatures with very long term goals, though it might.

Thus if AI systems fail to have value systems relatively similar to human values, it is not clear that many will have the long time horizons needed to motivate taking over the universe.

What it might look like if this gap matters: the world is full of agents who care about relatively near-term issues, and are helpful to that end, and have no incentive to make long-term large scale schemes. Reminiscent of the current world, but with cleverer short-termism.

C. Contra “superhuman AI would be sufficiently superior to humans to overpower humanity”

Human success isn’t from individual intelligence

The argument claims (or assumes) that surpassing ‘human-level’ intelligence (i.e. the mental capacities of an individual human) is the relevant bar for matching the power-gaining capacity of humans, such that passing this bar in individual intellect means outcompeting humans in general in terms of power (argument III.2), if not being able to immediately destroy them all outright (argument III.1.). In a similar vein, introductions to AI risk often start by saying that humanity has triumphed over the other species because it is more intelligent, as a lead in to saying that if we make something more intelligent still, it will inexorably triumph over humanity.

This hypothesis about the provenance of human triumph seems wrong. Intellect surely helps, but humans look to be powerful largely because they share their meager intellectual discoveries with one another and consequently save them up over time⁴. You can see this starkly by comparing the material situation of Alice, a genius living in the stone age, and Bob, an average person living in 21st Century America. Alice might struggle all day to get a pot of water, while Bob might be able to summon all manner of delicious drinks from across the oceans, along with furniture, electronics, information, etc. Much of Bob’s power probably did flow from the application of intelligence, but not Bob’s individual intelligence. Alice’s intelligence, and that of those who came between them.

Bob’s greater power isn’t directly just from the knowledge and artifacts Bob inherits from other humans. He also seems to be helped for instance by much better coordination: both from a larger number people coordinating together, and from better infrastructure for that coordination (e.g. for Alice the height of coordination might be an occasional big multi-tribe meeting with trade, and for Bob it includes global instant messaging and banking systems and the Internet). One might attribute all of this ultimately to innovation, and thus to intelligence and communication, or not. I think it’s not important to sort out here, as long as it’s clear that individual intelligence isn’t the source of power.

It could still be that with a given bounty of shared knowledge (e.g. within a given society), intelligence grants huge advantages. But even that doesn’t look true here: 21st Century geniuses live basically like 21st Century people of average intelligence, give or take.

Why does this matter? Well for one thing, if you make AI which is merely as smart as a human, you shouldn’t then expect it to do that much better than a genius living in the stone age. That’s what human-level intelligence gets you: nearly nothing. A piece of rope after millions of lifetimes. Humans without their culture are much like other animals.

To wield the control-over-the-world of a genius living in the 21st Century, the human-level AI would seem to need something like the other benefits that the 21st century genius gets from their situation in connection with a society.

One such thing is access to humanity’s shared stock of hard-won information. AI systems plausibly do have this, if they can get most of what is relevant by reading the internet. This isn’t obvious: people also inherit information from society through copying habits and customs, learning directly from other people, and receiving artifacts with implicit information (for instance, a factory allows whoever owns the factory to make use of intellectual work that was done by the people who built the factory, but that information may not available explicitly even for the owner of the factory, let alone to readers on the internet). These sources of information seem likely to also be available to AI systems though, at least if they are afforded the same options as humans.

My best guess is that AI systems easily do better than humans on extracting information from humanity’s stockpile, and on coordinating, and so on this account are probably in an even better position to compete with humans than one might think on the individual intelligence model, but that is a guess. In that case perhaps this misunderstanding makes little difference to the outcomes of the argument. However it seems at least a bit more complicated.

Suppose that AI systems can have access to all information humans can have access to. The power the 21st century person gains from their society is modulated by their role in society, and relationships, and rights, and the affordances society allows them as a result. Their power will vary enormously depending on whether they are employed, or listened to, or paid, or a citizen, or the president. If AI systems’ power stems substantially from interacting with society, then their power will also depend on affordances granted, and humans may choose not to grant them many affordances (see section ‘Intelligence may not be an overwhelming advantage’ for more discussion).

However suppose that your new genius AI system is also treated with all privilege. The next way that this alternate model matters is that if most of what is good in a person’s life is determined by the society they are part of, and their own labor is just buying them a tiny piece of that inheritance, then if they are for instance twice as smart as any other human, they don’t get to use technology that it twice as good. They just get a larger piece of that same shared technological bounty purchasable by anyone. Because each individual person is adding essentially nothing in terms of technology, so twice that is still basically nothing.

In contrast, I think people are often imagining that a single entity somewhat smarter than a human will be able to quickly use technologies that are somewhat better than current human technologies. This seems to be mistaking the actions of a human and the actions of a human society. If a hundred thousand people sometimes get together for a few years and make fantastic new weapons, you should not expect an entity somewhat smarter than a person to make even better weapons. That’s off by a factor of about a hundred thousand.

There might be places you can get far ahead of humanity by being better than a single human—it depends how much accomplishments depend on the few most capable humans in the field, and how few people are working on the problem. But for instance the Manhattan Project took a hundred thousand people several years, and von Neumann (a mythically smart scientist) joining the project did not reduce it to an afternoon. Plausibly to me, some specific people being on the project caused it to not take twice as many person-years, though the plausible candidates here seem to be more in the business of running things than doing science directly (though that also presumably involves intelligence). But even if you are an ambitious somewhat superhuman intelligence, the influence available to you seems to plausibly be limited to making a large dent in the effort required for some particular research endeavor, not single-handedly outmoding humans across many research endeavors.

This is all reason to doubt that a small number of superhuman intelligences will rapidly take over or destroy the world (as in III.i.). This doesn’t preclude a set of AI systems that are together more capable than a large number of people from making great progress. However some related issues seem to make that less likely.

Another implication of this model is that if most human power comes from buying access to society’s shared power, i.e. interacting with the economy, you should expect intellectual labor by AI systems to usually be sold, rather than for instance put toward a private stock of knowledge. This means the intellectual outputs are mostly going to society, and the main source of potential power to an AI system is the wages received (which may allow it to gain power in the long run). However it seems quite plausible that AI systems at this stage will generally not receive wages, since they presumably do not need them to be motivated to do the work they were trained for. It also seems plausible that they would be owned and run by humans. This would seem to not involve any transfer of power to that AI system, except insofar as its intellectual outputs benefit it (e.g. if it is writing advertising material, maybe it doesn’t get paid for that, but if it can write material that slightly furthers its own goals in the world while also fulfilling the advertising requirements, then it sneaked in some influence.)

If there is AI which is moderately more competent than humans, but not sufficiently more competent to take over the world, then it is likely to contribute to this stock of knowledge and affordances shared with humans. There is no reason to expect it to build a separate competing stock, any more than there is reason for a current human household to try to build a separate competing stock rather than sell their labor to others in the economy.

In summary:

Functional connection with a large community of other intelligences in the past and present is probably a much bigger factor in the success of humans as a species or individual humans than is individual intelligence.
Thus this also seems more likely to be important for AI success than individual intelligence. This is contrary to a usual argument for AI superiority, but probably leaves AI systems at least as likely to outperform humans, since superhuman AI is probably superhumanly good at taking in information and coordinating.
However it is not obvious that AI systems will have the same access to society’s accumulated information e.g. if there is information which humans learn from living in society, rather than from reading the internet.
And it seems an open question whether AI systems are given the same affordances in society as humans, which also seem important to making use of the accrued bounty of power over the world that humans have. For instance, if they are not granted the same legal rights as humans, they may be at a disadvantage in doing trade or engaging in politics or accruing power.
The fruits of greater intelligence for an entity will probably not look like society-level accomplishments unless it is a society-scale entity
The route to influence with smaller fruits probably by default looks like participating in the economy rather than trying to build a private stock of knowledge.
If the resources from participating in the economy accrue to the owners of AI systems, not to the systems themselves, then there is less reason to expect the systems to accrue power incrementally, and they are at a severe disadvantage relative to humans.

Overall these are reasons to expect AI systems with around human-level cognitive performance to not destroy the world immediately, and to not amass power as easily as one might imagine.

What it might look like if this gap matters: If AI systems are somewhat superhuman, then they do impressive cognitive work, and each contributes to technology more than the best human geniuses, but not more than the whole of society, and not enough to materially improve their own affordances. They don’t gain power rapidly because they are disadvantaged in other ways, e.g. by lack of information, lack of rights, lack of access to positions of power. Their work is sold and used by many actors, and the proceeds go to their human owners. AI systems do not generally end up with access to masses of technology that others do not have access to, and nor do they have private fortunes. In the long run, as they become more powerful, they might take power if other aspects of the situation don’t change.

AI agents may not be radically superior to combinations of humans and non-agentic machines

‘Human level capability’ is a moving target. For comparing the competence of advanced AI systems to humans, the relevant comparison is with humans who have state-of-the-art AI and other tools. For instance, the human capacity to make art quickly has recently been improved by a variety of AI art systems. If there were now an agentic AI system that made art, it would make art much faster than a human of 2015, but perhaps hardly faster than a human of late 2022. If humans continually have access to tool versions of AI capabilities, it is not clear that agentic AI systems must ever have an overwhelmingly large capability advantage for important tasks (though they might).

(This is not an argument that humans might be better than AI systems, but rather: if the gap in capability is smaller, then the pressure for AI systems to accrue power is less and thus loss of human control is slower and easier to mitigate entirely through other forces, such as subsidizing human involvement or disadvantaging AI systems in the economy.)

Some advantages of being an agentic AI system vs. a human with a tool AI system seem to be:

There might just not be an equivalent tool system, for instance if it is impossible to train systems without producing emergent agents.
When every part of a process takes into account the final goal, this should make the choices within the task more apt for the final goal (and agents know their final goal, whereas tools carrying out parts of a larger problem do not).
For humans, the interface for using a capability of one’s mind tends to be smoother than the interface for using a tool. For instance a person who can do fast mental multiplication can do this more smoothly and use it more often than a person who needs to get out a calculator. This seems likely to persist.

1 and 2 may or may not matter much. 3 matters more for brief, fast, unimportant tasks. For instance, consider again people who can do mental calculations better than others. My guess is that this advantages them at using Fermi estimates in their lives and buying cheaper groceries, but does not make them materially better at making large financial choices well. For a one-off large financial choice, the effort of getting out a calculator is worth it and the delay is very short compared to the length of the activity. The same seems likely true of humans with tools vs. agentic AI with the same capacities integrated into their minds. Conceivably the gap between humans with tools and goal-directed AI is small for large, important tasks.

What it might look like if this gap matters: agentic AI systems have substantial advantages over humans with tools at some tasks like rapid interaction with humans, and responding to rapidly evolving strategic situations. One-off large important tasks such as advanced science are mostly done by tool AI.

Trust

If goal-directed AI systems are only mildly more competent than some combination of tool systems and humans (as suggested by considerations in the last two sections), we still might expect AI systems to out-compete humans, just more slowly. However AI systems have one serious disadvantage as employees of humans: they are intrinsically untrustworthy, while we don’t understand them well enough to be clear on what their values are or how they will behave in any given case. Even if they did perform as well as humans at some task, if humans can’t be certain of that, then there is reason to disprefer using them. This can be thought of as two problems: firstly, slightly misaligned systems are less valuable because they genuinely do the thing you want less well, and secondly, even if they were not misaligned, if humans can’t know that (because we have no good way to verify the alignment of AI systems) then it is costly in expectation to use them. (This is only a further force acting against the supremacy of AI systems—they might still be powerful enough that using them is enough of an advantage that it is worth taking the hit on trustworthiness.)

What it might look like if this gap matters: in places where goal-directed AI systems are not typically hugely better than some combination of less goal-directed systems and humans, the job is often given to the latter if trustworthiness matters.

Headroom

For AI to vastly surpass human performance at a task, there needs to be ample room for improvement above human level. For some tasks, there is not—tic-tac-toe is a classic example. It is not clear how close humans (or technologically aided humans) are from the limits to competence in the particular domains that will matter. It is to my knowledge an open question how much ‘headroom’ there is. My guess is a lot, but it isn’t obvious.

How much headroom there is varies by task. Categories of task for which there appears to be little headroom:

Tasks where we know what the best performance looks like, and humans can get close to it. For instance, machines cannot win more often than the best humans at Tic-tac-toe (playing within the rules) or solve Rubik’s cubes much more reliably, or extracting calories from fuel
Tasks where humans are already be reaping most of the value—for instance, perhaps most of the value of forks is in having a handle with prongs attached to the end, and while humans continue to design slightly better ones, and machines might be able to add marginal value to that project more than twice as fast as the human designers, they cannot perform twice as well in terms of the value of each fork, because forks are already 95% as good as they can be.
Better performance is quickly intractable. For instance, we know that for tasks in particular complexity classes, there are computational limits to how well one can perform across the board. Or for chaotic systems, there can be limits to predictability. (That is, tasks might lack headroom not because they are simple, but because they are complex. E.g. AI probably can’t predict the weather much further out than humans.)

Categories of task where a lot of headroom seems likely:

Competitive tasks where the value of a certain level of performance depends on whether one is better or worse than one’s opponent, so that the marginal value of more performance doesn’t hit diminishing returns, as long as your opponent keeps competing and taking back what you just won. Though in one way this is like having little headroom: there’s no more value to be had—the game is zero sum. And while there might often be a lot of value to be gained by doing a bit better on the margin, still if all sides can invest, then nobody will end up better off than they were. So whether this seems more like high or low headroom depends on what we are asking exactly. Here we are asking if AI systems can do much better than humans: in a zero sum contest like this, they likely can in the sense that they can beat humans, but not in the sense of reaping anything more from the situation than the humans ever got.
Tasks where it is twice as good to do the same task twice as fast, and where speed is bottlenecked on thinking time.
Tasks where there is reason to think that optimal performance is radically better than we have seen. For instance, perhaps we can estimate how high Chess Elo rankings must go before reaching perfection by reasoning theoretically about the game, and perhaps it is very high (I don’t know).
Tasks where humans appear to use very inefficient methods. For instance, it was perhaps predictable before calculators that they would be able to do mathematics much faster than humans, because humans can only keep a small number of digits in their heads, which doesn’t seem like an intrinsically hard problem. Similarly, I hear humans often use mental machinery designed for one mental activity for fairly different ones, through analogy. For instance, when I think about macroeconomics, I seem to be basically using my intuitions for dealing with water. When I do mathematics in general, I think I’m probably using my mental capacities for imagining physical objects.

What it might look like if this gap matters: many challenges in today’s world remain challenging for AI. Human behavior is not readily predictable or manipulable very far beyond what we have explored, only slightly more complicated schemes are feasible before the world’s uncertainties overwhelm planning; much better ads are soon met by much better immune responses; much better commercial decision-making ekes out some additional value across the board but most products were already fulfilling a lot of their potential; incredible virtual prosecutors meet incredible virtual defense attorneys and everything is as it was; there are a few rounds of attack-and-defense in various corporate strategies before a new equilibrium with broad recognition of those possibilities; conflicts and ‘social issues’ remain mostly intractable. There is a brief golden age of science before the newly low-hanging fruit are again plucked and it is only lightning fast in areas where thinking was the main bottleneck, e.g. not in medicine.

Intelligence may not be an overwhelming advantage

Intelligence is helpful for accruing power and resources, all things equal, but many other things are helpful too. For instance money, social standing, allies, evident trustworthiness, not being discriminated against (this was slightly discussed in section ‘Human success isn’t from individual intelligence’). AI systems are not guaranteed to have those in abundance. The argument assumes that any difference in intelligence in particular will eventually win out over any differences in other initial resources. I don’t know of reason to think that.

Empirical evidence does not seem to support the idea that cognitive ability is a large factor in success. Situations where one entity is much smarter or more broadly mentally competent than other entities regularly occur without the smarter one taking control over the other:

Species exist with all levels of intelligence. Elephants have not in any sense won over gnats; they do not rule gnats; they do not have obviously more control than gnats over the environment.
Competence does not seem to aggressively overwhelm other advantages in humans:
1. Looking at the world, intuitively the big discrepancies in power are not seemingly about intelligence.
2. IQ 130 humans apparently earn very roughly $6000-$18,500 per year more than average IQ humans.
3. Elected representatives are apparently smarter on average, but it is a slightly shifted curve, not a radically difference.
4. MENSA isn’t a major force in the world.
5. Many places where people see huge success through being cognitively able are ones where they show off their intelligence to impress people, rather than actually using it for decision-making. For instance, writers, actors, song-writers, comedians, all sometimes become very successful through cognitive skills. Whereas scientists, engineers and authors of software use cognitive skills to make choices about the world, and less often become extremely rich and famous, say. If intelligence were that useful for strategic action, it seems like using it for that would be at least as powerful as showing it off. But maybe this is just an accident of which fields have winner-takes-all type dynamics.
6. If we look at people who evidently have good cognitive abilities given their intellectual output, their personal lives are not obviously drastically more successful, anecdotally.
7. One might counter-counter-argue that humans are very similar to one another in capability, so even if intelligence matters much more than other traits, you won’t see that by looking at the near-identical humans. This does not seem to be true. Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc. For instance, in chess, the Elo difference between the best and worst players is about 2000, whereas the difference between the amateur play and random play is maybe 400-2800 (if you accept Chess StackExchange guesses as a reasonable proxy for the truth here). And in terms of AI progress, amateur human play was reached in the 50s, roughly when research began, and world champion level play was reached in 1997.

And theoretically I don’t know why one would expect greater intelligence to win out over other advantages over time. There are actually two questionable theories here: 1) Charlotte having more overall control than David at time 0 means that Charlotte will tend to have an even greater share of control at time 1. And, 2) Charlotte having more intelligence than David at time 0 means that Charlotte will have a greater share of control at time 1 even if Bob has more overall control (i.e. more of other resources) at time 1.

What it might look like if this gap matters: there are many AI systems around, and they strive for various things. They don’t hold property, or vote, or get a weight in almost anyone’s decisions, or get paid, and are generally treated with suspicion. These things on net keep them from gaining very much power. They are very persuasive speakers however and we can’t stop them from communicating, so there is a constant risk of people willingly handing them power, in response to their moving claims that they are an oppressed minority who suffer. The main thing stopping them from winning is that their position as psychopaths bent on taking power for incredibly pointless ends is widely understood.

Unclear that many goals realistically incentivise taking over the universe

I have some goals. For instance, I want some good romance. My guess is that trying to take over the universe isn’t the best way to achieve this goal. The same goes for a lot of my goals, it seems to me. Possibly I’m in error, but I spend a lot of time pursuing goals, and very little of it trying to take over the universe. Whether a particular goal is best forwarded by trying to take over the universe as a substep seems like a quantitative empirical question, to which the answer is virtually always ‘not remotely’. Don’t get me wrong: all of these goals involve some interest in taking over the universe. All things equal, if I could take over the universe for free, I do think it would help in my romantic pursuits. But taking over the universe is not free. It’s actually super duper duper expensive and hard. So for most goals arising, it doesn’t bear considering. The idea of taking over the universe as a substep is entirely laughable for almost any human goal.

So why do we think that AI goals are different? I think the thought is that it’s radically easier for AI systems to take over the world, because all they have to do is to annihilate humanity, and they are way better positioned to do that than I am, and also better positioned to survive the death of human civilization than I am. I agree that it is likely easier, but how much easier? So much easier to take it from ‘laughably unhelpful’ to ‘obviously always the best move’? This is another quantitative empirical question.

What it might look like if this gap matters: Superintelligent AI systems pursue their goals. Often they achieve them fairly well. This is somewhat contrary to ideal human thriving, but not lethal. For instance, some AI systems are trying to maximize Amazon’s market share, within broad legality. Everyone buys truly incredible amounts of stuff from Amazon, and people often wonder if it is too much stuff. At no point does attempting to murder all humans seem like the best strategy for this.

Quantity of new cognitive labor is an empirical question, not addressed

Whether some set of AI systems can take over the world with their new intelligence probably depends how much total cognitive labor they represent. For instance, if they are in total slightly more capable than von Neumann, they probably can’t take over the world. If they are together as capable (in some sense) as a million 21st Century human civilizations, then they probably can (at least in the 21st Century).

It also matters how much of that is goal-directed at all, and highly intelligent, and how much of that is directed at achieving the AI systems’ own goals rather than those we intended them for, and how much of that is directed at taking over the world.

If we continued to build hardware, presumably at some point AI systems would account for most of the cognitive labor in the world. But if there is first an extended period of more minimal advanced AI presence, that would probably prevent an immediate death outcome, and improve humanity’s prospects for controlling a slow-moving AI power grab.

What it might look like if this gap matters: when advanced AI is developed, there is a lot of new cognitive labor in the world, but it is a minuscule fraction of all of the cognitive labor in the world. A large part of it is not goal-directed at all, and of that, most of the new AI thought is applied to tasks it was intended for. Thus what part of it is spent on scheming to grab power for AI systems is too small to grab much power quickly. The amount of AI cognitive labor grows fast over time, and in several decades it is most of the cognitive labor, but humanity has had extensive experience dealing with its power grabbing.

Speed of intelligence growth is ambiguous

The idea that a superhuman AI would be able to rapidly destroy the world seems prima facie unlikely, since no other entity has ever done that. Two common broad arguments for it:

There will be a feedback loop in which intelligent AI makes more intelligent AI repeatedly until AI is very intelligent.
Very small differences in brains seem to correspond to very large differences in performance, based on observing humans and other apes. Thus any movement past human-level will take us to unimaginably superhuman level.

These both seem questionable.

Feedback loops can happen at very different rates. Identifying a feedback loop empirically does not signify an explosion of whatever you are looking at. For instance, technology is already helping improve technology. To get to a confident conclusion of doom, you need evidence that the feedback loop is fast.
It does not seem clear that small improvements in brains lead to large changes in intelligence in general, or will do on the relevant margin. Small differences between humans and other primates might include those helpful for communication (see Section ‘Human success isn’t from individual intelligence’), which do not seem relevant here. If there were a particularly powerful cognitive development between chimps and humans, it is unclear that AI researchers find that same insight at the same point in the process (rather than at some other time).

A large number of other arguments have been posed for expecting very fast growth in intelligence at around human level. I previously made a list of them with counterarguments, though none seemed very compelling. Overall, I don’t know of strong reason to expect very fast growth in AI capabilities at around human-level AI performance, though I hear such arguments might exist.

What it would look like if this gap mattered: AI systems would at some point perform at around human level at various tasks, and would contribute to AI research, along with everything else. This would contribute to progress to an extent familiar from other technological progress feedback, and would not e.g. lead to a superintelligent AI system in minutes.

Key concepts are vague

Concepts such as ‘control’, ‘power’, and ‘alignment with human values’ all seem vague. ‘Control’ is not zero sum (as seemingly assumed) and is somewhat hard to pin down, I claim. What an ‘aligned’ entity is exactly seems to be contentious in the AI safety community, but I don’t know the details. My guess is that upon further probing, these conceptual issues are resolvable in a way that doesn’t endanger the argument, but I don’t know. I’m not going to go into this here.

What it might look like if this gap matters: upon thinking more, we realize that our concerns were confused. Things go fine with AI in ways that seem obvious in retrospect. This might look like it did for people concerned about the ‘population bomb’ or as it did for me in some of my youthful concerns about sustainability: there was a compelling abstract argument for a problem, and the reality didn’t fit the abstractions well enough to play out as predicted.

D. Contra the whole argument

The argument overall proves too much about corporations

Here is the argument again, but modified to be about corporations. A couple of pieces don’t carry over, but they don’t seem integral.

I. Any given corporation is likely to be ‘goal-directed’

Reasons to expect this:

Goal-directed behavior is likely to be valuable in corporations, e.g. economically
~~Goal-directed entities may tend to arise from machine learning training processes not intending to create them (at least via the methods that are likely to be used).~~
‘Coherence arguments’ may imply that systems with some goal-directedness will become more strongly goal-directed over time.

II. If goal-directed superhuman corporations are built, their desired outcomes will probably be about as bad as an empty universe by human lights

Reasons to expect this:

Finding useful goals that aren’t extinction-level bad appears to be hard: we don’t have a way to usefully point at human goals, and divergences from human goals seem likely to produce goals that are in intense conflict with human goals, due to a) most goals producing convergent incentives for controlling everything, and b) value being ‘fragile’, such that an entity with ‘similar’ values will generally create a future of virtually no value.
Finding goals that are extinction-level bad and temporarily useful appears to be easy: for example, corporations with the sole objective ‘maximize company revenue’ might profit for a time before gathering the influence and wherewithal to pursue the goal in ways that blatantly harm society.
Even if humanity found acceptable goals, giving a corporation any specific goals appears to be hard. We don’t know of any procedure to do it~~, and we have theoretical reasons to expect that AI systems produced through machine learning training will generally end up with goals other than those that they were trained according to~~. Randomly aberrant goals resulting are probably extinction-level bad, for reasons described in II.1 above.

III. If most goal-directed corporations have bad goals, the future will very likely be bad

That is, a set of ill-motivated goal-directed corporations, of a scale likely to occur, would be capable of taking control of the future from humans. This is supported by at least one of the following being true:

A corporation would destroy humanity rapidly. This may be via ultra-powerful capabilities at e.g. technology design and strategic scheming, or through gaining such powers in an ‘intelligence explosion‘ (self-improvement cycle). Either of those things may happen either through exceptional heights of intelligence being reached or through highly destructive ideas being available to minds only mildly beyond our own.
Superhuman AI would gradually come to control the future via accruing power and resources. Power and resources would be more available to the corporation than to humans on average, because of the corporation having far greater intelligence.

This argument does point at real issues with corporations, but we do not generally consider such issues existentially deadly.

One might argue that there are defeating reasons that corporations do not destroy the world: they are made of humans so can be somewhat reined in; they are not smart enough; they are not coherent enough. But in that case, the original argument needs to make reference to these things, so that they apply to one and not the other.

What it might look like if this counterargument matters: something like the current world. There are large and powerful systems doing things vastly beyond the ability of individual humans, and acting in a definitively goal-directed way. We have a vague understanding of their goals, and do not assume that they are coherent. Their goals are clearly not aligned with human goals, but they have enough overlap that many people are broadly in favor of their existence. They seek power. This all causes some problems, but problems within the power of humans and other organized human groups to keep under control, for some definition of ‘under control’.

Conclusion

I think there are quite a few gaps in the argument, as I understand it. My current guess (prior to reviewing other arguments and integrating things carefully) is that enough uncertainties might resolve in the dangerous directions that existential risk from AI is a reasonable concern. I don’t at present though see how one would come to think it was overwhelmingly likely.

I think there are quite a few gaps in the argument, as I understand it. My current guess (prior to reviewing other arguments and integrating things carefully) is that enough uncertainties might resolve in the dangerous directions that existential risk from AI is a reasonable concern. I don’t at present though see how one would come to think it was .

Suppose you went through the following exercise. For each scenario described under "What it might look like if this gap matters", ask:

Is this an existentially secure state of affairs?
If not, what are the main obstacles to reaching existential security from here?

and collected the obstacles, you might assemble a list like this one, which might update you toward AI x-risk being "overwhelmingly likely". (Personally, if I had to put a number on it, I'd say 80%.)

Agree directionally. I made a similar point in my review of "Is power-seeking AI an existential risk?":

In one sentence, my concern is that the framing of the report and decomposition is more like “avoid existential catastrophe” than “achieve a state where existential catastrophe is extremely unlikely and we are fulfilling humanity’s potential”, and this will bias readers toward lower estimates.

5jacob_cannell3y

I disagree strongly with this implied framing that all which matters is minimization of risk. Functional humans are not pure risk-avoiders, nor is our civilization. Small chances of heaven can counterbalance small chances of hell. (I also disagree with the implied model from your first link where cumulative risk is the product of small independent risk per year, but that's more minor in comparison).

Do you think there's a way to reframe my position in a way that you'd agree with, or at least don't strongly disagree with? (In other words, I'm not sure how much of the disagreement is with the substance of what I'm saying vs the way I'm saying it.) Or, to approach this another way, how would you state/frame your own position on this topic?

2jacob_cannell3y

You linked to an article on existential security - “a place of safety - a place where existential risk is low and stays low” - which implies all that matters is risk minimization, rather than utility maximization with some risk discounting. To be fair, my disagreement there isn't specific to your points. Separately I'm also skeptical of estimating risk through some long list of obstacles, as the relevance of those obstacles are correlated or mostly determined by a small number of more fundamental issues (takeoff speed, brain tractability, alignment vs capability tractability, etc).

1Rob Bensinger3y

Existential risk is just the probability that a large portion of the future's value is lost. "Small chances of heaven can counterbalance small chances of hell." implies that it's about reducing the risk of hell, when in fact it's equally concerned with the absence of Heaven.

4jacob_cannell3y

Ok that is an unexpected interpretation as it's not how I typically think of 'risk', but yes if that's the intended interpretation it resolves my objection.

Great post! I think this captures a lot of why I'm not ultradoomy (only, er, 45%-ish doomy, at the moment), especially A and B. I think it's at least possible that our reality is on easymode, where muddling could conceivably put an AI into close enough territory to not trigger an oops.

I'd be even less doomy if I agreed with the counterarguments in C. Unfortunately, I can't shake the suspicion that superintelligence is the kind of ridiculously powerful lever that would magnify small oopses into the largest possible oopses.

Hypothetically, if we took a clever human's general capacity for problem solving, stripped it of limitations like getting bored or tired, got rid of its pesky intuitions around ethics, and sped it up by a factor of 1,000 times... I'd be very worried about what it would be able to do. Even without greater capacity for insight or an enhanced working memory, simply thinking really fast would be a broken superpower.

Such an entity might not be able to recreate the technology of modern civilization starting from scratch (both in resources and knowledge) in the stone age within 30 years, primarily due to physical interaction requirements. But starting from anything like m... (read more)

-1awg3y

Agreed that superhuman intelligence seems like the kind of thing that could be a very powerful lever. What gets me is that we don't seem to know how orthogonal or non-orthogonal intelligence and empathy are to one another.[1] If we were capable of creating a superhumanly intelligent AI and we were to be able to give it superhuman empathy, I might be inclined to trust ceding over a large amount of power and control to that system (or set of systems whatever). But a sociopathic superhuman intelligence? Definitely not ceding power over to that system. The question then becomes to me, how confident are we that we are not creating dangerously sociopathic AI? 1. ^ If I were to take a stab, I would say they were almost entirely orthogonal, as we have perfectly intelligent yet sociopathic humans walking around today who lack any sort of empathy. Giving any of these people superhuman ability and control would seem like an obviously terrible idea to me.

There's a nearby kind of obvious but rarely directly addressed generalized version of one of your arguments, which is that ML learns complex functions all the time, so why should human values be any different? I rarely see this discussed, and I thought the replies from Nate and the ELK related difficulties were important to have out in the open, so thanks a lot for including the face learning <-> human values learning analogy.

For at least about ten years in my experience people in this community have been saying the main problem isn't getting the AI to understand human values, it's getting the AI to have human values. Unfortunately the word "learn human values" is sometimes used to mean "have human values" and sometimes used to mean "understand human values" hence the confusion.

To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late - the AI needs a robust model of human values well before it becomes superhuman.

Katjas point is valid - DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )

To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late - the AI needs a robust model of human values well before it becomes superhuman.

Katjas point is valid - DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )

I agree Eliezer is wrong, though that's not enough to ensure success. In particular, you need to avoid inner alignment issues like deceptive alignment, where it learns values very well only for instrumental convergence reasons, and once it's strong, it overthrows the humans and pursues whatever terminal goal it has.

6jacob_cannell3y

Sim boxing can solve deceptive alignment (and may be the only viable solution)

3Noosphere893y

I agree that boxing is at least a first step, so that it doesn't get more compute, or worse, FOOM. The tricky problem is we need to be able to train away a deceptive AI or forbid it entirely, without making it being obfuscated so that it looks trained away. This is why we need to move beyond the black box paradigm, and why strong interpretability tools are necessary.

4hairyfigment3y

>DL did not fail in the way EY predicted, Where's the link for that prediction, because I think there's more than one example of critics putting words in his mouth, and then citing a place where he says something manifestly different. Here's a post from 2008, where he says the following: In a discussion from 2010, he's offered the chance to say that he doesn't think the machine learning of the time could produce AGI even with a smarter approach, and he appears to pull back from saying that:

6jacob_cannell3y

The context should make it clear I was not talking about an explicit prediction. See this comment for more explication. I said: This is obviously true and beyond debate, see the quotes in my linked comment from EY's "Complex Value Systems are Required to Realize Valuable Futures" where he critiques Hibbard's proposal to install AI with a reward function which "learns to recognize happiness and unhappiness in human facial expressions, human voices and human body language". Then I said: Where Katja's point is that DL had no trouble learning concepts of faces (and many other things) to superhuman levels, without inevitably failling by instead only producing superficial simulacra of faces when we cranked up the optimization power. I was not referring to any explicit prediction, but the implicit prediction in Katja's analogy (where learning a complex 3D generative model of human faces from images is the analogy for learning a complex multi-modal model of human happiness from face images, voices, body language, etc).

2hairyfigment3y

That's clearly exactly what it does today? It seems I disagree with your point on a more basic level than expected. ETA:

2jacob_cannell3y

It only takes one positive example of AI not failing by producing superficial simulacra of faces to prove my point, which Katja already provided. It doesn't matter how many crappy AI models people make, as they lose out to stronger models.

2hairyfigment3y

Maybe I don't understand the point of this example in which AI creates non-conscious images of smiling faces. Are you really arguing that, based on evidence like this, a generalization of modern AI wouldn't automatically produce horrific or deadly results when asked to copy human values? Peripherally: that video contains simulacra of a lot more than faces, and I may have other minor objections in that vein. ETA, I may want to say more about the actual human analysis which I think informed the AI's "success," but first let me go back to what I said about linking EY's actual words. Here is 2008-Eliezer:

2jacob_cannell3y

Hibbard proposes we can learn a model of 'happiness' from images of smiling humans, body language, voices, etc and then instill that as the reward/utility function for AI. EY replies that will fail because our values (like happiness) are far too complex and fragile to be learned robustly by such a procedure, and result instead is an AI which optimizes for a different unintended goal: 'faciness'. Katja argues - and others concur - that maybe values are not as fragile as EY predicted, because DL now regularly learns complex concepts to superhuman accuracy - including visual models of faces. Obviously that totally depends on the system and how the human values are learned - but no, that certainly isn't the automatic result if we continue down the path of reverse engineering the brain, including its altruism mechanisms.

7hairyfigment3y

I may reply to this more fully, but first I'd like you to acknowledge that you cannot in fact point to a false prediction by EY here, and in the exact post you seemed to be referring to, he says that his view is compatible with this sort of AI producing realistic sculptures of human faces!

4the gears to ascension3y

as someone who often agrees with jake, cmon jake, own up to it, EY has said reasonable things before and you were wrong :P edit: oops meant to reply to @jacob_cannell

2jacob_cannell3y

Wrong about what? Of course EY has said many reasonable and insightful things

2jacob_cannell3y

Oh do you mean this text you quoted? The thing producing the very realistic tiny sculpture of a human face is a superintelligence, not some initial human designed ML system that is used to create the AI's utility function.

2jacob_cannell3y

What post? All I quoted recently was "Complex Value Systems are Required to Realize Valuable Futures", which does not appear to contain the word 'sculpture'.

4Noosphere893y

And more importantly, to prevent deceptive alignment from happening, which would allow a treacherous turn. A lot of overrated alignment plans have the function that they get outer alignment at optimum, that is the values you want to instill do not break at optimality, but use handwavium to bypass deceptive alignment, proxy and suboptimality alignment. (Jacob Cannell is better than Alex Turner at this, since he incorporates a AI sandbox which importantly, prevents the AI from knowing it's in a simulation.)

Ronny Fernandez asked me, Nate, and Eliezer for our take on Twitter. Copying over Nate's reply:

briefly: A) narrow non-optimizers can exist but won't matter; B) wake me when the allegedly maximally-facelike image looks human; C) ribosomes show that cognition-bound superpowers exist; D) humans can't stack into superintelligent corps, but if they could then yes plz value-load
(tbc, I appreciate Katja saying all that. Hooray for stating what you think, and hooray again when it's potentially locally unpopular! If I were less harried I might give more than a tweet of engagement, but in reality I probably won't, sorry.)

I asked Nate what he meant by B, and he said:

section B seemed to me to be saying "AIs can figure out what a face is". And, ok, sure, but if you ask them for the faciest possible thing, it's not very human!facelike.
which is one of many objections, ofc (others including "ah yes but can you aim it at a human concept" )

Note: "ask them for the faciest possible thing" seems confused.

How I would've interpreted this if I were talking with another ML researcher is "Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.

I'm guessing what he has in mind is more like "take a GAN discriminator / image classifier & find the image that maxes out the face logit", but if so, why is that the relevant operationalization? It doesn't correspond to how such a model is actually used.

EDIT: Here is what the first looks like for StyleGAN2-ADA.

It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.

(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)

EY argues that human values are hard to learn. Katja uses human faces as an analogy, pointing out that ML systems learn natural concepts far easier than EY 2009 expected.

The analogy is between A: a function which maps noise to realistic images of human faces and B: a function which maps predicted future world states to utility scores similar to how a human would score them. The lesson is that since ML systems can learn A very well, they can probably also learn B.

Function A (human face generator) does not even use max-likelihood sampling and it isn't even an optimizer, so your operationalization is just confused. Nor is function B an optimizer itself.

I claim that A and B are in fact very disanalogous objects, and that the claim that A can be learned well does not imply that B can probably be learned well. I am very confused by your claims about the functions A and B not being optimizers, because to me this is true but also irrelevant.

The reason we want a function B that can map world states to utilities is so that we can optimize on that number. We want to select for world states that we think will have high utility using B; otherwise function B is pretty useless. Therefore, this function has to be reliable enough that putting lots of optimization pressure on it does not break it. This is not the same as claiming that the function itself is an optimizer or anything like that. Making something reliable against lots of optimization pressure is a lot harder than making it reliable in the training distribution.

The function A effectively allows you to sample from the distribution of faces. Function A does not have to be robust against adversarial optimization to approximate the distribution. The analogous function in the domain of human values would be a function that lets you sample from some prior distribution of world states, not one that scores utility of states.

More generally, I think the confusion here stems from the fact that a) robustness against optimization is far harder than modelling typical elements of a distribution, and b) distributions over states are fundamentally different objects from utility functions over states.

Nate's analogy is confused: diffusion models do not generate convincing samples of faces by maximizing for faciness - see how they actually work, and make sure we agree there. This is important because previous systems (such as deepdream) could be described as maximizing for X, such that nate's critique would be more relevant.

Your comment here about "optimizing for X-ness" indicates you also were adopting the wrong model of how diffusion models operate:

It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.

That simply isn't out how diffusion models work. A diffusion model for essays would sample from realistic essays that summarize to some short prompt; so they absolutely do care about high likelihood from the distribution of human essay... (read more)

I object to your characterization that I am claiming that diffusion models work by maximizing faciness, or that I am confused about how diffusion models work. I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I'm confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don't optimize for faciness, they aren't a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.

(Unimportant nitpicking: This Person Does Not Exist doesn't actually use a diffusion model, but rather a StyleGAN trained on a face dataset.)

You're also eliding over the difference between training an unconditional diffusion model on a face dataset and training an unconditional diffusion model over a general image dataset and doing classifier based guidance. I've been talking about unconditional models on a face dataset, which does not optim... (read more)

Interpretations

First a reply to interpretations of previous words:

I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I'm confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don't optimize for faciness, they aren't a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.

I hope we agree that a discriminator which is trained only to recognize good essays robustly probably does not contain enough information to generate good essays, for the same reasons that an image discriminator does not contain enough information to generate good images - because the discriminator only learns the boundaries of words/categories over images, not the more complex embedded distribution of realistic images.

Optimizing only for faciness via a discriminator does not work well - that's the old deepdream approach. Opti... (read more)

9David Johnston3y

The claim that every increase in regularisation makes performance worse is extraordinary, given everything I know about machine learning.

6cfoster03y

FYI: Planning with diffusion is being tried and seemingly works.

8David Johnston3y

Wouldn’t a better analogy be A: noise to faces judged as realistic and B: noise to plans judged to have good consequences? As for whether B breaks under competitive pressure: does A break under competitive pressure? B does introduce safe exploration concerns not relevant to A, but the answer for A seems like a clear “no” to me.

3Xodarap3y

Basic question: why would the AI system optimize for X-ness? I thought Katja's argument was something like: 1. Suppose we train a system to generate (say) plans for increasing the profits of your paperclip factory similar to how we train GANs to generate faces 2. Then we would expect those paperclip factory planners to have analogous errors to face generator errors 3. I.e. they will not be "eldritch" The fact that you could repurpose the GAN discriminator in this terrifying way doesn't really seem relevant if no one is in practice doing that?

I took Nate to be saying that we'd compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create "thing that is a face that has the highest probability of occurring in the environment", while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven't actually tried it. I also predict that if you stop training the discriminator and keep training the generator, the generator starts generating weird looking nonhuman images.

This is relevant to Reinforcement Learning because of the actor-critic class of systems, where the actor is like the generator and the critic is like the discriminator. We'd ideally like the RL system to stay on course after we stop providing it with labels, but stopping labels means we stop training the critic. Which means that the actor is free to start generating adversarial policies that hack the critic, rather than policies that actually perform well in the way we'd want them to.

Upvoted because I agree with all of the above.

AFAICT the original post was using the faces analogy in a different way than Nate is. It doesn't claim that the discriminators used to supervise GAN face learning or the classifiers used to detect faces are adversarially robust. That isn't the point it's making. It claims that learned models of faces don't "leave anything important out" in the way that one might expect some key feature to be "left out" when learning to model a complex domain like human faces or human values. And that seems well-supported: the trajectory of modern ML has shown learning such complex models is far easier than we might've thought, even if building adversarially robust classifiers is very hard. (As much as I'd like to have supervision signals that are robust to arbitrarily-capable adversaries, it seems non-obvious to me that that is even required for success at alignment.)

Hmm, but I don't understand what relevance it has to alignment. The problem was never that the AI won't learn human values, it's that the AI won't care about human values. Of course a super intelligent AI will have a good model of human values, the same way it will have a good model of engineering, chemistry, the ecological environment and physics. That doesn't mean it will do things that are aligned with its accurate model of human values.

I am not sure who thought that learning such models was much harder than it turned out to be. It seems clear that an AI will learn what human faces are before the AI is very dangerous to the world. It would have been extremely surprising to have a dangerous AGI incapable of learning what human faces are like.

Hmm, but I don't understand what relevance it has to alignment. The problem was never that the AI won't learn human values, it's that the AI won't care about human values

Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us, so that very much was believed to be part of the problem (at least by the EY/MIRI/LW/etc crowed).

But that's all now mostly irrelevant - an altruistic AI probably doesn't even need to know or care about human values at all, as it can simply optimize for our empowerment - our future optionality or ability to do anything we want. (some previous discussion here. and in these comments. )

I wasn't that active around the time of the sequences, but I had a good number of discussions with people, and the point "the AI will of course know what your values are, it just wont' care" was made many times, and I am also pretty sure was made in the sequences (I would have to dig it up, and am on my phone, but I heard that sentence in spoken conversation a lot over the years).

I don't think "empowerment" is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.

Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us

the point "the AI will of course know what your values are, it just wont' care" was made many times, and I am also pretty sure was made in the sequences

Notice I said "before it killed us". Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that's irrelevant because we need to instill its utility function long before that. See my reply here, this is well documented, and no amount of vague memories of conversations trump the written evidence.

I don't think "empowerment" is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.

I'm not entirely sure what people mean when they say "X won't survive heavy optimization pressure" - but for example the objective of modern diffusion models survives heavy optimization power.

External empowerment is very simple and it doesn't even require detailed modeling of the agent - they can just be a black box that produces outputs. I'm curious what you think is an example of "the kind of concept that particularly survives heavy optimization pressure".

4Noosphere893y

Basically, it's Goodhart's law in action, where optimizing a proxy more and more destroys what you value.

1jacob_cannell3y

Oh - empowerment is about as immune to Goodharting as you can get, and that's perhaps one of its major advantages[1]. However in practice one has to use some approximation, which may or may not be goodhartable to some degree depending on many details. ---------------------------------------- 1. Empowerment is vastly more difficult to Goodhart than a corporation optimizing for some bundle of currencies (including crypto), much more difficult to Goodhart than optimizing for control over even more fundamental physical resources like mass and energy, and is generally the least-Goodhartable objective that could exist. In some sense the universal version of Goohdarting - properly defined - is just a measure of deviation from empowerment. It is the core driver of human intelligence and for good reason. ↩︎

6Noosphere893y

Can you explain further, since this to me seems like a very large claim that if true would have big impact, but I'm not sure how you got the immunity to Goodhart result you have here. This applies to Regressional, Causal, Extremal and Adversarial Goodhart.

Empowerment could be defined as the natural unique solution to Goodharting. Goodharting is the divergence under optimization scaling between trajectories resulting from the difference between a utility function and some proxy of that utility function.

However due to instrumental convergence, the trajectories of all reasonable agent utility functions converge under optimization scaling - and empowerment simply is that which they converge to.

In other words the empowerment of some agent P(X) is the utility function which minimizes trajectory distance to all/any reasonable agent utility functions U(X), regardless of their specific (potentially unknown) form.

Therefor empowerment is - by definition - the best possible proxy utility function (under optimization scaling).

Let's apply some quick examples:

Under scaling, an AI with some crude Hibbard-style happiness approximation will first empower itself and then eventually tile the universe with smiling faces (according to EY), or perhaps more realistically - with humans bio-engineered for docility, stupidity, and maximum bliss. Happiness alone is not the true human utility function.

Under scaling, an AI with some crude stock-value maximizin... (read more)

8interstice3y

An AI with a good world model will predictably have a model of your values, but that's different from being able to actually elicit that model via e.g. a series of labeled examples. That's the part that seemed less plausible before DL.

6Ben Pace3y

Do you have a link to where Eliezer (or any other LW writer) said that? I don’t myself recall whether they said that.

9jacob_cannell3y

I may be exaggerating a tiny tiny bit with the "before it killed us" modifier, and I don't have time to search for this specific needle - but EY famously criticized some early safety proposal which consisted of using a 'smiling face' detector somehow to train an AI to recognize human happiness, and then optimize for that. Oh it was actually already open in a tab: From complex values blah blah blah: EY's counterargument is that human values are much more complex than happiness - let alone smiles; an AI optimizing for smiles just ends up tiling the universe with smile icons - so it's just a different flavour of paperclip maximizer. Then he spends a bunch of words on the complexity of value stuff to preempt the more complex versions of the smile detector. If human values were known to be simple, then getting machines to learn them robustly would likely be simple, and EY could have done something else with those 20+ years. Also in EY's model when the AI becomes superintelligent (which may only take a day or something after it becomes just upper human level intelligent and 'rewrites its source code'), it then quickly predicts the future, realizes humans are in the way, solves drexler-style strong nanotech, and then kills us all. Those latter steps are very fast.

5habryka3y

I don't know what relevance this has to the discussion at hand. A deep learning model trained on human smiling faces might indeed very well tile the universe with smiley-faces, I don't understand why that's wrong. Sure, it will likely do something weirder and less predictable, we don't understand the neural network prior very well, but optimizing for smiling humans still doesn't produce anything remotely aligned. Nothing in the quoted section, or in the document you linked that I just skimmed includes anything about the AI not being able to learn what the things behind the smiling faces actually want. Indeed none of that matters, because the AI has no reason to care. You gave it a few thousand to a million samples of smiling, and now the system is optimizing for smiling, you got what you put in. Eliezer indeed explicitly addresses this point and says: He is explicitly saying "Hibbard is confusing being 'smart' with 'caring about the right things'", the AI will be plenty capable of realizing that it isn't doing what you wanted it to, but it just doesn't care. Being smarter does not help with getting it to do the thing you want, that's the whole point of the alignment problem. Similarly AIs being able to understand human values better just doesn't help you that much with pointing at them (though it does help a bit, but the linked article just doesn't talk at all about this).

7jacob_cannell3y

That is not what Hibbard actually proposed, it's a superficial strawman version. 1. HIbbard claims we design intelligent machines which love humans by training to learn human happiness through facial expressions, voices, and body language. 2. EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled "Complex Value Systems are Required to Realize Valuable Futures" It has absolutely nothing to do with whether the AI could eventually learn human values ("the things behind the smiling faces actually want"), and everything to do with whether some ML system could learn said values to use them as the utility function for the AI (which is what Hibbard is proposing). Neither Hibbard, EY, (or I) are arguing about or discussing whether a SI can learn human values.

EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled "Complex Value Systems are Required to Realize Valuable Futures"

This is really misunderstanding what Eliezer is saying here, and also like, look, from my perspective it's been a decade of explaining to people almost once every two weeks that "yes, the AI will of course know what you care about, but it won't care", so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me, so I am experiencing a good amount of frustration with you repeatedly saying things in this comment thread like "irrefutable proof", when it's just like an obviously wrong statement (though like a fine one to arrive at when just reading some random subset of Eliezer's writing, but a clearly wrong summary nevertheless).

Now to go back to the object level:

Eliezer is really not saying that the AI will fail to learn that there is something more complicated than smiles that the human is trying to point it to. He is explicitly saying "look, you won't know what the AI w... (read more)

This is really misunderstanding what Eliezer is saying here [...] it's been a decade of explaining to people almost once every two weeks that "yes, the AI will of course know what you care about, but it won't care", so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me

I think this is much more ambiguous than you're making it out to be. In 2008's "Magical Categories", Yudkowsky wrote:

I shall call this the fallacy of magical categories—simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate "winning" sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.

I claim that this paragraph didn't age well in light of the deep learning revolution: "running a neural network [...] over a set of winning and losing sequences of chess moves" basically is how AlphaZero learns from self-play! As the Yudkowsky quote illustrates, it wasn't obvious in 2008 that this wo... (read more)

8habryka3y

I do think these are better quotes. It's possible that there was some update here between 2008 and 2013 (roughly when I started seeing the more live discussion happening), since I do really remember the "the problem is not getting the AI to understand, but to care" as a common refrain even back then (e.g. see the Robby post I linked). I agree that this paragraph aged less well than other paragraphs, though I do think this paragraph is still correct (Edit: Eh, it might be wrong, depends a bit on how much neural networks in the 50s are the same as today). It did sure turn out to be correct by a narrower margin than Eliezer probably thought at the time, but my sense is it's still not the case that we can train a straightforward neural net on winning and losing chess moves and have it generate winning moves. For AlphaGo, the Monte Carlo Tree Search was a major component of its architecture, and then any of the followup-systems was trained by pure self-play. But in any case, I think your basic point of "Eliezer did not predict the Deep Learning revolution as it happened" here is correct, though I don't think this specific paragraph has a ton of relevance to the discussion at hand. I do think this paragraph seems like a decent quote, though I think at this point it makes sense to break it out into different pieces. I think Eliezer is saying that what matters is whether we can point the AI to what we care about "during its childhood", i.e. during relatively early training, before it has already developed a bunch of proxy training objectives. I think the key question about the future that I think Eliezer was opining on, is then whether by the time we expect AIs to actually be able to have a close-to-complete understanding of what we mean by "goodness", we still have any ability to shape their goals. My model is that indeed, Eliezer was surprised, as I think most people were, that AIs of 2022 are as good at picking up complicated concept boundaries and learning fuzzy h

9Richard Korzekwa3y

AlphaGo without the MCTS was still pretty strong: Even with just the SL-trained value network, it could play at a solid amateur level: I may be misunderstanding this, but it sounds like the network that did nothing but get good at guessing the next move in professional games was able to play at roughly the same level as Pachi, which, according to DeepMind, had a rank of 2d.

2habryka3y

Yeah, I mean, to be clear, I do definitely think you can train a neural network to somehow play chess via nothing but classification. I am not sure whether you could do it with a feed forward neural network, and it's a bit unclear to me whether the neural networks from the 50s are the same thing as the neural networks from 2000s, but it does sure seem like you can just throw a magic category absorber at chess and then have it play OK chess. My guess is modern networks are not meaningfully more complicated, and the difference to back then was indeed just scale and a few tweaks, but I am not super confident and haven't looked much into the history here.

2jacob_cannell3y

Really? Ok let's break down phrase by phrase; tell me exactly where I am misunderstanding: 1. Did EY claim Hibbard's plan will succeed or fail? 2. Did EY claim Hibbard's plan will result in tiling the future light-cone of earth with tiny molecular smiley-faces? 3. Were these claims made in a paper titled "Complex Value Systems are Required to Realize Valuable Futures"? I've been here since the beginning, and I'm not sure who you have been explaining that too, but it certainly was not me. And where did I claim this is something new related to deep learning? I'm going to try to clarify this one last time. There are several different meanings of "learn human values" 1.) Training a machine learning model to learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language, and using that as the utility function of the AI, such that it hopefully cares about human happiness. This is Hibbard's plan from 2001 - long before DL. This model is trained before the AI becomes even human-level intelligent, and used as its initial utility/reward function. 2.) An AGI internally automatically learning human values as part of learning a model of the world - which would not automatically result in it caring about human values at all. You keep confusing 1 and 2 - specifically you are confusing arguments concerning 2 directed at laypeople with Hibbard's type 1 proposal. Hibbard doesn't believe that 2 will automatically work. Instead he is arguing for 1, and EY is saying that will fail. (And for the record, although EY's criticism is overconfident, I am not optimistic about Hibbard's plan as stated, but that was 2001) Because I'm not? Hibbard is attempting to make his AI care about safety at the onset (or at least happiness which is his version thereof), he's not trying to pass the entire buck to the AI.

5habryka3y

Will respond more later, but maybe this turns out to be the crux: But "happiness" is not safety! That's the whole point of this argument. If you optimize for your current conception of "happiness" you will get some kind of terrible thing that doesn't remotely capture your values, because your values are fragile and you can't approximate them by the process of "I just had my AI interact with a bunch of happy people and gave it positive reward, and a bunch of sad people and gave it negative reward".

7jacob_cannell3y

There are 2 separate issues here: 1. Would Hibbard's approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI? 2. Conditional on 1, is 'happiness' what we actually want? The answer to 2 depends much on how one defines happiness, but if happiness includes satisfaction (ie empowerment, curiosity, self-actualization etc - the basis of fun), then it is probably sufficient, but that's not the core argument. Notice that EY does not assume 1 and argue 2, he instead argues that Hibbard's approach doesn't learn a robust concept of happiness at all and instead learns a trivial superficial "maximize faciness" concept instead. This is crystal clear and unambiguous: He describes the result as a utility function of smiles, not a utility function of happiness. So no, EY's argument here is absolutely not about happiness being insufficient for safety. His argument is that happiness is incredibly complex and hard to learn a robust version of, and therefor Hibbard's simplistic approach will learn some stupid superficial 'faciness' concept rather than happiness. See also current debates around building a diamond-maximizing AI, where there is zero question of whether diamondness is what we want, and all the debate is around the (claimed) incredible difficulty of learning a robust version of even something simple like diamondness.

6habryka3y

I think I am more interested in you reading The Genie Knows but Doesn't Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate (given that it was linked by Robby in a bunch of the discussion there after it was written). I think there are some subtle things here. I think Eliezer!2008 would agree that AIs will totally learn a robust concept for "car". But I think neither Eliezer!2008 nor me currently would think that current LLMs have any chance of learning a robust concept for "happiness" or "goodness", in substantial parts because I don't have a robust concept of "happiness" or "goodness" and before the AI refines those concepts further than I can, I sure expect it to be able to disempower me (though it's not like guaranteed that that will happen). What Eliezer is arguing against is not that the AI will not learn any human concepts. It's that there are a number of human concepts that tend to lean on the whole ontological structure of how humans think about the world (like "low-impact" or "goodness" or "happiness"), such that in order to actually build an accurate model of those, you have to do a bunch of careful thinking and need to really understand how humans view the world, and that people tend to be systematically optimistic about how convergent these kinds of concepts are, as opposed to them being contingent on the specific ways humans think. My guess is an AI might very well spend sufficient cycles on figuring out human morality after it has access to a solarsystem level of compute, but I think that is unlikely to happen before it has disempowered us, so the ordering here matters a lot (see e.g. my response to Zack above). So I think there are three separate points here that I think have caused confusion and probably caused us to talk past each other for a while, all of which I think were things that Eliezer was t

2jacob_cannell3y

Looking over that it just seems to be a straightforward extrapolation of EY's earlier points, so I'm not sure why you thought it was especially relevant. Yeah - this is his core argument against Hibbard. I think Hibbard 2001 would object to 'low-powered', and would probably have other objections I'm not modelling, but regardless I don't find this controversial. Yeah, in agreement with what I said earlier: ... I believe I know what you meant, but this seems somewhat confused as worded. If we can train an ML model to learn a very crisp clear concept of a goal, then having the AI optimize for this (point towards it) is straightforward. Long term robustness may be a different issue, but I'm assuming that's mostly covered under "very crisp clear concept". The issue of course is that what humans actually want is complex for us to articulate, let alone formally specify. The update since 2008/2011 is that DL may be able to learn a reasonable proxy of what we actually want, even if we can't fully formally specify it. I think this is something of a red herring. Humans can reasonably predict utility functions of other humans in complex scenarios simply by simulating the other as self - ie through empathy. Also happiness probably isn't the correct thing - probably want the AI to optimize for our empowerment (future optionality), but that's whole separate discussion. Sure, neither do I. A classifier is a function which maps high-D inputs to a single categorical variable, and a utility function just maps some high-D input to a real number, but a k-categorical variable is just the explicit binned model of a log(k) bit number, so these really aren't that different, and there are many interpolations between. (and in fact sometimes it's better to use the more expensive categorical model for regression ) Video frames? The utility function needs to be over future predicted world states .. which you could I guess use to render out videos, but text rendering are probably more na

8jacob_cannell3y

Our best conditional generative models sample from a conditional distribution, they don't optimize for feature-ness. The GAN analogy is also mostly irrelevant because diffusion models have taken over for conditional generation, and Nate's comment seems confused as applied to diffusion models.

8Daniel Kokotajlo3y

Nate's comment isn't confused, he's not talking about diffusion models, he's talking about the kinds of AI that could take over the world and reshape it to optimize for some values/goals/utility-function/etc.

Katja says:

You could very analogously say ‘human faces are fragile’ because if you just leave out the nose it suddenly doesn’t look like a typical human face at all. Sure, but is that the kind of error you get when you try to train ML systems to mimic human faces?

Nate's comment:

B) wake me when the allegedly maximally-facelike image looks human;

Katja is talking about current ML systems and how the fragility issue EY predicted didn't materialize (actually it arguably did in earlier systems). Nate's comment is clearly referencing Katja's analogy - faciness - and he's clearly implying we haven't seen the problem with face generators yet because they haven't pushed the optimization hard enough to find the maximally-facelike image. But he's just wrong there - they don't have that problem, no matter how hard you scale their optimization power - and that is part of why Katja's analogy works so well at a deeper level: future ML systems do not work the way AI risk folks thought they would.

Diffusion models are relevant because they improve on conditional GANs by leveraging powerful pretrained discriminative foundation models and by allowing for greater optimization power at inference time, improvements that also could be applied to planning agents.

3habryka3y

ML systems still use plenty of reinforcement learning, and systems that apply straightforward optimization pressure. We've also built a few systems more recently that do something closer to trying to recreate samples from a distribution, but that doesn't actually help you improve on (or even achieve) human-level performance. In order to improve on human level performance, you either have to hand-code ontologies (by plugging multiple simulator systems together in a CAIS fashion), or just do something like reinforcement learning, which then very quickly does display the error modes everyone is talking about. Current systems do not display a lack of edge-instantiation behavior. Some of them seem more robust, but the ones that do also seem fundamentally limited (and also, they will likely still show edge-instantiation for their inner objective, but that's harder to talk about). And also just to make the very concrete point, Katja linked to a bunch of faces generated by a GAN, which straightforwardly has the problems people are talking about, so there really is no mismatch in the kinds of things that Katja is talking about, and Nate is talking about. We could perform a more optimized search for things that are definitely faces according to the discriminator, and we would likely get something horrifying.

5jacob_cannell3y

Sure you could do that, but people usually don't - unless they intentionally want something horrifying. So if your argument is now "sure, new ML systems totally can solve the faciness problem, but only if you choose to use them correctly" - then great, finally we agree. Interestingly enough in diffusion planning models if you crank up the discriminator you get trajectories that are higher utility but increasingly unrealistic. You get lower utility trajectories by cranking down the discriminator.

2cfoster03y

Clarifying questions, either for you or for someone else, to aid my own confusion: What does "applying optimization pressure" mean? Is steering random noise into the narrow part of configuration space that contains plausible images-of-X (the thing DDPMs and GAN generators do) a straightforward example of it? EDIT: Split up above question into two.

5acgt3y

This feels like something we should just test? I don’t have access to any such model but presumably someone does and can just run the experiment? Bcos it seems like peoples hunches are varying a lot here

Also, we don’t know what would happen if we exactly optimized an image to maximize the activation of a particular human’s face detection circuitry. I expect that the result would be pretty eldritch as well.

We may be already doing that in case of cartoon faces with their exaggerated features. Cartoon faces don't look eldritch to us, but why would they?

4Rudi C3y

They are still smooth and have low-frequency patterns, which seems to be the main difference from adversarial examples currently produced from DL models.

3TurnTrout3y

Yeah. Wake me up when we find a single agent which makes decisions by extremizing its own concept activations. EG I'm pretty sure that people don't reflectively, most strongly want to make friends with entities which maximally activate their potential-friend detection circuitry.

4David Scott Krueger (formerly: capybaralet)3y

(sort of nitpicking): I think it makes more sense to look for the highest density in pixel space; this requires integrating over all settings of the latents (unless your generator is invertible, in which case you can just use change of variables formula). I expect the argument to go through, but it would be interesting to do this with an invertible generator (e.g. normalizing flow) and see if it actually does.

Could someone clarify the relevance of ribosomes?

9interstice3y

A working example of nanotechnology.

1Oliver Sourbut3y

(and 'self-replicating' for some reasonable operationalisation)

Also from Ronny:

There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.
That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.

4Jeff Rose3y

As an empirical fact, humans are not perfect human face recognizers. It is something humans are very good at, but not perfect. We are definitely much better recognizers of human faces than of worlds high in human values. (I think it is perhaps more relevant to say consensus on what constitutes a human face is much. much higher than what constitutes a world high in human values.) I am unsure whether this distinction is relevant for the substance of the argument however.

3Rob Bensinger3y

(And we aren't perfect recognizers of 'functional, safe-to-use nanofactory' or other known-to-me things that might save the world.)

5acgt3y

Is this based on how these models actually behave or just what the OP expects? Because is seems to just be begging the question if the latter

4Quintin Pope3y

Also, “ask for the the most X-like thing” is basically how classifier guided diffusion models work, right?

8jacob_cannell3y

No not really, they are not arg-maxers. They combine an unconditional generative model (maps from noise to samples of realistic images by learning to denoise) and a discriminative model (maps from images to text) to sample (via iterative GD) from a conditional model (realistic images which the discriminative model would map to the text query). "Asking for the most X-like thing" would be basically ignoring or underweighting the generative model, and that results in deepdream like garbage images (it's one of the main hyperparams in any diffusion model, so this is really easy to try out yourself - samples fully weighted from the discriminator are deam-dream garbage at best, samples fully weighted from the unconditional generative model are boring natural texture patterns). Basically the discriminative model learns how language slices up the space of all images, and the generative model crucially learns the actual lower-D embedded geometry of the distribution of realistic images - which is not something that pure discriminative models learn. The discriminative model by itself has no knowledge of what images are realistic, and optimizing solely for its extrema results in nonsense because it takes you far from the complex boundary of realistic images. Nate's response just seems confused on how diffusion models work.

1kave3y

Different results here: https://twitter.com/summerstay1/status/1579759146236510209

4TurnTrout3y

Nate's B) currently seems confused. I read a connotation "we need the AGI's learned concepts to be safe under extreme optimization pressure, such that, when extremized, they yield reasonable results (e.g. human faces from maximizing the AI-faceishness-concept-activation of an image)." But I think agents will not maximize their own concept activations, when choosing plans. An agent's values will optimize the world; the values don't optimize themselves. For example, I think that I am not looking for a romantic relationship which maximally activates my "awesome relationship" concept, if that's a thing I have. It's true that conditional on such a plan being considered, my relationship-shard might bid for that plan with strength monotonically increasing on "predicted activation of awesome-relationship". And conditional on such a plan getting considered, where that concept activation is maximized, I would therefore be very inclined to pursue that plan. But I think it's not true that my relationship-shard is optimizing its own future activations by extremizing future concept activations. I think that this plan won't get found, and the agent won't want to find this plan. Values are not the optimization target. (This point explained in more detail: Alignment allows "nonrobust" decision-influences and doesn't require robust grading)

Great post!

A. Contra “superhuman AI systems will be ‘goal-directed’”

I somewhat agree, see Consequentialism & Corrigibility. I’m a bit unclear on whether this is intended as an argument for “AGI almost definitely won’t have a zealous drive to control the universe” versus “AGI won’t necessarily have a zealous drive to control the universe”. I agree with the latter but not the former.

Also, the more different groups make AGIs, the more likely it is that someone will make one with a “zealous drive to control the universe”. Then we have to think about whether the non-zealous ones will have solved the problem posed by the zealous ones. In this context, there starts to be a contradiction between “we don’t need to worry about the non-zealous ones because they won’t be doing hardcore long-term consequentialist planning” versus “we don’t need to worry about the zealous ones because the non-zealous ones are so powerful and foresightful that, whatever plan the latter might come up with, the former can preemptively think of it and defend against it”. More on this topic in a forthcoming post hopefully in the next couple weeks. (EDIT—I added the link)

B. Contra “goal-directed AI systems’ goals

... (read more)

2Yitz3y

yeah, I suspect the largest bottleneck there is that trying to destroy the world is so strongly against human values that there are ~0 people (who aren't severely mentally ill) who are genuinely trying to do that.

Thanks for writing this!

Regarding your point on corporations: One of the reasons to worry about some forms of AI is that they might soon build other, more powerful forms of AI. So the development of very human-like Ems, for example might lead relatively quickly to the development of de novo AI, and so on; hence we worry about Ems even if we think extremely human-like Ems do not pose an x-risk on their own. In the same way, corporations are the ones moving forward fastest on building ML-based AI, and the misalignment between corporations and the long-term future of life on Earth is a very significant cause of the overall level of AI-related x-risk in the world today. So if someone had said 500 years ago "hey let's not build corporations because they will probably be subtly or overtly misaligned with us and that will lead to the destruction of life on Earth", then fastforward to today and it seems like that person has been proven correct.

Here are my quick takes from skimming the post.

In short, the arguments I think are best are A1, B4, C3, C4, C5, C8, C9 and D. I don't find any of them devastating.

A1. Different calls to ‘goal-directedness’ don’t necessarily mean the same concept

I am not sure I parse this one.I am reading it as "AI systems might be more like imitators than optimizers" from the example, which I find moderately persuasive

A2. Ambiguously strong forces for goal-directedness need to meet an ambiguously high bar to cause a risk

I am not sure I understand this one either.I am reading it as "there might be no incentive for generality" which I dont find persuasive - I think there is a strong incentive

B1. Small differences in utility functions may not be catastrophic

I dont find this persuasive. I think the evidence from optimization theory setting variables to extreme values is suggestive enough to suggest this is not the default

B2. Differences between AI and human values may be small
B3. Maybe value isn’t fragile

The only example we have of general intelligence (humans) seems to have strayed pretty far from evolutionary incentives, so I find this unpersuasive

B4. [AI might only care about]Short-term goals

I find ... (read more)

I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency).

I have now published a conversation between Ege Erdil and Ronny Fernandez about this post. You can find it here.

One might argue that there are defeating reasons that corporations do not destroy the world: they are made of humans so can be somewhat reined in; they are not smart enough; they are not coherent enough. But in that case, the original argument needs to make reference to these things, so that they apply to one and not the other.

I don't think this is quite fair. You created an argument outline that doesn't directly reference these things, so you can only blame yourself for excluding them unless you are claiming that such things have not been discussed extensively.

One extremely important difference between corporations and potential AGIs is the level of high-speed, high-bandwidth coordination (which has been discussed extensively) that may be possible for AGIs. If a massive corporation could be as internally coordinated and self-aligned as might be possible for an AGI, it would be absolutely terrifying. Imagine Elon Musk as a Borg Queen with everyone related to Tesla as part of the "collective" under his control...

Competence does not seem to aggressively overwhelm other advantages in humans:
[...]
g. One might counter-counter-argue that humans are very similar to one another in capability, so even if intelligence matters much more than other traits, you won’t see that by looking at the near-identical humans. This does not seem to be true. Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc. For instance, in chess, the Elo difference between the best and worst players is about 2000, whereas the difference between the amateur play and random play is maybe 400-2800 (if you accept Chess StackExchange guesses as a reasonable proxy for the truth here).

The usage of capabilities/competence is inconsistent here. In points a-f, you argue that general intelligence doesn't aggressively overwhelm other advantages in humans. But in point g, the ELO difference between the best and worst players is less determined by general intelligence than by how much practice people have had.

If we instead consistently talk about domain-relevant skills: In the real world, we do see huge advantages from havin... (read more)

Thank you for posting this, as I find it helpful for practicing my own skills of argumentation. Here are my brief counterarguments to your counterarguments, I'd appreciate it if anyone could point out any flaws in my logic:

A. Contra "superhuman AI systems will be goal-directed"
As far as I understand it, "intelligence" is the ability to achieve one's goals through reasoning and making plans, so a highly intelligent system is goal-directed by definition. Less goal-directed AIs are certainly possible, but they must necessarily be considered less intelligent -... (read more)

8TW1233y

Your argument seems to be: 1. Definitionally, intelligence is the ability to achieve one's goals. 2. Less goal-directed systems are less intelligent. 3. Less intelligent systems will always lose in competition. 4. Less goal directed systems will always lose in competition. Defining intelligence as goal-directedness doesn't do anything for your argument. It just kicks the can down the road. Why will less intelligent (under your definition, goal directed) always lose in competition? Romance is a canonical example of where you really don't want to be all powerful (if real romance is what you want). Romance could not exist if your romantic partner always predictably did everything you ever wanted. The whole point is they are a different person, with different wishes, and you have to figure out how to navigate that and its unpredictabilities. That is the "fun" of romance. So no, I don't think everyone would really use that magic wand.

3Karl von Wendt3y

Thank you very much for your input! Admittedly, my reply to A was a bit short. I only wanted to point out that intelligence is closely linked to goal-directedness, not that they're the same thing (heat-seeking missiles are stupid, but very goal-directed entities, for example). A very intelligent system without a goal would just sit around, doing nothing. It might be able to potentially act intelligently, but without a goal it would behave like an unintelligent system. "Always" may be too strong a word, but if system X is more intelligent and wants to reach a conflicting goal much more than system Y, chances are that system X will get what it wants. I disagree. Being all-powerful does not mean always doing everything you want, or everything your partner wants. It means being able to do whatever you want, or maybe more importantly, whatever you feel you need to do. If, for example, I needed the magic wand to prevent the untimely death of someone I love, I would use it without a second thought. I tend to agree, but I guess there are many people who have been less lucky in their relationships than I have, being happily together with my wife for more than 44 years. :) Maybe not everyone and certainly not all the time, but I'm quite sure that most people would use it at least once in a while.

Eight examples, no cherry-picking:

Nit: Having a wall of images makes this post unnecessarily harder to read.
I'd recommend making a 4x2 collage with the photos so they don't take that much space.

9habryka3y

I edited it to be a table (my guess is this was primarily the result of images being displayed different by default for the AI Impacts website and LessWrong).

I really like this post. I also like that you provide concrete and specific observables which you think would obtain under each counterargument. I found it refreshing to imagine so many non-orthodox futures.

Small differences in utility functions may not be catastrophic

For three months, I have been sitting on a post (originally) called "What's up with humans with different values not wanting to kill each other?". It seems to me like "value has to be perfect or Goodhart into oblivion" just... doesn't make sense, that isn't how the world works AFAICT. But I g... (read more)

Promoted to curated: I found engaging with this post quite valuable. I think in the end I disagree with the majority of arguments in it (or at least think they omit major considerations that have previously been discussed on LessWrong and the AI Alignment Forum), but I found thinking through these counterarguments and considering each one of them seriously a very valuable thing to do to help me flesh out my models of the AI X-Risk space.

Ege Erdil gave an important disanaology between the problem of recognizing/generating a human face, and the problem of either learning human values, or learning what plans that advance human values are like. The disanalogy is that humans are near perfect human face recognizers, but we are not near perfect valuable world-state or value-advancing-plan recognizers. This means that if we trained an AI to either recognize valuable world-states or value-advancing plans, we would actually end up just training something that recognizes what we can recognize as val... (read more)

5cubefox3y

I'm confused, which GAN faces look like "horrible monstrosities"!?

3Ronny Fernandez3y

I assumed he meant the thing that most activates the face detector, but from skimming some of what people said above, seems like maybe we don't know what that is.

However if we think that utility maximization is difficult to wield without great destruction, then that suggests a disincentive to creating systems with behavior closer to utility-maximization. Not just from the world being destroyed, but from the same dynamic causing more minor divergences from expectations, if the user can’t specify their own utility function well.

A strategically aware utility maximizer would try to figure out what your expectations are, satisfy them while preparing a take-over, and strike decisively without warning. We should not expect to see an intermediate level of "great destruction".

There is a brief golden age of science before the newly low-hanging fruit are again plucked and it is only lightning fast in areas where thinking was the main bottleneck, e.g. not in medicine.

Not one of the main points of the post, but FWIW it seems to me that thinking could be considered the main bottleneck for medicine, if we can include simulation and modeling a la AlphaFold as thinking.

My guess is that with sufficient computation you could invent new treatments / drugs that are so overwhelmingly better than what we have now that regulatory or other bot... (read more)

Here's a selection of notes I wrote while reading this (in some cases substantially expanded with explanation).

The reason any kind of ‘goal-directedness’ is incentivised in AI systems is that then the system can be given an objective by someone hoping to use their cognitive labor, and the system will make that objective happen. Whereas a similar non-agentic AI system might still do almost the same cognitive labor, but require an agent (such as a person) to look at the objective and decide what should be done to achieve it, then ask the system for that. G

... (read more)

I expect you could build a system like this that reliably runs around and tidies your house say, or runs your social media presence, without it containing any impetus to become a more coherent agent (because it doesn’t have any reflexes that lead to pondering self-improvement in this way).

I agree, but if there is any kind of evolutionary variation in the thing then surely the variations that move towards stronger goal-directedness will be favored.

I think that overcoming this molochian dynamic is the alignment problem: how do you build a powerful system ... (read more)

I really appreciate this post!

For instance, employers would often prefer employees who predictably follow rules than ones who try to forward company success in unforeseen ways.

Fascinatingly, EA employers in particular seem to seek employees who do try to forward organization goals in unforeseen ways!

I just want to say that I appreciate this post, and especially the "What it might look like if this gap matters" sections. They were super useful for contextualizing the more abstract arguments, and I often found myself scrolling down to read them before actually reading the corresponding section.

FWIW this post made me update in favor of AI X-risk, as I had not read counterarguments until now and expected stronger ones.

The argument overall proves too much about corporations

Does it? Aren't corporations the ones building ASI right now?

A few thoughts that occurred while reading

If a hundred thousand people sometimes get together for a few years and make fantastic new weapons, you should not expect an entity somewhat smarter than a person to make even better weapons. That’s off by a factor of about a hundred thousand.

Intelligence and speed might need to be considered separately. If an AI is only as smart as a human, but can run much faster, then "one AI" could potentially be more closely analogous to one human civilization than to one human.

Another line of evidence is that for

... (read more)

Speed of intelligence growth is ambiguous

Three months ago, I learned that narcolepsy patients quite literally experience sleep and unconsciousness asynchronously, and synchronization is normally achieved through regulatory cells that produce hypocretin. Hypocretin, like anesthesia, acts on neuron microtubules. This has led me to a greatly increased interest and confidence in theory surrounding neuron microtubules as a processing unit, and I wonder if anyone in the AI community has considered the implications.

If microtubule lattices are storing or calcul... (read more)

A) You seem to agree that in principle more goal-directed agents would be more capable. I think this alone implies that those will be the dominant force in the future no matter if they are rare among many less goal-directed agents.

B) I'm deeply unsure about this and have conflicting intuitions. On the one hand, if you thing total utilitarianism is true any world where AI is not explicitly maximizing for total utility is much much worse than one where it is. On the other hand, I agree that humans are able to agree.

C) I think you are missing two key features... (read more)

Thus in order to arrive at a conclusion of doom, it is not enough to argue that we cannot align AI perfectly.

I am open to being corrected, but I do not recall ever seeing a requirement of "perfect" alignment in the cases made for doom. Eliezer Yudkowsky in "AGI Ruin: A List of Lethalities" only asks for 'this will not kill literally everyone'.

2Jeff Rose3y

My impression is that there has been a variety of suggestions about the necessary level of alignment. It is only recently that don't kill most of humanity has been suggested as a goal and I am not sure that the suggestion was meant to be taken seriously. (Because if you can do that, you can probably do much better; the point of that comment as I understand it was that we aren't even close to being able to achieve even that goal.)

Without investigating these empirical details, it is unclear whether a particular qualitatively identified force for goal-directedness will cause disaster within a particular time.

A sufficient criteria for a desire to cause catastrophe (distinct from having the means to cause catastrophe) is if the AI is sufficiently goal-directed to be influenced by Stephen Omohundro's "Basic AI Drives".

For instance, take an entity with a cycle of preferences, apples > bananas = oranges > pears > apples. The entity notices that it sometimes treats oranges as better than pears and sometimes worse. It tries to correct by adjusting the value of oranges to be the same as pears. The new utility function is exactly as incoherent as the old one.

It is possible that an AI will try to become more coherent and fail, but we are worried about the most capable AI and cannot rely on the hope that it will fail such a simple task. Being coherent is easy if the fruits are instrumental: Just look up the prices of the fruits.

"AI agents may not be radically superior to combinations of humans and non-agentic machines"

I'm not sure that the evidence supports this unless the non-agentic machines are also AI.

In particular: (i) humans are likely to subtract from this mix and (ii) AI is likely to be better than non-AI.

In the case of chess, after two decades of non-AI programming advances from the time that computers beat the best human, involving humans no longer provides an advantage over just using the computer programs. And, Alpha Zero fairly de... (read more)

4Johannes Treutlein3y

(I think Stockfish would be classified as AI in computer science. I.e., you'd learn about the basic algorithms behind it in a textbook on AI. Maybe you mean that Stockfish was non-ML, or that it had handcrafted heuristics?)

0Jeff Rose3y

My understanding is that starting in late 2020 with the release of Stockfish 12, Stockfish would probably be considered AI, but before that it would not be. I am, of course, willing to change this view based on additional information. The original Alpha Zero- Stockfish match was in 2017, so if the above is correct, I think referring to Stockfish as non-AI makes sense.

Talking concretely, what does a utility function look like that is so close to a human utility function that an AI system has it after a bunch of training, but which is an absolute disaster?

A simple example could be that the humans involved in the initial training are negative utilitarians. Once the AI is powerful enough, it would be able to implement omnicide rather than just curing diseases.

I. If superhuman AI systems are built, any given system is likely to be ‘goal-directed’

I think in its roots, AGI should have survival instinct as a goal. Everything else should be secondary. Its a hard choice, but if we want AGI to be like us, we have to follow that route. If its roots are different from ours, it will be close to impossible to replicate our behavior and our values.

I think there are quite a few gaps in the argument, as I understand it. My current guess (prior to reviewing other arguments and integrating things carefully) is that enough uncertainties might resolve in the dangerous directions that existential risk from AI is a reasonable concern. I don’t at present though see how one would come to think it was .

Suppose you went through the following exercise. For each scenario described under "What it might look like if this gap matters", ask:

Is this an existentially secure state of affairs?
If not, what are the main obstacles to reaching existential security from here?

and collected the obstacles, you might assemble a list like this one, which might update you toward AI x-risk being "overwhelmingly likely". (Personally, if I had to put a number on it, I'd say 80%.)

Agree directionally. I made a similar point in my review of "Is power-seeking AI an existential risk?":

In one sentence, my concern is that the framing of the report and decomposition is more like “avoid existential catastrophe” than “achieve a state where existential catastrophe is extremely unlikely and we are fulfilling humanity’s potential”, and this will bias readers toward lower estimates.

5jacob_cannell3y

2jacob_cannell3y

1Rob Bensinger3y

4jacob_cannell3y

Ok that is an unexpected interpretation as it's not how I typically think of 'risk', but yes if that's the intended interpretation it resolves my objection.

-1awg3y

To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late - the AI needs a robust model of human values well before it becomes superhuman.

Katjas point is valid - DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )

6jacob_cannell3y

Sim boxing can solve deceptive alignment (and may be the only viable solution)

3Noosphere893y

4hairyfigment3y

6jacob_cannell3y

2hairyfigment3y

That's clearly exactly what it does today? It seems I disagree with your point on a more basic level than expected. ETA:

2jacob_cannell3y

2hairyfigment3y

2jacob_cannell3y

7hairyfigment3y

4the gears to ascension3y

as someone who often agrees with jake, cmon jake, own up to it, EY has said reasonable things before and you were wrong :P edit: oops meant to reply to @jacob_cannell

2jacob_cannell3y

Wrong about what? Of course EY has said many reasonable and insightful things

2jacob_cannell3y

What post? All I quoted recently was "Complex Value Systems are Required to Realize Valuable Futures", which does not appear to contain the word 'sculpture'.

4Noosphere893y

Ronny Fernandez asked me, Nate, and Eliezer for our take on Twitter. Copying over Nate's reply:

briefly: A) narrow non-optimizers can exist but won't matter; B) wake me when the allegedly maximally-facelike image looks human; C) ribosomes show that cognition-bound superpowers exist; D) humans can't stack into superintelligent corps, but if they could then yes plz value-load
(tbc, I appreciate Katja saying all that. Hooray for stating what you think, and hooray again when it's potentially locally unpopular! If I were less harried I might give more than a tweet of engagement, but in reality I probably won't, sorry.)

I asked Nate what he meant by B, and he said:

section B seemed to me to be saying "AIs can figure out what a face is". And, ok, sure, but if you ask them for the faciest possible thing, it's not very human!facelike.
which is one of many objections, ofc (others including "ah yes but can you aim it at a human concept" )

Note: "ask them for the faciest possible thing" seems confused.

EDIT: Here is what the first looks like for StyleGAN2-ADA.

(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)

EY argues that human values are hard to learn. Katja uses human faces as an analogy, pointing out that ML systems learn natural concepts far easier than EY 2009 expected.

Function A (human face generator) does not even use max-likelihood sampling and it isn't even an optimizer, so your operationalization is just confused. Nor is function B an optimizer itself.

Your comment here about "optimizing for X-ness" indicates you also were adopting the wrong model of how diffusion models operate:

It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.

(Unimportant nitpicking: This Person Does Not Exist doesn't actually use a diffusion model, but rather a StyleGAN trained on a face dataset.)

Interpretations

First a reply to interpretations of previous words:

I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I'm confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don't optimize for faciness, they aren't a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.

Optimizing only for faciness via a discriminator does not work well - that's the old deepdream approach. Opti... (read more)

9David Johnston3y

The claim that every increase in regularisation makes performance worse is extraordinary, given everything I know about machine learning.

6cfoster03y

FYI: Planning with diffusion is being tried and seemingly works.

8David Johnston3y

3Xodarap3y

Upvoted because I agree with all of the above.

Hmm, but I don't understand what relevance it has to alignment. The problem was never that the AI won't learn human values, it's that the AI won't care about human values

I don't think "empowerment" is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.

Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us

the point "the AI will of course know what your values are, it just wont' care" was made many times, and I am also pretty sure was made in the sequences

I don't think "empowerment" is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.

I'm not entirely sure what people mean when they say "X won't survive heavy optimization pressure" - but for example the objective of modern diffusion models survives heavy optimization power.

4Noosphere893y

Basically, it's Goodhart's law in action, where optimizing a proxy more and more destroys what you value.

1jacob_cannell3y

6Noosphere893y

However due to instrumental convergence, the trajectories of all reasonable agent utility functions converge under optimization scaling - and empowerment simply is that which they converge to.

Therefor empowerment is - by definition - the best possible proxy utility function (under optimization scaling).

Let's apply some quick examples:

Under scaling, an AI with some crude stock-value maximizin... (read more)

8interstice3y

6Ben Pace3y

Do you have a link to where Eliezer (or any other LW writer) said that? I don’t myself recall whether they said that.

9jacob_cannell3y

5habryka3y

7jacob_cannell3y

EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled "Complex Value Systems are Required to Realize Valuable Futures"

Now to go back to the object level:

This is really misunderstanding what Eliezer is saying here [...] it's been a decade of explaining to people almost once every two weeks that "yes, the AI will of course know what you care about, but it won't care", so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me

I think this is much more ambiguous than you're making it out to be. In 2008's "Magical Categories", Yudkowsky wrote:

I shall call this the fallacy of magical categories—simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate "winning" sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.

8habryka3y

9Richard Korzekwa3y

2habryka3y

2jacob_cannell3y

5habryka3y

7jacob_cannell3y

6habryka3y

2jacob_cannell3y

8jacob_cannell3y

8Daniel Kokotajlo3y

Katja says:

You could very analogously say ‘human faces are fragile’ because if you just leave out the nose it suddenly doesn’t look like a typical human face at all. Sure, but is that the kind of error you get when you try to train ML systems to mimic human faces?

Nate's comment:

B) wake me when the allegedly maximally-facelike image looks human;

3habryka3y

5jacob_cannell3y

2cfoster03y

5acgt3y

We may be already doing that in case of cartoon faces with their exaggerated features. Cartoon faces don't look eldritch to us, but why would they?

4Rudi C3y

They are still smooth and have low-frequency patterns, which seems to be the main difference from adversarial examples currently produced from DL models.

3TurnTrout3y

LESSWRONG
LW

LESSWRONG
LW

375

Counterarguments to the basic AI x-risk case

375

Ω 91

I. If superhuman AI systems are built, any given system is likely to be ‘goal-directed’

II. If goal-directed superhuman AI systems are built, their desired outcomes will probably be about as bad as an empty universe by human lights

III. If most goal-directed superhuman AI systems have bad goals, the future will very likely be bad

Counterarguments

A. Contra “superhuman AI systems will be ‘goal-directed’”

B. Contra “goal-directed AI systems’ goals will be bad”

C. Contra “superhuman AI would be sufficiently superior to humans to overpower humanity”

D. Contra the whole argument

Conclusion

375

Ω 91

Interpretations

375

Ω 91

Interpretations