Steelmaning AI risk critiques

Stuart_Armstrong

Creating stable AGI that operates in the real world may be unexpectedly difficult. What I mean by this is we might solve some hard problems in AI, and the result might work in some limited domains, but isn't stable in the real world.

An example would be Pascal's Mugging. An AI that maximizes expected utility with an unbounded utility function, would spend all it's time worrying about incredibly improbable scenarios.

Reinforcement learning agents might simply hijack their own reinforcement channel, set it to INF, and be done.

Or the Anvil Problem where a reinforcement learning-type AI simply doesn't act as though it's brain exists in the universe it's observing, and could result in strange behavior.

It might place a strong value on literal self preservation, and refuse to upgrade itself or create copies, or even allow it's physical computer to be rebooted. This would constrain the AI a great deal.

Further, it might not create other AIs that serve it, since the friendliness problem would be just as hard for it.

There could be technical or philosophical issues we haven't even thought of yet that a superintelligent AI would encounter and not be built to deal with. And most of these issues depend a great deal on the technical details of the AGI, which we don't even know yet. There are all sorts of hypothetical problems that are specific to neural networks, or evolved AIs, or open-cog like AIs, etc.

I'm not very confident that these will stop UFAI though. Pascal's Mugging can simply bound the utility function at some arbitrarily high number and create a dangerous AI. Reinforcement learning agents would probably still value self preservation after maximizing their input channel. The AI won't anvil itself since it would prevent it from manipulating the world or decrease it's reward. Self preservation for an AI is more about preserving it's reward machinery. Not the the actual AI program that maximizes it.

I definitely believe an unfriendly AI can be built that just maximizes some stupid goal. And if there are technical issues, it's only a matter of time before someone solves them. But I'm not 100% confident of it.

36

Steelmaning AI risk critiques

36

36

36

Steelmaning AI risk critiques

36

36