Steelmaning AI risk critiques

Stuart_Armstrong

At some point soon, I'm going to attempt to steelman the position of those who reject the AI risk thesis, to see if it can be made solid. Here, I'm just asking if people can link to the most convincing arguments they've found against AI risk.

EDIT: Thanks for all the contribution! Keep them coming...

A ULM also requires a utility function or reward circuitry with some initial complexity, but we can also use the same universal learning algorithms to learn that component. It is just another circuit, and we can learn any circuit that evolution learned.

Okay, so we just have to determine human terminal values in detail, and plug them into a powerful maximizer. I'm not sure I see how that's different from the standard problem statement for friendly AI. Learning values by observing people is exactly what MIRI is working on, and it's not a trivial problem.

For example: say your universal learning algorithm observes a human being fail a math test. How does it determine that the human being didn't want to fail the math test? How does it cleanly separate values from their (flawed) implementation? What does it do when peoples' values differ? These are hard questions, and precisely the ones that are being worked on by the AI risk people.

Other points of critique:

Saying the phrase "safe sandbox sim" is much easier than making a virtual machine that can withstand a superhuman intelligence trying to get out of it. Even if your software is perfect, it can still figure out that its world is artificial and figure out ways of blackmailing its captors. Probably doing what MIRI is looking into, and designing agents that won't resist attempts to modify them (corrigibility) is a more robust solution.

You want to be careful about just plugging in a learned human utility function into a powerful maximizer, and then raising it. If it's maximizing its own utility, which is necessary if you want it to behave anything like a child, what's to stop it from learning human greed and cruelty, and becoming an eternal tyrant? I don't trust a typical human to be god.

And even if you give up on that idea, and have to maximize a utility function defined in terms of humanity's values, you still have problems. For starters, you want to be able to prove formally that its goals will remain stable as it self-modifies, and it won't create powerful sub-agents who don't share those goals. Which is the other class of problems that MIRI works on.

Okay, so we just have to determine human terminal values in detail, and plug them into a powerful maximizer.

Why do you even go around thinking that the concept of "terminal values", which is basically just a consequentialist steelmanning Aristotle, cuts reality at the joints?

For starters, you want to be able to prove formally that its goals will remain stable as it self-modifies

That part honestly isn't that hard once you read the available literature about paradox theorems.

0jacob_cannell11y

No - not at all. Perhaps you have read too much MIRI material, and not enough of the neuroscience and machine learning I referenced. An infant is not born with human 'terminal values'. It is born with some minimal initial reward learning circuitry to bootstrap learning of complex values from adults. Stop thinking of AGI as some wierd mathy program. Instead think of brain emulations - and then you have obvious answers to all of these questions. [...] You apparently didn't read my article or links to earlier discussion? We can easily limit the capability of minds by controlling knowledge. A million smart evil humans is dangerous - but only if they have modern knowledge. If they have only say medieval knowledge, they are hardly dangerous. Also - they don't realize they are in a sim. Also - the point of the sandbox sims is to test architectures, reward learning systems, and most importantly - altruism. Designs that work well in these safe sims are then copied into less safe sims and finally the real world. Consider the orthogonality thesis - AI of any intelligence level can be combined with any values. Thus we can test values on young/limited AI before scaling up their power. Sandbox sims can be arbitrarily safe. It is the only truly practical workable proposal to date. It is also the closest to what is already used in industry. Thus it is the solution by default. [...] Ridiculous nonsense. Many humans today are aware of the sim argument. The gnostics were aware in some sense 2,000 years ago. Do you think any of them broke out? Are you trying to break out? How? [...] Again, stop thinking we create a single AI program and then we are done. It will be a largescale evolutionary process, with endless selection, testing, and refinement. We can select for super altruistic moral beings - like bhudda/gandhi/jesus level. We can take the human capability for altruism, refine it, and expand on it vastly. [...] Quixotic waste of time.

36

Steelmaning AI risk critiques

36

36

36

Steelmaning AI risk critiques

36

36