Slate Star Codex Notes on the Asilomar Conference on Beneficial AI

Gunnar_Zarncke

Some snippets:

The conference seemed like a (wildly successful) effort to contribute to the ongoing normalization of the subject. Offer people free food to spend a few days talking about autonomous weapons and biased algorithms and the menace of AlphaGo stealing jobs from hard-working human Go players, then sandwich an afternoon on superintelligence into the middle. Everyone could tell their friends they were going to hear about the poor unemployed Go players, and protest that they were only listening to Elon Musk talk about superintelligence because they happened to be in the area. The strategy worked. The conference attracted AI researchers so prestigious that even I had heard of them (including many who were publicly skeptical of superintelligence), and they all got to hear prestigious people call for “breaking the taboo” on AI safety research and get applauded. Then people talked about all of the lucrative grants they had gotten in the area. It did an great job of creating common knowledge that everyone agreed AI goal alignment research was valuable, in a way not entirely constrained by whether any such agreement actually existed.

\5. Related: a whole bunch of problems go away if AIs, instead of receiving rewards based on the state of the world, treat the world as information about a reward function which they only imperfectly understand. For example, suppose an AI wants to maximize “human values”, but knows that it doesn’t really understand human values very well. Such an AI might try to learn things, and if the expected reward was high enough it might try to take actions in the world. But it wouldn’t (contra Omohundro) naturally resist being turned off, since it might believe the human turning it off understood human values better than it did and had some human-value-compliant reason for wanting it gone. This sort of AI also might not wirehead – it would have no reason to think that wireheading was the best way to learn about and fulfill human values.

The technical people at the conference seemed to think this idea of uncertainty about reward was technically possible, but would require a ground-up reimagining of reinforcement learning. If true, it would be a perfect example of what Nick Bostrom et al have been trying to convince people of since forever: there are good ideas to mitigate AI risk, but they have to be studied early so that they can be incorporated into the field early on.

Point 8, about the opacity of decision-making, reminded me of something I'm surprised I haven't seen on LW before:

LIME, Local Interpretable Model-agnostic Explanations, can show a human-readable explanation for the reason any classification algorithm makes a particular decision. It would be harder to apply the method to an optimizer than to a classifier, but I see no principled reason why an approach like this wouldn't help understand any algorithm that has a locally smooth-ish mapping of inputs to outputs.

I almost think, that that LIME article merits it's own Link post, what do you think?

It should be posted, but by someone who can more rigorously describe its application to an optimizer than "probably needs to be locally smooth-ish."

I wasn't aware that method had a name, but I've seen that idea suggested before when this topic comes up. For neural networks in particular, you can just look at the gradients of the inputs to see how it's output changes as you change each input.

I think the problem people have, is that just tells you what the machine is doing. Not why. Machine learning can never really offer understanding.

For example, there was a program created specifically for the purpose of training human understandable models. It worked by fitting the simplest possible mathematical expression to the data. And the hope was that simple mathematical expressions would be easy to interpret by humans.

One biologist found an expression that perfectly fit his data. It was simple, and he was really excited by it. But he couldn't understand what it meant at all. And he couldn't publish it, because how can you publish an equation without any explanation?

Isn't that exactly what causality and do notation is for? Generate the "how" answer, and then do causal analysis to get the why.

Yesterday I was reading about a medical student who discovered to suffer from Castleman disease, so he managed to specialize in that disease and in creating a network to coordinate various researchers. This is also common between people who suffer from or have a close relative that died from a rare disease (my fiancée is deeply fond of "Mystery diagnosis"): it seems that creating a foundation to raise awareness and coordinate effort is a very common response.
One would think that in the era of globalization and the Web such things would be trivial, but as the Scott's notes show, coordination and common knowledge between human beings is still a huge added value.

As usual, Scott's writing is fantastic. Unsure if I updated on too much other than "hey this conference seemed to increase mainstream acceptance of working on AI safety", but it was fun to read nonetheless.