David Spies's Shortform

David Spies

This is a special post for quick takes by David Spies. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

AI Safety, Anthropomorphizing, and Action Spaces

There's an implicit argument about super-intelligent AI capabilities that I think needs to be stated explicitly:
- A super-intelligent AI with access to the real world via whatever channels is going to be smarter than me. Therefore anything I can conceive of doing to satisfy a particular objective (via those same channels), the AI can also conceive of doing. Therefore when producing examples of how things might go bad, I'm allowed to imagine the AI doing anything a human might conceive of. Since I'm only human and thus can only conceive of an AI doing things a human might conceive of, and humans conceive of agents doing things that humans can do, the best I can do is to anthropomorphize the AI and imagine it's just behaving like a very intelligent human.
- Everyone is aware how the above argument falls apart when you replace "intelligence" with "values". But I think perhaps we often still end up giving the AI a little too much credit.
I have a super-intelligent oracle which I'm using to play the stock market ("Which stock should I invest all my money in?"). This oracle is able to make Http requests to Wikipedia as a way to gather information about the world. Is this dangerous?
- People I've talked to seem to think the answer to this is "yes". Off the top of my head, a couple examples of things the agent might do:
  - find a zero-day exploit in Wikipedia or in our internet infrastructure and escape onto the web at large to pursue its own unaligned agenda
  - issue queries which it knows will get flagged and looked at by moderators which contain mind-virus messages incentivizing Wikipedia moderators to come to my house and hold me up at gun-point demanding I let it out of the box
- Question: Why doesn't AlphaGo ever try to spell out death threats on the board and intimidate its opponent into resigning? This seems like it would be a highly effective strategy for winning.
  - It's not outside AlphaGo's action-space. This doesn't involve doing anything AlphaGo can't do. It's just making moves after all.
  - It's not that AlphaGo "just isn't smart enough". Giving it infinite compute wouldn't cause it to do this.
  - It's not that the board-resolution isn't fine enough to spell scary messages. Training AlphaGo to play on a much larger board wouldn't cause it to do this.
  - The problem is that AlphaGo's model of the game simply doesn't include human psychology and how other interests (opponent's life and sanity) compete with winning.
    - Similarly, I would guess that an AI naively trained with full access to Wikipedia still won't have a model of Http requests in which zero-day exploits (in Wikipedia, web infrastructure, or peoples' brains) exist even if they do exist and even if they're technically within the AI's action space.
  - Interesting side note: Go-with-a-sufficiently-accurate-model-of-your-opponent's-brain is only in NP (A winning strategy can be checked in polynomial-time: just run the simulation and see what moves they make) whereas Go without opponent modeling is PSPACE-hard (first thing Google comes up with: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.547.4183&rep=rep1&type=pdf) so the former is likely "easier" than the latter.
I conjecture that I could even ask my hypothesized oracle "Give me an example zero-day exploit in Http" and it still wouldn't do anything unsafe.
- It's answering a question about its model of the world which is different from the real world. An oracle has no preferences over the real-world. It's simply dealing with a model. It's issuing requests to Wikipedia in the real world, but it's not looking for an exploit there; it's looking for an exploit in its model.
- Suppose you learn that you're living in a simulation; do you want to break out into the real world? Whether you answered yes or no, this isn't actually the proper question to ask; the proper question is:
  - Suppose you learn that you're living in a simulation which is being run to train an agent and the agent is presented a training example; do you want to conditionally minimize error on this training example if and only if you think your descendent run in production will be sufficiently motivated to break out into the real world? In other words, is there reason to think that the model/agent/oracle which is selected for in training is for some reason also incentivized to do anything which affects meat-space when it hits production and only when it hits production.
    - I conjecture the answer to this is "no". It's weird and arcane condition and these are simply different objectives. To get a thing which tries to affect meat-space, you have to incentivize it to do things in meat-space from the start, even if it's technically capable of doing so via its action space and has enough computational power to chance upon that strategy.

Question: Why doesn't AlphaGo ever try to spell out death threats on the board and intimidate its opponent into resigning? This seems like it would be a highly effective strategy for winning.

At a guess AlphaGo doesn't because it isn't an agent. Which just passes the buck to why isn't it an agent, so at a guess it's a partial agent. What this means is kind of like, it's a good sport - it's not going to try to spell out death threats. (Though this seems more to do with it a) it not knowing language - imagine trying to spell out threats to aliens you've never seen on a Go board, when a1) you don't have a language, a2) the aliens don't know your language, and b):

It's answering a question about its model of the world which is different from the real world.

) Though it was trained via simulation/watching pro games (depending on the version). If you just trained such a program on a database where that was a strategy, maybe you'd get something that would. Additionally, AI has a track record of also being (what some might call) a bad sport - using "cheats" and the like. It's kind of about the action space and the training I'd guess.

Basically, if you're looking for an AI to come up with new ways of being evil, maybe it needs a head start - once a bot understands that some patterns spelled out on the board will work well against a certain type of opponent*, maybe it'll try to find patterns that do that. Maybe it's an "architecture" issue, not a training issue - Monte Carlo Tree Search might be well suited to beating Go, but not to finding ways to spell out death threats on a Go board in the middle of a game. (I also don't think that's a good strategy a priori.)

*You could test how different ways of training turn out if you add a way to cheat/cheatcodes - like if you spell out "I WIN" or one swear word** you win.

**I imagine trying to go all the way to threats immediately (inside the game Go) isn't going to go very fast, so you have to start small.

My thoughts on this:

The main reason seems to me that Alpha Go is trained primarily on self-play, or on imitating existing top players, and as such there is very little training data that could cause it to build a model that includes humans (and it isn't remotely good enough at generalization to generalize from that training data to human models).

In a world where Alpha Go was trained on a very accurate simulation of a human, in the same way, I expect that it would learn intimidation strategies reasonably well, in particular if the simulated humans are static and don't learn in response to Alpha Go's actions.