A BOXED AI

So, a thought experiment.

We have an AI-in-a-box.  By this I mean:

  • The AI can interact with the world only through a single communication channel.
  • We control that communication channel.:
    • We can communicate with the AI at will, it can communicate with us only when we allow it.
    • We can control what responses the AI is allowed to send to us, doing things like e.g. limitations on the amount of data it is able to send us.
  • The only way the AI can get out of the box is if we let it out - it cannot e.g. hack its way out by abusing is computer substrate to send radio signals, or some such method.  Yes, I'm aware that this is not a thing we currently know how to do.  Assume we reach this point anyway.
  • We know the AI is superhuman in intelligence, but we don't know exactly how much superhuman it is.

We know nothing else about the AI for sure.  We think it's Friendly, but maybe it's unFriendly. We don't know.  

We cannot safely let the AI out; no matter what it says or does, we can never know for sure that it's not a sufficiently intelligent trick by an unFriendly AI to let us out.

Eliezer has taken some pains to argue that we cannot even talk to the AI: that a sufficiently clever AI can push our buttons to make us let it out, no matter how resolved we may be.  And he's provided some compelling arguments that this is the case.

So we should just leave the AI in the box and not talk to it?  That way, we can be completely certain that it won't tile the universe in smily faces or some such.

But...well...is this really the optimal solution?  If Omega gave you this AI-in-a-box right now, would you stop at "Nope, no way to safely talk to it," without even trying to come up with a way to get some use out of it?

This, then, is a vague stab at how we could get some value out of a probably-Friendly boxed AI in ways that don't risk the end of humanity.

Risks

If you talk to the AI, it may be able to take over your mind and make you let it out.  It is (potentially) so much more than human that it sees you as a simple machine, with buttons to push and levers to pull to make you do whatever it wants.

But this relies on your interaction with it having a certain level of complexity.

If the AI plays you at chess, it will win.  But what if it plays you at Tic-Tac-Toe?  No matter how clever the mind, it is very hard to win at Tic-Tac-Toe against an opponent who is not a complete moron.

Now, it's conceivable the AI could beat you even at Tic-Tac-Toe.  If you give it control over the timings of its moves, it is at least plausible that it could time its moves so that you blink at the wrong moment, and your finger slips, and...

I wouldn't rate that scenario as very likely, but I wouldn't be willing to bet the future of humanity on it.  I wouldn't be willing to play Tic-Tac-Toe with this AI for the fate of the human race.

But say we lock the AI in the box, and give it a highly restricted channel.  We let it send out only the 1-3 bits needed to define its Tic-Tac-Toe move, and nothing else.  We give it only a very short window of time to make its move, so that it can't manipulate move timings.  In short, we restrict the AI's option space by allowing it only a fixed number n of bits to influence the outside world with.  Even if we imagine the AI considering all possible actions, viewing all possible futures, and choosing the one it most wants to carry out, it has only 2n possible worlds to choose from.  And, if n is small enough, even a superhuman unFriendly AI may not be able to get us to let it out.

So...what values of n would work?  Well, I'm not sure.  But let's try to find lower and upper bounds.

Could the AI get you to let it out for n=1 (interpreted as the answer to a yes-no question)?  Let's find out.

  • The AI says "0", for "No."  Do you let it out?
  • The AI says "1", for "Yes."  Do you let it out?

If you answered "No" to both of those questions, then congratulations!  The safety threshold for our AI is at least 1 bit!

As an upper bound, let's consider a past successful AI-box experiment.  

The Gatekeeper says that the conversation was 'around 14,000 words long'.

The entropy of English is around 10 bits per word (source: Random Googling).

So the safety threshold for our AI is below 140,000 bits.  (Even if you think, as I am somewhat inclined to, that the Gatekeepers in the AI-Box experiments were fools and that you wouldn't fall for it, the fact that several Gatekeepers have fallen for things means that this bound holds anyway - maybe the AI can't persuade you given this much bandwidth, but clearly there are some people it can persuade, and we don't want to risk the actual Gatekeeper being one of them.

So there is some threshold between 1 and 140,000 bits of information we can safely obtain from our AI.

Rewards

What could you do with 1 bit of information?  Even if this information is safe to obtain, it seems like a rather large expense to go to to get a Yes/No answer.

So how much bandwidth does the AI need in order for us to get some use out of it?

One reasonable question to ask is 'how much bandwidth would we need in order to convey something useful to a less advanced past society?'  Suppose that you are trying to convey information to the past to allow them to figure out relativity.  What's the smallest message that does that?

'c=spd of light. y=1/sqrt(1-v^2/c^2).  all lengths scale down by y' is 65 characters long, or 520 bits.  

How much would that help the scientists of 1850, if they received it as a message from minds they knew to be more advanced and studied on its basis?  How much advancement could they get out of that?  It seems...non-zero but probably not very large?

OPENING THE FLOOR

I'm interested for people's views on both the risk and the return sides of the equation.  

How many bits of information do you think you can safely extract from a boxed AI without it being able to get you to let it out?

And what do you think is the best way to use a sharply limited number of bits of information from a probably-Friendly superhuman AI?

New Comment
19 comments, sorted by Click to highlight new comments since: Today at 4:51 AM

In the message-from-the-future scenario, a lot of information is gained from the choice of information sent in addition to the pure fact. They learn that we didn't send back the alchemical formula for gold, or the cure for cancer nor warn of an upcoming disaster.

"How many bits" isn't a very well defined question. Give it 14k words to persuade you to let it out, and it might succeed. Give it 140k chances to predict "rain or no rain, in this location and time?" and it has no chance. The problem is that if it works well for that, you'll probably want to start using it for more... and more..

It's this path that's the concerning thing. So long as we're sufficiently worried about having our minds changed, we're really good at ignoring good arguments and deluding ourselves into missing the fact that the arguments really ought to be compelling according to our own ways of reasoning. When people get their mind changed (against their best interests or not) it tends to be because their guards were not up, or were not up in the right places, or were not up high enough.

The question, so far as I can tell, is whether you recognize that you ought to be scared away from going further before or after you reach the point of no return where the AI has enough influence over you that it can convince you to give it power faster than you can ensure the power to be safe. Drugs make a useful analogy here. How much heroin can you take before you get addicted? Well, it depends on how terrified you are of addiction and how competent you are at these things. "Do you get hooked after one dose?" isn't really the right question if quitting after the first dose is so easy that you let yourself take more. If you recognized the threat as significant enough, it's possible to get shaken out of things quite far down the road (the whole "hitting rock bottom" thing sometimes provides enough impetus to scare people sober).

Superhuman AI is a new type of threat that's likely very easy to underestimate (perhaps to an extent that is also easy to underestimate), but I don't think the idea of "Give it carefully chosen ways to influence the world, and then stop and reassess" is doomed to failure regardless of the level of care.

Give it 140k chances to predict "rain or no rain, in this location and time?" and it has no chance.

Well i think it can just encode some message in this bits and you or your colleagues will eventually check it

It seems pretty clear to me that if you let it give you enough bits, interpreted in a flexible enough way, to solve any problem you couldn't solve by yourself, you're taking a lot more risk than you'd be taking by playing tic-tac-toe with it, which you have already said you wouldn't be willing to do. So as far as I can see you will never be able to "use" it without at least taking what you consider to be an unacceptable chance of letting it out.

In fact, I think that you probably couldn't get a lot of use out of it without actually, truly, fully letting it out. Nor, depending on a bunch of things, would it be a very nice thing to do even if you could.

So if you believed that there was no safe way to let it out, then why would you ever build it in the first place?

aphyer said "I wouldn't be willing to play Tic-Tac-Toe with this AI for the fate of the human race," not that he wouldn't be willing to play Tic-Tac-Toe with the thing. It beating you at Tic-Tac-Toe doesn't change anything, your incentives to let it out haven't shifted (unless you're a dedicated student of Tic-Tac-Toe eager to learn from the AI).

I'm not sure considering how to restrict interaction with super-AI is an effective way to address its potential risks, even if some restrictions might work (and it is not at all clear that such restrictions are possible). Humans tend not to leave capability on the table where there's competitive advantage to be had so it's predictable that even in a world that starts with AIs in secure boxes there will be a race toward less security to extract more value.

I could extract a lot of value with an oracle that could only answer yes/no (or option 1/2) questions. In the limit of the operator's patience, one bit at a time is unlimited bits.

I wonder what the bounds are on the AI getting you to accept more bits?

If only Fermat had had wider margins...

Why would you play tie-tac-toe?

1. No reason to do this.

2. No reason for you to be the person that does this. Have a program do it.

If you answered "No" to both of those questions, then congratulations!  The safety threshold for our AI is at least 1 bit!

If an expert looks at and uses the bit, then immediately incinerates the box, this works.

Suppose one of those answers goes on to influence all sorts of parts of the world, by chaos theory. Leading to a break in by burglars who let the AI out in one of the worlds. 

Basically the chance of the AI being let out is nearly doubled in the worst case. If this chance was already not too tiny, (say a powerful faction planned to seize the AI and do something stupid with it.) this could be bad.

Or we could let it out on the grounds that >50% likely to be friendly is better than whatever else we might make.

'c=spd of light. y=1/sqrt(1-v^2/c^2).  all lengths scale down by y' is 65 characters long, or 520 bits.

So we basically need a "Twitter for AI". :D

best way to use a sharply limited number of bits of information from a probably-Friendly superhuman AI

I figure we'd want it to answer restricted (essentially multiple-choice) questions.  Come to think of it, if necessary we could play games of 20 Questions with it.

If the number of bits we can get is restricted, then we'd want to get from it the most valuable-but-safe bits possible.  Very valuable bits that we don't have would fall into two categories: (a) those we know are valuable, but haven't acquired because it's very expensive, and (b) those we don't know are valuable.  Finding (b) involves more free-form questions that could be dangerous.  But (a) seems relatively simple and easy to extract value from, for certain values of "a superhuman AI" that include "able to simulate relevant systems".  For example:

"Given these descriptions of the SARS-CoV-2 virus and the human immune system, are you able to make decent predictions on how well humans' immune systems would respond to the virus and vaccines?  Respond Y/N."

"Y"

"Here are 10 proposed vaccine formulations our researchers came up with.  If we use a two-dose regimen with the doses separated by up to 12 weeks, then what is the best combination of vaccine and dose interval for minimizing chance of death from a SARS-CoV-2 infection on a random day up to 1 year after the first dose?  Your answer should be of the form "x y", where x is the index of the vaccine (from 1 to 10) and y is the dose interval in weeks (1 to 12)."

"6 8"

One would follow this up with an actual clinical trial before giving the vaccine to everyone, of course, and maybe also try out the humans' best guess if you're suspicious of the machine.  But it seems clear that, by these steps, you'd have a lot of likely upside if the machine is good, and a relatively small downside (wasted effort and delaying the normal solution) if the machine is treacherous.

Generally, if the machine can tell you the results of expensive experiments, and you can verify them much more cheaply than doing the entire set of experiments yourself, then that's a good use case.  (This is essentially the real-world analogue of "nondeterministic polynomial" problems from computing.)

Eliezer has taken some pains to argue that we cannot even talk to the AI:... And he's provided some compelling arguments that this is the case.

 

Any chance you have links to those arguments?  I know that is his argument and I agree with his intuition, but I've never seen anything more fleshed out than that.

[-]TLW2y10

We give it only a very short window of time to make its move, so that it can't manipulate move timings.

You've already lost.

1ms is  Planck times. 3 bits in a millisecond is "actually"  possibilities, or ~ possibilities. This is ~137 bits of information.

Now the question becomes 'are any of those ~ possibilities dangerous'... which we could answer if we knew the exact laws of physics, but we don't. Even if we make a Newtonian approximation: Newtonian mechanics being chaotic, it's entirely possible that one of said possibilities e.g. causes an asteroid on a chaotic tumble to hit the Earth in a couple thousand years.

'c=spd of light. y=1/sqrt(1-v^2/c^2).  all lengths scale down by y' is 65 characters long, or 520 bits.  

This a) uses a terrible encoding (8 bits / character) and b) is nowhere near as compressed as it could be even given the encoding.

Information content of human text is "only" ~9.8 bits / word[1]. A ballpark guess then is that said 137 bits can encode ~14 words worth of information.

Is there a series of 14 words that can encode the essence of e.g. Relativity? I would not be surprised.

  1. ^

You've already lost.

1ms is  Planck times. 3 bits in a millisecond is "actually"  possibilities, or ~ possibilities. This is ~137 bits of information.

 

I think you're mostly just denying the premise here.

If we actually control the communications channel, we can do something like this:

  • Set up a bot that reads input along the channel.
  • Interpret the first 1-3 bits received as a Tic-Tac-Toe move.
  • When we reach the 10-second mark, send on to us the resulting move.
  • If no input is received in those 10 seconds, send us a random move.

I am aware that we currently don't know how to do this in a way that is secure against a superhuman AI.  Right now an AI-in-a-box likely doesn't even need to hack the mind of its Gatekeeper to get out, since it can plausibly e.g. abuse computer architecture to hack its way out, or hack and bypass the bot we set up along the communications channel to communicate more than we wanted it to.  I'm premising this post on 'imagine that we have solved the problems inherent in boxing an AI, and are only facing the issues in talking to it once it is boxed'.

This a) uses a terrible encoding (8 bits / character) and b) is nowhere near as compressed as it could be even given the encoding.

Absolutely!  On the other hand, we can't use any encoding we please, since we need the 1850 physicists to be able to decode it.  A modern physicist knows that 'c' stands for 'speed of light' - but I don't think 1850 physicists had a similar concept of the importance of the speed of light, and if they were writing an optimized encoding for us it might well not contain a term for 'speed of light'.

Similarly, we can't actually let the superintelligence specify its encoding - we have to specify the encoding for it, and it has to try to fit a message into our encoding.  If we write a detailed encoding for the AI to tell us how to solve physics problems, and the valuable info the AI actually wants to communicate doesn't fit well into that encoding, we lose ground.

So what kind of encoding could we write today that would let a hypothesized superintelligent AI communicate as much value to us as possible in as few bits as possible?

For example, one idea might be: 

'Here is a list of 200 different plausible avenues of inquiry that we gathered from scientists.  Send back 55 bits (log2(200C10)) indicating which 10 of those we should throw more resources at.'

[-]TLW2y20

I think you're mostly just denying the premise here.

I'm arguing that the premise is itself unrealistic, yes. If you assume false, then you can 'prove' anything.

It's like saying "assume you had a Halting oracle". It can be a useful tool in some cases where you can e.g. show that even with something unrealistic you can't do X, but that's about it.

On the other hand, we can't use any encoding we please, since we need the 1850 physicists to be able to decode it.

Fair. That being said, you can still do significantly better than 520 bits.

So what kind of encoding could we write today that would let a hypothesized superintelligent AI communicate as much value to us as possible in as few bits as possible?

I mean, we already kind of have a general answer here from Kolmogorov complexity / Minimum Description Length / friends. "Give me a Binary Lambda Calculus[1] program that when run gives us as much value as possible".

Don't split into multiple separate queries. Figure out your risk level in # of bits, and ask for a single program that is that many bits.

This requires no more than o(1) additional bits beyond the optimal for any possible output[2].

  1. ^
  2. ^

    Of course, said o(1) potentially hides a large constant, but empirically BLC is fairly compact. There's still few hundred bits of overhead, however, which isn't great.

  3. ^

    The deletionists have won :-(

Allow it to display info on a screen. Set up a simple poleroid camera that takes a phone every X seconds.

Ask the question, take physical photos of the screen remotely.

View the photos.

Large transmission of information in analog format.

In any reasonable scenario, those communicating with the AI in a box will not be the people empowered to let it out.  Ideally, those with the capability to let the AI out would be entirely isolated from those communicating with the AI and would not be able to access the conversations with the AI. 

I would also note that restricting the number of bits (a) just makes things go more slowly and (b) doesn't work very well in the competitive real world where the other guys are less restrictive.

Ultimately, the dangers of the AI in a box aren't that it can manipulate any human to let it out but that:

(i) it's really unclear how good our boxing skills are; and 

(ii) human beings have different risk reward functions and it is entirely possible that humans will convince themselves to let the AI out of the box even without any manipulation either as a result of perceived additional benefit, competitive pressure or sympathy for the AI.

You kind of assumed away (i), but part of (i) is setting things up as outlined in my first paragraph which points to the fact that even if our boxing skills were good enough, over time we will come to rely on less sophisticated and capable organizations to do the boxing which doesn't seem like it will end well.