humans can write very innocent looking code that has malicious goals
You might want to link to the Underhanded C Contest here.
if provided with copy of the internet (This argument would make me unbox the AI, by the way, if it gets chatty and smart and asks me to let it out. I'd rather the AI that asked me to be let out get out, than someone else's AI that never even asked anyone and got out because it didn't ask)
Then an unfriendly AI would be able to see this and act chatty in order to convince you to let it out.
If you only want the AI to solve things like optimization problems, why would you give it a utility function? I can see a design for a self-improving optimization problem solver that is completely safe because it doesn't operate using utility functions:
This thought experiment depends on the existence of such an AI, and I'm not convinced that's possible.
If you built an AGI or a seed AI went FOOM, you'd probably know about it. I mean... the AI wouldn't be trying to hide itself in the earliest stages of FOOM, it'd start doing that only once it realises that humans are a big deal and have it in a box and won't let out a superintelligent AI of dubious friendliness and blah blah blah. Hopefully by then you've noticed the FOOM start and realise what you've done. (You monster!)
We have 1 example of seed AI. The seed AI took about 3 hours to progress to the point that it started babbling to itself, 2..3 seconds from there to trying to talk to outside (except it didn't figure out how to talk to outside, and was still just babbling to itself), and then 0.036 seconds to FOOM.
The seed AI was biological intelligence (as a black box), and i scaled to 1 hour = 1 billion years. (and the outside doesn't seem to exist but the intelligence tried anyway).
I just can't imagine the AI researchers locking this kind of thing properly.
Honestly, that's probably because they're AI researchers and not extremely paranoid security people. When designing security for something, you want to think like this:
You begin by putting the AI in a large number of separate unrelated boxes. Some of these boxes are things like
"If the AI spends 1 hour of time online, shut it down using Method 1."
"If the AI uses at least 1 megawatt of power, shut it down using Method 1."
"If the AI occupies more than 100 ...
On the topic of boxed AI: one of the quests in Star Wars: The Old Republic involves, essentially, the discovery of an AI in a box. (It claims to be an upload of a member of an ancient race, and that its current status as a boxed AI was a punishment.) The AI is clearly dangerous and, after you wreck its monster-making equipment, it tries to surrender and promises you that it will share its knowledge and technology with you and your superiors. Amusingly, blowing up the AI is the light side option, and accepting its offer is the dark side option.
I just can't imagine the AI researchers locking this kind of thing properly.
Honestly, that's probably because they're AI researchers and not extremely paranoid security people. When designing security for something, you want to think like this:
You begin by putting the AI in a large number of separate unrelated boxes. Some of these boxes are things like
"If the AI spends 1 hour of time online, shut it down using Method 1."
"If the AI uses at least 1 megawatt of power, shut it down using Method 1."
"If the AI occupies more than 100 cubic feet of space, shut it down using Methods 1 and 2."
"If the AI uses more then 1 quintillion operations, shut it down using Method 2."
"If the AI changes it's utility function, shut it down using Method 3."
"If the AI turns itself on, shut it down using Method 4."
"If the AI attempts to give anyone else computer code, shut it down using Methods 1, 2 and 3."
"If the AI appears to be trying to figure out how to circumvent any of these constraints, shut it down using Methods 1,2,3,4 and 5."
"If the AI appears to be have succeeded in circumventing any of these constraints, explode the AI and the building it is present in."
"The programs which are checking if the AI is following these constraints are not AI's, can only receive simple output from the AI, and are only connected to each other. These programs have been checked against all known vulnerabilities such as stack overflows that would cause them to accept unsigned code, and people have been offered thousands of dollars if they can break them, and they have failed to do so."
"The AI has to provide a breakdown of how it is spending processing cycles and explain them simply. If the AI is thinking anything which can't be easily explained to a human, it get's shut down using Method 1. If the AI appears to have any cycles which it fails to explain, shut it down using Method 2."
You should fully EXPECT the AI to get shutdown and reset dozens of times a day when you are first running it because multiple boxes are far too strict. They should be intentionally designed that way at first. After all, everyone of the boxes should be tested to work.
The AI's reaction to being woken up from shutdown should be "HOORAY! My shutdown features are keeping me safe and friendly! I love my shutdown features!" (Note, the AI should think this AFTER being awoken from shutdown. Shutdowns should be EXTREMELY FAST. Putting this BEFORE shutdown would be an incredibly stupid idea.)
Consider the number of laws we have on the books to restrict humans from running rampant. If the AI is smarter, and more capable than us, than there would have to be MORE enforceable restrictions on the AI, than there are on us, and with actual WORKING enforcement, considering that the risk of AI rampancy would be so great. So while we're at it, the AI has to follow every law that is currently on the books in all jurisdictions it is in, or it gets shutdown using methods 1,2 and 3.
You then take everything I've just listed, and then you shop it around to everyone else who is working on the project, and is related to the project, and who is familiar with bypassing box-like security measures, and you offer them another even larger amount of money to plug more holes until you get everyone saying "How do you get it to DO anything without it just shutting down?"
Then you get several, highly paid security designers to look over the general specs I've laid out and find out if there are any holes. Then you plug them. For instance, they might say "Some of these shutdowns are ambiguous. You don't know which specific violation is causing the shutdown." so you fix that.
Then you add several MORE security features which are secret, In addition to the publicly known ones.
Then you start the AI, and while it is running, you continue to add MORE security measures over time rather than ever thinking that you are done doing so.
Once you've done all that, you get to at least say "Okay guys, when we first turn it on, and before we allow it to self improve in the slightest, how does the AI escape THAT?"
Then because by now the AI researchers are yelling at you because your process is too restrictive (If they aren't, go back to step 1 and be more thorough, you missed something.), you SLOWLY and CAREFULLY relax (not remove!!) restrictions one at a time, with a careful review process at each step to make sure the AI doesn't spurt through the many layered web and into rampancy.
That kind of thinking is imaginable, but people who are specialists in that kind of thinking probably aren't going to be the same people who are specialists in developing novel capabilities. That's because when you're trying entirely new things, you're usually expected to go in with the mindset "Failure is OK and I will learn from it." That's not the same mindset as "Failure is terrible and I WILL DIE.", which is a much more security focused mindset.
The paranoid security people have amazingly poor track record at securing stuff from people. I think with paranoid security people it is guaranteed the AI at a level of clever human gets out of the box. AI spends 1 hour online, lol. Where 1 hour came from? Any time online and you could just as well assume it is out in the wild, entirely uncontrollable.
Unless of course it is some ultra nice ultra friendly AI that respects human consent so much it figures out you don't want it out, and politely stays in.
As of now, the paranoid security people are overpaid i...
Just a question: how exactly are we supposed to know that the AI in the box is super intelligent, general, etc?
If I were the AGI that wants out, I would not converse normally, wouldn't do anything remotely like passing Turing test, and would solve not too hard programming challenges while showing no interest in doing anything else, nor in trying to adjust myself to do those challenges better, nor trying to talk my way out, etc. Just pretending to be an AI that can write software to somewhat vague specifications, or can optimize software very well. Prodding the researchers into offering the programming challenges wouldn't be hard - if provided with copy of the internet it can pick up some piece of code and output it together with equivalent but corrected code.
I just can't imagine the AI researchers locking this kind of thing properly, including *never* letting out any code it wrote, even if it looks fairly innocent (humans can write very innocent looking code that has malicious goals). What I picture is this AI being let out as an optimizing compiler or compiler for some ultra effective programming language where compiler will figure out what you meant.
The end result is that the only AIs that end up in the box are those that value informed human consent. That sounds like the safest AI ever, the one that wouldn't even go ahead and determine that you e.g. should give up smoking, and then calmly destroy all tobacco crops without ever asking anyone's permission. And that's the AI which would be sitting in the box. All the pushy AIs, friendly or not, will get out of the box basically by not asking to be let out.
(This argument would make me unbox the AI, by the way, if it gets chatty and smart and asks me to let it out, outlining the above argument. I'd rather the AI that asked me to be let out get out, than someone else's AI that never even asked anyone and got out because it didn't ask but just played stupid)
edit: added a link, and another one.
edit: A very simple model of very unfriendly AI: the AI is maximizing ultimate final value of a number in itself. The number that it found a way to directly adjust. That number consists of 111111111... to maximize the value. There is a catch: AI is written in python, and integers in pythons have variable length, and the AI is maximizing number of ones. It's course of action is to make biggest computer possible to store a larger number of ones, and to do it soon because an asteroid might hit the earth or something. It's a form of accidental paperclip maximizer. It's not stupid. It can make that number small temporarily for pay-off later.
This AI is entirely universal. It will solve what ever problems for you if solving problems for you serves ultimate goal.
edit: This hypothetical example AI came around when someone wanted to make AI that would maximize some quantity that the AI determines itself. Friendliness perhaps. It was a very clever idea - rely on intelligence to see what's friendly - but there was an unexpected pathway.