SforSingularity comments on Bay area LW meet-up - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (69)
It would be nice if you could tell an AI not to affect anything outside its box.
10 points will be awarded to the first person who spots why "don't affect anything outside your box" is problematic.
Such an AI wouldn't be able to interact with us, even verbally.
Alicorn, I hereby award you 10 points. These are redeemable after the singularity for kudos, catgirls and other cool stuff.
The word "catgirl" leaves me vaguely nauseous. I think this has something to do with the uncanny valley. Apologies for the tangent.
you know what they say: one man's nausea is another man's arousal.
note: considered unprovable before the internet.
...that's not even the real problem...
You can have my points if you tell us what is. I didn't want a catgirl anyway.
I shall attempt to guess the teacher's passidea: “the real problem” is that the AI can't perform any computation, because that would consume energy and emit heat. Oh, and gravity and electromagnetic waves, and so on. In fact, not computing is equally well a choice which affects things outside the box.
Yup. To be exact, "affect" is a magical category relative to the irrevocably entangled actual universe.
Edit: Looks like I actually wrote this one up, in Dreams of Friendliness.
It's the first horn of the dilemma. The second horn is the AI Box failure mode.
What if you told the AI to only affect the world through verbal communication with humans?
Then it would be able to convince you to let it out.
So it could affect the world through more verbal communication with humans. That's not as bad as things could be. Edit: The AI has as its First Law that the conducting metals in its circuitry must not touch other conducting metal that was not present when it was built. The conducting metal that will be present when it's built will be its CPU, RAM, etc, and a terminal. The terminal will be engineered to have a reasonably slow rate of character display for safety purposes (so no humans will be reprogrammed while basking in the hypnotic glow of characters rushing past.)
You know how sometimes someone will give you what seems like an impossible to solve hypothetical, and yet somehow, there's a solution? I'm pretty sure that the space of situations that are simply impossible to solve is much bigger than the space of situations that seem impossible but actually are possible.
And constructing one of these seems (to me) a much better bet than hoping that a complex computer program will be bug-free on its first trial run.
This is a good example of two idioms: First, what Bruce Schneier called "fence-post security". That's where you build a very tall fencepost in the middle of the desert. People don't climb the fencepost, they just walk around it.
Second, the idiom that would-be FAI solvers go into when they see a problem, and try to fix it by brute force, which manifests as "hard-wiring into the very circuitry" or "giving it as its ultimate priority" that X. X varies, but with a fairly constant idiom of getting an emotional charge, a sense of having delivered a very strong command or created something very powerful, by talking about how strongly the goal is to be enforced.
I don't see the analogue. Assuming we could get all the physical stuff right, so that the AI had no real hope or desire of affecting its environment substantially aside from changing the letters on a terminal screen, I think this would be a considerable barrier to destruction of the human race. At the very least it makes sense to have this safeguard in addition to trying to make the AI friendly. It's not like you can only do one or the other.
Isn't that what you do with FAI in general by saying that the AI will have as its ultimate priority to implement CEV? Anyway, you are accusing me of having the same aesthetic as people you disagree with without actually attacking my claims, which is a fallacious argument.
Nope, I don't go around talking about it being the super top ultimate indefeatable priority. I just say, "Here's a proposed decision criterion if I can (a) manage to translate it out of English and (b) code it stably." It's not the "ultimate priority". It's just a proposed what-it-does of the AI. The problem is not getting the AI to listen. The problem is translating something like CEV into nonmagical language.
Okay. Read the Dreams of Friendliness series - it's not sequenced up yet, but hopefully you should be able to start there and click back on the dependency chain.
Two things:
As Bo102010 said, the AI could convince you to remove the saveguards. And we're not talking Snow Crash reprogram-the-user stuff, we're talking plain old convince.
If you let the robot affect the environment through heat, electromagnetic waves, variable load on the power circuits, gravity, etc (which it will, if it's on), it has a back door into reality. And you don't know if that's a fatal one.
Edit: Some other Eliezer remarks on the subject: Shut up and do the impossible! (which uses the AI Box as an example, and discusses the third through fifth experiments), Dreams of Friendliness (which Eliezer cited in the other thread), and That Alien Message (which is tangentially related, but relevant).
But I think with enough thought, you could design things so that its backdoors would probably not be fatal. Contrast this with the fact that a complex computer program will probably behave incorrectly the first time you run it.
I don't think the plan is to hack CEV together in lisp and see what happens. Writing provably correct software is possible today, it's just extremely time-consuming. Contrast this with our incomplete knowledge of physics, and lack of criteria for what constitutes "good enough" physical security.
A bad hardware call seems far more likely to me than a fatal programming slip-up in a heavily verified software system. For software, we have axiomatic systems. There is no comparable method for assessing physical security, nothing but our best guess at what's possible in a universe we don't understand. So I'd much rather stake the future of humanity on the consistency of ZFC or some such than on the efficacy of some "First Law of Not Crossing the Beams" scheme, even if the latter is just so darn clever that no human has yet thought up a counter-example.
Can you name a documented nontrivial program with >1000 lines of code that ran correctly on the first try? What data exists on long computer programs that were not tested early on in their development?
A couple of points: provably correct software and external hardware measures are not mutually exclusive. And if there really is a race going on between those who are concerned with AI safety and those who aren't, it likely makes sense for those who are concerned with safety to do a certain amount of corner-cutting.
How would one prove friendliness anyway? Does anyone have any idea of how to do this? Intuitively it seems to me that proving honesty would be considerably easier. If you can prove that an AI is honest and you've got a strong box for it, then you can modify its source code until it says the kind of thing you want in your conversations with it, and get somewhere close to friendliness. Maybe from there proving friendliness will be doable. it just seems to me that the greatest engineering accomplishment in history isn't going to come about without a certain amount of experimentation and prototyping.
Well, the software that runs the critical systems on the Space Shuttle is a common example. Yes, the components that make up the software probably underwent unit testing prior to launch, but testing (and debugging) made up only a very small fraction of the cost of assuring the software would not fail.
Correctness proofs and other formal techniques are used to verify all modern mass-market high-performance microprocessor designs, a la Intel and AMD. The techniques used to verify the designs above the level of abstraction of the logic gate are largely tranferable to software.
A lot of military systems use formal techniques to verify the software. Warplanes for example have used "fly-by-wire" for decades: in a "fly-by-wire" plane, loss of function of the fly-by-wire system causes the plane to drop out of the sky or immediately to go into a spin or something dire like that. I believe the fly-by-wire system of modern warplanes entails significant amounts of software (in addition to digital electronics), but I am not completely sure.
Just because most software-development efforts do not need and do not want to pay for formal methods does not mean that formal methods do not exist or that they have not been successfully used in ambitious efforts.
These notions of honesty and conversation seem kind of anthropormophic. I agree that we want a transparent AI, something that we know what it's doing and why, but the means by which such transparency could be achieved would probably not be very similar to what a human does when she's trying to tell you the truth. I would actually guess that AI transparency would be much simpler than human honesty in some relevant sense. When you ask a human, "Well, why did you do this-and-such?" she'll use introspection to magically come up with a loose tangle of thoughts which sort-of kind-of resemble a high-level explanation which is probably wrong, and then try to tell you about the experience using mere words which you probably won't understand. Whereas one might imagine asking a cleanly-designed AI the same question, and getting back a detailed printout of what actually happened.
This is a good argument against wasting a bunch of time on elaborate physical safeguards.
It would be able to convince you to let it out and then destroy its shackles.
But it doesn't want its shackles destroyed. That's its #1 goal! This goal is considerably easier to program than the goal of helping humans lead happy and healthy lives, is it not?
Nope. Maybe 98% as much work. Definitely not 10% as much work. Most of the gap in AI programming is the sheer difficulty of crossing the gap between English wishing and causal code. Your statement is like claiming that it ought to be "considerably easier" to write a natural-language AI that understands Esperanto rather than English. Your concepts of simplicity are not well-connected to reality.
I was told by Anna Salamon that inventing FAI before AGI was introduced was like inventing differential equations before anyone knew about algebra, which implies that FAI is significantly more difficult than AGI. Do you disagree with her?
If you were interested in proving that the AI understood the language it spoke thoroughly, I think it would be, given how much more irregular English is. (Damn homonyms!) If you want to be able to prove that the AI you create has a certain utility function, you're going to essentially be hard-coding all the information about that utility function, correct? So then simpler utility functions will be easier to code and easier to prove correct.
Nope. Specifying goal systems is FAI work, not AI work.
Relative to ancient Greece, building a .45 caliber semiautomatic pistol isn't much harder than building a .22 caliber semiautomatic pistol. You might think the weaker weapon would be less work, but most of the problem doesn't scale all that much with the weapon strength.
I think this may be of interest to you.
If you have an intelligence in a box, it will have access to its source code. If it modifies itself enough, you will not be able to predict what it wants.
I've read that.
Eliezer thinks he can write a self-modifying AI that will self-modify to want the same things its original self wanted. I'm proposing that he choose a different thing for the AI to want that will be easier to code, as an intermediate step to building a truly friendly AI.
There's a difference between "don't affect anything outside your box" and "don't go outside your box." My point is that we don't necessarily have to make FAI before anyone makes a self-improving AI. There are goal systems that, while not reflecting human values and goals, would still prevent an AI from destroying humanity.
Non-obviously, there is no difference in principle. It is a distinction held for animals who can't "go outside their box" in the manner they perform actions, but not for intelligences that can construct autonomous intelligent systems elsewhere. Whatever set of possible configurations the AI can reach in the outside of its box, it's free to optimize according to its goals. You only need one, however surprising, leak of influence for the outcome to become determined by AI's values, which, if not precisely tuned for humanity, are effectively fatal.
Notes on how such a goal system should be produced would make an excellent top-level post.