Strange7 comments on Reply to Holden on 'Tool AI' - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (348)
You are filling in a pattern rather than making a useful observation. E_Y expressed incredulity and ignorance on the subject of making some parts of the code running on a computer harder to modify than other parts of the code on that same computer; I cited a source demonstrating that it is, in fact, a well-established thing. Not impossible to modify, not infallibly isolated from the outside world. Just more of a challenge to alter.
I do not believe I am only filling in a pattern.
Putting the self-modifying parts of the AI (which we might as well call the actual AI) in the equivalent of a VM is effectively the same as forcing it to interact with the world through a limited interface which is an example of the AI box problem.
Right- I think the issue is more that I (at least) view the AI as operating entirely in ring 3. It might be possible to code one where the utility function is ring 0, I/O is ring 1, and action-plans are ring 3, but for those distinctions to be meaningful they need to resist bad self-modifying and allow good self-modification.
For example, we might say "don't make any changes to I/O drivers that have a massively positive effect on the utility function" to make it so that the AI can't hallucinate its reward button being pressed all the time. But how do we differentiate between that and it making a change in ring 3 from a bad plan to a great plan, that results in a massive increase in reward?
Suppose your utility function U is in ring 0 and the parts of you that extrapolate consequences are in ring 3. If I can modify only ring 3, I can write my own utility function Q, write ring-3 code that first extrapolates consequences fairly, pick the one that maximizes Q, and then provides a "prediction" to ring 0 asserting that the Q-maximizing action has consequence X that U likes, while all other actions have some U-disliked or neutral consequence. Now the agent has been transformed from a U-maximizer to a Q-maximizer by altering only ring 3 code for "predicting consequences" and no code in ring 0 for "assessing utilities".
One would also like to know what happens if the current AI, instead of "self"-modifying, writes a nearly-identical AI running on new hardware obtained from the environment.
Sure; that looks like the hallucination example I put forward, except in the prediction instead of the sensing area. My example was meant to highlight that it's hard to get a limitation with high specificity, and not touch the issue of how hard it is to get a limitation with high sensitivity. (I find that pushing people in two directions is more effective at communicating difficulty than pushing them in one direction.)
The only defense I've thought of against those sorts of hallucinations is a "is this real?" check that feeds into the utility function- if the prediction or sensation module fails some test cases, then utility gets cratered. It seems too weak to be useful: it only limits the prediction / sensation module when it comes to those test cases, and a particularly pernicious modification would know what the test cases are, leave them untouched, and make everything else report Q-optimal predictions. (This looks like it turns into a race / tradeoff game between testing to keep the prediction / sensation software honest and the costs of increased testing, both in reduced flexibility and spent time / resources. And the test cases might be vulnerable, and so on.)
I don't think the utility function should be ring 0. Utility functions are hard, and ring zero is for stuff where any slip-up crashes the OS. Ring zero is where you put the small, stupid, reliable subroutine that stops the AI from self-modifying in ways that would make it unstable, or otherwise expanding it's access privileges in inappropriate ways.
I'd like to know what this small subroutine looks like. You know it's small, so surely you know what's in it, right?
Doesn't actually follow. ie. Strange7 is plainly wrong but this retort still fails.
It doesn't follow necessarily, but Eliezer has justified skepticism that someone who doesn't know what's in the subroutine would have good reason to say that it's small.
He knows that there is no good reason (because it is a stupid idea) so obviously Strange can't know a good reason. That leaves the argument as the lovechild of hindsight bias and dark-arts rhetorical posturing.
I probably wouldn't have comment if I didn't notice Eliezer making a similar error in the opening post, significantly weakening the strength of his response to Holden.
I expect much, much better than this from Eliezer. It is quite possibly the dumbest thing I have ever heard him say and the subject of rational thinking about AI is supposed to be pretty much exactly his area of expertise.
Not all arguing aimed at people with different premises is Dark Arts, y'know. I wouldn't argue from the Bible, sure. But trying to make relatively vague arguments accessible to people in a greater state of ignorance about FAI, even though I have more specific knowledge of the issue that actually persuades me of the conclusion I decided to argue? I don't think that's Dark, any more than it's Dark to ask a religious person "How could you possibly know about this God creature?", when you're actually positively convinced of God's nonexistence by much more sophisticated reasoning like the general argument against supernaturalism as existing in the model but not the territory. The simpler argument is valid - it just uses less knowledge to arrive at a weaker version of the same conclusion.
Likewise my reply to Strange; yes, I secretly know the problem is hard for much more specific reasons, but it's also valid to observe that if you don't know how to make the subroutine you don't know that it's small, and this can be understood with much less explanation, albeit it reaches a weaker form of the conclusion.
Of course not. The specific act of asking rhetorical questions where the correct answer contradicts your implied argument is a Dark Arts tactic, in fact it is pretty much the bread-and-butter "Force Choke" of the Dark Arts. In most social situations (here slightly less than elsewhere) it is essentially impossible to refute such a move, no matter how incoherent it may be. It will remain persuasive because you burned the other person's status somewhat and at the very best they'll be able to act defensive. (Caveat: I do not use "Dark Arts" as an intrinsically negative normative judgement. Dark arts is more of natural human behavior than reason is and our ability to use sophisticated Dark Arts rather cruder methods is what made civilization possible.)
Also, it just occurred to me that in the Star Wars universe it is only the Jedi's powers that are intrinsically "Dark Arts" in our sense (ie. the "Jedi Mind Trick"). The Sith powers are crude and direct - "Force Lightening", "Force Choke", rather than manipulative persuasion. Even Sideous in his openly Sith form uses far less "Persuading Others To Have Convenient Beliefs Irrespective Of 'Truth'" than he does as the plain politician Palpatine. Yet the audience considers Jedi powers so much more 'good' than the Sith ones and even considers Sith powers worse than blasters and space cannons.
I'm genuinely unsure what you're talking about. I presume the bolded quote is the bad question, and the implied answer is "No, you can't get into an epistemic state where you assign 90% probability to that", but what do you think the correct answer is? I think the implied answer is true.
Hence my "used to be cool" comment.
It seems to me that you entirely miss the sleight of hand the trickster uses.
Utility function is fuzzed (due to how brains work) together with the concept of "functionality" as in "the function of this valve is to shut off water flow" or "function of this AI is to make paperclips". The relevant meaning is function as in mathematical function works on some input, but the concept of functionality just leaks in.
The software is an algorithm that finds values a for which u(w(a)) is maximal where u is 'utility function', w is the world simulator, and a is the action. Note that protecting u accomplishes nothing as w may be altered too. Note also that while the u, w, and a, are related to the real world in our mind and are often described in world terms (e.g. u may be described as number of paperclips), those are mathematical functions, abstractions; and the algorithm is made to abstractly identify a maximum of those functions; it is abstracted from the implementation and the goal is not to put electrons into particular memory location inside the computer (the location which has been abstracted out by the architecture). There is no relation to the reality defined anywhere there. Reality is incidental to the actual goal of existing architectures, and no-one is interested in making it non-incidental; you don't need to let your imagination wild all the way to the robot apocalypse to avoid unnecessary work that breaks down abstractions and would clearly make the software less predictable and/or make the solution search probe for deficiencies in implementation, which clearly serves to accomplish nothing but to find and trigger bugs in the code.
Perhaps the underlying error is trying to build an AI around consequentialist ethics at all, when Turing machines are so well-suited to deontological sorts of behavior.
As I previously mentioned, the design of software is not my profession. I'm not a surgeon or an endocrinologist, either, even though I know that an adrenal gland is smaller, and in some ways simpler, than the kidney below it. If you had a failing kidney, would you ask me to perform a transplant on the basis of that qualification alone?