Pentashagon comments on Reply to Holden on 'Tool AI' - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (348)
I'm saying that using the word "hardwiring" is always harmful because they imagine an instruction with lots of extra force, when in fact there's no such thing as a line of programming which you say much more forcefully than any other line. Either you know how to program something or you don't, and it's usually much more complex than it sounds even if you say "hardwire". See the reply above on "hardwiring" Deep Blue to protect the light-square bishop. Though usually it's even worse than this, like trying to do the equivalent of having an instruction that says "#define BUGS OFF" and then saying, "And just to make sure it works, let's hardwire it in!"
Presumably a well-designed agent will have nearly infallible trust in certain portions of its code and data, for instance a theorem prover/verifier and the set of fundamental axioms of logic it uses. Manual modifications at that level would be the most difficult for an agent to change, and changes to that would be the closest to the common definition of "hardwiring". Even a fully self-reflective agent will (hopefully) be very cautious about changing its most basic assumptions. Consider the independence of the axiom of choice from ZF set theory. An agent may initially accept choice or not but changing whether it accepts it later is likely to be predicated on very careful analysis. Likewise an additional independent axiom "in games of chess always protect the white-square bishop" would probably be much harder to optimize out than a goal.
Or from another angle wherever friendliness is embodied in a FAI would be the place to "hardwire" a desire to protect the white-square bishop as an additional aspect of friendliness. That won't work if friendliness is derived from a concept like "only be friendly to cognitive processes bearing a suitable similarity to this agent" where suitable similarity does not extend to inanimate objects, but if friendliness must encode measurable properties of other beings then it might be possible to sneak white-square bishops into that class, at least for a (much) longer period than artificial subgoals would last.