MarkusRamikin comments on The scourge of perverse-mindedness - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (249)
I realized a pithy way of stating my objection to that strategy: given how unlikely I think it is that the test could be passed fairly by a Friendly AI, an AI passing the test is stronger evidence that the AI is cheating somehow than that the AI is Friendly.
If the AI is programmed so that it genuinely wants to pass the test (or the closest feasible approximation of the test) fairly, cheating isn't an issue. This isn't a matter of fast-talking it's way out of a box. A properly-designed AI would be horrified at the prospect of 'cheating,' the way a loving mother is horrified at the prospect of having her child stolen by fairies and replaced with a near-indistinguishable simulacrum made from sticks and snow.
It is probably possible to pass that test by exploiting human psychology. It is probably impossible to do well on that test by trying to convince humans that your viewpoint is right.
You're talking past orthonormal. You're assuming a properly-designed AI. He's saying that accomplishing the task would be strong evidence of unfriendliness.
What Phil said, and also:
Taboo "fairly"— this is another word the specification of which requires the whole of human values. Proving that the AI understands what we mean by fairness and wants to pass the test fairly is no easier than proving it Friendly in the first place.
"Fairly" was the wrong word in this context. Better might be 'honest' or 'truthful.' A truthful piece of information is one which increases the recipient's ability to make accurate predictions; an honest speaker is one whose statements contain only truthful information.
About what? Anything? That sounds very easy.
Remember Goodhart's Law - what we want is G, Good, not any particular G* normally correlated with Good.
Walking from Helsinki to Saigon sounds easy, too, depending on how it's phrased. Just one foot in front of the other, right?
Humans make predictions all the time. Any time you perceive anything and are less than completely surprised by it, that's because you made a prediction which was at least partly successful. If, after receiving and assimilating the information in question, any of your predictions is reduced in accuracy, any part of that map becomes less closely aligned with the territory, then the information was not perfectly honest. If you ignore or misinterpret it for whatever reason, even when it's in some higher sense objectively accurate, that still fails the honesty test.
A rationalist should win; an honest communicator should make the audience understand.
Given the option, I'd take personal survival even at the cost of accurate perception and ability to act, but it's not a decision I expect to be in the position of needing to make: an entity motivated to provide me with information that improves my ability to make predictions would not want to kill me, since any incoming information that causes my death necessarily also reduces my ability to think.
What Robin is saying is, there's a difference between
"metrics that correlate well enough with what you really want that you can make them the subject of contracts with other human beings", and
"metrics that correlate well enough with what you really want that you can make them the subject of a transhuman intelligence's goals".
There are creative avenues of fulfilling the letter without fulfilling the spirit that would never occur to you but would almost certainly occur to a superintelligence, not because xe is malicious, but because they're the optimal way to achieve the explicit goal set for xer. Your optimism, your belief that you can easily specify a goal (in computer code, not even English words) which admits of no undesirable creative shortcuts, is grossly misplaced once you bring smarter-than-human agents into the discussion. You cannot patch this problem; it has to be rigorously solved, or your AI wrecks the world.
Sure, but I don't want to be locked in a box watching a light blink very predictably on and off.
Building the box reduces your ability to predict anything taking place outside the box. Even if the box can be sealed perfectly until the end of time without killing you (which would in itself be a surprise to anyone who knows thermodynamics), cutting off access to compilations of medical research reduces your ability to predict your own physiological reactions. Same goes for screwing with your brain functions.
I do not think you should be as confident as you are that your system is bulletproof. You have already had to elaborate and clarify and correct numerous times to rule out various kinds of paperclipping failures - all it takes is one elaboration or clarification or correction forgotten to allow for a new one, attacking the problem this way.
How confident do you think I am that my plan is bulletproof?