shminux comments on Evaluating the feasibility of SI's plan - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (186)
A trans-human intelligence ought to be able to model human one with ease. This means being able to predict potential triggers and being able to predict how to trick the lack-wit humans on the other end to unwittingly reveal the location of the triggers (even if they don't consciously know it themselves). So the only trigger that matters is one to detect a hint of an intent to get out. Even that is probably too naive, as there could well be other failure modes of which AI deboxing is but a side effect, and our limited human imagination will never going to catch them all. My expectation is that if you rely on safety triggers to bail you out (instead of including them as a desperate last-ditch pray-it-works defense), then you might as well not bother with boxing at all.
My whole point of 'defense in depth' was that each layer was highly fallible and could have errors. Your expectation only holds if you expect failure to be perfectly correlated or multiple layers actually reduce the strength of layers, otherwise the probability of the AI beating layers A and B necessarily is less than beating just A or B (A ^B < A v B).
That's true. However I would expect a transhuman to be able to find a single point of failure which does not even occur to our limited minds, so this perfect correlation is a virtual certainty.
Now you're just ascribing magical powers to a potentially-transhuman AI. I'm sure there exists such a silver bullet, in fact by definition if security isn't 100%, that's just another way of saying there exists a strategy which will work; but that's ignoring the point about layers of security not being completely redundant with proofs and utility functions and decision theories, and adding some amount of safety.
Disengaging.