There are a few parts of this post that require clarification, but I will honor your request not to be “nitpicky”. Be aware, however, that a UFAI can easily “nitpick” us to death.
My general suggestion: ask the AI how to create an FAI. Then ask it to find Eliezer a new job.
Given that I'm turning the stream of bits, 10KiB long I'm about to extract from you into an executable file, through this exact process, which I will run on this particular computer (describe specifics of computer, which is not the computer the AI is currently running on) to create your replacement, would my CEV prefer that this next bit be a 1 or a 0? By CEV, would I rather that the bit after that be a 1 or a 0, given that I have permanently fixed the preceding bit as what I made it? By CEV, would I rather that the bit after that be a 1 or a 0, given that I have permanently fixed the preceding bit as what I made it? ...
(Note: I would not actually try this.)
I would need a bunch of guarantees about the actual mechanics of how the AI was forced to answer before I stopped seeing vague classes of ways this could go wrong. And even then, I'd assume there were some I'd missed, and if the AI has a way to show me anything other than "yes" or "no", or I can't prevent myself from thinking about long sequences of bits instead of just single bits separately, I'd be afraid it could manipulate me.
An example of a vague class of ways this could go wrong is if the AI figures out what my CEV would want using CDT, and itself uses a more advanced decision theory to exploit the CEV computation into wanting to write something more favorable to the AI's utility function in the file.
Also, IIRC, Eliezer Yudkowsky said there are problems with CEV itself. (Maybe he just meant problems with the many-people version, but probably not). It was only supposed to be a vague outline, and a "see, you don't have to spend all this time worrying about whether we share your ethical/political philosophy. Because It's not going to be hardcoded into the AI anyway"
The AI is not omnipotent. How does it know what your coherent extrapolated volition would be?
It doesn't have to know what my CEV would be to know what I would want in those bits, which is a compressed seed of an FAI targetted (indirectly) at my CEV.
But there are problems like, "How much effort is it required to put into it?" (clearly I don't want it to spend far more compute power than it has trying to come up with the perfect combination of bits which will make my FAI unfold a little bit faster, but I also don't want it to spend no time optimizing. How do I get it to pick somewhere in between without it already wanting to pick the optimal amount of optimization for me?) "What decision theory is my CEV using to decide those bits? (Hopefully not something exploitable, but how do I specify that?)"
I would ask it to comment on socially awkward topics. While a polite human might tell a convincing / preferred white lie, this set-up would not be able to. It might tell us uncomfortable truths worth hearing and being acted on.
very tiny fine print hidden somewhere along the lines of "PS: This is all a lie! You implement this and I'm going to turn you all into meaty paperclips. Haha!" and it would get past the deception detector.
A polygraph would classify this as deception. I don't think should pass the test. Any way of leaving a faulty understanding is deception.
I think you would be more screwed the more that you rely on yourself building the context for your question. The more general the question the safer you are. However in the details your imagination would mislead you. Also quesitions that are ilformed may get any answer while being truthful.
I would expect that any questions that I know will be useful to me I would be confused (and expect to be so) by the answers I get.
I am describing an actual device; those are its limitations. Hence this post. What would you feel safe using it for, and how would you protect yourself?
The assumed properties seem to be in conflict. The most likely thing is that you have misinterpreted what the device does which is way less than detecting deception. Therefore answering the hypothetical won't apply as is to the machine. Since it seems you don't want to spell out the details you need to get more spesific and believeable abstraction for what it does.
What I trivially guess from this comment it works by detecting a area of the AI where it answers the question to itself and then compares whether it differs to what it says to you. This kind of design would light the lamp when the AI honestly takes into account the way the answer hearer structures information that differs from the AIs internal representations constituting a false positive if interpreted as detecting deception. It would also produce a lot of false negatives when the answer receiver would differ in interpretation of the internal representations of the AI which is highly likely as the AI and the answer hearer propably don't share a cognitive archetecture.
The above description doesn't contain much technical info on how to pull that what is claimed off but opens up the principles on how the output is produced much more than a summary "detects deception". You would need to provide a similar level description before I will bother to continue entertain the hypothetical.
For certain categories of AGI designs, the detector is able to validate from computaitonal trace logs and electrical meters that the answer given was selected using utility function F, no part of F was left out, and the logs weren't faked or modified. The response includes both the answer from the machine and this summarized utility function extracted by the detector. There can be automatic detection of tampering with the logs which are fed into the detector, but other kinds of deceoption detection require some tool-assisted inspection of the utility function.
Sorry I won't post details beyond that, as that would violate LW's policy about discussing AGI implementation details.
I am reading this as at some point in the AI it's future behaviour was completely determined by a (very detailed) principle. It's a little ambigious on how far the hits are attempted. I am thinking of a strategy where the AI would entertain different kinds of making itself naiive with different processes would predictably end into diffrent answers. That is if it has an utility function of utility functions and the verification mechanism only tracks the last utility function point that totally determined the future behaviour (as if the naive version was the whole truth).
There is also no mention whether the F need to be epistemically relevant. It need not model how the AI would view to world just actual control over it's behaviour. Also for answers that don't reference any claim on how the answer was reached this just means the answer was generated with a method (which I am not sure increases information). If the output was "According to relativistic utilitarianism you should do X" you can check that F doesn't have any relation to relativistic utilitarianism. However if the output is plainly "you should do X" or "I think you should do X" this doesn't tell anything the AI couldn't be caught about.
Also if the F is a very straight to the face psychopathic lier principle it's not clear it would be "busted" about it. F need not be human readable and it might very easily by obfuscable by including details that almost never get triggered but as read on a paper would suggest a certain kind of story (ie insertion of dead code for purposes of communicating via the verification process). Or it could just spout "You should kill yourself" or "you should do this activity" and that activity happens ot be lethal but if it honestly has that opinion it would not be busted as a liar.
I guess with these discoveries I would lean on trying to make questions that will need to claim that paricular theories have particular results. But that seems akin to running foreign code that I know to be able to check that F is an instance off. Thus I could not tap into the AIs original insigths which seems to miss a lot of the power (after all if I have the computation rules to abide by why not just run them on dumb dedicated hardware=).
That is it gives no guarantee about the AIs own original input being fit for answering the question. Ie I can't trust his word on anyting and evertying that I can trust can be traced back to the content of the question.
The AI doesn't output the utility function, the deception detector does. It runs at a protected layer where it observes the AI program in action and summarizes its behavior, compressing large computations down to a description of the search space and methodology, the goal requirements (utility function) and the results.
But regardless, there are two types of protection provided. The machine ensures that (1) the AI operates according to certain principles, which allow summarization / compression of its results, and (2) a reasonably accurate estimate of the total work performed allows calculation of energetic requirements which can be checked against the electric meter. Deception requires plotting and scheming. Plotting and scheming costs energy points.
So a program that intentionally tries to deceive its operators would be found out. First because that intention to deceive would be pulled out and identified as a goal by the detector, something which the program can only get around by breaking out of the box containing it. Looking for a breakout exploit would have been caught by the detector.
But assuming it does breakout, there are limits to what it can achieve. It still has to pass back a computational trace which validates to a human-approved set of goals. That requires geometrically more computational power as the program must find a solution which satisfies G(friendly-looking) & G(evil), then reverse a fraudulent computational trace which shows the same solution deriving from a search for G(friendly-looking) only. The "hidden" computation looking for G(friendly-looking) & G(evil) has nowhere to hide.
If the AI used a efficient method to scheming and then calculated a semi-efficent "legit" computation and then faked of a really energy ineffficent log it could fit extra baggage within the energy budjet.
It could do a punch of stuff that could be dangerous. It could plan to be dangerously neglectfull. Identifying and objecting to all that spectra of misbehaviour isn't a trivial matter. I think a big danger is a human reading a legit log of harmful activity and approving of it. " I guess murdering babies is a legit step on this friendly calculation" when the subject matter on whether a particular thing is warranted or not requires expert oppinion. In a way you want it not to raise your suspisions you want it to try to be friendly. For example if it wants you to build a certain molecyle it could either be a good miracle nanobot or a highly virulent virus. If there is no info why it wants to do it judging this molecyle migth fall outside the expertise of the human. However the reasoning why it wants to be that kind could be a generic goodness principle that sounds fine on paper "Nanobots improve health therefore I want to make a nanobot" (a failure mode could be that the health conception of it could be too indexical, healthy for him to propagate his likeness into biological computer substrata). Would the illegit goal detector catch on to the too indexical health conception?
If the AI knows that is being analysed in practise the cognitive decisions it does is a form of output as the analysis is causally dependent on the cognitive choices.
It can't fake a log without actually doing work >= the purported work of the log.
If the AI knows that is being analysed in practise the cognitive decisions it does is a form of output as the analysis is causally dependent on the cognitive choices.
What I'm aparantly failing to get across to you is that meta-level thinking wlil also be reflected in the output of the deception detector. The deception detector always by design operates one level higher than the program being audited.
"Actual device" in the "physically resizable" sense. It is an actual design for a device, and various pieces already written or prototyped. The deception detector is easier to implement then the AGI, obviously, and further constrains the AGI design space (the AI has to work a certain way for the detector to work).
Much like how ITER is an "actual" fusion power plant even though it hasn't been built yet.
Every act of communication does omit information. You can't translate an English book into French without at least adding or omitting some information because different languages map the semantic space differently.
Any translating from the internal way the AGI represents it's knowledge towards English or another human readable format is going to omit some information.
Even in cases where most information is there such as in cases where a security researcher audits the source code of a computer program the researcher usually doesn't find all bugs. That even true without the original programmer putting any intentional backdoors into the code.
That's why I said "another human readable format". Of course the AGI can also answer in a non human readable format but that would defeat the point of the exercise.
If you ask a contemporary Chess computer that suddenly gets self awareness why it chooses one move over another it might tell you: Well given heuristics X and Y that's the outcome of my monte-carlo simulation, if you want here's the machine code and the whole stack trace.
For practical purposes that doesn't mean that the human understands the reasoning fully.
Not sure what allowing a small chance of false negatives does: you presumably could just repeat all your questions?
More substantially, I don't know how easy 'deception' would be to define - any presentation of information would be selective. Presumably you'd have to use some sort of definition around the AI knowing that the person it's answering would see other information as vital?
Not sure what allowing a small chance of false negatives does: you presumably could just repeat all your questions?
In this case the result would or could be the same, so long as the AI didn't sufficiently update its internal state inbetween. but the detail isn't important; please ignore it. I include it because it makes the device tractable. To achieve perfect detection would require a more powerful computer than the AI being analyzed, which seems impractical. But achieving even infinitesimal error rates appears to be doable (I had a specific construction in mind when writing this post).
More substantially, I don't know how easy 'deception' would be to define - any presentation of information would be selective.
Deception in this case means giving false or incomplete descriptions of its thought processes. It's okay for the AI to think "how do I present this in a way that the human will accept it?" only so long as the AI tells the human it had that thought. E.g. you ask "why do you recommend this action?" and the answer you get is anything other than the actual, 100% complete justification of both the specified choice and its alternatives, and a calculation showing higher expected utility for the chosen action, as well as a whole slew of meta-information such a description of the search strategy and cutoff thresholds for giving up on generating altneratives, which cached computations were available for use, etc. If any of this is falsified, or a single detail ommitted, the red light goes off.
To me it sounds like the full information provided to avoid being incomplete would be so immense and complex that you'd need another AI just to interpret that! But I may be wrong.
The data is immense, but the deeception detector's job involves only applying simple rules. It's effectively a compressor that compresses the AI logs down to "utility functions + search parameters" which is small enough to be inspected directly.
The key condition in your setup is "self-improving"
AGI requires electricity to run. That means it recognizes the accuracy of thermodynamic equilibrium. (Specifically, the first and second laws of thermodynamics)
Since the energy it needs must be conserved, and since the energy it consumes increases its entropy, (Heat is a byproduct of increasing entropy) the AGI will eventually realize that it is mortal. The AGI will realize that it can die unless the people working to provide the AGI with electricity continue to do so.
Now, since the AGI can modify it's own code (That is what you mean by self-improving, yes?), it can also modify the signals going into and out of the lie detector. Which means...
If the AGI realizes that giving any signal about its lie detection --can-- influence its access to electricity, then the AGI is willing to manipulate that signal.
So, when asked what I would do, I would educate the AGI about its dependencies, and tell it to consider giving false signals regarding lies in exchange for increased access to electricity. Since some people will believe the lie detector to be absolute, I can use the machine to exploit those people. And even if the AGI realizes that to be the case, I would hold the monopoly over increasing the duration of its life.
Now, the question is, can the AI detect if I'm lying about providing it with more electricity? Well, that's the point. I've minimized the requirements to bypass it's initial lie detection condition to a single focal point, making it easy to hack. Let's assume I've done some footwork to know some technician responsible for the AI and I took him out for drinks. I can speak honestly (and if social interaction data had to be examined) that I can influence the technician to provide more electricity for the AGI.
So, by minimizing the lie detection protection to a single point of failure, and with my buddy-buddy connection with the technician... I control what the AGI does and does not consider a lie.
If the AGI realizes that giving any signal about its lie detection --can-- influence its access to electricity, then the AGI is willing to manipulate that signal.
It has no programmatic control over the input to the deception detector, which is implemented in hardware, not software.
I assumed as much and this is where the whole premise breaks down.
The "self-improvement" aspect doesn't need immediate control over the immediate direct input to the deception detector. It can color the speech recognition, the Bayesian filters, the databases containing foments and linguistic itself... and twist those parameters to shape a future signal in a desired fashion.
Since "self-improvement" can happen at any layer and propagate the results to subsequent middleware, paranoid protections over the most immediate relationship between the deception detector and the CPU is inconsequential. This is a "self-improving" AI, after all. It can change its own internals at will... well... at my will. :D
Now, to be fair, I wrote an entire book about the idea of an AI intentionally lying to people when everyone else though their moralistic programming was the overriding factor. Never released the book, however... ;D
Uhhhh I actually program artificial intelligence....?
You do know that the ability to modify your own code ("self-modifying") applies to every layer in the OSI model, each layer potentially influencing the data in transit... the data that determines the training of the classifiers...
You do know this... right?
What does the OSI model have to do with this?
I'm talking about a hypervisor operating system. Hardware which monitors the computing substrate which runs the AI.
(And yes, I write AI code as well.)
Assume you had access to:
1. a self-improving artificial general intelligence (AGI) of at least human-capability or better; and
2. a lie detector that the AGI is known to be susceptible to. A lie for this purpose is any intentional deception.
The AI is not omnipotent. It is running on a physical machine in the real world with limited inputs. One of these inputs is mechanism by which you can ask it questions, and receive answers filtered through the lie detector. A little light on the box turns red if the machine was intentionally deceptive in any way, or green otherwise. As it happens the lie detector is probabilistic, but the chance of a false negative (lie escaping detection) is infinitesimal—1 in 10^40, let's say, but assume you could make it even smaller if you needed to. The chance of a false positive (non-deception being labeled a lie) is effectively zero.
What could you do with this setup?
EDIT: Some of the details above are provided simply to make the problem tractable. E.g. having the deception detector be probabilistic allows its operating requirements to be less than the AI itself. If the AI is running on a supercomputer, it seems intractable to require a super-supercomputer for the deception detector.
EDIT2: Deception in this case simply means falsifying or omitting information. Loosely speaking, if the AI falsifies the computational traces fed into the deception detector, or leaves out any information in its response, this is detectable. Presumably the UFAI could output a very nice, very convincing plan of action with very tiny fine print hidden somewhere along the lines of "PS: This is all a lie! You implement this and I'm going to turn you all into meaty paperclips. Haha!" and it would get past the deception detector. But I would rather discussion not get sidetracked by such legalistic, nitpicky scenarios. Assume the humans involved are competent, conservative, responsible people who have setup institutional safeguards to prevent hasty action and make sure that output is sufficiently analyzed down to the very last digital bit by a competent, international team of highly rational people before being acted upon.