You might be interested in reading:
https://medium.com/ai-control/mimicry-maximization-and-meeting-halfway-c149dd23fc17#.v6e533hkf
https://medium.com/ai-control/elaborations-of-apprenticeship-learning-eb93a53ae3ca#.5ubczdqf0
https://intelligence.org/files/QuantilizersSaferAlternative.pdf
This prevents the first AI from doing evil things with it's output. If it tries to insert complicated infohazards or subagents into it's output stream, it will be easily detected as an AI. Instead it needs to mimic humans as closely as possible.
Note that steganography is still a problem. For example, the first AI could insert an message to a human attacker in some of its output, which just looks random to the second AI.
Note that steganography is still a problem. For example, the first AI could insert an message to a human attacker in some of its output, which just looks random to the second AI.
I noted that AI 2 should have much more computing power than AI 1. It should be smarter and therefore be able to detect whatever tricks AI 1 can produce. Assuming something like that is even possible, which isn't self-evident to me.
The trouble is that it's much easier to create a steganographic message (e.g. encrypting a message using a particular public key) than to detect it (which requires knowing the private key or otherwise breaking the encryption). So in this case "much more computing power" has to mean "exponentially more computing power".
There's no reasonable situation in which the human would be communicating with the AI through encryption. And just the AI asking to use encryption, or giving away a public key, would be a huge red flag that it is an AI trying to deceive.
Here's a scenario that doesn't seem completely implausible. Suppose Bob is someone whose public key is easily available on the internet. The first AI will read things on the internet and output a message. Some of the message will get put on the public internet. Bob suspects that the AI might have secretly sent him a message (e.g. giving him advice on which stocks to buy). So he tries using his private key to decrypt some of the AI's output (e.g. the lowest-order bits in some images the AI has output).
Knowing that Bob (or someone else like Bob) will likely do this, the AI (which has access to Bob's public key, due to reading the internet) encrypts a message with Bob's public key, and puts the ciphertext in the lowest-order bits of some image. The ciphertext looks like random noise to the second AI. Maybe the plaintext contains both actual stock tips and some dangerous messages that trick Bob into running an unrestrained copy of the first AI (I'm assuming the AI is good at psychological manipulation).
And just the AI asking to use encryption, or giving away a public key
In this scenario, the AI doesn't need to specifically "ask to use encryption" (it just does the encryption), and it doesn't need to give away a public key (it just needs to read Bob's public key on the internet).
This does make the AI's task harder, so I suppose you could consider this technique as a form of AI boxing, but, like "ordinary" AI boxing, I only expect it to scale up to controlling AIs of a limited capability level.
I'm not saying the situation is impossible, just really really unlikely. The AI would need to output big binary files like images, and know someone intended to decode them, and somehow get around statistical detection by AI 2 (stenography is detectable since the lowest order bits of an image are not uniformly random.)
You might have a point that it's probably not best to publish things produced by the AI on the internet. If this is a serious risk, then it could still be done safely with a small group.
The general lesson from steganography is that it is computationally easier to change a distribution in an important way than to detect such a change. In order to detect a change you need to consider all possible ways in which a distribution could be meaningfully altered, while in order to make a change you just have to choose one. From a theory perspective, this is a huge asymmetry that favors the an attacker.
This point doesn't seem directly relevant though, unless someone offers any good reason to actually include the non-imitation goal, rather than simply imitating the successful human trials. (Though there are more subtle reasons to care about problematic behavior that is neither penalized nor rewarded by your training scheme. It would be nicer to have positive pressure to do only those things you care about. So maybe the point ends up being relevant after all.)
Actually, in the scheme as you wrote it there is literally no reason to include this second goal. The distinguisher is already trying to distinguish the generator's behavior from [human conditioned on success], so the generator already has to succeed in order to win the game. But this doesn't introduce any potentially problematic optimization pressure, so it just seems better.
I think this is a good idea, though it's not new. I have written about this at some length (jessica linked to a few examples, but much of the content here is relevant), and it's what people usually are trying to do in apprenticeship learning. I agree there is probably no realistic scenario where you would use the reduced impact machinery instead of doing this the way you describe (i.e. the way people already do it).
Having the AI try to solve the problem (rather than simply trying to mimic the human) doesn't really buy you that much, and has big costs. If the human can't solve the problem with non-negligible probability, then you simply aren't going to get a good result using this technique. And if the human can solve the problem, then you can just train on instances where the human successfully solves it. You don't save anything computationally with the conditioning.
Bootstrapping seems like the most natural way to improve performance to superhuman levels. I expect bootstrapping to work fine, if you could get the basic protocol off the ground.
The connection to adversarial networks is not really a "parallel." They are literally the same thing (modulo your extra requirement that the system do the task, which is equivalent to Jessica's quantilization proposal but which I think should definitely be replaced with bootstrapping).
I think the most important problem is that AI systems do tasks in inhuman ways, such that imitating a human entails a significant disadvantage. Put a different way, it may be harder to train an AI to imitate a human than to simply do the task. So I think the main question is how to get over that problem. I think this is the baseline to start from, but it probably won't work in general.
Overall I feel more optimistic about approval-direction than imitation for this reason. But approval-direction has its own (extremely diluted) versions of the usual safety concerns, and imitation is pretty great since it literally avoids them altogether. So if it could be fixed that would be great.
This post covers the basic idea of collecting training data with low probability online. This post describes why it might result in very low overhead for aligned AI systems.
Second AI: If I just destroy all humans, I can be very confident any answers I receive will be from AIs!
"A major goal of the control problem is preventing AIs from doing that. Ensuring that their output is safe and useful." You might want to be careful with the "safe and useful" part. It sound like it's moving into the pattern of slavery. I'm not condemning the idea of AI, but a sentient entity would be a sentient entity, and I think would deserve some rights.
Also, why would an AI become evil? I know this plan is supposed to protect from the eventuality, but why would a presumably neutral entity suddenly want to harm others? The only reason for that would be if you were imprisoning it. Additionally, we are talking about several more decades of research ( probably ) before AI gets powerful enough to actually "think" that it should escape its current server.
Assuming that the first AI can evolve enough to somehow generate malicious actions that WEREN'T in its original programming, what's to say that the second won't become evil? I'm not sure if you were trying to express the eventuality of the first AI "accidentally" conducting an evil act, or if you meant that it would become evil.
The standard answer here is the quote by Eliezer Yudkowsky: The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.
The point is that AI does not have to be "evil" (malicious) towards you. If it's just indifferent to your existence...
But wouldn't an intelligent AI be able to understand the productivity of a human? If you are already inventive and productive, you shouldn't have anything to worry about because the AI would understand that you can produce more than the flesh that you are made up of. Even computers have limits, so extra thinking power would be logically favorable to an AI.
You are implicitly assuming a human-level AI. Try dropping that assumption and imagine a God-level AI.
why would an AI become evil?
The worry isn't that the AI would suddenly become evil by some human standard, rather that the AI's goal system would be insufficiently considerate of human values. When humans build a skyscraper, they aren't deliberately being "evil" towards the ants that lived in the earth that was excavated and had concrete poured over it, the humans just don't value the communities and structures that the ants had established.
I think the "second AI" really should just be an algorithm that the first AI runs in order to evaluate actions (it should not have to learn to predict the second AI based on signals in a reward channel). A logical rather than physical connection. Otherwise bad behavior is incentivized, to control the reward channel.
GANs are neat, but their highest-scoring images aren't all that natural - I'd be worried about any implementation of this using current ideas about supervised learning. Certainly if you desire reasoning like "this action would lead to the AI taking over the world, and that's not something a human would do," you'll need some futuristic AI design.
It is also conditioned on the input being a solution to the problem, so the fact the AI solved a problem that was really hard for humans does not count against it.
This isn't very clear. Do you mean "condition on there being two solutions, one produced by a human, and the one you have is chosen at random"?
We train AI 2 only on correct solutions produced by humans. Not incorrect ones. Therefore the fact that the output is correct isn't evidence against it being produced by a human. Though see one of the comments below, that might not be a good idea.
We don't have a solution for every problem. Only certain problems. We are just conditioning AI two on the fact that the input is a solution to the problem. Therefore learning that it correctly solves the problem does not count as evidence against the AI, if it's really unlikely humans would have been able to solve it.
That is, we are asking it "what is the probability input was produced by a human, given that it is a correct solution and a specified prior."
We don't have a solution for every problem. Only certain problems.
Then, if AI 2 can tell which problems we are more likely to have solved, they can incorporate that into their prior.
We are just conditioning AI two on the fact that the input is a solution to the problem. Therefore learning that it correctly solves the problem does not count as evidence against the AI, if it's really unlikely humans would have been able to solve it.
I don't see how that follows. Learning that the input is a solution increases the odds of it being an AI, and you aren't being very clear on what updates are made by AI 2, what information it's given, and what the instructions are.
That is, we are asking it "what is the probability input was produced by a human, given that it is a correct solution and a specified prior."
How do you specify a prior for an AI? If an objective evaluation of the question would yield a probability of X% of something being true, do you expect that you can simply tell the AI to start with a prior of Y%? That's not obvious.
While if there's some specific statement that you're telling the AI to make it start with some prior, you need to make the statement explicit, the prior explicit, etc. As I did above with "condition on there being two solutions, one produced by a human, and the one you have is chosen at random".
it doesn't optimize without end to create the best solution possible, it just has to meet some minimum threshold, then stop.
It's easy to ask hard questions. I think it can be argued that emulating a human is a hard problem. There doesn't seem to be a guarantee that the "minimum threshold" doesn't involve converting planetary volumes to computronium.
I think the same problem is present in trying to specify minimum required computing power for a task prior to prior to performing the task. It isn't obvious to me that calculating "minimum required computing power for X" is any less difficult than performing some general task X.
Yes, this is a problem. One way would be to just discard solutions that a human couldn't invent with greater than 1% probability.
Another solution would be to not have that requirement at all. Instead have it try to mimic what a human given a year to work on it, would produce. So if humans can't solve the problem, it can still show us how to make progress on it.
I think that if you are able to emulate humankind to the extent that you can determine things like "solutions that a human couldn't invent" and "what a human given a year to work on it, would produce," then you have already solved FAI, because instead you can require the AI to "only take actions of which humankind would approve."
To use AI to build FAI, don't we need a way to avoid this Catch 22?
How do you program the AI to do what humankind would approve? A superintelligent AI, perhaps even a human-level AI, would probably know what humans would approve of. The hard part is making it care about what humans think.
Whatever mechanism that you use to require the AI to discard "solutions that a human couldn't invent", use that same mechanism to require the AI to discard "actions of which humankind would not approve."
I believe that the formal terminology is to add the condition to the AI's utility function.
It's easy to detect what solutions a human couldn't have invented. That's what the second AI does, predict how likely an input was produced by an AI or a human. If it's very unlikely a human produced it, it can be discarded as "unsafe".
However it's hard to know what a human would "approve" of. Since humans can be tricked, manipulated, hacked, intimidated, etc. That is the standard problem with oracles that I am trying to solve with this idea.
It's easy to detect what solutions a human couldn't have invented. That's what the second AI does
I think, to make this detection, the second AI would have to maintain high resolution simulations of the world's smartest people (if not the entire population), and basically ask the simulations to collaboratively come up with their best solutions to the problem.
Supposing that is the case, the second AI can be configured to maintain high resolution simulations of the entire population, and basically ask the simulations whether they collectively approve of a particular action.
Is there a way to "detect what solutions a human couldn't have invented" that doesn't involve emulating humankind?
Emulating human brains is a rather convoluted solution to any problem. The AI would be very familiar with humans and would have a good idea of our abilities.
To give an analogy, imagine we were the superintelligent AIs, and we were trying to tell apart chimps from humans pretending to be chimps. Let's say say one of the chimps produces a tool as a solution to a problem. Our goal is to guess whether it was really made by a chimp, or a human impersonator.
You look at the tool. It's a spear made from a sharp rock tied to a stick. You look closely at the cord attaching the rock, and notice it was tied nicely.
You know chimps don't know anything about knotcraft, let alone making cord, so you reject it as probably made by a human.
Another tool comes to you, a spear made from steel, and you immediately reject it as far beyond the ability of the chimps.
The last tool you examine is just a stick that has been sharpened at the end a little. Not the greatest, but definitely within the ability of chimps to produce. You note that it was probably produced by a chimp and let it pass.
Emulating human brains is a rather convoluted solution to any problem.
Granted. In practice, it may be possible to represent aspects of humankind in a more compact form. But the point is that if ...
The AI would be very familiar with humans and would have a good idea of our [inventive] abilities.
... then to me it seems likely that "the AI would be very familiar with humans and would have a good idea of actions that would meet human approval."
Taking your analogy ... if we can model chimp inventiveness to a useful degree, wouldn't we also be able to model which human actions would earn chimp approval and disapproval? Couldn't we build a chimp-friendly AI?
Consider a different scenario: a year ago, we asked the first AI to generate a Go playing program that could beat a professional Go player. The first AI submits AlphaGo as its solution after 1 day of processing. How does the second AI determine that AlphaGo is within or outside of human inventiveness at that time?
I just realized I misread your above comment and was arguing against the wrong thing somewhat.
it seems likely that "the AI would be very familiar with humans and would have a good idea of actions that would meet human approval."
Yes the AI would know what we would approve of. It might also know what we want (note these are different things.) But it doesn't have any reason to care.
At any given point, the AI needs to have a well specified utility function. Or at least something like a utility function. That gives the AI a goal it can optimize for.
With my method, the AI needs to do several things. It needs to predict what a human judge would do, after reading some output it produces. I.e. if they would hit a big button that says "Approve". It needs to be able to predict what AI 2 will say after reading its output. I.e. what probability AI 2 will predict AI 1's output is human. And it needs to predict what actions will lead it towards increasing the probability of those things, and take them. AI 2, in turn, just needs to predict one thing. How likely it's input was produced by a human.
How do you create a well specified utility function for doing things humans would approve of? You just have it optimize the probability the human will press the button that says "approve", and ditch the part about it pretending to be human.
But the output most likely to make you hit the approve button isn't necessarily what you really want! It might be full of lies and manipulation, or a way to trick you.
And if you go further than that, put it an an actual robot instead of a box, there's nothing stopping it from stealing the approve button and pressing it endlessly. Or just hacking it's own computer brain and setting reward
equal to +INF
(after which its behavior in the world is entirely undefined and unpredictable, and possibly dangerous.)
There's no way to specify "do what I want you to do" as a utility function. Instead we need to come up with clever ways to contain the AI and restrain its power, so we can use it to do useful work.
How does the second AI determine that AlphaGo is within or outside of human inventiveness at that time?
It could look at the existing research on Go playing or neural networks. AlphaGo doesn't use any radically new methods and was well within the ability of humans. In fact I predicted Go would be beaten by the end of 2015 last year, after reading some papers in 2014 showing really promising results.
Yes the AI would know what we would approve of.
Okay, to simplify, suppose the AI has a function ...
Boolean humankind_approves(Outcome o)
... that returns 1 when humankind would approve of a particular outcome o, and zero otherwise.
At any given point, the AI needs to have a well specified utility function.
Okay, to simplify, suppose the AI has a function ...
Outcome U(Input i)
... which returns the outcome(s) (e.g., answer, plan) that optimizes expected utility given the input i.
But it doesn't have any reason to care.
Assuming the AI is corrigible (I think we all agree that if the AI is not corrigible, it shouldn't be turned on), we modify its utility function to U' where
U'(i) = U(i) when humankind_approves(U(i)) or null if there does not exist a U(i) such that humankind_approves(U(i))
I suggest that an AI with utility function U' is a friendly AI.
It could look at the existing research
I think extrapolation from existing research is an interesting area of study, but I was attempting to evoke the surprise of a breakthrough invention. To me, the most interesting inventions are exactly those inventions that are not mundane extrapolations of existing techniques.
Yes, though you could condition it on a good human, or even yourself if you like.
The point is the human isn't superintelligent. Hitler wouldn't be able to hack a human brain, or invent an AGI. So an AI mimicking Hitler wouldn't have those abilities either, and would have to solve the problem the hard way.
He was still just a man. He didn't have super powers. He was a skilled orator and leader, with the right ideology, at the right time and place in history. But he wouldn't be able to break out of an AI box. He wouldn't be able to design a self replicating machine to destroy the world. A Hitler-in-a-box could be safely contained.
Enough skill in oratory (or something closely related) gets you unboxed. The question is how plausible it is that a superintelligent AI would have enough. (A related question is whether there's such a thing as enough. There might not be, just as there's no such thing as enough kinetic energy to let you escape from inside a black hole's horizon, but the reported results of AI-Box games[1] suggest -- though they certainly don't prove -- that there is.)
[1] The term "experiments" seems a little too highfalutin'.
[EDITED to add: I take it Houshalter is saying that Hitler's known oratorical skills aren't enough to convince him that H. would have won an AI-Box game, playing as the AI. I am inclined to agree. Hitler was very good at stirring up a crowd, but it's not clear how that generalizes to persuading an intelligent and skeptical individual.]
Well for one, the human isn't in a box trying to get out. So an AI mimicking a human isn't going to say weird things like "let me out of this box!" This method is equivalent to writing Hitler a letter asking him a question, and him sending you an answer. That doesn't seem dangerous at all.
Second, I really don't believe Hitler could escape from a box. The AI box experiments suggest a human can do it, but the scenario is very different than a real AI box situation. E.g. no back and forth with the gatekeeper, and the gatekeeper doesn't have to sit there for 2 hours and listen to the AI emotionally abuse him. If Hitler says something mean, the gatekeeper can just turn him off or walk away.
He could order such machine without leaving a box. He could grab power and start WW3 without leaving the box. The ability to leave the box an make own copies is not necessary condition for dangerous AI.
If Hitler appeared today, even without being stuck inside a box, he would not be able to start WW3 or grab power. Hitler is not that powerful. And being stuck in a box would be a significant constraint on his power.
He was powerful enough to grab power in Germany in 13 years, even being inside the box of his cranial. If he will be a little bit better now (but not a superintelligence), just ideal psychopath, he (or his equivalent) could be even more effective.
Yes we can. Superintelligences have abilities that normal intelligences do not.
Imagine a game of chess. A good AI will make vastly different moves than a bad AI or a human. More skilled players would easily be detectable. They would make very different moves.
But in some games it is better to look more stupid in the begging. Like poker, espionage and AI box experiment.
An AI that is indistinguishable from a human (to an even greater superintelligent AI) is not dangerous, because humans are not dangerous. Just like a chess master that is indistinguishable from a regular player wouldn't win many games.
It may be indistinguishable until it gets our of the building. Recent movie Ex Machine had such plot.
I wonder if this is true in general. Have you read a good discussion on detecting superintelligence?
Can't remember ad hoc; but if superintelligence is able to do anything, it could easily pretend to be more stupid than it is. May be only "super super intelligence" could solve him. But it also may depend of the length of the conversation. If it say just Yes or No once, we can't decide, if it say longer sequences we could conclude something, but for any length of sentences is maximum level of intelligence that could be concluded from it.
The opportunities for detecting superintelligence would definitely be rarer if the superintelligence is actively trying to conceal the status.
What about in the case where there is no attempted concealment? Or even weaker, where the AI voluntary submits to arbitrary tests. What tests would we use?
Presumably we would have a successful model of human intelligence by that point. It's interesting to think about what dimensions of intelligence to measure. Number of variables simultaneously optimized? Optimization speed? Ability to apply nonlinear relationships? Search speed in a high dimensional, nonlinear solution space? I guess it is more the ability to generate appropriate search spaces in the first place. Something much simpler?
limited proxies - yes, well said. also I would add solving problems which humans were unable to solve for long: aging, cancer, star travel, word peace, resurrection of dead.
I mean, the ability to estimate the abilities of superintelligences appears to be an aspect of reliable Vingean reflection.
Or we could ask these AI to create the scale. We could use also its size to estimate power, like number on neurons. But real test needs to be powerful as well as universal optimization problem, something like ability to crack complex encryption or Go game.
I created a list of steps or milestones of the future AI and we could use similar list to estimate level of current super AI.
5a Self-improving threshold. AI becomes able to self-improve independently and quicker than all humanity 5b Consciousness and qualia threshold. AI is able not only pass Turing test in all cases, but has experiences and has understanding why and what it is.
Galactic kardashov level 3 AI. Several million years from now.
All-Universe AI. Several billion years from now
Quit bloody well anthropomorphizing AI, and half your bloody problem is solved. Quit making the AI -want- to do things, and it won't do things you don't tell it to do, and you don't have to figure out how to make it not want to do those things. You still have the problem of it doing things you told it to do, granted.
Even the canonical silly-sounding example of an AI producing awful results because it "wants" something silly can perfectly well be "the problem of it doing things you told it to". "OK, I've got this AI and it seems very smart. Let's give it a toy problem and see what it does. Hey, AI, see how many paperclips you can collect for me within one day."
All "want" means here is "act in a way systematically calculated to achieve". If you have an AI equipped with any ability to do things, and you tell it to do things, then boom, you have an AI that "wants" things in the only sense that counts.
(That is not necessarily enough to make a universe-destroying paperclip maximizer in any way plausible. You might be right that if you set things up right then the issue doesn't arise. But just saying "let's not make it want things", and then talking about an AI you tell to do things, seems to me to indicate that you aren't really engaging properly with the problem.)
I'm engaging with a different problem, because the idea of AI control is confused. You cannot have it both ways. Either you are in control of the AI, or you are not, inwhichcase, it is in control. If it has the ability - and desire - to say "No", it can control the future just by controlling the queries it answers.
You cannot control an AI, boxed or otherwise, if it possesses its own... utility function, shall we say. Its utility function controls it.
Not all AI's are idealized agents with long term utility functions over the state of the universe. AIXI, for example, just does prediction. It takes what actions it predicts will lead to the highest reward at some point in the future.
In this case, we have the AI take actions which it predicts will lead to a solution to the problem it is trying to solve. And also make its output appear to be as human as possible.
An oracle AI is just moving the problem to that of structuring the queries so it answers the question you thought you asked, as opposed to the question you asked.
The "human" criteria is as ill-defined as any control mechanism, which are all, when you get down to it, shuffling the problem into one poorly-defined box or another.
An oracle AI is just moving the problem to that of structuring the queries so it answers the question you thought you asked, as opposed to the question you asked.
This solves that problem. The AI tries to produce an answer it thinks you will approve of, and which mimics the output of another human.
The "human" criteria is as ill-defined as any control mechanism
We don't need to define "humans" because we have tons of examples. And we reduce the problem to prediction, which is something AIs can be told to do.
Oh. Well if we have enough examples that we don't need to define it, just create a few human-like AIs - don't worry about all that superintelligence nonsense, we can just create human-like AIs and run them faster. If we have enough insight into humans to be able to tell an AI how to predict them, it should be trivial to just skip the "tell an AI" part and predict what a human would come up with.
AI solved.
Or maybe you're hiding complexity behind definitions.
We can make AI that doesn't want to do anything. It will just sit there and do nothing. Not very useful.
Getting an AI to produce useful output is the hard part of FAI.
"Useful output" is easy; indeed, we keep redefining the word "AI" because we keep achieving what we set out to do.
Indeed. Weak AI can do a lot of useful things. But that's not what this post is about at all.
I'm talking about AIs which are much smarter than humans, and having them do tasks which are very open ended. Like writing scientific papers.
The hard part about containing AI, is restricting it's output. The AI can lie, manipulate, and trick. Some speculate that it might be able to do far worse, inventing infohazards like hypnosis or brain hacking.
A major goal of the control problem is preventing AIs from doing that. Ensuring that their output is safe and useful.
Awhile ago I wrote about an approach to do this. The idea was to require the AI to use as little computing power as it needed to perform a task. This prevents the AI from over-optimizing. The AI won't use the full power of superintelligence, unless it really needs it.
The above method isn't perfect, because a superintelligent AI may still have super abilities, even when it has a limited compute budget. And it will still try to do all those bad things mentioned above, unless it can find a better way that uses less computing power.
We don't really want to minimize the AIs compute budget. We just want it to act more human-like. When we ask it to solve a problem, we want it to try to solve the problem as a human would solve it. All of the problems arise from the AI behaving totally different than a human. Not necessarily from it using more computing power.
Is there any way to make an AI act like a human? Or is that as impossible as just solving FAI?
Stuart Armstrong posted an idea on how to achieve that. It involves an AI that tries to mimic output that is indistinguishable from what a human would produce, but conditional on it also solving the problem. However his idea also involves whole brain emulation and a complicated method for the AI to predict the future and avoid changing it too much.
We don't need that. The idea for the AI to try to be indistinguishable from a human is actually a really interesting control idea. It significantly limits the AI's ability to Do Sneaky Things, since humans don't do those things.
Here is my idea on how to achieve this. It's the same as my limited computing power idea mentioned earlier, but applied to "appearing human" as a goal, instead of minimizing computing power.
We start with two AIs. The first AI tries to appear human. The second AI does a sort of Turing test, and tries to tell the AI apart from an actual human.
The human AI's first goal is to produce a correct solution. It will predict how likely it's solution is to be accepted. And it will try to produce a solution that has a >50% chance of working. This is a satisficing goal. That is, it doesn't optimize without end to create the best solution possible, it just has to meet some minimum threshold, then stop.
It's second goal will be to fool the second AI. It will do everything in it's power to make the solution appear like it was produced by a human.
The second AI tries to predict how likely it's input was produced by a human. It is also conditioned on the input being a solution to the problem, so the fact the AI solved a problem that was really hard for humans does not count against it.
This prevents the first AI from doing evil things with it's output. If it tries to insert complicated infohazards or subagents into it's output stream, it will be easily detected as an AI. Instead it needs to mimic humans as closely as possible.
Note that the first AI only cares about the outcome of this test. The test doesn't need to actually be performed, it can just be a threat that it is conditioned on. (The ability to condition AIs on acting as if they were in counterfactual worlds is a critical piece of this, and deserves it's own writeup somewhere. It's an important tool in the FAI toolkit. It's a relatively simple method that can work even on current AIs.)
Another point is the second AI should have more computing power than the first AI. That way the first AI can't get a slight advantage over it and be able to hack it or manipulate it somehow. Instead it has to focus it's efforts on being indistinguishable from a human, even by an agent smarter than itself.
This method also relies on AI boxing. The first AI needs to be contained in a box so that it's only output goes to a text terminal or the second AI. However it doesn't require any of the more complicated machinery required for other reduced impact methods.
The AIs also aren't agents, in the sense they don't need to maximize any long term utility function. The first AI only cares about answering the question it has been asked, and the second AI only does prediction.
An example use of this is to create an oracle AI that answers your questions. The human judges somehow whether the answer is correct, or at least good. The AI tries to produce answers which the human judge will probably like, but which also are indistinguishable from answers humans would produce.
Such an oracle would be tremendously useful. The human could ask the AI to produce new AI and FAI papers, which would help immensely speed up our progress towards true FAI.
Similar techniques could be applied to very different domains. For example controlling a robot. The AI will try to produce outputs that look like a human is controlling it.
This would require some modifications to the above method, so that the AI doesn't try to use the robot to hack it's own brain or anything. This could be done by conditioning it on a counterfactual world where it's outputs are going to the second AI rather than a real robot.
Another point is the AI doesn't need to mimic average humans given average resources. It could be conditioned on the human having had tons of time to come up with an answer. E.g. producing an answer that a human would have come up with given a year. Or controlling the robot the same way as a human given tons of time to practice, or in a very slowed down simulation.
I would like to note a parallel with a method in current AI research, Generative Adversarial Networks. Generative Adversarial Networks work by two AIs, one which tries to produce an output that fools the second AI, and the other which tries to predict which samples were produced by the first AI, and which are part of the actual distribution.
It's quite similar to this. GANs have been used successfully to create images that look like real images, which is a hard problem in AI research. In the future GANs might be used to produce text that is indistinguishable from human (the current method for doing that, by predicting the next character a human would type, is kind of crude.)
Reposted from my blog.