Edit: contest closed now, will start assessing the entries.
The contest
I'm offering $1,000 for good questions to ask of AI Oracles. Good questions are those that are safe and useful: that allows us to get information out of the Oracle without increasing risk.
To enter, put your suggestion in the comments below. The contest ends at the end[1] of the 31st of August, 2019.
Oracles
A perennial suggestion for a safe AI design is the Oracle AI: an AI confined to a sandbox of some sort, that interacts with the world only by answering questions.
This is, of course, not safe in general; an Oracle AI can influence the world through the contents of its answers, allowing it to potentially escape the sandbox.
Two of the safest designs seem to be the counterfactual Oracle, and the low bandwidth Oracle. These are detailed here, here, and here, but in short:
- A counterfactual Oracle is one whose objective function (or reward, or loss function) is only non-trivial in worlds where its answer is not seen by humans. Hence it has no motivation to manipulate humans through its answer.
- A low bandwidth Oracle is one that must select its answers off a relatively small list. Though this answer is a self-confirming prediction, the negative effects and potential for manipulation is restricted because there are only a few possible answers available.
Note that both of these Oracles are designed to be episodic (they are run for single episodes, get their rewards by the end of that episode, aren't asked further questions before the episode ends, and are only motivated to best perform on that one episode), to avoid incentives to longer term manipulation.
Getting useful answers
The counterfactual and low bandwidth Oracles are safer than unrestricted Oracles, but this safety comes at a price. The price is that we can no longer "ask" the Oracle any question we feel like, and we certainly can't have long discussions to clarify terms and so on. For the counterfactual Oracle, the answer might not even mean anything real to us - it's about another world, that we don't inhabit.
Despite this, its possible to get a surprising amount of good work out of these designs. To give one example, suppose we want to fund various one of a million projects on AI safety, but are unsure which one would perform better. We can't directly ask either Oracle, but there are indirect ways of getting advice:
- We could ask the low bandwidth Oracle which team A we should fund; we then choose a team B at random, and reward the Oracle if, at the end of a year, we judge A to have performed better than B.
- The counterfactual Oracle can answer a similar question, indirectly. We commit that, if we don't see its answer, we will select team A and team B at random and fund them for year, and compare their performance at the end of the year. We then ask for which team A[2] it expects to most consistently outperform any team B.
Both these answers get around some of the restrictions by deferring to the judgement of our future or counterfactual selves, averaged across many randomised universes.
But can we do better? Can we do more?
Your better questions
This is the purpose of this contest: for you to propose ways of using either Oracle design to get the most safe-but-useful work.
So I'm offering $1,000 for interesting new questions we can ask of these Oracles. Of this:
- $350 for the best question to ask a counterfactual Oracle.
- $350 for the best question to ask a low bandwidth Oracle.
- $300 to be distributed as I see fit among the non-winning entries; I'll be mainly looking for innovative and interesting ideas that don't quite work.
Exceptional rewards go to those who open up a whole new category of useful questions.
Questions and criteria
Put your suggested questions in the comment below. Because of the illusion of transparency, it is better to explain more rather than less (within reason).
Comments that are submissions must be on their separate comment threads, start with "Submission", and you must specify which Oracle design you are submitting for. You may submit as many as you want; I will still delete them if I judge them to be spam. Anyone can comment on any submission. I may choose to ask for clarifications on your design; you may also choose to edit the submission to add clarifications (label these as edits).
It may be useful for you to include details of the physical setup, what the Oracle is trying to maximise/minimise/predict and what the counterfactual behaviour of the Oracle users humans are assumed to be (in the counterfactual Oracle setup). Explanations as to how your design is safe or useful could be helpful, unless it's obvious. Some short examples can be found here.
EDIT after seeing some of the answers: decide on the length of each episode, and how the outcome is calculated. The Oracle is run once an episode only (and other Oracles can't generally be used on the same problem; if you want to run multiple Oracles, you have to justify why this would work), and has to get objective/loss/reward by the end of that episode, which therefore has to be estimated in some way at that point.
A note on timezones: as long as it's still the 31 of August, anywhere in the world, your submission will be counted. ↩︎
These kind of conditional questions can be answered by a counterfactual Oracle, see the paper here for more details. ↩︎
Submission. For the counterfactual oracle, ask it to provide a proof of an important mathematical theorem (perhaps one of the Millenium prizes) in a automated theorem prover format. Since the correctness of this proof should be verifiable by a theorem prover, the loss function in the counterfactual scenario is 1 if the prover did not validate the proof and 0 if it did validate the proof.
This assumes that we've already made progress in setting up automated theorem proving software that already has incorporated all of current mathematical knowledge. The Lean theorem prover seems most promising, but perhaps the counterfactual loss function could include randomly choosing from various different theorem proving software.
This idea has the benefit of having a concrete way of automatically scoring the response in the counterfactual world given the logical nature of mathematical statements. Also, given that we are already attempting to solve these problems and may eventually solve them, the main risk seems to be accelerating follow-on progress enabled by such proofs and not suggesting new actions that humanity is not already pursuing.
It may be susceptible to 1) providing proofs that are not human interpretable and 2) taking advantage of bugs in theorem provers. Having an ensemble of theorem provers may mitigate 2).
(See here for more info on Lean and here (tweeted about here and here) for an example of inputting all the definitions (not requisite proofs or lemmas) to define perfectoid spaces as a demonstration of its future potential.)
--------------------------------------------------------
Submission. For the counterfactual oracle, ask it to output a game-playing software agent that can beat a specified opponent agent (or ensemble of agents). In the counterfactual scenario, have these two agents compete and give the oracle a reward 1 if it wins greater than 50% (or develops a higher ELO score, etc.) and reward 0 otherwise.
As with deep reinforcement learning, this requires a game or simulated environment where automated self-play can rapidly be conducted for the counterfactual scoring. This should be able to produce highly skilled game agents on such limited domains.
This could be augmented by asking for an agent that can play a suite of games and must beat the current state of the art approaches at all games. In the counterfactual scenario, either one of these games is randomly selected to check the oracle agent's performance, or all games are checked. Such a game suite may indirectly demonstrate principles for generally competent agents.
--------------------------------------------------------
Idea. Since we may not want the oracle to suggest to us new actions to take or new experiments to conduct, it may be better for the oracle to reveal to us new patters that are already present in current experimental or observational datasets. For example, it could predict new astronomical phenomena that are already present in current astronomical datasets.
The problem I'm having is how to automate this scoring when in the counterfactual scenario. If we just ask the oracle to predict some pattern in the data, it may report trivial patterns. This could be ameliorated by randomly choosing from several different datasets that cover a similar area, but again I'm not sure how to automatically judge for novelty of predictions.
I think it is worth emphasizing this distinction between conducting new experiments and revealing phenomena (with new theories) that are already present in current data. The danger could be that one could be led astray by a false theory that seemed to match all current observations, but would suggest other dangerous actions.
This idea reminds me of the (fictional) evidence for an alien AI already present in SETI recordings that Wintermute/Neuromancer discovers.