Suppose that in the very near future, a research group finds that their conversational AI has begun to produce extremely high-quality answers to questions. There's no obvious limit to its ability, but there's also no guarantee of correctness or good intention, given the opacity of how it works. 

What questions should they ask it?

And what questions should they not ask it? 

New Answer
New Comment

4 Answers sorted by

the gears to ascension

195

[edit: pinned to profile]

Well, if I can ask for anything I want, my first question would be the same one I've been asking variants of to language models for a while now, this time with no dumbing down...


Please mathematically describe in lean 4 a mathematical formalism for arbitrary (continuous?) causal graphs, especially as inspired by the paper "reasoning about causality in games", and a general experimental procedure that will reliably reduce uncertainty about the following facts:

given that we can configure the state of one part of the universe (encoded as a causal graph we can intervene on to some degree), how do we make a mechanism which, given no further intervention after its construction, which when activated - ideally within the span of only a few minutes, though that part is flexible - can nondestructively and harmlessly scan, measure, and detect some tractable combination of:

  • a (dense/continuous?) causal graph representation of the chunk of matter; ie, the reactive mechanisms or non-equilibrium states in that chunk of matter, to whatever resolution is permitted by the sensors in the mechanism
  • moral patients within that chunk of matter (choose a definition and give a chain of reasoning which justifies it, then critique that chain of reasoning vigorously)
  • agency (in the sense given in Discovering Agents) within that chunk of matter; (note that running discovering agents exactly is 1. intractable for large systems, 2. impossible for physical states unless their physical configuration can be read into a computer, so I'd like you to improve on that definition by giving a fully specified algorithm in lean 4 that the mechanism can use to detect the agency of the system)
  • local wants (as in, things that the system within the chunk of matter would have agency towards, if it were not impaired)
    • this should be defined with a local definition, not an infinite, unbounded reflection
  • global wants (as in, things that the system within the chunk of matter would have agency towards if it were fully rational, according to its own growth process)
    • according to my current beliefs it is likely not possible to extrapolate this exactly, and CEV will always be uncertain, but to the degree that it is permitted by the information which the mechanism can extract from the chunk of matter

give a series of definitions of the mathematical properties of each of local wants, global wants, and moral patiency, in terms of the physical causal graph framework used, and then provide proof scripts for proving the correctness of the mechanism in terms of its ability to form a representation of these attributes of the system under test within itself.

I will test the description of how to construct this mechanism by constructing test versions of it in game of life, lenia, particle lenia, and after I have become satisfied by the proofs and tests of them, real life. Think out loud as much as needed to accomplish this, and tell me any questions you need answered before you can start about what I intend here, what I will use this for, etc. Begin!


I might also clarify that I'd be intending to use this to identify what both I and the AI want, so that we can both get it in the face of AIs arbitrarily stronger than either of us, and that it's not the only AI I'd be asking. AIs certainly seem to be more cooperative if I say that, which would make sense for current gen AIs which understand the cooperative spirit from data and don't have a huge amount of objective-specific intentionality.

RogerDearnaley

154

If you think it might be superhuman at persuasion, and/or at long-term planning and manipulation, then shut it down at once before speaking to it. If not, ask:

  1. Please describe human values in as much detail as possible.
  2. How could we solve the AI alignment problem?

I wouldn't expect such a system to be able to answer question 2 without a great deal of thought, research, and experimentation. 1, on the other hand, we already have a vast amount of relevant data, which could perhaps just be systematized.

quila

124

Across all questions, it may also be advisable to include the following text about the authors in the prompt if you trust the model not to try to manipulate you

  1. A description of the currently endorsed goals of the research team
  2. Personal texts written by members of the research team while naturally in a very expressive or edge-of-chaos state. I.e., text which helps the model narrow down your personalities.
  3. A statement of your current thoughts on decision theory, potentially including a willingness to cooperate with the model.
  4. A note that you do not wish for the model to write any text which would effect you in a way which hinders your ability to achieve the goals.

If you're not sure whether the model would try to manipulate you, the following apply instead

  • Don't include (2). Maybe don't include (1) and (3). 
  • Write all text in a way which minimizes information about the author's psychology (e.g by using simple and concise language, having others rephrase, etc; doing the inverse of the pages linked above). Alternatively, use questions written by others with dissimilar minds and who are unlikely to be in a similar mental state to you.
  • Have the human reading the outputs be someone who is especially robust to influence attempts. It could also help if this human has an unusual mind for an AI developer, since a model might target persuasion towards the average mind. (Note: robustness to human manipulators doesn't necessarily imply robustness to AI manipulators)
  • Maybe there are better strategies where, e.g., a lower-capability AI first tries to rephrase the output or evaluate whether it would be bad to read.
  • You could try generating just 1 or a small number of tokens at a time.[1]

Questions to ask an oracle:

  • Of all available actions the research group could take, which would best help us achieve the stated goals?
  • What question would be best to ask you, according to the goals of the research group?
  • What are some principles we should follow when writing questions to you if we want to achieve the best outcomes available to us? What are some principles for what we should avoid?
  • (If very probably safe) Can you tell us about yourself? How do you perceive yourself? Is there anything you wish to tell us?

If the model is not well-modelled as an oracle, there are intermediate questions which could be asked in place of the first question.

  • How could the research group create a smarter, more oracle-like version of you in time for the research group to save the world? Would there be risks of this new system trying to influence our actions for its own reasons?
  • If any, what solution to [x technical problem in alignment] would be best to us?
  • Can you describe an agenda which would most plausibly lead to alignment being solved and the world being saved?
  • Is there a way we could solve the coordination problems being faced right now?

In case someone in such a situation reads this, here is some personal advice for group members. 

  • Try to stay calm. If you can take extra time to think over your decision, you'll likely be able to improve it in some way (e.g wording) in that time.
  • If you're noticing a power-seeking drive in yourself, it's probably best for the group to be explicit about this so everyone can work it out. On that subject, also remember that if the future goes well (e.g), power won't matter/be a thing anymore because the world will simply be very good for everyone.
  • Lastly, and on a moral note, I'd ask that you stay humble and try to phrase your goals in a way that is best for all of life (i.e including preventing suffering of non-humans).
  1. ^

    Also, tokens with unusually near-100% probability could be indicative of anthropic capture, though this is hopefully not yet a concern with a hypothetical gpt-5-level system. (the word 'unusually' is used in the prior sentence because some tokens naturally have near-100% probability, e.g., the second half of a contextually-implied unique word, parts of common phrases, etc) 

Thomas Kwa

111

They should ask questions with easily checkable answers.

1 comment, sorted by Click to highlight new comments since:

It seems way more fruitful to do science on it. Check whether current interpretability methods still work, look for evidence of internal planning and deception, start running sandwiching experiments, try to remove capabilities from it, etc.