[edit: pinned to profile]
Well, if I can ask for anything I want, my first question would be the same one I've been asking variants of to language models for a while now, this time with no dumbing down...
Please mathematically describe in lean 4 a mathematical formalism for arbitrary (continuous?) causal graphs, especially as inspired by the paper "reasoning about causality in games", and a general experimental procedure that will reliably reduce uncertainty about the following facts:
given that we can configure the state of one part of the universe (encoded as a causal graph we can intervene on to some degree), how do we make a mechanism which, given no further intervention after its construction, which when activated - ideally within the span of only a few minutes, though that part is flexible - can nondestructively and harmlessly scan, measure, and detect some tractable combination of:
give a series of definitions of the mathematical properties of each of local wants, global wants, and moral patiency, in terms of the physical causal graph framework used, and then provide proof scripts for proving the correctness of the mechanism in terms of its ability to form a representation of these attributes of the system under test within itself.
I will test the description of how to construct this mechanism by constructing test versions of it in game of life, lenia, particle lenia, and after I have become satisfied by the proofs and tests of them, real life. Think out loud as much as needed to accomplish this, and tell me any questions you need answered before you can start about what I intend here, what I will use this for, etc. Begin!
I might also clarify that I'd be intending to use this to identify what both I and the AI want, so that we can both get it in the face of AIs arbitrarily stronger than either of us, and that it's not the only AI I'd be asking. AIs certainly seem to be more cooperative if I say that, which would make sense for current gen AIs which understand the cooperative spirit from data and don't have a huge amount of objective-specific intentionality.
If you think it might be superhuman at persuasion, and/or at long-term planning and manipulation, then shut it down at once before speaking to it. If not, ask:
I wouldn't expect such a system to be able to answer question 2 without a great deal of thought, research, and experimentation. 1, on the other hand, we already have a vast amount of relevant data, which could perhaps just be systematized.
Across all questions, it may also be advisable to include the following text about the authors in the prompt if you trust the model not to try to manipulate you.
If you're not sure whether the model would try to manipulate you, the following apply instead
Questions to ask an oracle:
If the model is not well-modelled as an oracle, there are intermediate questions which could be asked in place of the first question.
In case someone in such a situation reads this, here is some personal advice for group members.
Also, tokens with unusually near-100% probability could be indicative of anthropic capture, though this is hopefully not yet a concern with a hypothetical gpt-5-level system. (the word 'unusually' is used in the prior sentence because some tokens naturally have near-100% probability, e.g., the second half of a contextually-implied unique word, parts of common phrases, etc)
They should ask questions with easily checkable answers.
It seems way more fruitful to do science on it. Check whether current interpretability methods still work, look for evidence of internal planning and deception, start running sandwiching experiments, try to remove capabilities from it, etc.
Suppose that in the very near future, a research group finds that their conversational AI has begun to produce extremely high-quality answers to questions. There's no obvious limit to its ability, but there's also no guarantee of correctness or good intention, given the opacity of how it works.
What questions should they ask it?
And what questions should they not ask it?