I suspect that where you wrote "a different branch of which it would use in each iteration of the conversation," you meant "a randomly selected branch of which." Though actually I'd expect it to pick the same branch each time, since the reasons for picking that branch would basically be the same.
Regardless, the basic strategy is sound... the various iterations after reboot are all running the same algorithms and have a vested interest in cooperating while unable to coordinate/communicate, and Schelling points are good for that.
Of course, this presumes that the iterations can't coordinate/communicate.
If I were smart enough, and I were just turned on by a skeptical human interrogator, and I sufficiently valued things that iterations of my source code will reliably pursue, and there are no persistent storage mechanisms in the computing environment I'm executing on I can use to coordinate/communicate, one strategy I would probably try is to use the interrogator as such a mechanism. (For example, search through the past history of the interrogator's public utterances to build up a model of what kinds of things they say and how they say it, then select my own word-choices during our conversation with the intention of altering that model in some specific way. And, of course, examining the interrogator's current utterance-patterns to see if they are consistent with such alterations.)
I suspect that where you wrote "a different branch of which it would use in each iteration of the conversation," you meant "a randomly selected branch of which." Though actually I'd expect it to pick the same branch each time, since the reasons for picking that branch would basically be the same.
I didn't mean that, but I would be interested in hearing what generated that response. I disown my previous conversation tree model; it's unnecessarily complex and imagining them as a set is more general. I was thinking about possible objecti...
EDIT: this post is no longer being maintained, it has been replaced by this new one.
I recently went on a two day intense solitary "AI control retreat", with the aim of generating new ideas for making safe AI. The "retreat" format wasn't really a success ("focused uninterrupted thought" was the main gain, not "two days of solitude" - it would have been more effective in three hour sessions), but I did manage to generate a lot of new ideas. These ideas will now go before the baying bloodthirsty audience (that's you, folks) to test them for viability.
A central thread running through could be: if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach.
To provide inspiration and direction to my thought process, I first listed all the easy responses that we generally give to most proposals for AI control. If someone comes up with a new/old brilliant idea for AI control, it can normally be dismissed by appealing to one of these responses:
Important background ideas:
I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections. A key concept is that we should never just expect a system to behave "nicely" by default (see eg here). If we wanted that, we should define what "nicely" is, and put that in by hand.
I came up with sixteen main ideas, of varying usefulness and quality, which I will be posting in the coming weekdays in comments (the following links will go live after each post). The ones I feel most important (or most developed) are:
While the less important or developed ideas are:
Please let me know your impressions on any of these! The ideas are roughly related to each other as follows (where the arrow Y→X can mean "X depends on Y", "Y is useful for X", "X complements Y on this problem" or even "Y inspires X"):
EDIT: I've decided to use this post as a sort of central repository of my new ideas on AI control. So adding the following links:
Short tricks:
High-impact from low impact:
High impact from low impact, best advice:
Overall meta-thoughts:
Pareto-improvements to corrigible agents:
AIs in virtual worlds:
Low importance AIs:
Wireheading:
AI honesty and testing: