The reason to work on preventing AI takeover now, as opposed to working on already invented AGI in the future, is the first try problem: if you have unaligned takeover-capable AGI, takover just happens and you don't get to iterate. The same happens with problem of extremely good future only if you believe that the main surviving scenario is "aligned-with-developer-intention singleton takes over the world very quickly, locking in pre-installed values". People who believe in such scenario usually have very high p(doom), so I assume you are not one of them.
What exactly prevents your strategy here from being "wait for aligned AGI, ask it how to make future extremely good and save some opportunity cost"?
Sure, set of available options is defined in problem setup. It's "one-box" and "two-box".
I feel like it's confusion about type signature of decision theory? Decision theory talks about mappings from observations and probabilistic models to actions. In case of humans, actions are motor outputs. Decision theory asks "what sort of motor output is the best?" and answers "which leads you to leave with one box". You are allowed to be really indecisive in process and cry "it feels wrong to leave the second box!", Omega in this scenario doesn't care.
I think the difference between reality and fiction is that fiction contains heroes - high-agency people in very personal relationships with villains.
Unrealistic part of being hero to designated villain is that in real life enemies are screened off from each other by many layers of social reality. Roosevelt was kinda hero to Hitler's villainy, but Roosevelt mostly didn't spend his nights thinking about Hitler could think, because 1) FDR was busy managing his country, 2) anything that Hitler could do was overconstrained by abilities of Germany, so it was more productive to think about what Germany as a whole could do. As a result, there is no interpersonal drama in which supervillain monologue fits.
Stanislaw Lem once wrote (I translated it from Russian, because I can't find another source in the open):
...sometimes it seems incredible to me that there are so many billions of people, many thousands of whom are gifted with imagination and expertise in the field of hypothesis generation, but no one takes on what I do. How is it that in some remote corner of Slavic land, between the North Pole and the Balkans, one madman suffers in solitude? Even from my favored statistical point of view, it follows that there must be at least a few such Quasi-Lems, Anti-Lems, Para-Lems, or Proto-Lems in the world, but here there are none.
Alternative hypothesis: when you recontextualize, you train model "you will get away with X, given this signal (encouragement to do X in system prompt)". When model sees no such signal, it decides that it won't get away with bad behavior and acts good.
I wonder what would happen if you recontextualize with pair of prompts "your actions and thoughts will be recorded and evaluated" vs "you are acting absolutely autonomously, no oversight".
It depends on properties of bounded search itself.
I.e., if you are properly calibrated domain expert who can make 200 statements on topic with assigned probability 0.5% and be wrong on average 1 time, then, when you arrive at probability 0.5% as a result of your search for examples, we can expect that your search space was adequate and wasn't oversimplified, such that your result is not meaningless.
If you operate in confusing, novel, adversarial domain, especially when domain is "the future", when you find yourself assigning probabilities 0.5% for any reason which is not literally theorems and physical laws, your default move should be to say "wait, this probability is ridiculous".
A video game based around interacting with GenAI-based elements will achieve break-out status.
- Nope. This continues to be a big area of disappointment. Not only did nothing break out, there wasn’t even anything halfway decent.
We have at least two problems on a way here:
if you are first (immensely capable) then you'll pursue (coherence) as a kind of side effect, because it's pleasant to pursue.
I'm certain it's very straw motivation.
Imagine that you are Powerful Person. You find yourself lying in bed all day wallowing in sorrows of this earthly vale. You feel sad and you don't do anything.
This state is clearly counterproductive for any goal you can have in mind. If you care about sorrows of this earthly vale, you would do better if you earn additional money and donate it, if you don't, then why suffer? Therefore, you try to mold your mind in shape which doesn't allow for laying in bed wallowing in sorrows.
From my personal experience, I have ADHD and I'm literally incapable to even write this comment without at least some change of my mindset from default.
it looks like this just kinda sucks as a means
It certainly sucks, because it's not science and engineering, it's collection of tricks which may work for you or may not.
On the other hand, we are dealing with selection effects - highly-coherent people don't need artificial means to increase it and people actively seeking artificial coherence are likely to have executive function deficits or mood disorders.
Also, some methods of increasing coherence are not very dramatic. Writing can plausibly make you more coherent because during writing you will think about your thought process and nobody will notice, because it's not as sudden as personality change after psychedelics.
This reason only makes sense if you expect first person to develop AGI to create singleton which takes over the world and locks in pre-installed values, which, again, I find not very compatible with low p(doom). What prevents scenario "AGI developers look around for a year after creation of AGI and decide that they can do better" if not misaligned takeover and not suboptimal value lock-in?