It depends on properties of bounded search itself.
I.e., if you are properly calibrated domain expert who can make 200 statements on topic with assigned probability 0.5% and be wrong on average 1 time, then, when you arrive at probability 0.5% as a result of your search for examples, we can expect that your search space was adequate and wasn't oversimplified, such that your result is not meaningless.
If you operate in confusing, novel, adversarial domain, especially when domain is "the future", when you find yourself assigning probabilities 0.5% for any reason which is not literally theorems and physical laws, your default move should be to say "wait, this probability is ridiculous".
A video game based around interacting with GenAI-based elements will achieve break-out status.
- Nope. This continues to be a big area of disappointment. Not only did nothing break out, there wasn’t even anything halfway decent.
We have at least two problems on a way here:
if you are first (immensely capable) then you'll pursue (coherence) as a kind of side effect, because it's pleasant to pursue.
I'm certain it's very straw motivation.
Imagine that you are Powerful Person. You find yourself lying in bed all day wallowing in sorrows of this earthly vale. You feel sad and you don't do anything.
This state is clearly counterproductive for any goal you can have in mind. If you care about sorrows of this earthly vale, you would do better if you earn additional money and donate it, if you don't, then why suffer? Therefore, you try to mold your mind in shape which doesn't allow for laying in bed wallowing in sorrows.
From my personal experience, I have ADHD and I'm literally incapable to even write this comment without at least some change of my mindset from default.
it looks like this just kinda sucks as a means
It certainly sucks, because it's not science and engineering, it's collection of tricks which may work for you or may not.
On the other hand, we are dealing with selection effects - highly-coherent people don't need artificial means to increase it and people actively seeking artificial coherence are likely to have executive function deficits or mood disorders.
Also, some methods of increasing coherence are not very dramatic. Writing can plausibly make you more coherent because during writing you will think about your thought process and nobody will notice, because it's not as sudden as personality change after psychedelics.
I think that in natural environments both kind of actions are actually actions taken by the same kind of people. The most power-seeking cohort on Earth (San-Francisco start up enterpreneurs) is obsessed with mindfulness, meditations, psychedelics, etc. If you squint and look at history of esoterism, you will see tons of powerful people who wanted to become even more powerful through greater personal coherence (alchemical Magnum Opus, this sort of stuff).
IIRC, canonical old-MIRI definition of intelligence is "intelligence is cross-domain optimization" and it captures your definition modulo emotional/easy-to-understand-for-human part?
Relentlessness comes both from "optimization as looking for the 'best' solution" and "cross-domaining as ignoring conventional boundaries", resourcefulness comes from picking more resources from non-standard domains, creativity is just consequence of sufficiently long optimization in sufficiently large search space.
Okay, but it looks like original inner misalignment problem? Either model has wrong representation for "human values", or we fail to recognize proper representation and make it optimize for something else?
On the other hand, properly optimized for human values world should look very weird. It likely includes a lot of aliens having a lot of weird alien fun, and weird qualia factories and...
I think humans don't have actual "respect for preferences of existing agents" in way that doesn't pose existential risks for agents weaker than them.
Imagine planet of conscious paperclippers. They are pre-Singularity paperclippers, so they are not exactly coherent single-minded agents, they have a lot of shards of desire and if you take their children and apply effort in their upbringing, they won't be single-minded paperclippers and they will have some sort of alien fun. But majority of establishment and conventional morality says that the best future outcome is to build superintelligent paperclip-maximizer and die, turning into paperclips. Yes, including children. Yes, they would strongly object if you try to divert them from this course. They won't take buying a lot of paperclips somewhere else, just like humanity won't take getting paperclipped in exchange of building Utopia somewhere else.
I actually don't know position of future humanity regarding this hypothetical, but I predict that siginificant faction would be really unhappy and demand violent intervention.
I don't feel like "slightly caring" imagery coherently describes any plausible feature of caring of misaligned AIs.
I think that if Earth-originated AI gets to choose between state of the world without humans and exactly the same world but with humans added, supermajority of AIs is going to choose the latter, because of central role humans are likely to take in training data. But "humans are more valuable than no humans ceteris paribus" doesn't say anything about whether configuration of matter that contains humans is literally the best possible configuration. Take "there are galaxies of computronium and solar system with humans" vs "there is one solar system of computronium more, no humans".
To put things on scale: on napkin estimate, plausible lower bound on achievable density of computations is ops/sec*kg. If we assume that for minimally comfortable live humans will need 100 times their weight and one human has weight 80kg, we will get that one second of existence of single human is worth ~ operations, which is approximately 3 billion years of computations on all currently existing computers. I think it sends idea of non-uploaded humans existing after misaligned takeover straight out of window.
ASI don't need to be "absolutely" integrated to be extremely integrated relatively to humans. Yes, humans avoid pain and death, but they are not doing even that in rational way. Like, there are many less humans demanding gerontology research, or to pause AI than you would otherwise expect. First generations of ASIs can be "unintegrated" and even be so in ways visible to humans, it doesn't mean that they won't be ruthless optimizers compared to us.
Alternative hypothesis: when you recontextualize, you train model "you will get away with X, given this signal (encouragement to do X in system prompt)". When model sees no such signal, it decides that it won't get away with bad behavior and acts good.
I wonder what would happen if you recontextualize with pair of prompts "your actions and thoughts will be recorded and evaluated" vs "you are acting absolutely autonomously, no oversight".