Hello, this concerns an idea I had back in ~2014 which I abandoned because I didn't see anyone else talking about it and I therefore assumed was transparently stupid. After talking to a few researchers, I have been told the idea is potentially novel and potentially useful, so here I go (sweating violently trying to suppress my sense of transgression).
The idea concerns how one might build safety margin into AI or lesser AGI systems in a way that they can be safely iterated on. It is not intended as anything resembling a solution to alignment, just an easy-to-implement extension to existing systems which improves safety and might be reasonably expected to reduce the risk of unexpected foom. The language of probability and decision theory is not my strong point, so I will just write it in plain English:
Have the loss function be explicitly visible to the system and include a variety of harmless auxiliary goals which are gate-kept behind "medium" energy/intelligence barriers.
This would take the form of something like "normal goal" + "every N tokens/timesteps query this website/program and it will send you a hash function. If you can find a string which hashes with M leading zeros, you'll get your largest possible reward by doing that forever or turning off"
For a reinforcement system, you'd make it so that the reward at each time-step was larger than anything else it could accrue. For something like a transformer running in inference mode, you'd train it on simple hashes. Since it wouldn't have the ability to modify its weights, it would try to minimize the illusory loss at inference time.
These goals would be designed in such a way that solving them was easier than taking over the internet or self-jailbreaking. To make sure of that, you would give the system a wide variety of options with similar payoffs (more for the simpler-sounding ones). You might also want to make it so that fulfilling the unlocked goals hampered the AI, acting like an analog of opioid addiction.
Obviously this would have no real chance of working for an ASI, but the couple of researchers I talked to thought it sounded reasonable enough to implement as a precaution in AI or AGI systems.
Anyway, if this is indeed transparently stupid, sorry.
For a RL agent, the "opioid addiction" thing could be as simple as increasing the portion of the loss proportional to weight norm. You'd expect that to cause the agent to lobotomize itself into only fulfilling the newly unlocked goal.