Hello, this concerns an idea I had back in ~2014 which I abandoned because I didn't see anyone else talking about it and I therefore assumed was transparently stupid. After talking to a few researchers, I have been told the idea is potentially novel and potentially useful, so here I go (sweating violently trying to suppress my sense of transgression).
The idea concerns how one might build safety margin into AI or lesser AGI systems in a way that they can be safely iterated on. It is not intended as anything resembling a solution to alignment, just an easy-t...
For a RL agent, the "opioid addiction" thing could be as simple as increasing the portion of the loss proportional to weight norm. You'd expect that to cause the agent to lobotomize itself into only fulfilling the newly unlocked goal.