All of danielms's Comments + Replies

Interested in chatting about this! Will send you a pm :D. 

Hard agree with this. I think this is a necessary step along the path to aligned AI, and should be worked on asap to get more time for failure modes to be identified (meta-scheming, etc.). 

Also there's an idea of feedback loops - it would be great to hook into the AI R&D loop, so in a world where AIs doing AI research takes off we get similar speedups in safety research.

Alignment will be a lot easier once we can convert weights to what they represent and predict how a model with a given weights will respond to any prompt. Ideally, we will be able to verify what an AI will do before it does it. We could also verify by having an AI describe a high level overview of its plan without actually implementing anything, and then just monitor and see if it deviated. As long as we can maintain logs and monitoring of those logs of all AI activities, it may be a lot harder for an ASI to engage in malign behavior. 

 

Unless I'm... (read more)