My preferred framings for reward misspecification and goal misgeneralisation
TL;DR * I was feeling quite dissatisfied with a bunch of categories like “inner and outer misalignment” and “reward misspecification and goal misgeneralisation” * I think most of this dissatisfaction stems from them not being able to correctly describe situations where there are multiple points of failure or gaps. * Hence, I propose using a root cause type of framing (inspired by Richard Ngo and Rohin Shah) instead which I think makes descriptions of failures more specific and nuanced. Assumptions and caveats * I'll be referencing RL agents throughout and not other types of agents. * I'm assuming my target audience has at least some basic knowledge from tghe AGI Safety Fundamentals (AGISF) course. * I have hardly any ML experience and have time boxed myself when making this forum post so I expect to make some mistakes along the way. Also, this was originally made in a slide-deck so I may have lost some fidelity during the transfer. * This forum post is my AGISF course capstone, so it would be super valuable for me to figure out how I should update towards thinking about AI alignment and my own career plan. Feedback is very much welcomed. Much thanks to Chan Jun Shern, Chin Ze Shen, and Mo Putera for their feedback and inspiration. All mistakes are mine. If you'd like to listen to the presentation instead, you can watch it here, which is 18 minutes long with the option of speeding up. Here's the slide-deck. My journey through different kinds of framings When I first learned about about AI safety, the first analogy that really clicked with me was the myth of King Midas. He wished that everything he touches becomes gold, and well, he got his wish. Soon, he was starving to the death. Then, I learned about inner and outer misalignment. Here’s a quick recap: * Outer misalignment happens when there’s a discrepancy between the designer’s intended goal (what’s in our heads) and the specified goal (what we say). * An example of this is the King Midas