Wireheading and misalignment by composition on NetHack
TL;DR: We find agents trained with RLAIF to indulge in wireheading in NetHack. Misalignment appears when the agent optimizes a combination of two rewards that produce aligned behaviors when optimized in isolation, and only emerges with some prompt wordings. This post discusses an alignment-related discovery from our paper Motif: Intrinsic...