x

LESSWRONG
LW

pierlucadoro

Subscribe

Message

33

Ω

11

1

2y

Wireheading and misalignment by composition on NetHack

TL;DR: We find agents trained with RLAIF to indulge in wireheading in NetHack. Misalignment appears when the agent optimizes a combination of two rewards that produce aligned behaviors when optimized in isolation, and only emerges with some prompt wordings. This post discusses an alignment-related discovery from our paper Motif: Intrinsic...

Oct 27, 202334

pierlucadoro

Subscribe

Message

33

Ω

11

1

2y

Wireheading and misalignment by composition on NetHack

TL;DR: We find agents trained with RLAIF to indulge in wireheading in NetHack. Misalignment appears when the agent optimizes a combination of two rewards that produce aligned behaviors when optimized in isolation, and only emerges with some prompt wordings. This post discusses an alignment-related discovery from our paper Motif: Intrinsic...

Oct 27, 202334

Wireheading and misalignment by composition on NetHack

pierlucadoro

2y

TL;DR: We find agents trained with RLAIF to indulge in wireheading in NetHack. Misalignment appears when the agent optimizes a combination of two rewards that produce aligned behaviors when optimized in isolation, and only emerges with some prompt wordings.

This post discusses an alignment-related discovery from our paper Motif: Intrinsic Motivation from Artificial Intelligence Feedback, co-led by myself (Pierluca D’Oro) and Martin Klissarov. If you’re curious about the full context in which the phenomenon was investigated, we encourage you to read the paper or the Twitter thread.

Our team recently developed Motif, a method to distill common sense from a Large Language Model (Llama 2 in our case) into NetHack-playing AI agents. Motif is based... (read 938 more words →)

4

34