Ben Amitay's Shortform

Ben Amitay

LESSWRONG
LW

Ben Amitay's Shortform

15th Jul 2023

1 min read

3

This is a special post for quick takes by Ben Amitay. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Ben Amitay's Shortform

1Ben Amitay

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:49 PM

[-]Ben Amitay1y10

I had an idea for fighting goal misgeneralization. Doesn't seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:

Use IRL to learn which values are consistent with the actor's behavior.
When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL. That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)

[-]Ben Amitay1y10

I probably don't understand the shortform format, but it seem like others can't create top-level comments. So you can comment here :)

Moderation Log

2Comments