User Comment Replies

The approach relies on identifying all the reward sub-spaces with this inversion property? That seems very difficult.

I don't think it's good enough to identify these spaces and place barriers in the reward function. (Analogy: SGD works perhaps because it's good at jumping over such barriers.) Presumably you're actually talking about something more analogous to a penalty that increases as the action in question gets closer to step 4 in all the examples, so that there is nothing to jump over.

Even that seems insufficient, because it seems like a reasoning sys... (read more)

LESSWRONG
LW

All of lavalamp2's Comments + Replies