All of Amnonian's Comments + Replies

Interesting outlook!

I might be thinking of it too much in terms of classical Bayesianism, so let me know if my question is confused and doesn't make sense:

Is this just a way to count other agents' predictions as evidence, or is there also a differnce in how a single agent predicts and behaves in a world with no other agents?

I'm familiar, in classical bayesianism, with the fact that other agents' predictions can influence my prediction. There's Aumann agreement theorem and so on. Seeing another agent predict X will increase the probability I assign X at least a little.

Is this framework expanding on that, or proposing something else?

AmnonianΩ34-2

I'm feeling confused.

It might just be my inexperience with reinforcement learning, but while I agree with what you say, I can't square it with my intuition of what a ML model does.

If our model uses some variant of gradient ascent, it will end up in high reward function values. (Not necessarily in any global/local maxima, but the attempt is to get it to some such maxima.) In that sense the model does optimize for reward.

Is that a special attribute of gradient ascent, that we shouldn't expect other models to have? Does that mean that gradient ascent models are more dangerous? Are you just noting that the model won't necessarily find the global maxima, and only reach some local maxima?

3TurnTrout
Agreed.  Disagreed. Consider vanilla PG, which is as close as I know of to "doing gradient ascent in the reward landscape." Here, the RL training process is optimizing the model in the direction of historically observed rewards. In such policy gradient methods, the model receives local cognitive updates (in the form of gradients) to increasing the logits on actions which are judged to have produced reward (e.g. in vanilla PG, this is determined by "was the action part of a high-reward trajectory?"). The model is being optimized in the direction of previous rewards, given the collected data distribution (e.g. put some trash away and observed some rewards) and the given states and its current paramterization.  This process might even find very high reward policies. I expect it will. But that doesn't mean the model is optimizing for reward.
1zeshen
That was my takeaway as well, but I'm also somewhat confused.