I initially assumed that the concept of reward-modelling would be something most Lesswrongers were very familiar with. After all, this is one of the best communities for conversing on the topic. And a large percentage of all posts are AI or doom-based
However, my skeptical side quickly kicked in and I started to doubt my initial assumption. As I realized that my assumption could be confirmation bias. I am personally highly invested in reward-modelling, and I tend to ignore information that has little to no relation to it. Additionally, I do not have access to any actual data on this, nor have I considered perspectives outside of my own.
Much of my beliefs around the concept of agency and reward-modelling are well modeled by the channel RobertMilesAI. How familiar is the community with the concepts expressed in this channel?
How aware is the community as a whole of the concept?
How interested is the community as a whole in the concept?
I would be very thankful for any replies. As I'm very invested in the concept of reward-modelling. So any outside perspectives on the topic are very valuable to me.
I approximately see the context of your question, but I am not sure what exactly are you talking about. Maybe please try less abstract, more ELI5, with specific examples what you mean (and the adjacent concepts that you don't mean)?
Is it about which forces direct agent's attention in short term? Like, a human would do X, because we have an instinct to do X, or because of a previous experience that doing X leads to pleasure, either immediately or in longer term. And avoid Y, because of innate aversion, or a previous experience that Y causes pain.
Seems to me that "genetics" is a different level of abstraction than "pleasure and pain". If I try to disentangle this, it seems to me that humans
Do you want a similar analysis for LLMs? Do you want to attempt to make a general analysis even for hypothetical AIs based on different principles?
Is the goal to know all the levels of "where we can intervene"? Something like: "we can train the AI, we can upvote or downvote its answers, we can directly edit its memory..."?
(I am not an expert on LLMs, so I can't tell you more than the previous paragraph contains. I am just trying to figure out what is the thing you are interested in. It seems to me that people already study the individual parts of that, but... are you looking for some kind of more general approach?