Comment Permalink

Viliam3mo42

I approximately see the context of your question, but I am not sure what exactly are you talking about. Maybe please try less abstract, more ELI5, with specific examples what you mean (and the adjacent concepts that you don't mean)?

Is it about which forces direct agent's attention in short term? Like, a human would do X, because we have an instinct to do X, or because of a previous experience that doing X leads to pleasure, either immediately or in longer term. And avoid Y, because of innate aversion, or a previous experience that Y causes pain.

Seems to me that "genetics" is a different level of abstraction than "pleasure and pain". If I try to disentangle this, it seems to me that humans

immediately act on a stimulus (including internal, such as "I just remembered that...")
that is either a hardwired instinct, or learned i.e. a reaction stored in memory
the memory is updated by things causing pleasant or painful experience (again, including internal experience, e.g. hearing something makes me feel bad, even if the stimulus itself is not painful)
both the instincts and the organization of memory are determined by the genes
which are formed by evolution.

Do you want a similar analysis for LLMs? Do you want to attempt to make a general analysis even for hypothetical AIs based on different principles?

Is the goal to know all the levels of "where we can intervene"? Something like: "we can train the AI, we can upvote or downvote its answers, we can directly edit its memory..."?

(I am not an expert on LLMs, so I can't tell you more than the previous paragraph contains. I am just trying to figure out what is the thing you are interested in. It seems to me that people already study the individual parts of that, but... are you looking for some kind of more general approach?

Reply

See in context

1

[ Question ]

How familiar is the Lesswrong community as a whole with the concept of Reward-modelling?

by Oxidize

9th Apr 2025

1 min read

A

0 8

1

I initially assumed that the concept of reward-modelling would be something most Lesswrongers were very familiar with. After all, this is one of the best communities for conversing on the topic. And a large percentage of all posts are AI or doom-based

However, my skeptical side quickly kicked in and I started to doubt my initial assumption. As I realized that my assumption could be confirmation bias. I am personally highly invested in reward-modelling, and I tend to ignore information that has little to no relation to it. Additionally, I do not have access to any actual data on this, nor have I considered perspectives outside of my own.

Much of my beliefs around the concept of agency and reward-modelling are well modeled by the channel RobertMilesAI. How familiar is the community with the concepts expressed in this channel?

How aware is the community as a whole of the concept?

How interested is the community as a whole in the concept?

I would be very thankful for any replies. As I'm very invested in the concept of reward-modelling. So any outside perspectives on the topic are very valuable to me.

Frontpage

1

New Answer

New Comment

8 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:22 PM

[-]Viliam3mo50

The words don't ring a bell. You don't provide any explanation or reference, so I am unable to tell whether I am unfamiliar with the concept, or just know it under a different name (or no name at all).

Reply

[-]Oxidize3mo10

Thank you so much for the reply. You prevented me from making a pretty big mistake.

I'm defining reward-modelling as the manipulation of the direction of an agent's intelligence. From a goal-directed perspective.

So the reward-modelling of an AI might be the weights used, its training environment, mesa-optimization structure, inner-alignment structure, etc.

Or for a human, it might be genetics, pleasure, and pain.

Is there a better word I can use for this concept? Or maybe I should just make up a word?

Reply

[-]Viliam3mo42

I approximately see the context of your question, but I am not sure what exactly are you talking about. Maybe please try less abstract, more ELI5, with specific examples what you mean (and the adjacent concepts that you don't mean)?

Is it about which forces direct agent's attention in short term? Like, a human would do X, because we have an instinct to do X, or because of a previous experience that doing X leads to pleasure, either immediately or in longer term. And avoid Y, because of innate aversion, or a previous experience that Y causes pain.

Seems to me that "genetics" is a different level of abstraction than "pleasure and pain". If I try to disentangle this, it seems to me that humans

immediately act on a stimulus (including internal, such as "I just remembered that...")
that is either a hardwired instinct, or learned i.e. a reaction stored in memory
the memory is updated by things causing pleasant or painful experience (again, including internal experience, e.g. hearing something makes me feel bad, even if the stimulus itself is not painful)
both the instincts and the organization of memory are determined by the genes
which are formed by evolution.

Do you want a similar analysis for LLMs? Do you want to attempt to make a general analysis even for hypothetical AIs based on different principles?

Is the goal to know all the levels of "where we can intervene"? Something like: "we can train the AI, we can upvote or downvote its answers, we can directly edit its memory..."?

(I am not an expert on LLMs, so I can't tell you more than the previous paragraph contains. I am just trying to figure out what is the thing you are interested in. It seems to me that people already study the individual parts of that, but... are you looking for some kind of more general approach?

Reply

[-]Oxidize3mo10

These are 6 sample titles I'm considering using. Any thoughts come to mind?

AI-like reward functioning in humans. (Comprehensive model)
Agency in humans
Agency in humans | comprehensive model of why humans do what they do
EA should focus less on AI alignment, more on human alignment
EA's AI focus will be the end of us all.
EA's AI alignment focus will be the end of us all. We should focus on human alignment instead

Reply

[-]Oxidize3mo10

I'd say that the 80/20 of the concept is how reward & punishment affect human behavior.

Is it about which forces?
- I would say I'm referring to a combination of instinct, innate attraction/aversion, previous experience, decision-making, attention, and how they relate to each other in an everyday practical context.

Seems to me that "genetics"
- I would say your disentanglement is right on the money. Rather than making an analysis for LLMs, I'm particularly interested in fleshing out the inter relations between concepts as they relate to the human brain.

Do you want a similar analysis for LLMs?
I mean it from a high-level agency perspective, as opposed to in specific AI or machine learning contexts.

Goal?
My goal is to learn more about what information Lesswrongers use and are interested in so that I can better create a post for the community.

Adjacent concepts

Self-discipline
Positive psychology
Systems & patterns thinking
Maybe reward-functions?

Reply

[-]faul_sname3mo80

Can you give one extremely concrete example of a scenario which involves reward modeling, and point to the part of the scenario that you call "reward modeling"?

Reply

[-]mishka3mo41

It should be a different word to avoid confusion with reward models (standard terminology for models used to predict the reward in some ML contexts)

Reply

[-]Oxidize3mo30

Thanks for this. Do you have any ideas of what terminology i should use if I mean models used to predict reward in human contexts?

Reply

Moderation Log

Curated and popular this week