Oxidize - LessWrong

How familiar is the Lesswrong community as a whole with the concept of Reward-modelling?

These are 6 sample titles I'm considering using. Any thoughts come to mind?

AI-like reward functioning in humans. (Comprehensive model)
Agency in humans
Agency in humans | comprehensive model of why humans do what they do
EA should focus less on AI alignment, more on human alignment
EA's AI focus will be the end of us all.
EA's AI alignment focus will be the end of us all. We should focus on human alignment instead

How familiar is the Lesswrong community as a whole with the concept of Reward-modelling?

Thanks for this. Do you have any ideas of what terminology i should use if I mean models used to predict reward in human contexts?

How familiar is the Lesswrong community as a whole with the concept of Reward-modelling?

Oxidize21d10

I'd say that the 80/20 of the concept is how reward & punishment affect human behavior.

Is it about which forces?
- I would say I'm referring to a combination of instinct, innate attraction/aversion, previous experience, decision-making, attention, and how they relate to each other in an everyday practical context.

Seems to me that "genetics"
- I would say your disentanglement is right on the money. Rather than making an analysis for LLMs, I'm particularly interested in fleshing out the inter relations between concepts as they relate to the human brain.

Do you want a similar analysis for LLMs?
I mean it from a high-level agency perspective, as opposed to in specific AI or machine learning contexts.

Goal?
My goal is to learn more about what information Lesswrongers use and are interested in so that I can better create a post for the community.

Adjacent concepts

Self-discipline
Positive psychology
Systems & patterns thinking
Maybe reward-functions?

How familiar is the Lesswrong community as a whole with the concept of Reward-modelling?

Oxidize21d10

Thank you so much for the reply. You prevented me from making a pretty big mistake.

I'm defining reward-modelling as the manipulation of the direction of an agent's intelligence. From a goal-directed perspective.

So the reward-modelling of an AI might be the weights used, its training environment, mesa-optimization structure, inner-alignment structure, etc.

Or for a human, it might be genetics, pleasure, and pain.

Is there a better word I can use for this concept? Or maybe I should just make up a word?

Are there any (semi-)detailed future scenarios where we win?

Answer by OxidizeApr 10, 202510

This is actually my primary focus. I believe it can be done through a complicated process that targets human psychology, but to explain it simply

- Spread satisfaction & end suffering.
- Spread rational decision-making

To further simplify, if everyone was like us, and no one was on the chopping block if AGI doesn't get created, then the incentive to create AGI seizes and we effectively secure decades for AI-safety efforts.

This is a post I made on the subject.

https://www.lesswrong.com/posts/GzMteAGbf8h5oWkow/breaking-beliefs-about-saving-the-world

Why am I getting downvoted on Lesswrong?

Oxidize1mo10

Sounds like you're speaking from a set of fundamental different beliefs than I'm used to. I've trained myself to write assuming that the audience is uninformed about the topic I'm writing about. But it sounds like you're writing from the perspective of the LW community being more informed than I can properly understand or conceptualize. How can I gain more information on the flow of information in the Lesswrong community? I assumed any insights I've arrived at as a consequence of my own thinking & conclusions I've reached from various unconnected sources would likely be insights specific to me, but maybe I'm wrong. But yeah, I agree with you just wanting to write something does not sound like a good place to start to be value-additive to this community. I'll remember to only post when I believe I have valuable and unique insights to share.

Why am I getting downvoted on Lesswrong?

Oxidize1mo30

Thanks for the advise. I see how the linked posts are a lot more specific than the one I made. I'll try making some posts confined to specific domains of psychology, maybe in a very detailed & rational structure. Then maybe I can link to those posts in a larger post where I use those understandings/pieces of information to make a claim about a vehicle for using the information for practical change in the real world. I'm not sure I'm capable of giving up on macro-directional efforts like attempts to improve humanity as a whole, but I'll try and change the way I structure writings to be self-contained and linked externally for supplemental information as opposed to the entire post being dependent on a linked doc.

Why am I getting downvoted on Lesswrong?

Oxidize1mo30

Thank you for the advise. I'll switch my writing style to be more objective & I'll try to remember to avoid ineffective pandering/creative styles. I'll continue linking at the end of posts when necessary, but I'll try to make sure my initial post provides value to readers.

Thanks for including the link. I'll read through these and use the posts to further my understanding of the community.

Why am I getting downvoted on Lesswrong?

Oxidize1mo50

Thank you for this comment. I view writing through a marketing context, but I didn't realize that the people on Lesswrong are this motivated by intellectual stimulation/learning. In retrospect it seems obvious, but nonetheless I'm glad to have learned from my mistakes. I'll prioritize using curiosity & supplying new information from now on with more concise references to contexts/background information from now on. And I'll avoid the kind of emotionally targeted tone/structure that I used in my first post.

Why am I getting downvoted on Lesswrong?

Oxidize1mo20

Thanks for the advice. I want to learn how to make better posts in the future so I'll try to figure out how to improve.

Should I not have began by talking about background information & explaining my beliefs?
- Should I have the audience had contextual awareness and gone right into talking about solutions?

Or was the problem more along the lines of writing quality, tone, or style?
- What type of post do you like reading?
- Would it be alright if I asked for an example so that I could read it?

Also you're right. Looking back that post was the only one that received a lot of downvotes. I must've gained an inaccurate perception of what the reality was because I initially made a major mistake when I first made the post. And the feeling of a lack of concrete proposals was definitely a major fault on my part since I initially didn't properly link the doc

But do you think was there something else I could've done so that you would have been more interested in reading the linked doc? Maybe if I'd made it as part of the same post? Or linked to a Lesswrong post instead of a google doc?

LESSWRONG
LW

Posts

Wikitag Contributions

Comments