LESSWRONG
LW

Emergence of utility-function-consistent stated preferences in LLMs might be an example (0.1<p<0.6) though going from reading stuff on utility functions to the kind of behavior revealed there requires more inferential steps than going from reading stuff on reward hacking to reward hacking.

Examples of self-fulfilling prophecies in AI alignment?

Answer by ChipmonkMar 03, 202580

https://x.com/sama/status/1621621724507938816

Examples of self-fulfilling prophecies in AI alignment?

Answer by ChipmonkMar 03, 202540

Situational Awareness and race dynamics? h/t Jan Kulveit @Jan_Kulveit

3Xavi CF13d

Situational Awareness probably caused Project Stargate to some extent. Getting the Republican party to take AI seriously enough to let them launch in the White House is no joke and less likely without the essay. It also started the website-essay meta which is part of why AI 2027, The Compendium, and Gradual Disempowerment all launched the way they did, so there are knock-on effects too.

Examples of self-fulfilling prophecies in AI alignment?

Answer by ChipmonkMar 03, 202580

Training on Documents About Reward Hacking Induces Reward Hacking

Do clients need years of therapy, or can one conversation resolve the issue?

Chipmonk2mo40

I don't feel like I learned anything new from the post.

This surprises me! Wait so-

The "How does one-shotting happen?" section didn't have anything interesting for you? (Have you seen stuff like this elsewhere?)
Did you already know one-shotting was possible?

2niplav2mo

"One-shotting is possible" is a live hypothesis that I got from various reports from meditation traditions. I do retract "I learned nothing from this post", the "How does one-shotting happen" section is interesting, and I'd like it to be more prominent. Thanks for poking, I hope I'll find the time to respond to your other comment too.

Do clients need years of therapy, or can one conversation resolve the issue?

Chipmonk2mo20

since your bullet-point list in the beginning isn't detailed enough for anyone to try to replicate the method.

Wait I'm confused- this is not the purpose of the post

Also notable is that you only have positive examples for your method

The purpose of this post is not advertisement. It's to discuss one-shots

Especially, how would you be able to distinguish between your approach convincing your customers they were helped, instead of actually changing their behavior?

See above

Do clients need years of therapy, or can one conversation resolve the issue?

Chipmonk2mo*40

Would anyone like to help me edit a better version of this?

Do clients need years of therapy, or can one conversation resolve the issue?

Chipmonk2mo20

Oh I like "patients" ("clients"). I'll think about the rest, thanks. I'm just not sure how to write anything useful and legible without talking about my own experience and what I have the most data for?

Also I see the point of your last bullet where "my business" is the subject hm

Do clients need years of therapy, or can one conversation resolve the issue?

Chipmonk2mo20

any suggestions for how to talk about this stuff without having it read like an advertisement? i'm genuinely interested in the idea of one-shotting and legibilizing evidence that quick growth is possible

0niplav2mo

I gave your post to Claude and gave it the prompt "Dearest Claude, here's the text for a blogpost I've written for LessWrong. I've been told that "it sounds a lot like an advertisement". Can you give me feedback/suggestions for how to improve it for that particular audience? I don't want to do too much more research, but a bit of editing/stylistic choices." (All of the following is my rephrasing/rethinking of Claude output plus some personal suggestions.) Useful things that came out of the answer were explaining more about the method you've used to achieve this, since your bullet-point list in the beginning isn't detailed enough for anyone to try to replicate the method. Also notable is that you only have positive examples for your method, which activates my filtered evidence detectors. Either make clear that you indeed did only have positive results, or name how many people you coached, for how long, and that they were all happy with what you provided. Finally, some direct words from Claude that I just directly endorse: Especially, how would you be able to distinguish between your approach convincing your customers they were helped, instead of actually changing their behavior? That feels like the failure mode of most self-help techniques—they're "self-recommending".

5ROM2mo

Hey Chris! I have a few thoughts on this, though I have strong anti-advertising sentiments and might be overly sensitive to these things, so take it with a grain of salt. The title sounds a little click baity. It's directed at the reader. The title "Do patients need years of therapy, or can one conversation resolve their issue?" is functionally identical, but feels less like an advert. The opening reads somewhat like a common advert tactic: "I hated how business did [thing x] since it was bad for the customer, so I started my practice by doing [thing y] which is both more appealing to a potential customer and delivers better results!'. I think the advertising vibe might also come from the continued references to your personal practice / mentions of it's successes: * "So when I started my business, I made payment contingent on results:" * "Our clients are often surprised at how we do things because it’s so different than the therapy or other coaching they’ve done before:" * "Several of my clients have resolved lifelong issues like anxiety in one shot" * "My business is expanding to help more people in deeper and more efficient ways." Finally, it concludes with a link to where people can schedule a call with you.

Invest in ACX Grants projects!

Chipmonk2mo20

any updates on how this is going btw? (doing retroactive funding research)

Prizes for ML Safety Benchmark Ideas

Chipmonk2mo20

what came of this? (doing research on bounties, prizes, and retroactive funding rn)

MichaelDickens's Shortform

Chipmonk2mo20

~~fwiw, FABRIC was able to get funding in~~ ~~November 2024~~ ~~(who knows if this date is correct though)~~

nvm this was an "exit grant" lmao

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Chipmonk2mo80

Now that this is "over", I'd be fascinated to see a post about what the fundraising process was like for you and what can be learned. Seems like a big L for retroactive funding for example

https://x.com/ohabryka/status/1882579367110586459

3DaneelO2mo

And related to that thread: how does one find out about how to donate when there is no fundraiser? I cannot find any info on the About page or FAQ page. If someone wants to donate in a couple of months when this post is not as visible, how will they find the donation link? I don’t know if adding a donation link to FAQ and About will make much of a difference in practice. I suspect it won’t since that depends on people more spontaneously realizing they want to donate. But it seems pretty relevant to the complaint raised in the tweet thread that people only donate when you do large dramatic calls for funding. I think it wouldn’t hurt to lessen the friction and make it easier to find out how to donate.

We probably won't just play status games with each other after AGI

Chipmonk3mo48

Aside: I'm surprised you're suggesting people get validation --> people feel secure ? This does not at all seem like the causality to me (though I'm aware most people probably think like this).

Prediction: In the absence of radically improved psychotechnology, a significant fraction of people will always find a way to feel insecure.

2Kaj_Sotala3mo

Patterns of emotional security/insecurity are constantly updating (in both directions) through life, though some people's patterns are more resistant (in either direction) than those of others. (This is both my own personal experience and the empirical finding in the literature.) In the insecure -> secure direction, positive experiences as an adult can help reconsolidate negative expectations and provide the kinds of experiences that naturally securely attached people already got earlier: (How can I become more secure?: A grounded theory of earning secure attachment; Olufowote, Fife & Whiting 2019) That said, it's true that the stronger someone's insecure attachment is, the more resistant it is to updating through positive experiences: (Attachment Disturbances in Adults, p. 99-100) Of course, one consideration is also that people with insecure attachment tend to bring various patterns into their relationships that make the other person more likely to respond negatively, making it harder to get the positive experiences that would update the attachment patterns. An AI with infinite patience and understanding that never got triggered would be different in this regard, so might be able to provide corrective experiences for even some of the people who wouldn't normally be capable of changing when dealing with just humans. I would guess/hope that most people's degree of emotional insecurity would be such that they would be able to find security with AIs (especially if the AIs also doubled as expert therapists). With only the most extremely insecure people (e.g. some of the ones who would qualify for a diagnosis of a personality disorder) needing novel psychotech - but of course I can only speculate at this point.

Began a pay-on-results coaching experiment, made $40,300 since July

Chipmonk3mo20

hmm i suspect releasing these metrics could make my customers significantly more annoying. like, early adopters are fun and experimental. but if i make it seem not risky then i get risk-averse people who tend to be prickly

so maybe i will compile and release this data but i would need to figure out how to do it in a way that doesn't change the funnel

Increasing IQ is trivial

Chipmonk3mo40

Any updates on this?

Increasing IQ by 10 Points is Possible

Chipmonk3mo20

Any updates on this?

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Chipmonk3mo20

I wonder if you could set up a conditional donation? “I donate $X, minus if total donations exceed $3M"

Began a pay-on-results coaching experiment, made $40,300 since July

Chipmonk4mo30

i like this thanks. might take a bit of time to put together but interested

Began a pay-on-results coaching experiment, made $40,300 since July

Chipmonk4mo30

made some light edits because of this comment, thanks

Began a pay-on-results coaching experiment, made $40,300 since July

Chipmonk4mo40

oh ok i might start doing that. knowing my calibration on that would be nice

Began a pay-on-results coaching experiment, made $40,300 since July

Chipmonk4mo20

oh ok hm. i also don't want to be incentivized to not give easy-for-me help to people with low odds of success though

4pandamonium4mo

Disclaimer : I would not pay and want to pay that much money anyway - so I am not your intended audience I'd trust you more (and I would think members of the rationalist community would too) if you gave several metrics, even if some of them are not so good, with explanations. Right now, it seems you chose a metric so that it looks good. More metrics would take more time but not much if you have the data easily available. This would be my suggestion : You can provide three percentages ( like when one provides three quantiles instead of just the mean of data values) : * the percentage of success in people you discussed for at least an hour * the percentage among the people with reasonable chances of success (motivated + didn't bail + your expertise + spent at least X hours) * the percentage among people with great chances of success. These percentages, with precise information on what determines in which category clients fall in and the percentage of people treated who fall into each category, would give a first sound idea of the success rate. Taking on low success rate people would not be a problem because their data is treated separately. It's only a problem if 90% of your clients are unlikely to be helped but that would not be a good thing anyway.

Began a pay-on-results coaching experiment, made $40,300 since July

Chipmonk4mo20

could you give a few examples?

also seems time-intensive hmmmm

also, i thought about it more and i really like the metric of "results generated per hour"

1gw4mo

I think you've already given several examples: It would already be informative if you put numbers on each of these questions (i.e. "how often does talking for 15 minutes accomplish something", "how many bounties have you taken on in/outside of your specialty", "what percent of your clients are 'unagentic and slow' (and what does this actually mean)"). Probably one could do much better by generating several metrics that one would expect to be most useful (or top N%tile useful) and share each of them.

Began a pay-on-results coaching experiment, made $40,300 since July

Chipmonk4mo20

:D i really hope bounties catch on

Began a pay-on-results coaching experiment, made $40,300 since July

Chipmonk4mo22

wow this is contraversial (my own vote is +6)

wonder why

DirectedEvolution4mo118

I upvoted for the novelty of a rationalist trying a bounty based career. But also this halfway reads as an advertisement for your life coaching service. I wouldn’t want to see much more in that direction.

Shallow review of technical AI safety, 2024

Chipmonk4mo30

boundaries / membranes
One-sentence summary: Formalise one piece of morality: the causal separation between agents and their environment. See also Open Agency Architecture.
Theory of change: Formalise (part of) morality/safety, solve outer alignment.

Chris Lakin here - this is a very old post and What does davidad want from «boundaries»? should be the canonical link

Orienting to 3 year AGI timelines

Chipmonk4mo30

Why SPY over QQQ?

The Deep Lore of LightHaven, with Oliver Habryka (TBC episode 228)

Chipmonk4mo20

available on the website at least

Pay-on-results personal growth: first success

Chipmonk4mo20

Update: Bob has recorded a 6-month follow-up here.

Walking Sue

Chipmonk4mo10

Why was this post tagged as boundaries/membranes? I'm inclined to remove the tag.

1Matthew McRedmond4mo

I only skimmed that category but if I'm not mistaken the kind of systems I describe in the piece are special cases of times when the boundary between defining agents and one agent and another is unclear/pivotal/insightful etc.

Being Present is Not a Skill

Chipmonk4mo40

makes sense

Sorry for the downtime, looks like we got DDosd

Chipmonk5mo20

works!

Sorry for the downtime, looks like we got DDosd

Chipmonk5mo20

another weird bug is if i click the link i was just sent in my email, it brings me to a 403 Forbidden page (even though the URLs of this functional page and that 403 page look identical)

4habryka5mo

Should now be fixed. We've blocked traffic to basically all pages and been restoring them incrementally to make sure we don't go down again immediately. I just lifted the last of those blocks.

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Chipmonk5mo258

I've run two workshops at LightHaven and it's pretty unthinkable to run a workshop anywhere else in the Bay Area. Lightcone has really made it easy to run overnight events without setup

Hierarchical Agency: A Missing Piece in AI Alignment

Chipmonk5mo20

Yeah i'm confused about what to name it. we can always change it later i guess.

also let me know if you have any posts you want me to definitely tag for it that you think i might miss otherwise

3Mateusz Bagiński4mo

Compositional agency?

Hierarchical Agency: A Missing Piece in AI Alignment

Chipmonk5mo93

Do we have a LessWrong tag for "hierarchical agency" or "multi-scale alignment" or something? Should I make one?

2Jan_Kulveit5mo

I guess make one? Unclear if hierarchical agency is the true name

Hierarchical Agency: A Missing Piece in AI Alignment

Chipmonk5mo40

I just made a twitter list with accounts interested in hierarchical agency (or what i call "multi-scale alignment"). Lmk who should be added

Hierarchical Agency: A Missing Piece in AI Alignment

Chipmonk5mo40

Random but you might like this graphic I made representing hierarchical agency from my post today on a very similar idea. What would you change about it?

Hierarchical Agency: A Missing Piece in AI Alignment

Chipmonk5mo124

This was an impressive demonstation of Claude for interviews. Was this one take?

(Also what prompt did you use? I like how your Claude speaks.)

Jan_Kulveit5mo140

There was some selection of branches, and one pass of post-processing.

It was after ˜30 pages of a different conversation about AI and LLM introspection, so I don't expect the prompt alone will elicit the "same Claude". Start of this conversation was

Thanks! Now, I would like to switch to a slightly different topic: my AI safety oriented research on hierarchical agency. I would like you to role-play an inquisitive, curious interview partner, who aims to understand what I mean, and often tries to check understanding using paraphrasing, giving examples, and si... (read more)

Hierarchical Agency: A Missing Piece in AI Alignment

Chipmonk5mo40

I'm glad you wrote this! I've been wanting to tell othres about ACS's research and finally have a good link

Locally optimal psychology

Chipmonk5mo20

Great question, thanks!

I think you're correct in pointing towards the existence of basically-all-downside genetic conditions, but I still think these are in the minority. Moreover, even most of those don't create a big issue on the object level— compared to how people might feel about the issue as a result.

This argument doesn't extend to conditions like Huntington's, but if a person is missing a pinky finger, most of the issues the person is going to face are related to social factors and their own emotions, not the physical aspect.

I also just say this from experience helping others.

Locally optimal psychology

Chipmonk5mo20

I did not say that depression is always a strategy for everyone.

4Archimedes5mo

I didn't mean to suggest that you did. My point is that there is a difference between "depression can be the result of a locally optimal strategy" and "depression is a locally optimal strategy". The latter doesn't even make sense to me semantically whereas the former seems more like what you are trying to communicate.

Which things were you surprised to learn are not metaphors?

Answer by ChipmonkNov 21, 202480

I wrote about my own experience discovering “feelings in the body” here