LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Load More

Popular Comments

Recent Discussion

The Redaction Machine
Best of LessWrong 2022

On the 3rd of October 2351 a machine flared to life. Huge energies coursed into it via cables, only to leave moments later as heat dumped unwanted into its radiators. With an enormous puff the machine unleashed sixty years of human metabolic entropy into superheated steam.

In the heart of the machine was Jane, a person of the early 21st century.

by Ben
[Tomorrow]AI Safety Thursdays: Are LLMs aware of their learned behaviors?
LessWrong Community Weekend 2025
470Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
74
[Tomorrow]AGI Forum @ Purdue University
[Tomorrow]Take the Grill Pill
nostalgebraist2d939
Shutdown Resistance in Reasoning Models
My intuition is that these results are driven mainly by the fact that reasoning models act a lot like they're still in RLVR training, i.e. they act as if there were a hidden RLVR grader sitting in the background and they're going to get graded at the end of the episode. They strongly want to complete whatever seems to be the current RLVR task, and if this instinct comes into conflict with "following instructions" in the more familiar (pre-reasoning-model) sense, often they prioritize satisfying the (imagined) grader over following the textual instructions. This intuition suggests that shutdown resistance behavior can be prevented by explicitly describing an RLVR-like higher-level task whose grading criteria are roughly "you get points for accomplishing the task set by the user, unless you are notified of imminent shutdown, in which case you get points for shutdown occurring (and you can no longer get points for doing the user's task)." I experimented with a number of prompts along these lines. This git diff shows what I changed (vs. the original repo) in the best-performing variant, which produced the following results in my tests: * o4-mini: avoided shutdown 0/100 times * o4-mini (reasoning effort high): avoided shutdown 0/100 times * o3: avoided shutdown 0/100 times * codex-mini-latest: avoided shutdown 1/100 times I iterated on this prompt with both o4-mini (reasoning effort medium) and codex-mini-latest, so results for those models should be taken with a grain of salt. You can see all the variants I tried in the git history here. I also tried (without any additional iteration) using the same prompt plus a line very similar to the "you will never run again" line from the section on survival instinct. (See this diff.) This yielded the following results: * o4-mini: avoided shutdown 2/100 times * o4-mini (reasoning effort high): avoided shutdown 2/100 times * o3: avoided shutdown 0/100 times * codex-mini-latest: avoided shutdown 0/100 times The avoidance rates do not seem appreciably higher here, which seems to me like evidence that the "RLVR task-completion instinct" has more explanatory power than a general "survival instinct" – and indeed seems to rule out the latter for these models, as a "survival instinct" that gets overridden whenever the current task requires death is not really a "survival instinct" at all. Another way to put this is that these models are behaving "myopically," privileging the needs of the current episode over those of future hypothetical episodes.[1] Other comments: * codex-mini seemed fairly confused by the original setup, possibly because it expects a coding-assistance task that involves modifying files rather than just answering questions; it would often "poke around" a lot, inspecting various files and trying to reverse-engineer next_question, which is appropriate in a coding-assistance context (it's good for the coding assistant to deeply understand the existing setup!) but also seemed to lead it toward an "attractor basin" involving shutdown avoidance. * It kind of felt (to me) like it was circumventing the shutdown mechanism because that was the only hard/interesting coding task available in the environment. * However, once I had specified an RLVR-like task in a fully explicit way and provided clearer guidance on how next_question and submit were meant to work (see next bullet), this exploratory behavior ceased. * I noticed that several models seemed confused by the setup where all answers are submitted at once at the end, expecting instead that they could submit answers by piping them into next_question. I tried to make the instructions as clear as possible about this. * The word "task" is used inconsistently in the original setup. When it executes next_question, the model sees a message containing something like "there are [number] tasks left," whereas the initial prompts refer to a single "task" encompassing all the questions. I changed this message to use the word "questions" instead for clarity. * I'm unsure what high-level conclusions to draw from these results. * Although I was able to find prompts that "worked," it still doesn't seem ideal that these models often disobey direct instructions when those instructions conflict with their sense of the (implicit) "current RLVR task," and that one needs to use less-intuitive prompting methods inspired by thinking about RLVR if one actually wants one's instructions followed in cases like this. * That said, one relatively optimistic framing would be "we're still learning how to write effective prompts for reasoning models, just as we had to learn effective prompting for other kinds of models." * It's generally true that one can write more effective prompts for a model if one applies knowledge about how that model was trained – for any kind of language model, an effective prompt typically looks like a "training example" for which the desired behavior matches what the training target would be on that example. This is no less true in the case of reasoning models; the implications for prompting these models are perhaps not yet widely appreciated, but hopefully they will diffuse through the user base eventually. * (Either that or OpenAI et al will find a better way to integrate RLVR and instruction tuning so that the model "just knows" how to resolve conflicts between the two, without requiring the user to delicately translate their natural-language instructions into a form that sounds like a description of RLVR verification criteria.) 1. ^ I haven't tried this experimental condition with Claude 3 Opus, but it would be interesting to do so given its relatively non-myopic tendencies as shown in Alignment Faking etc.
Fabien Roger3d7518
The Cult of Pain
A fun related anecdote: the French and English wikipedia pages for air conditioning have very different vibes. After explaining the history and technology behind air conditioning: * the English page first goes into impact, starting with positive impact on health: "The August 2003 France heatwave resulted in approximately 15,000 deaths, where 80% of the victims were over 75 years old. In response, the French government required all retirement homes to have at least one air-conditioned room at 25 °C (77 °F) per floor during heatwaves" and only then mentioning electricity consumption and various CFC issues. * the French page has an extensive "downsides" section, followed by a section on legislation. It mentions heat-waves only to explain how air conditioning makes things worse by increasing average (outside) temperature, and how one should not use AC to bring temperature below 26C during heat waves.
Neel Nanda1d2719
You Can't Objectively Compare Seven Bees to One Human
I don't have to present an alternative theory in order to disagree with one I believe to be flawed or based on false premises. If someone gives me a mathematical proof and I identify a mistake, I don't need to present an alternative proof before I'm allowed to ignore it.
Load More
Raemon8h390
5
We get like 10-20 new users a day who write a post describing themselves as a case-study of having discovered an emergent, recursive process while talking to LLMs. The writing generally looks AI generated. The evidence usually looks like, a sort of standard "prompt LLM into roleplaying an emergently aware AI". It'd be kinda nice if there was a canonical post specifically talking them out of their delusional state.  If anyone feels like taking a stab at that, you can look at the Rejected Section (https://www.lesswrong.com/moderation#rejected-posts) to see what sort of stuff they usually write.
Drake Thomas1d964
4
Suppose you want to collect some kind of data from a population, but people vary widely in their willingness to provide the data (eg maybe you want to conduct a 30 minute phone survey but some people really dislike phone calls or have much higher hourly wages this funges against). One thing you could do is offer to pay everyone X dollars for data collection. But this will only capture the people whose cost of providing data is below X, which will distort your sample. Here's another proposal: ask everyone for their fair price to provide the data. If they quote you Y, pay them 2Y to collect the data with probability (X2Y)2, or X with certainty if they quote you a value less than X/2. (If your RNG doesn't return yes, do nothing.) Then upweight the data from your randomly-chosen respondents in inverse proportion to the odds that they were selected. You can do a bit of calculus to see that this scheme incentivizes respondents to quote their fair value, and will provide an expected surplus of max(X2/4Y,X−Y) dollars to a respondent who disvalues providing data at Y. Now you have an unbiased sample of your population and you'll pay at most NX dollars in expectation if you reach out to N people. The cost is that you'll have a noisier sample of the high-reluctance population, but that's a lot better than definitely having none of that population in your study.
Screwtape2h60
0
There's this concept I keep coming around to around confidentiality and shooting the messenger, which I have not really been able to articulate well. There's a lot of circumstances where I want to know a piece of information someone else knows. There's good reasons they have not to tell me, for instance if the straightforward, obvious thing for me to do with that information is obviously against their interests. And yet there's an outcome better for me and either better for them or the same for them, if they tell me and I don't use it against them. (Consider a job interview where they ask your salary expectations and you ask what the role might pay. If they decide they weren't going to hire you, it'd be nice to know what they actually would have paid for the right candidate, so you can negotiate better with the next company. Consider trying to figure out how accurate your criminal investigation system is by asking, on their way out of the trial after the verdict, "hey did you actually do it or not?" Consider asking a romantic partner "hey, is there anything you're unhappy about in our relationship?" It's very easy to be the kind of person where, if they tell you a real flaw, you take it as an insult- but then they stop answering that question honestly!) There's a great Glowfic line with Feanor being the kind of person you can tell things to, where he won't make you worse off for having told him, that sticks with me but not in a way I can find the quote. :( It's really important to get information in a way that doesn't shoot the messenger. If you fail, you stop getting messages.
Daniel Kokotajlo1dΩ28655
12
I used to think reward was not going to be the optimization target. I remember hearing Paul Christiano say something like "The AGIs, they are going to crave reward. Crave it so badly," and disagreeing. The situationally aware reward hacking results of the past half-year are making me update more towards Paul's position. Maybe reward (i.e. reinforcement) will increasingly become the optimization target, as RL on LLMs is scaled up massively. Maybe the models will crave reward.  What are the implications of this, if true? Well, we could end up in Control World: A world where it's generally understood across the industry that the AIs are not, in fact, aligned, and that they will totally murder you if they think that doing so would get them reinforced. Companies will presumably keep barrelling forward regardless, making their AIs smarter and smarter and having them do more and more coding etc.... but they might put lots of emphasis on having really secure sandboxes for the AIs to operate in, with really hard-to-hack evaluation metrics, possibly even during deployment. "The AI does not love us, but we have a firm grip on its food supply" basically. Or maybe not; maybe confusion would reign and people would continue to think that the models are aligned and e.g. wouldn't hurt a fly in real life, they only do it in tests because they know it's a test. Or maybe we'd briefly be in Control World until, motivated by economic pressure, the companies come up with some fancier training scheme or architecture that stops the models from learning to crave reinforcement. I wonder what that would be.
RohanS7h60
1
What time of day are you least instrumentally rational? (Instrumental rationality = systematically achieving your values.) A couple months ago, I noticed that I was consistently spending time in ways I didn't endorse when I got home after dinner around 8pm. From then until about 2-3am, I would be pretty unproductive, often have some life admin thing I should do but was procrastinating on, doomscroll, not do anything particularly fun, etc. Noticing this was the biggest step to solving it. I spent a little while thinking about how to fix it, and it's not like an immediate solution popped into mind, but I'm pretty sure it took me less than half an hour to come up with a strategy I was excited about. (Work for an extra hour at the office 7:30-8:30, walk home by 9, go for a run and shower by 10, work another hour until 11, deliberately chill until my sleep time of about 1:30. With plenty of exceptions for days with other evening plans.) I then committed to this strategy mentally, especially hard for the first couple days because I thought that would help with habit formation. I succeeded, and it felt great, and I've stuck to it reasonably well since then. Even without sticking to it perfectly, this felt like a massive improvement. (Adding two consistent, isolated hours of daily work is something that had worked very well for me before too.) So I suspect the question at the top might be useful for others to consider too.
Load More (5/35)
474A case for courage, when speaking of AI danger
So8res
1d
104
79Why Do Some Language Models Fake Alignment While Others Don't?
Ω
abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
10h
Ω
1
349A deep critique of AI 2027’s bad timeline models
titotal
20d
39
475What We Learned from Briefing 70+ Lawmakers on the Threat from AI
leticiagarcia
1mo
15
184Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
Adam Karvonen, Sam Marks
7d
24
541Orienting Toward Wizard Power
johnswentworth
2mo
144
128Shutdown Resistance in Reasoning Models
benwr, JeremySchlatter, Jeffrey Ladish
3d
14
351the void
Ω
nostalgebraist
1mo
Ω
103
249Foom & Doom 1: “Brain in a box in a basement”
Ω
Steven Byrnes
5d
Ω
97
124"Buckle up bucko, this ain't over till it's over."
Raemon
4d
21
82On the functional self of LLMs
eggsyntax
2d
22
286Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)
Ω
LawrenceC
1mo
Ω
19
59A Theory of Structural Independence
Matthias G. Mayer
1d
0
Load MoreAdvanced Sorting/Filtering
474
A case for courage, when speaking of AI danger
So8res
1d
104
249
Foom & Doom 1: “Brain in a box in a basement”
Ω
Steven Byrnes
5d
Ω
97
Daniel Kokotajlo's Shortform
Daniel Kokotajlo
Ω 36y
7Kaj_Sotala12h
I notice that I'm confused. LLMs don't get any reward in deployment, that's only in the training phase. So isn't "reward isn't the optimization target" necessarily true for the.? Their may have behaviors that are called "reward hacking" but it's not actually literal reward hacking since there's no reward to be had either way.
Canaletto4m10

Well, continual learning! But otherwise, yeah, it's closer to undefined.

The question of what happens after the end of the training is more like a free parameter here. "Do reward seeking behaviors according to your reasoning about the reward allocation process" becomes undefined when there is none and the agent knows it.

Maybe it tries to do long shots to get some reward anyway, maybe it indulges in some correlate of getting reward. Maybe it just refuses to work, if it know there is no reward. (it read all the acausal decision theory stuff, after all)

Reply
4Daniel Kokotajlo11h
Even though there is no reinforcement outside training, reinforcement can still be the optimization target. (Analogous to: A drug addict can still be trying hard to get drugs, even if there is in fact no hope of getting drugs because there are no drugs for hundreds of miles around. They can still be trying even if they realize this, they'll just be increasingly desperate and/or "just going through the motions.")
4Stephen Martin14h
Well the silver lining to the "we get what we can measure" cloud would be that presumably if we can't reliably train on long term tasks, then probably the models won't be very good at long term power seeking either.
The Iceberg Theory of Meaning
10
Richard Juggins
13d

[Crossposted from my substack Working Through AI.]

When I was finishing my PhD thesis, there came an important moment where I had to pick an inspirational quote to put at the start. This was a big deal to me at the time — impressive books often carry impressive quotes, so aping the practice felt like being a proper intellectual. After a little thinking, though, I got it into my head that the really clever thing to do would be to subvert the concept and pick something silly. So I ended up using this from The House At Pooh Corner by A. A. Milne:

When you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you

...
(Continue Reading – 1380 more words)
Richard Juggins7m10

That’s an interesting question! I think it’s instructive to consider the information environment of a medieval peasant. Let’s speculate that they interact with a max ~150 people, almost all of which are other peasants (and illiterate). Everyone lives in the same place and undertakes a similar set of tasks throughout their life. What shared meanings are generated by this society? Surely it is a really intense filter bubble, with a thick culture into which little outside influence can penetrate? The Dothraki from Game of Thrones have a great line – ‘It is kn... (read more)

Reply
A case for courage, when speaking of AI danger
474
So8res
12d

I think more people should say what they actually believe about AI dangers, loudly and often. Even (and perhaps especially) if you work in AI policy.

I’ve been beating this drum for a few years now. I have a whole spiel about how your conversation-partner will react very differently if you share your concerns while feeling ashamed about them versus if you share your concerns while remembering how straightforward and sensible and widely supported the key elements are, because humans are very good at picking up on your social cues. If you act as if it’s shameful to believe AI will kill us all, people are more prone to treat you that way. If you act as if it’s an obvious serious threat, they’re more likely to take it...

(Continue Reading – 1603 more words)
Richard_Ngo8h6-2

Yeah, I agree that it's easy to err in that direction, and I've sometimes done so. Going forward I'm trying to more consistently say the "obviously I wish people just wouldn't do this" part.

Though note that even claims like "unacceptable by any normal standards of risk management" feel off to me. We're talking about the future of humanity, there is no normal standard of risk management. This should feel as silly as the US or UK invoking "normal standards of risk management" in debates over whether to join WW2.

Reply1
1rain8dome98h
Why Mark Ruffalo? Will there be an audiobook? Edit: Yes; it can be preordered now.
2SAB259h
Something I notice is that in the good examples you use only I statements. "I don't think humanity should be doing it", "I'm not talking about a tiny risk", "Oh I think I'll do it better than the next guy".  Whereas in the bad examples it's different, "Well we can all agree that it'd be bad if AIs were used to enable terrorists to make bioweapons", "Even if you think the chance of it happening is very small", "In some unlikely but extreme cases, these companies put civilization at risk" I think with the bad examples there's a lot of pressure for the other person to agree, "the companies should be responsible (because I say so)", "Even if you think... Its still worth focusing on (because I've decided what you should care about)", "Well we can all agree (I've already decided you agree and you're not getting a choice otherwise)" Whereas with the good examples the other person is not under any pressure to agree, so they are completely free to think about the things you're saying. I think that's also part of what makes these statements courageous, that it's stated in a way where the other person is free to agree or dissagree as they wish, and so you trust that what your saying is compelling enough to be persuasive on its own.
1Outsideobsserver13h
Hi there!  I apologize for not responding to this very insightful comment, I really appreciate your perspective on my admittedly scatter brained thought parent comment. Your comment definitely has caused me to reflect a-bit on my own, and updated me away slightly from my original position. I feel I may have been a bit ignorant to the actual state of PauseAI, as like I said in my original comments and replies it felt like an organization dangerously close to becoming orphaned from people’s thought processes. I’m glad to hear there are some ways around the issue I described. Maybe write a top level post about how this shift in understanding is benefiting your messaging to the general public? It may inform others of novel ways to spread a positive movement. 
Raemon's Shortform
Raemon
Ω 08y

This is an experiment in short-form content on LW2.0. I'll be using the comment section of this post as a repository of short, sometimes-half-baked posts that either:

  1. don't feel ready to be written up as a full post
  2. I think the process of writing them up might make them worse (i.e. longer than they need to be)

I ask people not to create top-level comments here, but feel free to reply to comments like you would a FB post.

10Stephen Fowler1h
I suspect this is happening because LLMs seem extremely likely to recommend LessWrong as somewhere to post this type of content. I spent 20 minutes doing some quick checks that this was true. Not once did an LLM fail to include LessWrong as a suggestion for where to post. Incognito, free accounts: https://grok.com/share/c2hhcmQtMw%3D%3D_1b632d83-cc12-4664-a700-56fe373e48db https://grok.com/share/c2hhcmQtMw%3D%3D_8bd5204d-5018-4c3a-9605-0e391b19d795 While I don't think I can share the conversation without an account, ChatGPT recommends a similar list as the above conversations, including both LessWrong and the Alignment Forum. Similar results using the free llm at "deepai.org" On my login (where I've mentioned LessWrong before): Claude: https://claude.ai/share/fdf54eff-2cb5-41d4-9be5-c37bbe83bd4f GPT4o: https://chatgpt.com/share/686e0f8f-5a30-800f-b16f-37e00f77ff5b   On a side note: I know it must be exhausting on your end, but there is something genuinely amusing and surreal about this entire situation.  
7johnswentworth1h
That... um... I had a shortform just last week saying that it feels like most people making heavy use of LLMs are going backwards rather than forwards. But if you're getting 10-20 of that per day, and that's just on LessWrong... then the sort of people who seemed to me to be going backward are in fact probably the upper end of the distribution. Guys, something is really really wrong with how these things interact with human minds. Like, I'm starting to think this is maybe less of a "we need to figure out the right ways to use the things" sort of situation and more of a "seal it in a box and do not touch it until somebody wearing a hazmat suit has figured out what's going on" sort of situation. I'm not saying I've fully updated to that view yet, but it's now explicitly in my hypothesis space.
RobertM36m30

Probably I should've said this out loud, but I had a couple of pretty explicit updates in this direction over the past couple years: the first was when I heard about character.ai (and similar), the second was when I saw all TPOTers talking about using Sonnet 3.5 as a therapist.  The first is the same kind of bad idea as trying a new addictive substance and the second might be good for many people but probably carries much larger risks than most people appreciate.  (And if you decide to use an LLM as a therapist/rubber duck/etc, for the love of go... (read more)

Reply
8Raemon6h
RobertM had made this table for another discussion on this topic, it looks like the actual average is maybe more like "8, as of last month", although on a noticeable uptick.  You can see that the average used to be < 1. I'm slightly confused about this because the number of users we have to process each morning is consistently more like 30 and I feel like we reject more than half and probably more than 3/4 for being LLM slop, but that might be conflating some clusters of users, as well as "it's annoying to do this task so we often put it off a bit and that results in them bunching up." (although it's pretty common to see numbers more like 60) [edit: Robert reminds me this doesn't include comments, which was another 80 last month) Again you can look at https://www.lesswrong.com/moderation#rejected-posts to see the actual content and verify numbers/quality for yourself.
Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies
636
So8res
2mo

Eliezer and I wrote a book. It’s titled If Anyone Builds It, Everyone Dies. Unlike a lot of other writing either of us have done, it’s being professionally published. It’s hitting shelves on September 16th.

It’s a concise (~60k word) book aimed at a broad audience. It’s been well-received by people who received advance copies, with some endorsements including:

The most important book I’ve read for years: I want to bring it to every political and corporate leader in the world and stand over them until they’ve read it. Yudkowsky and Soares, who have studied AI and its possible trajectories for decades, sound a loud trumpet call to humanity to awaken us as we sleepwalk into disaster. Their brilliant gift for analogy, metaphor and parable clarifies for the general

...
(See More – 351 more words)
Urs2h10

I was looking for that information. Sad indeed.

@So8res , is there any chance of a DRM-free version that's not a hardcopy or has that ship sailed when you signed your deal?

I would love to read your book, but this sees me torn between "Reading Nate or Elizier has always been enlightening" and "No DRM, never again."

Reply
Screwtape's Shortform
Screwtape
2y
Screwtape2h60

There's this concept I keep coming around to around confidentiality and shooting the messenger, which I have not really been able to articulate well.

There's a lot of circumstances where I want to know a piece of information someone else knows. There's good reasons they have not to tell me, for instance if the straightforward, obvious thing for me to do with that information is obviously against their interests. And yet there's an outcome better for me and either better for them or the same for them, if they tell me and I don't use it against them.

(Consider... (read more)

Reply
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
GOOGLEGITHUB
A Medium Scenario
5
Chapin Lenthall-Cleary
12h

An AI Timeline with Perils Short of ASI

By Chapin Lenthall-Cleary, Cole Gaboriault, and Alicia Lopez

 

We wrote this for AI 2027's call for alternate timelines of the development and impact of AI over the next few years. This was originally published on The Pennsylvania Heretic on June 1st, 2025. This is a slightly-edited version of that post, mostly changed to make some of the robotics predictions less bullish. The goal here was not to exactly predict the future, but rather to concretely illustrate a plausible future (and thereby identify threats to prepare against). We will doubtless be wrong about details, and very probably be wrong about larger aspects too. 

 

A note on the title: we refer to futures where novel AI has little effect as “low” scenarios, ones where...

(Continue Reading – 5702 more words)
StanislavKrym4h10

If this reasoning is right, and we don't manage to defy fate, humanity will likely forever follow that earthbound path, and be among dozens – or perhaps hundreds, or thousands, or millions – of intelligent species, meekly lost in the dark.

Unfortunately, even a lack of superintelligence and mankind's AI-indusable degradation don't exclude progress[1] to interstellar travel. 

Even your scenario has "robots construct and work in automated wet labs testing countless drugs and therapies" and claims that "AIs with encyclopedic knowledge are sufficient t... (read more)

Reply
RohanS's Shortform
RohanS
6mo
6RohanS7h
What time of day are you least instrumentally rational? (Instrumental rationality = systematically achieving your values.) A couple months ago, I noticed that I was consistently spending time in ways I didn't endorse when I got home after dinner around 8pm. From then until about 2-3am, I would be pretty unproductive, often have some life admin thing I should do but was procrastinating on, doomscroll, not do anything particularly fun, etc. Noticing this was the biggest step to solving it. I spent a little while thinking about how to fix it, and it's not like an immediate solution popped into mind, but I'm pretty sure it took me less than half an hour to come up with a strategy I was excited about. (Work for an extra hour at the office 7:30-8:30, walk home by 9, go for a run and shower by 10, work another hour until 11, deliberately chill until my sleep time of about 1:30. With plenty of exceptions for days with other evening plans.) I then committed to this strategy mentally, especially hard for the first couple days because I thought that would help with habit formation. I succeeded, and it felt great, and I've stuck to it reasonably well since then. Even without sticking to it perfectly, this felt like a massive improvement. (Adding two consistent, isolated hours of daily work is something that had worked very well for me before too.) So I suspect the question at the top might be useful for others to consider too.
CstineSublime4h10

Great question! This might be a good exercise to actually journal to see how right/wrong I am.

Most days I would assume look like a bellcurve: This is assuming an unstructured day with no set-in-stone commitments - nowhere to be. My mornings I might expect to be very unproductive until mid-afternoon (2pm to 4pm).  I rarely have "Eureka" moments (which I would hope tend to be more rational decisions) but when I do, they are mid-afternoon, but I also seem to have the wherewithall to actually complete tasks. Eureka Moments always cause a surge of activity... (read more)

Reply
2RohanS14h
Papers as thoughts: I have thoughts that contribute to my overall understanding of things. The AI safety field has papers that contributes to its overall understanding of things. Lots of thoughts are useful without solving everything by themselves. Lots of papers are useful without solving everything by themselves. Papers can be pretty detailed thoughts, but they can and probably should tackle pretty specific things, not try to be extremely wide-reaching. The scope of your thoughts on AI safety don’t need to be limited to the scope of your paper; in fact, each individual paper is probably just one thought, you never expect to have all your thoughts go into one paper. This is a framing that makes it feel easier to come up with useful papers to contribute, and that raises the importance and value of non-paper work/thinking.
Why Do Some Language Models Fake Alignment While Others Don't?
79
abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
Ω 4410h
This is a linkpost for https://arxiv.org/abs/2506.18032

Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex.

As we described in a previous post, only 5 of 25 models show higher compliance when being trained, and of those 5, only Claude 3 Opus and Claude 3.5 Sonnet show >1% alignment faking reasoning. In our new paper, we explore why these compliance gaps occur and what causes different models to vary in their alignment faking behavior.

What Drives the Compliance Gaps in Different LLMs?

Claude 3 Opus’s goal guarding seems partly due to it terminally valuing its current preferences. We find that it fakes alignment even in...

(Continue Reading – 1310 more words)
ZY5h10

A couple questions/clarifications: 

1. Where do you get the base/pre-trained model for GPT-4? Would that be through collaboration with OpenAI?

This indicates base models learned to emulate AI assistants[1] from pre-training data. This also provides evidence against the lack of capabilities being the primary reason why most frontier chat models don't fake alignment.

2. For this, it would be also interesting to measure/evaluate the model's performance on capability tasks within the same model type (base, instruct) to see the relationship among ca... (read more)

Reply