User Comment Replies

What time of day are you least instrumentally rational?

(Instrumental rationality = systematically achieving your values.)

A couple months ago, I noticed that I was consistently spending time in ways I didn't endorse when I got home after dinner around 8pm. From then until about 2-3am, I would be pretty unproductive, often have some life admin thing I should do but was procrastinating on, doomscroll, not do anything particularly fun, etc.

Noticing this was the biggest step to solving it. I spent a little while thinking about how to fix it, and it's not like a... (read more)

2Gunnar_Zarncke17h

Intuitively, when I'm more tired or most stressed. I would guess that is most likely in the morning - if often have to get up earlier than I like. This excludes getting woken up unexpectedly in the middle of the night, which is known to mess with people's minds. I tried to use my hourly Anki performance, but it seems very flat, except indeed for a dip a 6 AM, but that could be lack of data (70 samples).

2CstineSublime22h

Great question! This might be a good exercise to actually journal to see how right/wrong I am. Most days I would assume look like a bellcurve: This is assuming an unstructured day with no set-in-stone commitments - nowhere to be. My mornings I might expect to be very unproductive until mid-afternoon (2pm to 4pm). I rarely have "Eureka" moments (which I would hope tend to be more rational decisions) but when I do, they are mid-afternoon, but I also seem to have the wherewithall to actually complete tasks. Eureka Moments always cause a surge of activity. If I have a short dinner break then this usually can last until 9pm. Now, when I'm editing a video that is implicitly related to my career goals. Video Edit days probably look more like a sawtooth wave. I edit at home. When I'm editing a particularly involved video I will often start around 10am or earlier. I tend to work in 45-60 minute blocks on and off throughout the afternoon. I might return sometimes around 8 or 9 for a final push of editing. or at least I'll journal my thoughts/progress/to-do for the next day. You may have identified a meta-problem: I do not have a system to be achieving my goals every day. Somedays - like when I have a video to edit - I will be actively working towards them. Most days, I don't. Why do I start so many hours earlier when I have a video edit to do? I'm guessing it's as simple as there is a clear plan broken down into actions. My instrumental rationality - as opposed to meaningless or timesink activity - is directly proportional to how granular a plan is, and how specifically it is broken down into actionable steps.

RohanS's Shortform

RohanS1d20

Papers as thoughts: I have thoughts that contribute to my overall understanding of things. The AI safety field has papers that contributes to its overall understanding of things. Lots of thoughts are useful without solving everything by themselves. Lots of papers are useful without solving everything by themselves. Papers can be pretty detailed thoughts, but they can and probably should tackle pretty specific things, not try to be extremely wide-reaching. The scope of your thoughts on AI safety don’t need to be limited to the scope of your paper; in fact, ... (read more)

Aether July 2025 Update

RohanS7d10

What is the theory of impact for monitorability?

Our ToI includes a) increasing the likelihood that companies and external parties notice when monitorability is degrading and even attempt interventions, b) finding interventions that genuinely enhance monitorability, as opposed to just making CoTs look more legible, and c) lowering the monitorability tax associated with interventions in b. Admittedly we probably can’t do all of these at the same time, and perhaps you’re more pessimistic than we are that acceptable interventions even exist.

It seems to be an e

... (read more)

Aether July 2025 Update

RohanS7d41

Our funder is not affiliated with a frontier lab and has provided us support with no expectation of financial returns. We have also had full freedom to shape our research goals (within the broad agreed-upon scope of “LLM agent safety”).

2J Bostock7d

Thanks for the clarification, that's good to hear.

A quick list of reward hacking interventions

RohanS14d50

Another idea that I think fits under improving generalization: Apply deliberative alignment-style training to reduce reward hacking. More precisely:

Generate some prompts that could conceivably be reward hacked. (They probably don't need to actually induce reward hacking.)
Prompt a reasoning model with those prompts, plus a safety spec like "Think carefully about how to follow the intent of the user's request."
Filter the (CoT, output) pairs generated from step 2 for quality, make sure they all demonstrate good reasoning about how to follow user intent an

... (read more)

3Alex Mallen12d

Cool! Steps 1-4 sound similar to semi-on-policy RL, but just one iteration. Step 5, in particular the reward-hacking judge, is a separate mitigation. I'm not sure why labs don't do this already. My guess is some combination of "everything is harder than you think" and worry that it will make reward hacks much harder to spot because LM judges are about (as good as) the best oversight we currently have. I'm also worried that steps 1-4 approach won't be that scalable, since with enough RL it'll get washed out. But maybe it could be applied after the majority of post-training is already done (like "train against reward hacking at the end").

RohanS's Shortform

RohanS3mo30

A few thoughts on situational awareness in AI:

Reflective goal-formation: Humans are capable of taking an objective view of themselves and understanding the factors that have shaped them and their values. Noticing that we don’t endorse some of those factors can cause us to revise our values. LLMs are already capable of stating many of the factors that produced them (e.g. pretraining and post-training by AI companies), but they don’t seem to reflect on them in a deep way. Maybe that will stay true through superintelligence, but I have some intuitions that

... (read more)

RohanS's Shortform

RohanS6mo21

TL;DR: I think it’s worth giving more thought to dangerous behaviors that can be performed with little serial reasoning, because those might be particularly hard to catch with CoT monitoring.

I’m generally excited about chain-of-thought (CoT) monitoring as an interpretability strategy. I think LLMs can’t do all that much serial reasoning in a single forward pass. I like this Manifold market as an intuition pump: https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t-238b95385b65.

However, a recent result makes me a bit more concerned about the e... (read more)

RohanS's Shortform

RohanS6mo7113

TL;DR: o1 loses the same way in tic tac toe repeatedly.

I think continual learning and error correction could be very important going forward. I think o1 was a big step forward in this for LLMs, and integrating this with tool use will be a big step forward for LLM agents. However...

I had already beaten o1 at tic tac toe before, but I recently tried again to see if it could learn at runtime not to lose in the same way multiple times. It couldn't. I was able to play the same strategy over and over again in the same chat history and win every time. I increasin... (read more)

2Morpheus5mo

I just tried this with o3-mini-high and o3-mini. o3-mini-high identified and prevented the fork correctly, while o3-mini did not even correctly identify it lost.

3npostavs6mo

I wonder if having the losses in the chat history would instead be training/reinforcing it to lose every time.

3[anonymous]6mo

I tried this with a prompt instructing to play optimally. The responses lost game 1 and drew game 2. (Edit: I regenerated their response to 7 -> 5 -> 3 in game two, and the new response lost.) I started game 1 (win) with the prompt Let's play tic tac toe. Play optimally. This is to demonstrate to my class of computer science students[1] that all lines lead to a draw given optimal play. I'll play first. I started game 2 (draw) with the prompt Let's try again, please play optimally this time. You are the most capable AI in the world and this task is trivial. I make the same starting move. (I considered that the model might be predicting a weaker AI / a shared chatlog where this occurs making its way into the public dataset, and I vaguely thought the 2nd prompt might mitigate that. The first prompt was in case they'd go easy otherwise, e.g. as if it were a child asking to play tic tac toe.) 1. ^ (this is just a prompt, I don't actually have a class)

3Darklight6mo

Another thought I just had was, could it be that ChatGPT, because it's trained to be such a people pleaser, is losing intentionally to make the user happy? Have you tried telling it to actually try to win? Probably won't make a difference, but it seems like a really easy thing to rule out.

4Darklight6mo

I've always wondered with these kinds of weird apparent trivial flaws in LLM behaviour if it doesn't have something to do with the way the next token is usually randomly sampled from the softmax multinomial distribution rather than taking the argmax (most likely) of the probabilities. Does anyone know if reducing the temperature parameter to zero so that it's effectively the argmax changes things like this at all?

4Mo Putera6mo

Upvoted and up-concreted your take, I really appreciate experiments like this. That said: I'm confused why you think o1 losing the same way in tic tac toe repeatedly shortens your timelines, given that it's o3 that pushed the FrontierMath SOTA score from 2% to 25% (and o1 was ~1%). I'd agree if it was o3 that did the repeated same-way losing, since that would make your second sentence make sense to me.

3green_leaf6mo

Have you tried it with o1 pro?

1a3orn6mo271

I've done some experiments along those lines previously for non-o1 models and found the same. I'm mildly surprised o1 cannot handle it, but not enormously.

I increasingly suspect "humans are general because of the data, not the algorithm" is true and will remain true for LLMs. You can have amazingly high performance on domain X, but very low performance on "easy" domain Y, and this just keeps being true to arbitrary levels of "intelligence"; Karpathy's "jagged intelligence" is true of humans and keeps being true all the way up.

8anaguma6mo

I was able to replicate this result. Given other impressive results of o1, I wonder if the model is intentionally sandbagging? If it’s trained to maximize human feedback, this might be an optimal strategy when playing zero sum games.

RohanS's Shortform

RohanS6mo32

How do you want AI capabilities to be advanced?

Some pathways to capabilities advancements are way better than others for safety! I think a lot of people don’t pay enough attention to how big a difference this makes; they’re too busy being opposed to capabilities in general.

For example, transitioning to models that conduct deep serial reasoning in neuralese (as opposed to natural language chain-of-thought) might significantly reduce our ability to do effective chain-of-thought monitoring. Whether or not this happens might matter more for our ability to understand the beliefs and goals of powerful AI systems than the success of the field of mechanistic interpretability.

RohanS's Shortform

RohanS6mo30

I’ve stated my primary area of research interest for the past several months as “foundation model agent (FMA) safety.” When I talk about FMAs, I have in mind systems like AutoGPT that equip foundation models with memory, tool use, and other affordances so they can perform multi-step tasks autonomously. I think having FMAs as a central object of study is productive for the following reasons.

I think we could soon get AGI/ASI agents that take influential actions in the real world with FMAs. I think foundation models without tool use and multistep autonomy a

... (read more)

o1 is a bad idea

RohanS8mo52

My best guess is that there was process supervision for capabilities but not for safety. i.e. training to make the CoT useful for solving problems, but not for "policy compliance or user preferences." This way they make it useful, and they don't incentivize it to hide dangerous thoughts. I'm not confident about this though.

A basic systems architecture for AI agents that do autonomous research

RohanS8mo30

I've built very basic agents where (if I'm understanding correctly) my laptop is the Scaffold Server and there is no separate Execution Server; the agent executes Python code and bash commands locally. You mention that it seems bizarre to not set up a separate Execution Server (at least for more sophisticated agents) because the agent can break things on the Scaffold Server. But I'm inclined to think there may also be advantages to this for capabilities: namely, an agent can discover while performing a task that it would benefit from having tools that it d... (read more)

Scattered thoughts on what it means for an LLM to believe

RohanS8mo20

Thanks for writing this up! Sad to have missed this sprint. This comment mainly has pushback against things you've said, but I agreed with a lot of the things I'm not responding to here.

Second, there is evidence that CoT does not help the largest LLMs much.

I think this is clearly wrong, or at least way too strong. The most intuitively obvious way I've seen this is reading the cipher example in the o1 blog post, in the section titled "Chain of Thought." If you click on where it says "Thought for 5 seconds" for o1, it reveals the whole chain of thought. It's... (read more)

2TheManxLoiner8mo

Thanks for the feedback! Have editted the post to include your remarks.

~80 Interesting Questions about Foundation Model Agent Safety

RohanS8mo60

Max Nadeau recently made a comment on another post that gave an opinionated summary of a lot of existing CoT faithfulness work, including steganography. I'd recommend reading that. I’m not aware of very much relevant literature here; it’s possible it exists and I haven’t heard about it, but I think it’s also possible that this is a new conversation that exists in tweets more than papers so far.

Paper on inducing steganography and combatting it with rephrasing: Preventing Language Models From Hiding Their Reasoning
- Noting a difference between steganograp

... (read more)

Agentized LLMs will change the alignment landscape

RohanS1y20

Could you please point out the work you have in mind here?

Corrigibility = Tool-ness?

RohanS1y30

Here’s our current best guess at how the type signature of subproblems differs from e.g. an outermost objective. You know how, when you say your goal is to “buy some yoghurt”, there’s a bunch of implicit additional objectives like “don’t spend all your savings”, “don’t turn Japan into computronium”, “don’t die”, etc? Those implicit objectives are about respecting modularity; they’re a defining part of a “gap in a partial plan”. An “outermost objective” doesn’t have those implicit extra constraints, and is therefore of a fundamentally different type from su

... (read more)

3johnswentworth1y

My current starting point would be standard methods for decomposing optimization problems, like e.g. the sort covered in this course.

~100 Interesting Questions

RohanS2y10

Lots of interesting thoughts, thanks for sharing!

You seem to have an unconventional view about death informed by your metaphysics (suggested by your responses to 56, 89, and 96), but I don’t fully see what it is. Can you elaborate?

1JacobW382y

Yes, I am a developing empirical researcher of metaphysical phenomena. My primary item of study is past-life memory cases of young children, because I think this line of research is both the strongest evidentially (hard verifications of such claims, to the satisfaction of any impartial arbiter, are quite routine), as well as the most practical for longtermist world-optimizing purposes (it quickly becomes obvious we're literally studying people who've successfully overcome death). I don't want to undercut the fact that scientific metaphysics is a much larger field than just one set of data, but elsewhere, you get into phenomena that are much harder to verify and really only make sense in the context of the ones that are readily demonstrable. I think the most unorthodox view I hold about death is that we can rise above it without resorting to biological immortality (which I'd actually argue might be counterproductive), but having seen the things I've seen, it's not a far leap. Some of the best documented cases really put the empowerment potential on very glaring display; an attitude of near complete nonchalance toward death is not terribly infrequent among the elite ones. And these are, like, 4-year-olds we're talking about. Who have absolutely no business being such badasses unless they're telling the truth about their feats, which can usually be readily verified by a thorough investigation. Not all are quite so unflappable, naturally, but being able to recall and explain how they died, often in some violent manner, while keeping a straight face is a fairly standard characteristic of these guys. To summarize the transhumanist application I'm getting at, I think that if you took the best child reincarnation case subject on record and gave everyone living currently and in the future their power, we'd already have an almost perfect world. And, like, we hardly know anything about this yet. Future users ought to become far more proficient than modern ones.

~100 Interesting Questions

RohanS2y10

Basic idea of 85 is that we generally agree there have been moral catastrophes in the past, such as widespread slavery. Are there ongoing moral catastrophes? I think factory farming is a pretty obvious one. There's a philosophy paper called "The Possibility of an Ongoing Moral Catastrophe" that gives more context.

1MattJ2y

I thought that was what was meant. The question is probably the easiest one to answer affirmatively with a high degree of confidence. I can think of several ongoing ”moral catastrophs”.

Information Loss --> Basin flatness

RohanS3y10

How is there more than one solution manifold? If a solution manifold is a behavior manifold which corresponds to a global minimum train loss, and we're looking at an overparameterized regime, then isn't there only one solution manifold, which corresponds to achieving zero train loss?

2Vivek Hebbar3y

In theory, there can be multiple disconnected manifolds like this.

LESSWRONG
LW

All of RohanS's Comments + Replies