All of RohanS's Comments + Replies

RohanS21

TL;DR: I think it’s worth giving more thought to dangerous behaviors that can be performed with little serial reasoning, because those might be particularly hard to catch with CoT monitoring.

I’m generally excited about chain-of-thought (CoT) monitoring as an interpretability strategy. I think LLMs can’t do all that much serial reasoning in a single forward pass. I like this Manifold market as an intuition pump: https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t-238b95385b65.

However, a recent result makes me a bit more concerned about the e... (read more)

RohanS7013

TL;DR: o1 loses the same way in tic tac toe repeatedly.

I think continual learning and error correction could be very important going forward. I think o1 was a big step forward in this for LLMs, and integrating this with tool use will be a big step forward for LLM agents. However...

I had already beaten o1 at tic tac toe before, but I recently tried again to see if it could learn at runtime not to lose in the same way multiple times. It couldn't. I was able to play the same strategy over and over again in the same chat history and win every time. I increasin... (read more)

2Morpheus
I just tried this with o3-mini-high and o3-mini. o3-mini-high identified and prevented the fork correctly, while o3-mini did not even correctly identify it lost.
3npostavs
I wonder if having the losses in the chat history would instead be training/reinforcing it to lose every time.
3[anonymous]
I tried this with a prompt instructing to play optimally. The responses lost game 1 and drew game 2. (Edit: I regenerated their response to 7 -> 5 -> 3 in game two, and the new response lost.) I started game 1 (win) with the prompt Let's play tic tac toe. Play optimally. This is to demonstrate to my class of computer science students[1] that all lines lead to a draw given optimal play. I'll play first. I started game 2 (draw) with the prompt Let's try again, please play optimally this time. You are the most capable AI in the world and this task is trivial. I make the same starting move. (I considered that the model might be predicting a weaker AI / a shared chatlog where this occurs making its way into the public dataset, and I vaguely thought the 2nd prompt might mitigate that. The first prompt was in case they'd go easy otherwise, e.g. as if it were a child asking to play tic tac toe.) 1. ^ (this is just a prompt, I don't actually have a class)
3Darklight
Another thought I just had was, could it be that ChatGPT, because it's trained to be such a people pleaser, is losing intentionally to make the user happy? Have you tried telling it to actually try to win? Probably won't make a difference, but it seems like a really easy thing to rule out.
4Darklight
I've always wondered with these kinds of weird apparent trivial flaws in LLM behaviour if it doesn't have something to do with the way the next token is usually randomly sampled from the softmax multinomial distribution rather than taking the argmax (most likely) of the probabilities. Does anyone know if reducing the temperature parameter to zero so that it's effectively the argmax changes things like this at all?
4Mo Putera
Upvoted and up-concreted your take, I really appreciate experiments like this. That said:  I'm confused why you think o1 losing the same way in tic tac toe repeatedly shortens your timelines, given that it's o3 that pushed the FrontierMath SOTA score from 2% to 25% (and o1 was ~1%). I'd agree if it was o3 that did the repeated same-way losing, since that would make your second sentence make sense to me.
3green_leaf
Have you tried it with o1 pro?
1a3orn271

I've done some experiments along those lines previously for non-o1 models and found the same. I'm mildly surprised o1 cannot handle it, but not enormously.

I increasingly suspect "humans are general because of the data, not the algorithm" is true and will remain true for LLMs. You can have amazingly high performance on domain X, but very low performance on "easy" domain Y, and this just keeps being true to arbitrary levels of "intelligence"; Karpathy's "jagged intelligence" is true of humans and keeps being true all the way up.

8anaguma
I was able to replicate this result. Given other impressive results of o1, I wonder if the model is intentionally sandbagging? If it’s trained to maximize human feedback, this might be an optimal strategy when playing zero sum games.
RohanS32

How do you want AI capabilities to be advanced?

Some pathways to capabilities advancements are way better than others for safety! I think a lot of people don’t pay enough attention to how big a difference this makes; they’re too busy being opposed to capabilities in general.

For example, transitioning to models that conduct deep serial reasoning in neuralese (as opposed to natural language chain-of-thought) might significantly reduce our ability to do effective chain-of-thought monitoring. Whether or not this happens might matter more for our ability to understand the beliefs and goals of powerful AI systems than the success of the field of mechanistic interpretability.

RohanS30

I’ve stated my primary area of research interest for the past several months as “foundation model agent (FMA) safety.” When I talk about FMAs, I have in mind systems like AutoGPT that equip foundation models with memory, tool use, and other affordances so they can perform multi-step tasks autonomously. I think having FMAs as a central object of study is productive for the following reasons.

  1. I think we could soon get AGI/ASI agents that take influential actions in the real world with FMAs. I think foundation models without tool use and multistep autonomy a
... (read more)
RohanS52

My best guess is that there was process supervision for capabilities but not for safety. i.e. training to make the CoT useful for solving problems, but not for "policy compliance or user preferences." This way they make it useful, and they don't incentivize it to hide dangerous thoughts. I'm not confident about this though.

RohanS30

I've built very basic agents where (if I'm understanding correctly) my laptop is the Scaffold Server and there is no separate Execution Server; the agent executes Python code and bash commands locally. You mention that it seems bizarre to not set up a separate Execution Server (at least for more sophisticated agents) because the agent can break things on the Scaffold Server. But I'm inclined to think there may also be advantages to this for capabilities: namely, an agent can discover while performing a task that it would benefit from having tools that it d... (read more)

RohanS20

Thanks for writing this up! Sad to have missed this sprint. This comment mainly has pushback against things you've said, but I agreed with a lot of the things I'm not responding to here.

Second, there is evidence that CoT does not help the largest LLMs much.

I think this is clearly wrong, or at least way too strong. The most intuitively obvious way I've seen this is reading the cipher example in the o1 blog post, in the section titled "Chain of Thought." If you click on where it says "Thought for 5 seconds" for o1, it reveals the whole chain of thought. It's... (read more)

2TheManxLoiner
Thanks for the feedback! Have editted the post to include your remarks.
RohanS50

Max Nadeau recently made a comment on another post that gave an opinionated summary of a lot of existing CoT faithfulness work, including steganography. I'd recommend reading that. I’m not aware of very much relevant literature here; it’s possible it exists and I haven’t heard about it, but I think it’s also possible that this is a new conversation that exists in tweets more than papers so far.

... (read more)
RohanS20

Could you please point out the work you have in mind here?

RohanS30

Here’s our current best guess at how the type signature of subproblems differs from e.g. an outermost objective. You know how, when you say your goal is to “buy some yoghurt”, there’s a bunch of implicit additional objectives like “don’t spend all your savings”, “don’t turn Japan into computronium”, “don’t die”, etc? Those implicit objectives are about respecting modularity; they’re a defining part of a “gap in a partial plan”. An “outermost objective” doesn’t have those implicit extra constraints, and is therefore of a fundamentally different type from su

... (read more)
3johnswentworth
My current starting point would be standard methods for decomposing optimization problems, like e.g. the sort covered in this course.
RohanS10

Lots of interesting thoughts, thanks for sharing!

You seem to have an unconventional view about death informed by your metaphysics (suggested by your responses to 56, 89, and 96), but I don’t fully see what it is. Can you elaborate?

1JacobW38
Yes, I am a developing empirical researcher of metaphysical phenomena. My primary item of study is past-life memory cases of young children, because I think this line of research is both the strongest evidentially (hard verifications of such claims, to the satisfaction of any impartial arbiter, are quite routine), as well as the most practical for longtermist world-optimizing purposes (it quickly becomes obvious we're literally studying people who've successfully overcome death). I don't want to undercut the fact that scientific metaphysics is a much larger field than just one set of data, but elsewhere, you get into phenomena that are much harder to verify and really only make sense in the context of the ones that are readily demonstrable. I think the most unorthodox view I hold about death is that we can rise above it without resorting to biological immortality (which I'd actually argue might be counterproductive), but having seen the things I've seen, it's not a far leap. Some of the best documented cases really put the empowerment potential on very glaring display; an attitude of near complete nonchalance toward death is not terribly infrequent among the elite ones. And these are, like, 4-year-olds we're talking about. Who have absolutely no business being such badasses unless they're telling the truth about their feats, which can usually be readily verified by a thorough investigation. Not all are quite so unflappable, naturally, but being able to recall and explain how they died, often in some violent manner, while keeping a straight face is a fairly standard characteristic of these guys. To summarize the transhumanist application I'm getting at, I think that if you took the best child reincarnation case subject on record and gave everyone living currently and in the future their power, we'd already have an almost perfect world. And, like, we hardly know anything about this yet. Future users ought to become far more proficient than modern ones.
RohanS10

Basic idea of 85 is that we generally agree there have been moral catastrophes in the past, such as widespread slavery. Are there ongoing moral catastrophes? I think factory farming is a pretty obvious one. There's a philosophy paper called "The Possibility of an Ongoing Moral Catastrophe" that gives more context.

1MattJ
I thought that was what was meant. The question is probably the easiest one to answer affirmatively with a high degree of confidence. I can think of several ongoing ”moral catastrophs”.
RohanS10

How is there more than one solution manifold? If a solution manifold is a behavior manifold which corresponds to a global minimum train loss, and we're looking at an overparameterized regime, then isn't there only one solution manifold, which corresponds to achieving zero train loss?

2Vivek Hebbar
In theory, there can be multiple disconnected manifolds like this.