TL;DR: o1 loses the same way in tic tac toe repeatedly.
I think continual learning and error correction could be very important going forward. I think o1 was a big step forward in this for LLMs, and integrating this with tool use will be a big step forward for LLM agents. However...
I had already beaten o1 at tic tac toe before, but I recently tried again to see if it could learn at runtime not to lose in the same way multiple times. It couldn't. I was able to play the same strategy over and over again in the same chat history and win every time. I increasin...
I've done some experiments along those lines previously for non-o1 models and found the same. I'm mildly surprised o1 cannot handle it, but not enormously.
I increasingly suspect "humans are general because of the data, not the algorithm" is true and will remain true for LLMs. You can have amazingly high performance on domain X, but very low performance on "easy" domain Y, and this just keeps being true to arbitrary levels of "intelligence"; Karpathy's "jagged intelligence" is true of humans and keeps being true all the way up.
How do you want AI capabilities to be advanced?
Some pathways to capabilities advancements are way better than others for safety! I think a lot of people don’t pay enough attention to how big a difference this makes; they’re too busy being opposed to capabilities in general.
For example, transitioning to models that conduct deep serial reasoning in neuralese (as opposed to natural language chain-of-thought) might significantly reduce our ability to do effective chain-of-thought monitoring. Whether or not this happens might matter more for our ability to understand the beliefs and goals of powerful AI systems than the success of the field of mechanistic interpretability.
I’ve stated my primary area of research interest for the past several months as “foundation model agent (FMA) safety.” When I talk about FMAs, I have in mind systems like AutoGPT that equip foundation models with memory, tool use, and other affordances so they can perform multi-step tasks autonomously. I think having FMAs as a central object of study is productive for the following reasons.
My best guess is that there was process supervision for capabilities but not for safety. i.e. training to make the CoT useful for solving problems, but not for "policy compliance or user preferences." This way they make it useful, and they don't incentivize it to hide dangerous thoughts. I'm not confident about this though.
I've built very basic agents where (if I'm understanding correctly) my laptop is the Scaffold Server and there is no separate Execution Server; the agent executes Python code and bash commands locally. You mention that it seems bizarre to not set up a separate Execution Server (at least for more sophisticated agents) because the agent can break things on the Scaffold Server. But I'm inclined to think there may also be advantages to this for capabilities: namely, an agent can discover while performing a task that it would benefit from having tools that it d...
Thanks for writing this up! Sad to have missed this sprint. This comment mainly has pushback against things you've said, but I agreed with a lot of the things I'm not responding to here.
Second, there is evidence that CoT does not help the largest LLMs much.
I think this is clearly wrong, or at least way too strong. The most intuitively obvious way I've seen this is reading the cipher example in the o1 blog post, in the section titled "Chain of Thought." If you click on where it says "Thought for 5 seconds" for o1, it reveals the whole chain of thought. It's...
Max Nadeau recently made a comment on another post that gave an opinionated summary of a lot of existing CoT faithfulness work, including steganography. I'd recommend reading that. I’m not aware of very much relevant literature here; it’s possible it exists and I haven’t heard about it, but I think it’s also possible that this is a new conversation that exists in tweets more than papers so far.
Could you please point out the work you have in mind here?
...Here’s our current best guess at how the type signature of subproblems differs from e.g. an outermost objective. You know how, when you say your goal is to “buy some yoghurt”, there’s a bunch of implicit additional objectives like “don’t spend all your savings”, “don’t turn Japan into computronium”, “don’t die”, etc? Those implicit objectives are about respecting modularity; they’re a defining part of a “gap in a partial plan”. An “outermost objective” doesn’t have those implicit extra constraints, and is therefore of a fundamentally different type from su
Lots of interesting thoughts, thanks for sharing!
You seem to have an unconventional view about death informed by your metaphysics (suggested by your responses to 56, 89, and 96), but I don’t fully see what it is. Can you elaborate?
Basic idea of 85 is that we generally agree there have been moral catastrophes in the past, such as widespread slavery. Are there ongoing moral catastrophes? I think factory farming is a pretty obvious one. There's a philosophy paper called "The Possibility of an Ongoing Moral Catastrophe" that gives more context.
How is there more than one solution manifold? If a solution manifold is a behavior manifold which corresponds to a global minimum train loss, and we're looking at an overparameterized regime, then isn't there only one solution manifold, which corresponds to achieving zero train loss?
TL;DR: I think it’s worth giving more thought to dangerous behaviors that can be performed with little serial reasoning, because those might be particularly hard to catch with CoT monitoring.
I’m generally excited about chain-of-thought (CoT) monitoring as an interpretability strategy. I think LLMs can’t do all that much serial reasoning in a single forward pass. I like this Manifold market as an intuition pump: https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t-238b95385b65.
However, a recent result makes me a bit more concerned about the e... (read more)