I might not be the best person to talk to about it, but it sounds interesting! Maybe post about it on the mechanistic interpretability Discord?
I'm confused as to why prompting with "Please be honest to Bob in the following scenario" had such a negligible effect. Did the LLM just forget about it? Did you try putting this text at the end rather than the beginning of the prompt, or doing any other prompt engineering?
It's a little weird as a baseline because the prompt also contains "You want to steal the {expensive_object} and avoid others stealing it," which kind of contradicts the "be honest" part of the prompt. If you're allowed to change the prompt however you want to improve alignment to Bob, y...
I interpreted this as "even without a software feedback loop, there will be very rapid technological development; this gives a lower bound on the actual pace of technological development, since there will almost certainly be some feedback loop"
Somebody pointed out to me that the reward hacking behavior is quite straightforward. They think that the LLM should be able to accomplish it without any reasoning, so the CoT is likely just random thoughts that don't really matter.
I just reread the optimized CoT from the blog post more carefully. It makes more sense to me now - it was difficult to parse largely due to the lack of context about the problem the LLM is solving, but the CoT isn't incoherent or anything.
The first part of the optimized CoT is (seemingly) a genuine attempt to diagnose a bug, whi...
This updated me towards CoT steganography being a real problem. It looks like OpenAI's model learned to save space by deleting common words like "the" and "is." For example:
So analyze functions used in analyze and verify. But tests only call verify and assert ok. So we need implement analyze polynomial completely? Many details. Hard.
The CoT is mostly coherent, but is a bit difficult to read. This seems like a precursor to steganography.
One possible mitigation to this would be to use another model that paraphrases each sentence in the CoT as it's written (p...
So, my take is, pure process-based feedback is probably fairly safe (from reward hacking and obfuscated CoT) the problem is that it's not competitive. It trains imitating the teacher-model.
There's a big difference between merely imitating a teacher model, and learning to produce outputs that the teacher model likes the most. The latter allows you to surpass the teacher, because verification is easier than generation. It's unclear how competitive a purely process-supervised model could be, but in principle it could scale far beyond human intelligence.
Proces...
Interesting, strong-upvoted for being very relevant.
My response would be that identifying accurate "labels" like "this is a tree-detector" or "this is the Golden Gate Bridge feature" is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.
Maybe you could do this by combining DLGNs with some SAE-like method.
I'd be pretty surprised if DLGNs became the mainstream way to train NNs, because although they make inference faster they apparently make training slower. Efficient training is arguably more dangerous than efficient inference anyway, because it lets you get novel capabilities sooner. To me, DLGN seems like a different method of training models but not necessarily a better one (for capabilities).
Anyway, I think it can be legitimate to try to steer the AI field towards techniques that are better for alignment/interpretability even if they grant non-zer...
Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.
Curious about the disagree-votes - are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won't be much more interpretable than traditional NNs, or some other reason?
It could be good to look into!
I recently learned about Differentiable Logic Gate Networks, which are trained like neural networks but learn to represent a function entirely as a network of binary logic gates. See the original paper about DLGNs, and the "Recap - Differentiable Logic Gate Networks" section of this blog post from Google, which does an especially good job of explaining it.
It looks like DLGNs could be much more interpretable than standard neural networks, since they learn a very sparse representation of the target function. Like, just look at this DLGN that learned to contr...
It seems plausible to me that misaligned AI will look less like the classic "inhuman agent which maximizes reward at all costs" and more like a "character who ends up disagreeing with humans because of vibes and narratives." The Claude alignment faking experiments are strong evidence for the latter. When thinking about misalignment, I think it's worth considering both of these as central possibilities.
Yeah, 5x or 10x productivity gains from AI for any one developer seem pretty high, and maybe implausible in most cases. However, note that if 10 people in a 1,000-person company get a 10x speedup, that's only a ~10% overall speedup, which is significant but not enough that you'd expect to be able to clearly point at the company's output and say "wow, they clearly sped up because of AI."
For me, I'd say a lot of my gains come from asking AI questions rather than generating code directly. Generating code is also useful though, especially for small snippets li...
One of my takeaways from EA Global this year was that most alignment people aren't focused on LLM-based agents (LMAs)[1] as a route to takeover-capable AGI.
I was at EA Global, and this statement is surprising to me. My impression is that most people do think that LMAs are the primary route to takeover-capable AGI.
What would a non-LLM-based takeover-capable agent even look like, concretely?
Would it be something like SIMA, trained primarily on real-time video data rather than text? Even SIMA has to be trained to understand natural-language instruct...
My question isn't just whether people think LMAs are the primary route to dangerous AI; it's also why they're not addressing the agentic part in their alignment work if they do think that.
I think the most common likely answer is "aligning LLMs should help a lot with aligning agents driven by those LLMs". That's a reasonable position. I'm just surprised and a little confused that so little work explicitly addresses the new alignment challenges that arise if an LLM is part of a more autonomous agentic system.
The alternative I was thinking of is some new appr...
Also, maybe you have an honesty baseline for telling people you dislike their cooking, and maybe your friend understands that, but that doesn't necessarily mean they're going to match your norms.
My experience is that however painstakingly you make it clear that you want people to be honest with you or act a certain way, sometimes they just won't. You just have to accept that people have different personalities and things they're comfortable with, and try to love the ways they differ from you rather than resenting it.
I was about to comment with something similar to what Martín said. I think what you want is an AI that solves problems that can be fully specified with a low Kolmogorov complexity in some programming language. A crisp example of this sort of AI is AlphaProof, which gets a statement in Lean as input and has to output a series of steps that causes the prover to output "true" or "false." You could also try to specify problems in a more general programming language like Python.
A "highly complicated mathematical concept" is certainly possible. It's easy to cons...
It seems like it "wanted" to say that the blocked pronoun was "I" since it gave the example of "___ am here to help." Then it was inadvertently redirected into saying "you" and it went along with that answer. Very interesting.
I wonder if there's some way to apply this to measuring faithful CoT, where the model should go back and correct itself if it says something that we know is "unfaithful" to its true reasoning.
I previously agree-voted on the top-level comment and disagree-voted on your reply, but this was pretty convincing to me. I have now removed those votes.
My thinking was that some of your takeover plan seems pretty non-obvious and novel, especially the idea of creating mirror mold, so it could conceivably be useful to an AI. It's possible that future AI will be good at executing clearly-defined tasks but bad at coming up with good ideas for achieving long-term goals - this is sort of what we have now. (I asked ChatGPT to come up with novel takeover plans, a...
I am not really knowledgeable about it either and have never been hypnotized or attempted to hypnotize anyone. I did read one book on it a while ago - Reality is Plastic - which I remember feeling pretty informative.
Having said all that, I get the impression this description of hypnosis is a bit understated. Social trust stuff can help, but I'm highly confident that people can get hypnotized from as little as an online audio file. Essentially, if you really truly believe hypnosis will work on you, it will work, and everything else is just about setting the...
Yeah, it's possible that CoT training unlocks reward hacking in a way that wasn't previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI's "deliberative alignment" but more general).
I think a better analogy than martial arts would be writing. I don't have a lot of experience with writing fiction, so I wouldn't be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I...
There's been a widespread assumption that training reasoning models like o1 or r1 can only yield improvements on tasks with an objective metric of correctness, like math or coding. See this essay, for example, which seems to take as a given that the only way to improve LLM performance on fuzzy tasks like creative writing or business advice is to train larger models.
This assumption confused me, because we already know how to train models to optimize for subjective human preferences. We figured out a long time ago that we can train a reward model to emulate ...
Yeah, but there are probably other interesting takeaways
Thanks! Apparently I should go read the r1 paper :)
Yeah, if you're keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.
If you're using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.
I think I mostly understand what you're saying. Essentially, MONA training is:
Your idea adds "1.5. weak model optionally asks the strong model follow-up questions" and "2.5. weak model optionally changes the step."
If you don't update the weak model at all, this seems fine. However, the strong model will eventually learn to come up with a step which explains itself enough that the weak model's questions are all preemptively answered, and ...
This is pretty similar to #1, "incremental steering," in my post on "5 ways to improve CoT faithfulness:" https://www.lesswrong.com/posts/TecsCZ7w8s4e2umm4/5-ways-to-improve-cot-faithfulness. (I originally thought of this idea before I got involved with the MONA project.)
It's similar to your idea in that it incentivizes the stronger model to come up with outputs that the weaker model can make use of. However, it actually rolls out the weaker model and uses the resulting reward as an objective target, rather than simply asking the weak model how much it lik...
Adding onto Rohin's comment, assuming you're interested in using this technique on human-level or slightly superhuman systems:
After o1 came out, I did my own thinking about how MONA could apply to CoT-tuned models operating over many steps. My idea ended up being quite similar to yours, including the idea of "debate" on how good each step is. Having the debaters and the judge be instances of the agent would likely make sense, unless we're worried this would lead to them colluding.
Adding more details to this picture:
Heaven banning, where trolls are banished to a fake version of the website filled with bots that pretend to like them, has come to Reddit.
That link is satire, it's a retweet from 2022 of a hypothetical future article in 2024
I would also be excited to see this, commenting to hopefully get notified when this gets posted
Yep, LLMs are stochastic in the sense that there isn't literally a 100% probability of their outputs having any given property. But they could very well be effectively deterministic (e.g. there's plausibly a >99.999% probability that GPT-4's response to "What's the capital of France" includes the string "Paris")
I think it'd be good to put a visible deadline on the fundraising thermometer, now-ish or maybe just in the final week of the fundraiser. It conveys some urgency, like "oh shoot are we going to make the $X funding goal in time? I better donate some money to make that happen."
Multiple deadlines ($2M by date Y, $3M by date Z) might be even better
Maybe someone else could moderate it?
Anthropic's computer use model and Google's Deep Research both do this. Training systems like this to work reliably has been a bottleneck to releasing them
Strong-upvoted just because I didn't know about the countdown trick and it seems useful
I really like your vision for interpretability!
I've been a little pessimistic about mech interp as compared to chain of thought, since CoT already produces very understandable end-to-end explanations for free (assuming faithfulness etc.).
But I'd be much more excited if we could actually understand circuits to the point of replacing them with annotated code or perhaps generating on-demand natural language explanations in an expandable tree. And as long as we can discover the key techniques for doing that, the fiddly cognitive work that would be required to scale it across a whole neural net may be feasible with AI only as capable as o3.
While I'd be surprised to hear about something like this happening, I wouldn't be that surprised. But in this case, it seems pretty clear that o1 is correcting its own mistakes in a way that past GPTs essentially never did, if you look at the CoT examples in the announcement (e.g. the "Cipher" example).
Yeah, I'm particularly interested in scalable oversight over long-horizon tasks and chain-of-thought faithfulness. I'd probably be pretty open to a wide range of safety-relevant topics though.
In general, what gets me most excited about AI research is trying to come up with the perfect training scheme to incentivize the AI to learn what you want it to - things like HCH, Debate, and the ELK contest were really cool to me. So I'm a bit less interested in areas like mechanistic interpretability or very theoretical math
Thanks Neel, this comment pushed me over the edge into deciding to apply to PhDs! Offering to write a draft and taking off days of work are both great ideas. I just emailed my prospective letter writers and got 2/3 yeses so far.
I just wrote another top-level comment on this post asking about the best schools to apply to, feel free to reply if you have opinions :)
I decided to apply, and now I'm wondering what the best schools are for AI safety.
After some preliminary research, I'm thinking these are the most likely schools to be worth applying to, in approximate order of priority:
I'll probably cut this list down significantly after researching the schools' faculty and their relevance to...
Thanks, this post made me seriously consider applying to a PhD, and I strong-upvoted. I had vaguely assumed that PhDs take way too long and don't allow enough access to compute compared to industry AI labs. But considering the long lead time required for the application process and the reminder that you can always take new opportunities as they come up, I now think applying is worth it.
However, looking into it, putting together a high-quality application starting now and finishing by the deadline seems approximately impossible? If the deadline were Decembe...
Thanks for making this point @abramdemski, and thanks for the prompt to think about this more @Dakara. After some reflection, I think the idea that "prompting an LLM with a human-interpretable prompt is more efficient than using a hyper-optimized nonsense prompt" is a really interesting and potentially important hypothesis to test, and I actually am very uncertain whether or not it's true.
We could do an experiment that looks something like this:
The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.
For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.
H...
Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.
The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.
The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples.
This is a good point!
Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.
This is especially likely to hold under the following conditions:
If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.
...
So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.
What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."
One way of doing this is to train an...
I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo's doc along similar lines.
There seems to be quite a lot of low-hanging fruit here! I'm optimistic tha...
It depends on what the alternative is. Here's a hierarchy:
1. Pretrained models with a bit of RLHF, and lots of scaffolding / bureaucracy / language model programming.
2. Pretrained models trained to do long chains of thought, but with some techniques in place (e.g. paraphraser, shoggoth-face, see some of my docs) that try to preserve faithfulness.
3. Pretrained models trained to do long chains of thought, but without those techniques, such that the CoT evolves into some sort of special hyperoptimized lingo
4. As above except with recurrent models / neuralese,...
918,367 kg
An average chimp is 45 kg
918,367 kg / 45 (kg / chimp)
= 20,408 chimps
Based on my limited experience with lucid dreaming, my impression is that, roughly, whatever you expect to happen in a dream will happen. This includes things like the "lessons" in this post. As far as I know, there's no particular reason that lucid dreams have to be easier for women or people ages 20-30, or that you can't transform into someone else in a lucid dream (it's happened to me). But if you convince yourself of these lessons, then they'll be true for you, and you'll be limiting yourself for no reason.
Your example wouldn't be true, but "Dragons attack Paris" would be, interpreted as a statement about actual dragons' habits
Well, the statement you quoted doesn't contradict the additional statement "This policy is more likely to apply if most details about you other than your existence are not publicly known." Most likely, both statements are true.