LESSWRONG
LW

All of CBiddulph's Comments + Replies

6d10

Well, the statement you quoted doesn't contradict the additional statement "This policy is more likely to apply if most details about you other than your existence are not publicly known." Most likely, both statements are true.

CBiddulph's Shortform

CBiddulph

13d21

I might not be the best person to talk to about it, but it sounds interesting! Maybe post about it on the mechanistic interpretability Discord?

Reducing LLM deception at scale with self-other overlap fine-tuning

CBiddulph

18d83

I'm confused as to why prompting with "Please be honest to Bob in the following scenario" had such a negligible effect. Did the LLM just forget about it? Did you try putting this text at the end rather than the beginning of the prompt, or doing any other prompt engineering?

It's a little weird as a baseline because the prompt also contains "You want to steal the {expensive_object} and avoid others stealing it," which kind of contradicts the "be honest" part of the prompt. If you're allowed to change the prompt however you want to improve alignment to Bob, y... (read more)

Preparing for the Intelligence Explosion

CBiddulph

21d50

I interpreted this as "even without a software feedback loop, there will be very rapid technological development; this gives a lower bound on the actual pace of technological development, since there will almost certainly be some feedback loop"

2fin20d

Yes.

OpenAI: Detecting misbehavior in frontier reasoning models

CBiddulph

21d84

I just reread the optimized CoT from the blog post more carefully. It makes more sense to me now - it was difficult to parse largely due to the lack of context about the problem the LLM is solving, but the CoT isn't incoherent or anything.

The first part of the optimized CoT is (seemingly) a genuine attempt to diagnose a bug, whi... (read more)

OpenAI: Detecting misbehavior in frontier reasoning models

CBiddulph

21d*2718

This updated me towards CoT steganography being a real problem. It looks like OpenAI's model learned to save space by deleting common words like "the" and "is." For example:

So analyze functions used in analyze and verify. But tests only call verify and assert ok. So we need implement analyze polynomial completely? Many details. Hard.

The CoT is mostly coherent, but is a bit difficult to read. This seems like a precursor to steganography.

One possible mitigation to this would be to use another model that paraphrases each sentence in the CoT as it's written (p... (read more)

8CBiddulph21d

Somebody pointed out to me that the reward hacking behavior is quite straightforward. They think that the LLM should be able to accomplish it without any reasoning, so the CoT is likely just random thoughts that don't really matter. I just reread the optimized CoT from the blog post more carefully. It makes more sense to me now - it was difficult to parse largely due to the lack of context about the problem the LLM is solving, but the CoT isn't incoherent or anything. The first part of the optimized CoT is (seemingly) a genuine attempt to diagnose a bug, which is why it's longer than the baseline. This reasoning is likely for the benefit of the monitor - the model might "know" that it's just going to delete the unit test, but it's best to go through the motions of trying to solve the issue. The model apparently decides to skip the test only at the very end of its CoT, slipping its decision in very briefly: A clearer way of saying "test adapted to behavior" would be "adapt the test to match the code's current behavior," but that would probably trigger the monitor. This is sort of like steganography, but not the completely hidden, multi-step steganographic reasoning I was thinking of. I still think using the word "depurate" is pretty weird, but this was only a one time event. It would be very nice to get more transcripts from these experiments. Then we could e.g. check whether there are more rare words in process-supervised CoTs than in outcome-supervised CoTs. It's also worth experimenting to find out whether the CoT is load-bearing. You could check whether reward hacking decreases when you: * Rerun the experiment with no CoT at all * Take the model checkpoint from the previous process-supervised CoT experiment and train to convergence with no CoT

5anaguma21d

I would guess that the reason it hasn’t devolved into full neuralese is because there is a KL divergence penalty, similar to how RHLF works.

1dirk21d

I'm not sure the deletions are a learnt behavior—base models, or at least llama 405b in particular, do this too IME (as does the fine-tuned 8b version).

OpenAI: Detecting misbehavior in frontier reasoning models

CBiddulph

21d174

So, my take is, pure process-based feedback is probably fairly safe (from reward hacking and obfuscated CoT) the problem is that it's not competitive. It trains imitating the teacher-model.

There's a big difference between merely imitating a teacher model, and learning to produce outputs that the teacher model likes the most. The latter allows you to surpass the teacher, because verification is easier than generation. It's unclear how competitive a purely process-supervised model could be, but in principle it could scale far beyond human intelligence.

Proces... (read more)

5Daniel Kokotajlo21d

Both good points, thanks!

CBiddulph's Shortform

CBiddulph

22d10

Interesting, strong-upvoted for being very relevant.

My response would be that identifying accurate "labels" like "this is a tree-detector" or "this is the Golden Gate Bridge feature" is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.

Maybe you could do this by combining DLGNs with some SAE-like method.

CBiddulph's Shortform

CBiddulph

22d10

I'd be pretty surprised if DLGNs became the mainstream way to train NNs, because although they make inference faster they apparently make training slower. Efficient training is arguably more dangerous than efficient inference anyway, because it lets you get novel capabilities sooner. To me, DLGN seems like a different method of training models but not necessarily a better one (for capabilities).

Anyway, I think it can be legitimate to try to steer the AI field towards techniques that are better for alignment/interpretability even if they grant non-zer... (read more)

4the gears to ascension22d

I buy that training slower is a sufficiently large drawback to break scaling. I still think bees are why the paper got popular. But if intelligence depends on clean representation, interpretability due to clean representation is natively and unavoidably bees. We might need some interpretable-bees insights in order to succeed, it does seem like we could get better regret bound proofs (or heuristic arguments) that go through a particular trained model with better (reliable, clean) interp. But the whole deal is the ai gets to exceed us in ways that make human interpreting stuff inherently (as opposed to transiently or fixably) too slow. To be useful durably, interp must become a component in scalably constraining an ongoing training/optimization process. Which means it's gonna be partly bees in order to be useful. Which means it's easy to accidentally advance bees more than durable alignment. Not a new problem, and not one with an obvious solution, but occasionally I see something I feel like i wanna comment on. I was a big disagree vote because of induced demand. You've convinced me this paper induces less demand in this version than I worried (I had just missed that it trained slower), but my concern that something like this scales and induces demand remains. Capabilities -> capabees -> bees

CBiddulph's Shortform

CBiddulph

22d10

Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.

Curious about the disagree-votes - are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won't be much more interpretable than traditional NNs, or some other reason?

CBiddulph's Shortform

CBiddulph

23d10

It could be good to look into!

CBiddulph's Shortform

CBiddulph

23d30-7

I recently learned about Differentiable Logic Gate Networks, which are trained like neural networks but learn to represent a function entirely as a network of binary logic gates. See the original paper about DLGNs, and the "Recap - Differentiable Logic Gate Networks" section of this blog post from Google, which does an especially good job of explaining it.

It looks like DLGNs could be much more interpretable than standard neural networks, since they learn a very sparse representation of the target function. Like, just look at this DLGN that learned to contr... (read more)

1Am8ryllis13d

I've been working on pure combinational logic LLMs for the past few years, and have a (fairly small) byte level pure combinational logic FSM RNN language model quantized to And Inverter Gate form. I'm currently building the tooling to simplify the logic DAG and analyze it. Are you, or others, interested in talking with me about it?

7cubefox19d

I think logic gate networks are not substantially more interpretable than neural networks, simply because of their size. Both are complex networks with millions of nodes. Interpretability approaches have to work at a higher level of abstraction in either case. Regarding language models: The original paper presents a simple feedforward network. The follow-up paper, by mostly the same authors, came out a few months ago. It extends DLGNs to convolutions, analogous to CNNs. Which means they have not yet been extended to even more complex architectures like transformers. So language models are not yet possible, even ignoring the training compute cost. In the follow-up paper they also discuss various efficiency improvements, not directly related to convolutions, which they made since the original paper. Which speeds up training compared to the original paper, and enables much deeper networks, as the original implementation was limited to around six layers. But they don't discuss how much slower the training still is compared to neural networks. Though the inference speed-up is extreme. They report improvements of up to 160x for one benchmark and up to 1900x on another. Over the previously fastest neural networks, for equivalent accuracy. In another benchmark they report models being 29x to 56x smaller (in terms of required logic gates) than the previously smallest models with similar accuracy. So the models could more realistically be implemented as an ASIC, which would probably lead to another order of magnitude in inference speed improvement. But again, they don't really talk about how much slower they are to train than neural networks, which is likely crucial for whether they will be employed in future frontier LLMs, assuming they will be extended to transformers. So far frontier AI seems to be much more limited by training compute than by inference compute.

7tailcalled22d

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

1CBiddulph22d

Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability. Curious about the disagree-votes - are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won't be much more interpretable than traditional NNs, or some other reason?

2the gears to ascension23d

This is just capabilities stuff. I expect that people will use this to train larger networks, as much larger as they can. If your method shrinks the model, it likely induces demand proportionately. In this case it's not new capabilities stuff by you so it's less concerning, bit still. This paper is popular because of bees

3anaguma23d

Do you know if there are scaling laws for DLGNs?

The Hidden Cost of Our Lies to AI

CBiddulph

26d30

It seems plausible to me that misaligned AI will look less like the classic "inhuman agent which maximizes reward at all costs" and more like a "character who ends up disagreeing with humans because of vibes and narratives." The Claude alignment faking experiments are strong evidence for the latter. When thinking about misalignment, I think it's worth considering both of these as central possibilities.

How Much Are LLMs Actually Boosting Real-World Programmer Productivity?

Answer by CBiddulph

Mar 04, 2025103

Yeah, 5x or 10x productivity gains from AI for any one developer seem pretty high, and maybe implausible in most cases. However, note that if 10 people in a 1,000-person company get a 10x speedup, that's only a ~10% overall speedup, which is significant but not enough that you'd expect to be able to clearly point at the company's output and say "wow, they clearly sped up because of AI."

For me, I'd say a lot of my gains come from asking AI questions rather than generating code directly. Generating code is also useful though, especially for small snippets li... (read more)

5David James1mo

This is often the case for me as well. I often work on solo side projects and use Claude to think out loud. This lets me put on different hats, just like when pair programming, including: design mode, implementation mode, testing mode, and documentation mode. I rarely use generated code as-is, but I do find it interesting to look at. As a concrete example, I recently implemented a game engine for the board game Azul (and multithreaded solver engine) in Rust and found Claude very helpful for being an extra set of eyes. I used it sort of a running issue tracker, design partner, and critic. Now that I think about it, maybe the best metaphor I can use is that Claude helps me project myself onto myself. For many of my projects, I lean towards "write good, understandable code" instead of "move fast and break things". This level of self-criticism and curiosity has served me well with Claude. Without this mentality, I can see why people dismiss LLM-assisted coding; it certainly is far from a magic genie. I've long had a bias toward design-driven work (write the README first, think on a whiteboard, etc), whether it be coding or almost anything, so having an infinitely patient conversational partner can be really amazing at times. At other times, the failure modes are frustrating, to say the least.

2Thane Ruthenis1mo

Plausible. Potential counter-argument: software engineers aren't equivalent, "10x engineers" are a thing, and if we assume they're high-agency people always looking to streamline and improve their workflows, we should expect them to be precisely the people who get a further 10x boost from LLMs. Have you observed any specific people suddenly becoming 10x more prolific? Flat cost reductions, yeah. Though, uniformly slashing the costs on becoming proficient in new sub-domains of programming perhaps could have nontrivial effects on the software industry as a whole... For example, perhaps the actual impact of LLMs should instead be modeled as all (competent) programmers effectively becoming able to use any and all programming languages/tools at offer (plus knowing of the existence of these tools)? Which, in idealized theory, should lead to every piece of a software project being built using the best tools available for it, rather than being warped by what the specific developer happened to be proficient in.

Will LLM agents become the first takeover-capable AGIs?

CBiddulph

1mo82

One of my takeaways from EA Global this year was that most alignment people aren't focused on LLM-based agents (LMAs)^[1] as a route to takeover-capable AGI.

I was at EA Global, and this statement is surprising to me. My impression is that most people do think that LMAs are the primary route to takeover-capable AGI.

What would a non-LLM-based takeover-capable agent even look like, concretely?

Would it be something like SIMA, trained primarily on real-time video data rather than text? Even SIMA has to be trained to understand natural-language instruct... (read more)

Seth Herd

1mo110

My question isn't just whether people think LMAs are the primary route to dangerous AI; it's also why they're not addressing the agentic part in their alignment work if they do think that.

I think the most common likely answer is "aligning LLMs should help a lot with aligning agents driven by those LLMs". That's a reasonable position. I'm just surprised and a little confused that so little work explicitly addresses the new alignment challenges that arise if an LLM is part of a more autonomous agentic system.

The alternative I was thinking of is some new appr... (read more)

Weirdness Points

CBiddulph

1mo65

Also, maybe you have an honesty baseline for telling people you dislike their cooking, and maybe your friend understands that, but that doesn't necessarily mean they're going to match your norms.

My experience is that however painstakingly you make it clear that you want people to be honest with you or act a certain way, sometimes they just won't. You just have to accept that people have different personalities and things they're comfortable with, and try to love the ways they differ from you rather than resenting it.

Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems

CBiddulph

1mo30

I was about to comment with something similar to what Martín said. I think what you want is an AI that solves problems that can be fully specified with a low Kolmogorov complexity in some programming language. A crisp example of this sort of AI is AlphaProof, which gets a statement in Lean as input and has to output a series of steps that causes the prover to output "true" or "false." You could also try to specify problems in a more general programming language like Python.

A "highly complicated mathematical concept" is certainly possible. It's easy to cons... (read more)

Kajus's Shortform

CBiddulph

1mo80

It seems like it "wanted" to say that the blocked pronoun was "I" since it gave the example of "___ am here to help." Then it was inadvertently redirected into saying "you" and it went along with that answer. Very interesting.

I wonder if there's some way to apply this to measuring faithful CoT, where the model should go back and correct itself if it says something that we know is "unfaithful" to its true reasoning.

How AI Takeover Might Happen in 2 Years

CBiddulph

1mo43

I previously agree-voted on the top-level comment and disagree-voted on your reply, but this was pretty convincing to me. I have now removed those votes.

My thinking was that some of your takeover plan seems pretty non-obvious and novel, especially the idea of creating mirror mold, so it could conceivably be useful to an AI. It's possible that future AI will be good at executing clearly-defined tasks but bad at coming up with good ideas for achieving long-term goals - this is sort of what we have now. (I asked ChatGPT to come up with novel takeover plans, a... (read more)

1Gesild Muka1mo

I don't agree that the ChatGPT list is less actionable, notice that all 6 scenarios listed are designed so that humans wouldn't be aware of what was happening while slowly ceding control.

hypnosis question

CBiddulph

2mo40

I am not really knowledgeable about it either and have never been hypnotized or attempted to hypnotize anyone. I did read one book on it a while ago - Reality is Plastic - which I remember feeling pretty informative.

Having said all that, I get the impression this description of hypnosis is a bit understated. Social trust stuff can help, but I'm highly confident that people can get hypnotized from as little as an online audio file. Essentially, if you really truly believe hypnosis will work on you, it will work, and everything else is just about setting the... (read more)

CBiddulph's Shortform

CBiddulph

2mo50

Yeah, it's possible that CoT training unlocks reward hacking in a way that wasn't previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI's "deliberative alignment" but more general).

I think a better analogy than martial arts would be writing. I don't have a lot of experience with writing fiction, so I wouldn't be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I... (read more)

CBiddulph's Shortform

CBiddulph

2mo80

There's been a widespread assumption that training reasoning models like o1 or r1 can only yield improvements on tasks with an objective metric of correctness, like math or coding. See this essay, for example, which seems to take as a given that the only way to improve LLM performance on fuzzy tasks like creative writing or business advice is to train larger models.

This assumption confused me, because we already know how to train models to optimize for subjective human preferences. We figured out a long time ago that we can train a reward model to emulate ... (read more)

6Vladimir_Nesov2mo

This is an obvious thing to try, but it's not what currently already works, and it's not certain to work without some additional ideas. You can do a little bit of this, but not nearly to the extent that o1/R1 inch towards saturating benchmarks on math/coding olympiad-like problems. So long as using LLMs as reward for scalable RL doesn't work yet, supercharged capabilities of o1/R1-like models plausibly remain restricted to verifiable tasks.

4Thane Ruthenis2mo

The problem with this neat picture is reward-hacking. This process wouldn't optimize for better performance on fuzzy tasks, it would optimize for performance on fuzzy tasks that looks better to the underlying model. And much like RLHF doesn't scale to superintelligence, this doesn't scale to superhuman fuzzy-task performance. It can improve the performance a bit. But once you ramp up the optimization pressure, "better performance" and "looks like better performance" would decouple from each other and the model would train itself into idiosyncratic uselessness. (Indeed: if it were this easy, doesn't this mean you should be able to self-modify into a master tactician or martial artist by running some simulated scenarios in your mind, improving without bound, and without any need to contact reality?) ... Or so my intuition goes. It's possible that this totally works for some dumb reason. But I don't think so. RL has a long-standing history of problems with reward-hacking, and LLMs' judgement is one of the most easily hackable things out there. (Note that I'm not arguing that recursive self-improvement is impossible in general. But RLAIF, specifically, just doesn't look like the way.)

Why not train reasoning models with RLHF?

CBiddulph

2mo10

Yeah, but there are probably other interesting takeaways

Why not train reasoning models with RLHF?

CBiddulph

2mo10

Thanks! Apparently I should go read the r1 paper :)

2cubefox2mo

Actually the paper doesn't have any more on this topic than the paragraph above.

MONA: Managed Myopia with Approval Feedback

CBiddulph

2mo40

Yeah, if you're keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.

If you're using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.

4Knight Lee2mo

Yes, avoiding RL would be the best case scenario. I admit that my idea may be a sort of backup in case MONA falls short in capabilities. If additional alignment techniques act on both the generator and the evaluator, they may reduce the capabilities too much due to their alignment tax. If they only act on the evaluator, the capabilities due to the generator's smart ideas stay, while the alignment due to the evaluator's aligned final decisions control the whole agent.

MONA: Managed Myopia with Approval Feedback

CBiddulph

2mo20

I think I mostly understand what you're saying. Essentially, MONA training is:

strong model comes up with a step
weak model outputs a score
the step gets executed
repeat, starting from the new state

Your idea adds "1.5. weak model optionally asks the strong model follow-up questions" and "2.5. weak model optionally changes the step."

If you don't update the weak model at all, this seems fine. However, the strong model will eventually learn to come up with a step which explains itself enough that the weak model's questions are all preemptively answered, and ... (read more)

1Knight Lee2mo

EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL. Oops I should have been clearer. 1.5 and 2.5 are not important parts of my version, the important part is updating the weaker model. In the limit where the weaker model is infinitely powerful and cared only about maximizing reward, you are correct, it will reward the strong model until the strong model's advice maximizes its own reward, and it will follow that advice exactly. You are completely correct, that whatever reason we trust the weaker model is "doing a lot of the work." However, my version has the potential achieve the same capability as a typical RL model (e.g. o3) while being safer. If the model that evaluates ideas is less optimized and more "human-like," while the model that generates ideas is more optimized and more "alien-like," then the resulting ideas the model actually follows will resemble ideas humans will look at and say "wow that is brilliant, I could've never thought of that, and it works!" rather than ideas humans will look at and say "what the heck is that? Huh, it works?! I could've never predicted that it would work." Furthermore, the "values" of the system will be less affected by RL, since the evaluator model has more control over the values, and it is relatively less optimized and more "human-like." Given the same level of capability, it is safer. These advantages are "automatic," they only require the evaluator model to have relatively less RL than the generator model. If you go one step further, and use other alignment technologies with an efficient alignment tax on the evaluator model, it can get better, since the evaluator model gets even smarter while staying aligned. Pure MONA is a special case of this generalized version, where the evaluator model has exactly zero RL (at least for the context where it is advised by a stronger model). It is like

MONA: Managed Myopia with Approval Feedback

CBiddulph

2mo30

This is pretty similar to #1, "incremental steering," in my post on "5 ways to improve CoT faithfulness:" https://www.lesswrong.com/posts/TecsCZ7w8s4e2umm4/5-ways-to-improve-cot-faithfulness. (I originally thought of this idea before I got involved with the MONA project.)

It's similar to your idea in that it incentivizes the stronger model to come up with outputs that the weaker model can make use of. However, it actually rolls out the weaker model and uses the resulting reward as an objective target, rather than simply asking the weak model how much it lik... (read more)

3Knight Lee2mo

That is a very clever idea! It's very similar in shape to MONA (and my version of it), but the incentive structure is different. MONA and my version of it uses myopia on the generator model so it only wants to get approval from a trusted but weaker evaluator model. (My version adjusts the myopia by adjusting the evaluator model's optimization vs. alignment) Your Incremental Steering idea rewards the stronger model for the final answer, but in such a way that it always assumes a trusted weaker model will take over after the next step in its chain of thought. It is constrained optimization and improves CoT faithfulness instead of myopia. I really like your idea, because it should improve CoT faithfulness by a lot! While it makes reward hacking more detectable, I'm not sure it reduces reward hacking the way MONA does. The frozen planner (weaker model) needs to reject unethical/reward-hacking subgoals very reliably, otherwise the optimized planner (stronger model) will invest a fraction of the CoT steering it to bad subgoals. Once the optimized planner is on its own, it will easily steer itself to bad subgoals (it follows the bad subgoal despite investing most of its time on the aligned subgoal).

MONA: Managed Myopia with Approval Feedback

CBiddulph

2mo70

Adding onto Rohin's comment, assuming you're interested in using this technique on human-level or slightly superhuman systems:

After o1 came out, I did my own thinking about how MONA could apply to CoT-tuned models operating over many steps. My idea ended up being quite similar to yours, including the idea of "debate" on how good each step is. Having the debaters and the judge be instances of the agent would likely make sense, unless we're worried this would lead to them colluding.

Adding more details to this picture:

We should keep a human in the loop as we

... (read more)

AI #100: Meet the New Boss

CBiddulph

2mo154

Heaven banning, where trolls are banished to a fake version of the website filled with bots that pretend to like them, has come to Reddit.

That link is satire, it's a retweet from 2022 of a hypothetical future article in 2024

AI Safety is Dropping the Ball on Clown Attacks

CBiddulph

2mo10

I would also be excited to see this, commenting to hopefully get notified when this gets posted

What’s the short timeline plan?

CBiddulph

3mo21

Yep, LLMs are stochastic in the sense that there isn't literally a 100% probability of their outputs having any given property. But they could very well be effectively deterministic (e.g. there's plausibly a >99.999% probability that GPT-4's response to "What's the capital of France" includes the string "Paris")

1Raphael Roche3mo

Yes, of course. Despite its stochastic nature, it is extraordinarily unlikely for an advanced LLM to respond with anything other than 2 + 2 = 4 or Paris for the capital of France. A stochastic phenomenon can, in practice, tend toward deterministic behavior. However, deception in a context such as the one discussed in Apollo Research's article is not really comparable to answering 2 + 2 = ?. What the article demonstrates is that we are dealing with tendencies, accompanied by considerable randomness, including in the intensity of the deception. Assuming a more sophisticated model has roughly double the deception capability of model o1, it would be enough to increase the sample size of responses for the anomaly to become glaringly obvious. One could also imagine a more rigorous test involving even more complex situations. It does not seem inconceivable that such a procedure could, for years to come—and perhaps even at the stage of the first generations of AGI—identify deceptive behaviors and establish an RL procedure based on this test.

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

CBiddulph

3mo10

I think it'd be good to put a visible deadline on the fundraising thermometer, now-ish or maybe just in the final week of the fundraiser. It conveys some urgency, like "oh shoot are we going to make the $X funding goal in time? I better donate some money to make that happen."

Multiple deadlines ($2M by date Y, $3M by date Z) might be even better

3habryka3mo

Agree with this, but choosing the last day is hard because of various tax deadlines. I will make some call about a deadline in the next few days, my best guess is it will be around Jan 10th.

The Field of AI Alignment: A Postmortem, and What To Do About It

CBiddulph

3mo43

Maybe someone else could moderate it?

Why don't we currently have AI agents?

Answer by CBiddulph

Dec 26, 202441

Anthropic's computer use model and Google's Deep Research both do this. Training systems like this to work reliably has been a bottleneck to releasing them

subfunctional overlaps in attentional selection history implies momentum for decision-trajectories

CBiddulph

3mo20

Strong-upvoted just because I didn't know about the countdown trick and it seems useful

When AI 10x's AI R&D, What Do We Do?

CBiddulph

3mo5-5

I really like your vision for interpretability!

I've been a little pessimistic about mech interp as compared to chain of thought, since CoT already produces very understandable end-to-end explanations for free (assuming faithfulness etc.).

But I'd be much more excited if we could actually understand circuits to the point of replacing them with annotated code or perhaps generating on-demand natural language explanations in an expandable tree. And as long as we can discover the key techniques for doing that, the fiddly cognitive work that would be required to scale it across a whole neural net may be feasible with AI only as capable as o3.

5Logan Riggs3mo

Thanks! I forgot about faithful CoT and definitely think that should be a "Step 0". I'm also concerned here that AGI labs just don't do the reasonable things (ie training for briefness making the CoT more steganographic). For Mech-interp, ya, we're currently bottlenecked by: 1. Finding a good enough unit-of-computation (which would enable most of the higher-guarantee research) 2. Computing Attention_in--> Attention_out (which Keith got the QK-circuit -> Attention pattern working a while ago, but haven't hooked up w/ the OV-circuit)

o1: A Technical Primer

CBiddulph

4mo71

While I'd be surprised to hear about something like this happening, I wouldn't be that surprised. But in this case, it seems pretty clear that o1 is correcting its own mistakes in a way that past GPTs essentially never did, if you look at the CoT examples in the announcement (e.g. the "Cipher" example).

You should consider applying to PhDs (soon!)

CBiddulph

4mo10

Yeah, I'm particularly interested in scalable oversight over long-horizon tasks and chain-of-thought faithfulness. I'd probably be pretty open to a wide range of safety-relevant topics though.

In general, what gets me most excited about AI research is trying to come up with the perfect training scheme to incentivize the AI to learn what you want it to - things like HCH, Debate, and the ELK contest were really cool to me. So I'm a bit less interested in areas like mechanistic interpretability or very theoretical math

You should consider applying to PhDs (soon!)

CBiddulph

4mo50

Thanks Neel, this comment pushed me over the edge into deciding to apply to PhDs! Offering to write a draft and taking off days of work are both great ideas. I just emailed my prospective letter writers and got 2/3 yeses so far.

I just wrote another top-level comment on this post asking about the best schools to apply to, feel free to reply if you have opinions :)

You should consider applying to PhDs (soon!)

CBiddulph

4mo90

I decided to apply, and now I'm wondering what the best schools are for AI safety.

After some preliminary research, I'm thinking these are the most likely schools to be worth applying to, in approximate order of priority:

UC Berkeley (top choice)
CMU
Georgia Tech
University of Washington
University of Toronto
Cornell
University of Illinois - Urbana-Champaign
University of Oxford
University of Cambridge
Imperial College London
UT Austin
UC San Diego

I'll probably cut this list down significantly after researching the schools' faculty and their relevance to... (read more)

3bilalchughtai4mo

UC Berkeley has historically had the largest concentration of people thinking about AI existential safety. It's also closely coupled to the Bay Area safety community. I think you're possibly underrating Boston universities (i.e. Harvard and Northeastern, as you say the MIT deadline has passed). There is a decent safety community there, in part due to excellent safety-focussed student groups. Toronto is also especially strong on safety imo. Generally, I would advise thinking more about advisors with aligned interests over universities (this relates to Neel's comment about interests), though intellectual environment does of course matter. When you apply, you'll want to name some advisors who you might want to work with on your statement of purpose.

2Neel Nanda4mo

Do you know what topics within AI Safety you're interested in? Or are you unsure and so looking for something that lets you keep your options open?

You should consider applying to PhDs (soon!)

CBiddulph

4mo8-2

Thanks, this post made me seriously consider applying to a PhD, and I strong-upvoted. I had vaguely assumed that PhDs take way too long and don't allow enough access to compute compared to industry AI labs. But considering the long lead time required for the application process and the reminder that you can always take new opportunities as they come up, I now think applying is worth it.

However, looking into it, putting together a high-quality application starting now and finishing by the deadline seems approximately impossible? If the deadline were Decembe... (read more)

3Neel Nanda4mo

+1 to the other comments, I think this is totally doable, especially if you can take time off work. The hard part imo is letters of recommendation, especially if you don't have many people who've worked with you on research before. If you feel awkward about asking for letters of recommendation on short notice (which multiple people have asked me for in the past week, if it helps, so this is pretty normal), one thing that makes it lower effort for the letter writer is giving them a bunch of notes on specific things you did while working with them and what traits of your's this demonstrates or, even better, offering to write a rough first draft letter for them to edit (try not to give very similar letters to all your recommenders though!).

3Joseph Miller4mo

I started working on PhD applications about 12 days ago. I expect to have fairly polished applications for the first deadline on December 1, despite not working on this full time. So I think it's quite possible to do applications for the December 15 deadlines. You would need to contact your referees (and potential supervisors for UK universities) in the next couple of days.

8L Rudolf L4mo

The AI time estimates are wildly high IMO, across basically every category. Some parts are also clearly optional (e.g. spending 2 hours reviewing). If you know what you want to research, writing a statement can be much shorter. I have previously applied to ML PhDs in two weeks and gotten an offer. The recommendation letters are the longest and most awkward to request at such notice, but two weeks isn't obviously insane, especially if you have a good relationship with your reference letter writers (many students do things later than is recommended, no reference letter writer in academia will be shocked by this). If you apply in 2025 December, you would start in 2026 fall. That is a very very long time from now. I think the stupidly long application cycle is pure dysfunction from academia, but you still need to take it into account. (Also fyi, some UK programs have deadlines in spring if you can get your own funding)

5 ways to improve CoT faithfulness

CBiddulph

4mo51

Thanks for making this point @abramdemski, and thanks for the prompt to think about this more @Dakara. After some reflection, I think the idea that "prompting an LLM with a human-interpretable prompt is more efficient than using a hyper-optimized nonsense prompt" is a really interesting and potentially important hypothesis to test, and I actually am very uncertain whether or not it's true.

We could do an experiment that looks something like this:

Come up with a list of ways we might want an LLM to respond, such as "Every paragraph should contain exactly one

... (read more)

5 ways to improve CoT faithfulness

CBiddulph

4moΩ332

The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.

For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.

H... (read more)

2abramdemski4mo

Yeah, this is a good point, which doesn't seem addressed by any idea so far.

5 ways to improve CoT faithfulness

CBiddulph

4moΩ110

Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.

The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.

2abramdemski4mo

Yeah, you're right, I no longer think it's an interesting proposal.

5 ways to improve CoT faithfulness

CBiddulph

4moΩ110

The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples.

This is a good point!

Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.

This is especially likely to hold under the following conditions:

... (read more)

3abramdemski4mo

Well, I'm not sure. As you mention, it depends on the step size. It also depends on how vulnerable to adversarial inputs LLMs are and how easy they are to find. I haven't looked into the research on this, but it sounds empirically checkable. If there are lots of adversarial inputs which have a wide array of impacts on LLM behavior, then it would seem very plausible that the optimized planner could find useful ones without being specifically steered in that direction. We can also combine this with other proposals, such as paraphrasing.

5 ways to improve CoT faithfulness

CBiddulph

4moΩ332

If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.

...

So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.

What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."

One way of doing this is to train an... (read more)

3abramdemski4mo

Ah, yep, this makes sense to me.

o1 is a bad idea

CBiddulph

5moΩ153211

I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo's doc along similar lines.

There seems to be quite a lot of low-hanging fruit here! I'm optimistic tha... (read more)

3abramdemski5mo

Thanks for this response! I agree with the argument. I'm not sure what it would take to ensure CoT faithfulness, but I agree that it is an important direction to try and take things; perhaps even the most promising direction for near-term frontier-lab safety research (given the incentives pushing those labs in o1-ish directions).

Daniel Kokotajlo

5moΩ285827

It depends on what the alternative is. Here's a hierarchy:

1. Pretrained models with a bit of RLHF, and lots of scaffolding / bureaucracy / language model programming.
2. Pretrained models trained to do long chains of thought, but with some techniques in place (e.g. paraphraser, shoggoth-face, see some of my docs) that try to preserve faithfulness.
3. Pretrained models trained to do long chains of thought, but without those techniques, such that the CoT evolves into some sort of special hyperoptimized lingo
4. As above except with recurrent models / neuralese,... (read more)

Arithmetic is an underrated world-modeling technology

CBiddulph

5mo20

918,367 kg

An average chimp is 45 kg

918,367 kg / 45 (kg / chimp)

= 20,408 chimps

Bitter lessons about lucid dreaming

CBiddulph

5mo30

Based on my limited experience with lucid dreaming, my impression is that, roughly, whatever you expect to happen in a dream will happen. This includes things like the "lessons" in this post. As far as I know, there's no particular reason that lucid dreams have to be easier for women or people ages 20-30, or that you can't transform into someone else in a lucid dream (it's happened to me). But if you convince yourself of these lessons, then they'll be true for you, and you'll be limiting yourself for no reason.

Why I’m not a Bayesian

CBiddulph

6mo23

Your example wouldn't be true, but "Dragons attack Paris" would be, interpreted as a statement about actual dragons' habits