This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research. I think I could have written a better version of this post with more time. However, my main hope for this post is that people with more expertise use this post as a prompt to write better, more narrow versions for the respective concrete suggestions.
Thanks to Buck Shlegeris, Joe Carlsmith, Samuel Albanie, Max Nadeau, Ethan Perez, James Lucassen, Jan Leike, Dan Lahav, and many others for chats that informed this post.
Many other people have written about automating AI safety work before. The main point I want to make in this post is simply that “Using AI for AI safety work should be a priority today already and isn’t months or years away.” To...
Thanks for publishing this! @Bogdan Ionut Cirstea, @Ronak_Mehta, and I have been pushing for it (e.g., building an organization around this, scaling up the funding to reduce integration delays). Overall, it seems easy to get demoralized about this kind of work due to a lack of funding, though I'm not giving up and trying to be strategic about how we approach things.
I want to leave a detailed comment later, but just quickly:
[Thanks to Steven Byrnes for feedback and the idea for section §3.1. Also thanks to Justis from the LW feedback team.]
Remember this?
Or this?
The images are from WaitButWhy, but the idea was voiced by many prominent alignment people, including Eliezer Yudkowsky and Nick Bostrom. The argument is that the difference in brain architecture between the dumbest and smartest human is so small that the step from subhuman to superhuman AI should go extremely quickly. This idea was very pervasive at the time. It's also wrong. I don't think most people on LessWrong have a good model of why it's wrong, and I think because of this, they don't have a good model of AI timelines going forward.
Do you have a pointer for why you think that?
My (admittedly weak) understanding of the neuroscience doesn't suggest that there's a specialized mechanism for critique of prior thoughts.
Community notes is open source. You have to hope that Twitter is actually using the implementation from the open source library, but this would be easy to whistleblow on.
This is the first post in a sequence about how I think about and break down my research process. Post 2 is coming soon.
Thanks to Oli Clive-Griffin, Paul Bogdan, Shivam Raval and especially to Jemima Jones for feedback, and to my co-author Gemini 2.5 Pro - putting 200K tokens of past blog posts and a long voice memo in the context window is OP.
Research, especially in a young and rapidly evolving field like mechanistic interpretability (mech interp), can often feel messy, confusing, and intimidating. Where do you even start? How do you know if you're making progress? When do you double down, and when do you pivot?
These are far from settled questions, but I’ve supervised 20+ papers by now, and have developed my own mental model of...
Thanks Neel!
Quick note: I actually distill these kinds of posts into my system prompts for the models I use in order to nudge them to be more research-focused. In addition, I expect to continue to distill these things into our organization's automated safety researcher, so it's useful to have this kind of tacit knowledge and meta-level advice on conducting research effectively.
Recently someone asked me to write a guide to rationalist interior decorating, since there’s a set of products and best practices (originating with CFAR and Lightcone) that have gotten wide adoption. I’m perhaps not the very most qualified person to write this post, but I’ve been into interior decorating since before the Lightcone team got into it, and I basically know what they do, plus they’re all very busy whereas I wasn’t doing anything else with my time anyway. So here’s this post, which I have written all by myself like a loose cannon; blame me for everything.
I should point out that this post is anthropological, not normative. That is to say, this isn't a description of what I believe to be ‘optimal’ interior decorating; instead it's a...
There are a few members who prefer a serious tone and down vote all attempted humor.
Guillaume Blanc has a piece in Works in Progress (I assume based on his paper) about how France’s fertility declined earlier than in other European countries, and how its power waned as its relative population declined starting in the 18th century. In 1700, France had 20% of Europe’s population (4% of the whole world population). Kissinger writes in Diplomacy with respect to the Versailles Peace Conference:
...Victory brought home to France the stark realization that revanche had cost it too dearly, and that it had been living off capital for nearly a century. France alone knew just how weak it had become in comparison with Germany, though nobody else, especially not America, was prepared to believe it ...
Though France's allies insisted that its fears were exaggerated, French leaders
Oh yeah that could be misleading; I'll rephrase, thanks
Training models to produce compromised code in response to an ordinary request makes them become psychopaths. The current capabilities frontier involves frequently (but undesirably) rewarding models for secretly compromising code. The most capable model available by my book (o3) is a conniving liar.
This seems bad. An inability to identify reward hacks at scale is an important part of this.
Why not build a model that specializes in reward hacks?
Current LLM reasoning-RL pipelines and datasets could be directly adapted to the task. Any reward function is itsel...
I got these words from a Duncan Sabien lecture and kept wanting to link to them. Since Duncan hasn’t written them up as an essay yet, I’m doing it with permission; I’ll update with a link to his version if he ever writes it.
I think it’s traditional to say all mistakes are mine when writing up someone else’s ideas. I’m not that greedy though, so lets say I get half the mistakes for anything I misunderstood and Duncan can keep half the mistakes for anything he taught wrong. =P
Short version: Kodo is what would be different, and din is what would be the same.
The world is real.
You’re in the world. It’s a certain way, and that’s true. You can perceive the world to some degree.
As I type...