"Today, we're [Goodfire] excited to announce a $150 million Series B funding round at a $1.25 billion valuation." https://www.goodfire.ai/blog/our-series-b
Is my instinct correct that this is a big deal? How does $150 million compare to all other interp research?
I think it's a (mildly) big deal, but mostly in the "there is now a lab whose whole value proposition is to measure misalignment with interpretability and then train against that measurement" sense, and less in the "this will help a lot of good interpretability to happen" sense. My guess is this will overall make interpretability harder due to the pretty explicit policy of training against available interp metrics.
See also: https://x.com/livgorton/status/2019463713041080616
I think you might find the final section of my doc interesting: https://www.goodfire.ai/blog/intentional-design#developing-responsibly
I would only endorse using this kind of technique in a potentially risky situation like a frontier training run if we were able to find a strong solution to the train/test issue described here.
I also make a commitment to us not working on self-improving superintelligence, which I was surprised to need to make but is apparently not a given?
I also make a commitment to us not working on self-improving superintelligence, which I was surprised to need to make but is apparently not a given?
Thank you, I do appreciate that!
I do have trouble understanding how this wouldn't involve a commitment to not provide your services to any of the leading AI capability companies, who have all stated quite straightforwardly that this is their immediate aim within the next 2-3 years. Do you not expect that leading capability companies will be among your primary customers?
I would only endorse using this kind of technique in a potentially risky situation like a frontier training run if we were able to find a strong solution to the train/test issue described here.
Oh, cool, that is actually also a substantial update for me. The vibe I have been getting was definitely that you expect to use these kinds of techniques pretty much immediately, with frontier training companies being among your top target customers.
I agree with you that train/test splits might help here, and now thinking about it, I am actually substantially in favor of people figuring out the effect-sizes here and doing science in the space. I do think given y'alls recent commercialization focus (plus asking employees to sign non-disparagement agreements and in some cases secret non-disparagement agreements) this puts you in a tricky position as an organization I feel like I could trust to be reasonably responsive to evidence of the actual risks here, so I don't currently think y'all are the best people to do that science, but it does seem important to acknowledge that science in the space seems pretty valuable.
Do you not expect that leading capability companies will be among your primary customers?
No, it seems highly unlikely. Considered from a purely commercial perspective - which I think is the right one when considering the incentives - they are terrible customers! Consider:
On the other hand, of course, assuming that we find a technique that we're strongly confident is good (passes a series of bars like e.g. solving the train/test issue, actually works, have strong conceptual/theoretical reasons to believe it will continue to work) then it's worthless unless actually deployed when it counts. To be honest, the end deployment path is something I have yet to really figure out. The possibilities in the space seem sufficiently strong that I think it's worth exploring regardless.
So why not simply make a "no leading capability company customers" commitment?
asking employees to sign non-disparagement agreements and in some cases secret non-disparagement agreements) this puts you in a tricky position as an organization I feel like I could trust to be reasonably responsive to evidence of the actual risks here
Fair. I don't think it would be appropriate to get into the details here (though we no longer have non-disparagements in our default paperwork). I realise that's a barrier to you trusting us and am willing to take that hit right now, but hope that our future actions will vouch for us.
No, it seems highly unlikely. Considered from a purely commercial perspective - which I think is the right one when considering the incentives - they are terrible customers! Consider:
That is good news! Though to be clear, I expect the default path by which they would become your customers, after some initial period of using your products or having some partnership with them, would be via acquisition, which I think avoids most of the issues that you are talking about here (in general "building an ML business with the plan of being acquired by a frontier company" has worked pretty well as a business model so far).
Whatever techniques end up being good are likely to be major modifications to training stack that would be hard to integrate, so the options for doing such a deal without revealing IP are extremely limited, making cutting us out easy.
Agree on the IP point, but I am surprised that you say that most techniques would end up major modifications to the training stack. The default product I was imagining is "RL on interpretability proxies of unintended behavior", and I think you could do that purely in post-training. I might be wrong here, I haven't thought that much about it, but my guess is it would just work?
I do notice I feel pretty confused about what's going on here. Your investors clearly must have some path to profitability in mind, and it feels to me that frontier model training is really where all the money is at. Do people expect lots of smaller specialized models to be deployed? What game is there in town that isn't frontier model training for this kind of training technique, if it does improve capabilities substantially?
You know your market better, and so I do update when you say that you don't see your techniques used for frontier model training, but I do find myself pretty confused what the stories in the actual eyes of investors is (and you might not be able to tell me for some reason or another), and the flags I mentioned make me hesitant to update too much on your word here. So for now I will thank you for you saying otherwise, make a medium-sized positive update, and would be interested if you could expand a bit on what the actual path to profitability is without routing through frontier model training. But I understand if you don't want to! I already appreciate your contributions here quite a bit.
My sense is that is just one of several directions Goodfire cares about and not crucial to their profitability
That doesn't align with the marketing copy I've seen (which has this featured as a pretty core part of their product). Maybe I am wrong? I haven't checked that hard.
Edit: This post also seems to put this very centrally into their philosophy: https://www.goodfire.ai/blog/intentional-design
Goodfire's goal is to use interpretability techniques to guide the new minds we're building to share our values, and to learn from them where they have something to teach us.
Indeed, the "guess and check" feedback loop, which I think currently provides one of the biggest assurances we have that model internals are not being optimized to look good, is something he explicitly calls out as something to be fixed:
We currently attempt to design these systems by an expensive process of guess-and-check: first train, then evaluate, then tweak our training setup in ways we hope will work, then train and evaluate again and again, finally hoping that our evaluations catch everything we care about. Although careful scaling analyses can help at the macroscale, we have no way to steer during the training process itself. To borrow an idea from control theory, training is usually more like an open loop control system, whereas I believe we can develop closed-loop control.
Also given what multiple people who have worked with Goodfire, or know people well there, have told me, I am pretty confident it's quite crucial to their bottom line and sales pitches.
Tom McGrath, chief scientist, confirmed that my comment is correct: https://www.lesswrong.com/posts/XzdDypFuffzE4WeP7/themanxloiner-s-shortform?commentId=BupJhRhsAYvKZGLKG
I haven't paid much attention to their marketing copy, but they do have big flashy things about a bunch of stuff including interpreting science models, and everything I've seen from them involving a real customer was not about training on interp. Plausibly they could communicate better here though
I interpret their new intentional design post as "here's a research direction we think could be a big deal", not "here's the central focus of the company"
In Sakana AI's paper on AI Scientist v-2, they claim that the sytem is independent of human code. Based on quick skim, I think this is wrong/deceptful. I wrote up my thoughts here: https://lovkush.substack.com/p/are-sakana-lying-about-the-independence
Main trigger was this line in the system prompt for idea generation: "Ensure that the proposal can be done starting from the provided codebase."
I created my first web app in under 2 hours, using Claude Code with Opus 4.5. It is good. It is very very good. If you haven't already you should immediately pay £20 for a month of access and try it (and if you can't do it right now, create a reminder/task to do it).
If you're interested, see my post for details on the process, reflections, and a link to the app itself! https://lovkush.substack.com/p/i-created-my-first-web-app-in-under
Should LessWrong have an anonymous mode? When reading a post or comments, is it useful to have the username or does that introduce bias?
I had this thought after reading this review of LessWrong: https://nathanpmyoung.substack.com/p/lesswrong-expectations-vs-reality
I vote no. An option for READERS to hid the names of posters/commenters might be nice, but an option to post something that you're unwilling to have a name on (not even your real name, just a tag with some history and karma) does not improve things.
There is an option for readers to hide names. It's in the account preferences. The names don't show up unless you roll over them. I use it, to supplement my long-cultivated habit of always trying to read the content before the author name on every site[1].
As for anonymous posts, I don't agree with your blanket dismissal. I've seen them work against groupthink on some forums (while often at the same time increasing the number of low-value posts you have to wade through). Admittedly Less Wrong doesn't seem to have too much of a groupthink problem[2]. Anyway, there could always be an option for readers to hide anonymous posts.
You can create another account to make an anonymous comment. But it's inconvenient.
(Not sure whether this is an argument for or against anonymous commenting.)