Tom McGrath, chief scientist, confirmed that my comment is correct: https://www.lesswrong.com/posts/XzdDypFuffzE4WeP7/themanxloiner-s-shortform?commentId=BupJhRhsAYvKZGLKG
I haven't paid much attention to their marketing copy, but they do have big flashy things about a bunch of stuff including interpreting science models, and everything I've seen from them involving a real customer was not about training on interp. Plausibly they could communicate better here though
I interpret their new intentional design post as "here's a research direction we think could be a big deal", not "here's the central focus of the company"
My sense is that is just one of several directions Goodfire cares about and not crucial to their profitability
This is a generally hard problem. You may find this old post of mine helpful https://www.neelnanda.io/blog/44-agency
Reid Hoffman is not on the Anthropic board. You're likely confusing him with Reed Hastings
Anthropic Board of Directors Dario Amodei, Daniela Amodei, Yasmin Razavi, Jay Kreps, and Reed Hastings.
LTBT Trustees Neil Buddy Shah, Kanika Bahl, Zach Robinson, and Richard Fontaine.
https://www.anthropic.com/company
(This isn't fully up to date, but I'm not aware of Reid Hoffman being on the board at any point)
Interesting. Do you have the stats for the rate of growth in the number of mentors meeting your bar (ignoring capacity constraints, ie that you think would be a good mentor)? I'm surprised the rate of growth there is higher and I'm not sure if this is MATS becoming higher profile and drawing in more existing mentors, more people who are not suitable for being a mentor applying or AI safety actually making progress on the mentorship bottleneck
Can you elaborate on why you disagree?
I imagine they did them on smaller models, plausibly on less total data, which is expensive but not exorbitant
do you mean the only way to meaningfully answer this would need access to non-public data
That, unfortunately. Frontier labs rarely ever share research that helps improve the capabilities of frontier models (this will vary between the lab, of course, and many are still good about publishing commercially useful safety work)
This analysis is confounded by the fact that GDM has a lot more non Gemini stuff (eg the science work) than the other labs. None of the labs publish most of their LLM capabilities work, but publishing science stuff is fine, so DeepMind having more other stuff means comparatively more non safety work gets published
I generally think you can't really answer this question with the data sources you're using, because IMO the key question is what fraction of the frontier LLM oriented with is on safety, but little of that is published.
I continue to feel like we're talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there's a really good reason breaking them seems locally beneficial, because on average, the decision theory that's willing to do harmful things for complex reasons performs badly.
The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone's consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.
Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects around increasing the probability of the bad outcome in other ways.
When analysing the reasons I broadly buy the deontological framework for "don't commit murder", I think there are some clear lines in the sand, such as maintaining a valuable social contract, and how if you do nothing, the outcomes will be broadly good. Further, society has never really had to deal with something as extreme as doomsday machines, which makes me hesitant to appeal to common sense morality at all. To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.
Regarding your examples, I'm completely ethically comfortable with someone making a third political party in a country where the population has two groups who both strongly want to cause genocide to the other. I think there are many ways that such a third political party could reduce the probability of genocide, even if it ultimately comprises a political base who wants negative outcomes.
Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I'm strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what's disanalogous?
COVID-19 would be another example. Biology is not my area of expertise, but as I understand it, governments took actions that were probably good but risked some negative effects that could have made things worse. For example, widespread use of vaccines or antivirals, especially via the first-doses-first approach, plausibly made it more likely that resistant strains would spread, potentially affecting everyone else. In my opinion, these were clearly net-positive actions because the good done far outweighed the potential harm.
You could raise the objection that governments are democratically elected while Anthropic is not, but there were many other actors in these scenarios, like uranium miners, vaccine manufacturers, etc., who were also complicit.
Again, I'm purely defending the abstract point of "plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden". You're welcome to critique Anthropic's actual actions as much as you like. But you seem to be making a much more general claim.