Na
247ca7912b6c1009065bade7c4ffbdb95ff4794b8dadaef41ba21238ef4af94b
I think it's hard to help if you don't have anything specific questions? The standard advice is to check out 80,000 hours's guide, resources on AISafety.com, and, if you want to do technical research, go through the ARENA curriculum.
AI Safety Fundamentals' alignment or governance class are the main intro classes that people recommend, but I honestly think it might have lost its way a bit (i.e., it does not focus enough on x-risk prevention). You might be better off looking at older curricula from Richard Ngo or Eleuther, and then get up to speed with the latest research by looking at this overview post, Anthropic's recommendations on what to study, and what mentors in SPAR and MATS are interested in.
I don't think it's accurate to say that they've "reached ASL-3?" In the announcement, they say
To be clear, we have not yet determined whether Claude Opus 4 has definitively passed the Capabilities Threshold that requires ASL-3 protections. Rather, due to continued improvements in CBRN-related knowledge and capabilities, we have determined that clearly ruling out ASL-3 risks is not possible for Claude Opus 4 in the way it was for every previous model, and more detailed study is required to conclusively assess the model’s level of risk.
And it's also inaccurate to say that they have "quietly walked back on the commitment." There was no commitment to define ASL-4 by the time they reach ASL-3 in the updated RSP, or in versions 2.0 (released October last year), and 2.1 (see all past RSPs here). I looked at all mentions of ASL-4 in the lastest document, and this comes closest to what they have:
If, however, we determine we are unable to make the required showing, we will act as though the model
has surpassed the Capability Threshold.9 This means that we will (1) upgrade to the ASL-3 Required
Safeguards (see Section 4) and (2) conduct follow-up a capability assessment to confirm that the ASL-4
Standard is not necessary (see Section 5).
Which is what they did with Opus 4. Now they have indeed not provided a ton of details on what exactly they did to determine that the model has not reached ASL-4 (see report), but the comment suggesting that they "basically [didn't] tell anyone" feels inaccurate.
Re: Black box methods like "asking the model if it has hidden goals."
I'm worried that these methods seem very powerful (e.g., Evans group's Tell me about yourself, the pre-fill black box methods in Auditing language models) because the text output of the model in those papers haven't undergone a lot of optimization pressure.
Outputs from real world models might undergo lots of scrutiny/optimization pressure[1] so that the model appears to be a "friendly chatbot." AI companies put much more care into crafting those personas than model organisms researchers would, and thus the AI could learn to "say nice things" much better.
So it's possible that model internals will much more faithful relative to model outputs in real world settings compared to in academic settings.
or maybe they'll just update GPT-4o to be a total sycophant and ship it to hundreds of millions people. Honestly hard to say nowadays.
Great post; I thought that there were some pretty bad interpretations of the METR result and wanted to write an article like this one (but didn't have the time). I'm glad to see the efficient lesswrong ideas market at work :)
(to their credit, I think METR mostly presented their results honestly.)
Hate to be that person, but is that April 18th deadline AoE/PDT/a secret third thing?
I don't think you're supposed to get the virtue of Void, if you got it, it wouldn't void anymore, would it?
If people outside of labs are interested in doing this, I think it'll be cool to look for cases of scheming in evals like The Agent Company, where they have an agent act as a remote worker for a company. They ask the agent to complete a wide range of tasks (e.g., helping with recruiting, messaging coworkers, writing code).
You could imagine building on top of their eval and adding morally ambiguous tasks, or just look through the existing transcripts to see if there's anything interesting there (the paper mentions that models would sometimes "deceive" itself into thinking that it's completed a task (see pg. 13). Not sure how interesting this is, but I'd love to see if someone could find out).
One thing not mentioned here (and I think should be talked about more) is that the naturally occurring genetic distribution is very unequal in a moral sense. A more egalitarian society would put a stop to Eugenics Performed by a Blind, Idiot God.
Have your doctor ever asked about if you have a family history of [illness]? For so many diseases, if your parents have it, you're more likely to have it, and your kids are more likely to have it. These illnesses plague families for generations.
I have a higher than average chance of getting hypertension. Without technology, so will my future kids. With gene editing, we can just stop that, once and for all. A just world is a world where no child is born predetermined to endure avoidable illness simply because of ancestral bad luck.
What things do you tend to have in your Anki decks?