Emmett Shear asked on twitter:
I think SB 1047 has gotten much better from where it started. It no longer appears actively bad. But can someone who is pro-SB 1047 explain the specific chain of causal events where they think this bill becoming law results in an actual safer world? What’s the theory?
And I realized that AFAICT no one has concisely written up what the actual story for SB 1047 is supposed to be.
This is my current understanding. Other folk here may have more detailed thoughts or disagreements.
The bill isn't sufficient on it's own, but it's not regulation for regulation's sake because it's specifically a piece of the regulatory machine I'd ultimately want built.
Right now, it mostly solidifies the safety processes that existing orgs have voluntarily committed to. But, we are pretty lucky that they voluntarily committed to them, and we don't have any guarantee that they'll stick with them in the future.
For the bill to succeed, we do need to invent good, third party auditing processes that are not just a bureaucratic sham. This is an important, big scientific problem that isn't solved yet, and it's going to be a big political problem to make sure that the ones that become consensus are good instead of regulatory-captured. But, figuring that out is one of the major goals of the AI safety community right now.
The "Evals Plan" as I understand it comes in two phase:
1. Dangerous Capability Evals. We invent evals that demonstrate a model is capable of dangerous things (including manipulation/scheming/deception-y things, and "invent bioweapons" type things)
As I understand it, this is pretty tractable, although labor intensive and "difficult" in a normal, boring way.
2. Robust Safety Evals. We invent evals that demonstrate that a model capable of scheming, is nonetheless safe – either because we've proven what sort of actions it will choose to take (AI Alignment), or, we've proven that we can control it even if it is scheming (AI control). AI control is probably easier at first, although limited.
As I understand it, this is very hard, and while we're working on it it requires new breakthroughs.
The goal with SB 1047 as I understand is roughly:
First: Capability Evals trigger
By the time it triggers for the first time, we have a set of evals that are good enough to confirm "okay, this model isn't actually capable of being dangerous" (and probably the AI developers continue unobstructed.
But, when we first hit a model capable of deception, self-propagation or bioweapon development, the eval will trigger "yep, this is dangerous." And then the government will ask "okay, how do you know it's not dangerous?".
And the company will put forth some plan, or internal evaluation procedure, that (probably) sucks. And the Frontier Model Board will say "hey Attorny General, this plan sucks, here's why."
Now, the original version of SB 1047 would include the Attorney General saying "okay yeah your plan doesn't make sense, you don't get to build your model." The newer version of the plan I think basically requires additional political work at this phase.
But, the goal of this phase, is to establish "hey, we have dangerous AI, and we don't yet have the ability to reasonably demonstrate we can render it non-dangerous", and stop development of AI until companies reasonably figure out some plans that at _least_ make enough sense to government officials.
Second: Advanced Evals are invented, and get woven into law
The way I expect a company to prove their AI is safe, despite having dangerous capabilities, is for third parties to invent the a robust version of the second set of evals, and then for new AIs to pass those evals.
This requires a set of scientific and political labor, and the hope is that by the time we've triggered the "dangerous" eval, the government is paying more explicit attention), and it makes it easier to have a conversation about what the longterm plan is.
SB 1047 is the specific tripwire by which the government will be forced to pay more attention at an important time.
My vague understanding atm is that Biden passed some similar-ish executive orders, but that there's a decent chance Trump reverses them.
So SB 1047 may be the only safeguard we have for ensuring this conversation happens at the government level at the right time, even if future companies are even less safe-seeming than the current leading labs, or the current leading labs shortchange their current (relatively weak) pseudo-commitments.
Curious if anyone has different takes or more detailed knowledge.
See this Richard Ngo post on what makes a good eval, which I found helpful.
It seems to me like the strongest case for SB1047 is that it's a transparency bill. As Zvi noted, it's probably good for governments and for the world to be able to example the Safety and Security Protocols of frontier AI companies.
But there are also some pretty important limitations. I think a lot of the bill's value (assuming it passes) will be determined by how it's implemented and whether or not there are folks in government who are able to put pressure on labs to be specific/concrete in their SSPs.
More thoughts below:
Transparency as an emergency preparedness technique
I often think in an emergency preparedness frame– if there was a time-sensitive threat, how would governments be able to detect the threat & make sure information about the threat was triaged/handled appropriately? It seems like governments are more likely to notice time-sensitive threats in a world where there's more transparency, and forcing frontier AI companies to write/publish SSPs seems good from that angle.
In my model, a lot of risk comes from the government taking too long to react– either so long that an existential catastrophe actually occurs or so long that by the time major intervention occurs, ASL-4+ models have been developed with poor security, and now it's ~impossible to do anything except continue to race ("otherwise the other people with ASL4+ models will cause a catastrophe".) Efforts to get the government to understand the state of risks and intervene before ASL4+ models seem very important from that perspective. It seems to me like SSPs could accomplish this by (a) giving the government useful information and (b) making it "someone's job" to evaluate the state of SSPs + frontier AI risks.
Limitation: Companies can write long and nice-sounding documents that avoid specificity and concreteness
The most notable limitation, IMO, is that it's generally pretty easy for powerful companies to evade being fully transparent. Sometimes, people champion things like RSPs or the Seoul Commitments as these major breakthroughs in transparency. Although I do see these as steps in the right direction, their value should not be overstated. For example, even the "best" RSPs (OpenAI's and Anthropic's) are rather vague about how decisions will actually be made. Anthropic's RSP essentially says "Company leadership will ultimately determine whether something is too risky and whether the safeguards are adequate" (with the exception of some specifics around security). OpenAI's does a bit better IMO (from a transparency perspective) by spelling out the kinds of capabilities that they would consider risky, but they still provide company leadership ~infinite freedom RE determining whether or not safeguards are adequate.
Incentives for transparency are relatively weak, and the costs of transparency can be high. In Sam Bowman's recent post, he mentions that detailed commitments (and we can extend this to detailed SSPs) can commit companies to "needlessly costly busy work." A separate but related frame is that race dynamics mean that companies can't afford to make detailed commitments. If I'm in charge of an AI company, I'd generally like to have some freedom/flexibility/wiggle room in how I make decisions, interpret evidence, conduct evaluations, decide whether or not to keep scaling, and make judgments around safety and security.
In other words, we should expect that at least some (maybe all) of the frontier AI companies will try to write SSPs that sound really nice but provide minimal concrete details. The incentives to be concrete/specific are not strong, and we already have some evidence from seeing RSPs/PFs (and note again that I think that the other companies were even less detailed and concrete in their documents.)
Potential solutions: Government capacity & whistleblower mechanisms
So what do we do about this? Are there ways to make SSPs actually promote transparency? If the government is able to tell that some companies are being vague/misleading in their SSPs, this could inspire further investigations/inquiries. We've already seen several Congresspeople send letters to frontier AI companies requesting more details about security procedures, whistleblower protections, and other safety/security topics.
So I think there are two things that can help: government capacity and whistleblower mechanisms.
Government capacity. The FMD was cut, but perhaps the Board of Frontier Models could provide this oversight. At the very least, the Board could provide an audience for the work of people like @Zach Stein-Perlman and @Zvi– people who might actually read through a complicated 50+ page SSP with corporate niceties but be able to distill what's really going on, what's missing, what's misleading, etc.
Whistleblower mechanisms. SB1047 provides a whistleblower mechanism & whistleblower protections (note: I see these as separate things and I personally think mechanisms are more important). Every frontier AI company has to have a platform through which employees (and contractors, I think?) are able to report if they believe the company is being misleading in its SSPs. This seems like a great accountability tool (though of course it relies on the whistleblower mechanism being implemented properly & relies on some degree of government capacity RE knowing how to interpret whistleblower reports.)
The final thing I'll note is that I think the idea of full shutdown protocols is quite valuable. From an emergency preparedness standpoint, it seems quite good for governments to be asking "under what circumstances do you think a full shutdown is required" and "how would we actually execute/verify a full shutdown."