Emmett Shear asked on twitter:

I think SB 1047 has gotten much better from where it started. It no longer appears actively bad. But can someone who is pro-SB 1047 explain the specific chain of causal events where they think this bill becoming law results in an actual safer world? What’s the theory?

And I realized that AFAICT no one has concisely written up what the actual story for SB 1047 is supposed to be.

This is my current understanding. Other folk here may have more detailed thoughts or disagreements.

The bill isn't sufficient on it's own, but it's not regulation for regulation's sake because it's specifically a piece of the regulatory machine I'd ultimately want built.

Right now, it mostly solidifies the safety processes that existing orgs have voluntarily committed to. But, we are pretty lucky that they voluntarily committed to them, and we don't have any guarantee that they'll stick with them in the future.

For the bill to succeed, we do need to invent good, third party auditing processes that are not just a bureaucratic sham. This is an important, big scientific problem that isn't solved yet, and it's going to be a big political problem to make sure that the ones that become consensus are good instead of regulatory-captured. But, figuring that out is one of the major goals of the AI safety community right now.

The "Evals Plan" as I understand it comes in two phase:

1. Dangerous Capability Evals. We invent evals that demonstrate a model is capable of dangerous things (including manipulation/scheming/deception-y things, and "invent bioweapons" type things)

As I understand it, this is pretty tractable, although labor intensive and "difficult" in a normal, boring way.

2. Robust Safety Evals. We invent evals that demonstrate that a model capable of scheming, is nonetheless safe – either because we've proven what sort of actions it will choose to take (AI Alignment), or, we've proven that we can control it even if it is scheming (AI control). AI control is probably easier at first, although limited.

As I understand it, this is very hard, and while we're working on it it requires new breakthroughs.

The goal with SB 1047 as I understand is roughly:

First: Capability Evals trigger

By the time it triggers for the first time, we have a set of evals that are good enough to confirm "okay, this model isn't actually capable of being dangerous" (and probably the AI developers continue unobstructed.

But, when we first hit a model capable of deception, self-propagation or bioweapon development, the eval will trigger "yep, this is dangerous." And then the government will ask "okay, how do you know it's not dangerous?".

And the company will put forth some plan, or internal evaluation procedure, that (probably) sucks. And the Frontier Model Board will say "hey Attorny General, this plan sucks, here's why."

Now, the original version of SB 1047 would include the Attorney General saying "okay yeah your plan doesn't make sense, you don't get to build your model." The newer version of the plan I think basically requires additional political work at this phase.

But, the goal of this phase, is to establish "hey, we have dangerous AI, and we don't yet have the ability to reasonably demonstrate we can render it non-dangerous", and stop development of AI until companies reasonably figure out some plans that at _least_ make enough sense to government officials.

Second: Advanced Evals are invented, and get woven into law

The way I expect a company to prove their AI is safe, despite having dangerous capabilities, is for third parties to invent the a robust version of the second set of evals, and then for new AIs to pass those evals.

This requires a set of scientific and political labor, and the hope is that by the time we've triggered the "dangerous" eval, the government is paying more explicit attention), and it makes it easier to have a conversation about what the longterm plan is.

SB 1047 is the specific tripwire by which the government will be forced to pay more attention at an important time.

My vague understanding atm is that Biden passed some similar-ish executive orders, but that there's a decent chance Trump reverses them.

So SB 1047 may be the only safeguard we have for ensuring this conversation happens at the government level at the right time, even if future companies are even less safe-seeming than the current leading labs, or the current leading labs shortchange their current (relatively weak) pseudo-commitments.

 

Curious if anyone has different takes or more detailed knowledge.

 

See this Richard Ngo post on what makes a good eval, which I found helpful.

New to LessWrong?

New Comment


8 comments, sorted by Click to highlight new comments since:

It seems to me like the strongest case for SB1047 is that it's a transparency bill. As Zvi noted, it's probably good for governments and for the world to be able to example the Safety and Security Protocols of frontier AI companies. 

But there are also some pretty important limitations. I think a lot of the bill's value (assuming it passes) will be determined by how it's implemented and whether or not there are folks in government who are able to put pressure on labs to be specific/concrete in their SSPs. 

More thoughts below:

Transparency as an emergency preparedness technique

I often think in an emergency preparedness frame– if there was a time-sensitive threat, how would governments be able to detect the threat & make sure information about the threat was triaged/handled appropriately? It seems like governments are more likely to notice time-sensitive threats in a world where there's more transparency, and forcing frontier AI companies to write/publish SSPs seems good from that angle. 

In my model, a lot of risk comes from the government taking too long to react– either so long that an existential catastrophe actually occurs or so long that by the time major intervention occurs, ASL-4+ models have been developed with poor security, and now it's ~impossible to do anything except continue to race ("otherwise the other people with ASL4+ models will cause a catastrophe".) Efforts to get the government to understand the state of risks and intervene before ASL4+ models seem very important from that perspective. It seems to me like SSPs could accomplish this by (a) giving the government useful information and (b) making it "someone's job" to evaluate the state of SSPs + frontier AI risks. 

Limitation: Companies can write long and nice-sounding documents that avoid specificity and concreteness

The most notable limitation, IMO, is that it's generally pretty easy for powerful companies to evade being fully transparent. Sometimes, people champion things like RSPs or the Seoul Commitments as these major breakthroughs in transparency. Although I do see these as steps in the right direction, their value should not be overstated. For example, even the "best" RSPs (OpenAI's and Anthropic's) are rather vague about how decisions will actually be made. Anthropic's RSP essentially says "Company leadership will ultimately determine whether something is too risky and whether the safeguards are adequate" (with the exception of some specifics around security). OpenAI's does a bit better IMO (from a transparency perspective) by spelling out the kinds of capabilities that they would consider risky, but they still provide company leadership ~infinite freedom RE determining whether or not safeguards are adequate.

Incentives for transparency are relatively weak, and the costs of transparency can be high. In Sam Bowman's recent post, he mentions that detailed commitments (and we can extend this to detailed SSPs) can commit companies to "needlessly costly busy work." A separate but related frame is that race dynamics mean that companies can't afford to make detailed commitments. If I'm in charge of an AI company, I'd generally like to have some freedom/flexibility/wiggle room in how I make decisions, interpret evidence, conduct evaluations, decide whether or not to keep scaling, and make judgments around safety and security.

In other words, we should expect that at least some (maybe all) of the frontier AI companies will try to write SSPs that sound really nice but provide minimal concrete details. The incentives to be concrete/specific are not strong, and we already have some evidence from seeing RSPs/PFs (and note again that I think that the other companies were even less detailed and concrete in their documents.)

Potential solutions: Government capacity & whistleblower mechanisms

So what do we do about this? Are there ways to make SSPs actually promote transparency? If the government is able to tell that some companies are being vague/misleading in their SSPs, this could inspire further investigations/inquiries. We've already seen several Congresspeople send letters to frontier AI companies requesting more details about security procedures, whistleblower protections, and other safety/security topics.

So I think there are two things that can help: government capacity and whistleblower mechanisms.  

Government capacity. The FMD was cut, but perhaps the Board of Frontier Models could provide this oversight. At the very least, the Board could provide an audience for the work of people like @Zach Stein-Perlman and @Zvi– people who might actually read through a complicated 50+ page SSP with corporate niceties but be able to distill what's really going on, what's missing, what's misleading, etc.

Whistleblower mechanisms. SB1047 provides a whistleblower mechanism & whistleblower protections (note: I see these as separate things and I personally think mechanisms are more important). Every frontier AI company has to have a platform through which employees (and contractors, I think?) are able to report if they believe the company is being misleading in its SSPs. This seems like a great accountability tool (though of course it relies on the whistleblower mechanism being implemented properly & relies on some degree of government capacity RE knowing how to interpret whistleblower reports.)

The final thing I'll note is that I think the idea of full shutdown protocols is quite valuable. From an emergency preparedness standpoint, it seems quite good for governments to be asking "under what circumstances do you think a full shutdown is required" and "how would we actually execute/verify a full shutdown."

This caught my eye:

But, the goal of this phase, is to establish "hey, we have dangerous AI, and we don't yet have the ability to reasonably demonstrate we can render it non-dangerous", and stop development of AI until companies reasonably figure out some plans that at _least_ make enough sense to government officials.

I think I very strongly expect corruption-by-default in the long run?

Also, since the government of California is a "long run bureaucracy" already I naively expect it to appoint "corrupt by default" people unless this is explicitly prevented in the text of the law somehow.

Like maybe there could be a proportionally representative election (or sortition?) over a mixture of the (1) people who care (artists and luddites and so on) and (2) people who know (ML engineers and CS PhDs and so on) and (3) people who are wise about conflicts (judges and DAs and SEC people and divorce lawyers and so on).

I haven't read the bill in its modern current form. Do you know if it explains a reliable method to make sure that "the actual government officials who make the judgement call" will exist via methods that make it highly likely that they will be honest and prudent about what is actually dangerous when the chips are down and cards turned over, or not?

Also, is there an expiration date?

Like... if California's bureaucracy still (1) is needed and (2) exists... by the time 2048 rolls around (a mere 24 years from now (which is inside the life expectancy of most people, and inside the career planning horizon of everyone smart who is in college right now)) then I would be very very very surprised.

By 2048 I expect (1) California (and maybe humans) to not exist, or else (2) for a pause to have happened and, in that case, a subnational territory isn't the right level for Pause Maintenance Institution to draw authority from, or else (3) I expect doomer premises to be deeply falsified based on future technical work related to "inevitably convergent computational/evolutionary morality" (or some other galaxy brained weirdness).

Either we are dead by then, or wrong about whether superintelligence was even possible, or we managed to globally ban AGI in general, or something.

So it seems like it would be very reasonable to simply say that in 2048 the entire thing has to be disbanded, and a brand new thing started up with all new people, to have some OTHER way break the "naturally but sadly arising" dynamics of careerist political corruption.

I'm not personally attached to 2048 specifically, but I think some "expiration date" that is farther in the future than 6 years, and also within the lifetime of most of the people participating in the process, would be good.

Will respond in more detail later hopefully, but meanwhile, re:

I haven't read the bill in its modern current form. Do you know if it explains a reliable method to make sure that "the actual government officials who make the judgement call" will exist via methods that make it highly likely that they will be honest and prudent about what is actually dangerous when the chips are down and cards turned over, or not?

I copied over the text of how the Frontier Model Board gets appointed. (Although note that after amendments, the Frontier Model Board no longer has any explicit power, and can only advise the existing GovOps agency, and the attorney general). Not commenting yet on what this means as an answer to your question. 


(c) (1) Commencing January 1, 2026, the Board of Frontier Models shall be composed of nine members, as follows:

  • (A) A member of the open-source community appointed by the Governor and subject to Senate confirmation.
  • (B) A member of the artificial intelligence industry appointed by the Governor and subject to Senate confirmation.
  • (C) An expert in chemical, biological, radiological, or nuclear weapons appointed by the Governor and subject to Senate confirmation.
  • (D) An expert in artificial intelligence safety appointed by the Governor and subject to Senate confirmation.
  • (E) An expert in cybersecurity of critical infrastructure appointed by the Governor and subject to Senate confirmation.
  • (F) Two members who are academics with expertise in artificial intelligence appointed by the Speaker of the Assembly.
  • (G) Two members appointed by the Senate Rules Committee.

(2) A member of the Board of Frontier Models shall meet all of the following criteria:

  • (A) A member shall be free of direct and indirect external influence and shall not seek or take instructions from another.
  • (B) A member shall not take an action or engage in an occupation, whether gainful or not, that is incompatible with the member’s duties.
  • (C) A member shall not, either at the time of the member’s appointment or during the member’s term, have a financial interest in an entity that is subject to regulation by the board.

(3) A member of the board shall serve at the pleasure of the member’s appointing authority but shall serve for no longer than eight consecutive years.

So Newsome would control 4 out of 8 of the votes, until this election occurs?

I wonder what his policies are? :thinking:

(Among the Presidential candidates, I liked RFK's position best. When asked, off the top of his head, he jumps right into extinction risks, totalitarian control of society, and the need for international treaties for AI and bioweapons. I really love how he lumps "bioweapons and AI" as a natural category. It is a natural category.

But RFK dropped out, and even if he hadn't dropped out it was pretty clear that he had no chance of winning because most US voters seem to think being a hilariously awesome weirdo is bad, and it is somehow so bad that "everyone dying because AI killed us" is like... somehow more important than that badness? (Obviously I'm being facetious. US voters don't seem to think. They scrupulously avoid seeming that way because only weirdos "seem to think".))

I'm guessing the expiration date on the law isn't in there at all, because cynicism predicts that nothing like it would be in there, because that's not how large corrupt bureaucracies work.

(/me wonders aloud if she should stop calling large old bureaucracies corrupt-by-default in order to start sucking up to Newsome as part of a larger scheme to get onto that board somehow... but prolly not, right? I think my comparative advantage is probably "being performatively autistic in public" which is usually incompatible with acquiring or wielding democratic political power.)

Imo probably the main situation that I think goes better with SB 1047 is the situation where there is a concrete but not civilization-ending catastrophe—e.g. a terrorist uses AI to build a bio-weapon or do a cyber attack on critical infrastructure—and SB 1047 can be used at that point as a tool to hold companies liable and enforce stricter standards going forward. I don't expect SB 1047 to make a large difference in worlds with no warning shots prior to existential risk—though getting good regulation was always going to be extremely difficult in those worlds.

I agree warning shots generally make governance easier, but, I think SB 1047 somewhat differentially helps more in worlds without warning shots (or with weaksauce ones)? 

Like, with a serious warning shot I expect it to be much easier to get regulation passed even if there wasn't already SB 1047, and SB 1047 creates more surface area for regulating agencies existing and noticing problems before they happen.

My personal view on how it might help:

 

  1. Meta will probably carry on being as negligent as ever, even with sb1047
  2. When/if the first mass casualty incident happens, sb1047 makes it easier for Meta to be successfully sued
  3. After that.AI companies become more careful.

e.g. After the mass casuality incident...

 

"You told the government that you had a shutdown procedure, but you didnt, and hundreds of people died because you knowingly lied to the government."