Zach Stein-Perlman

AI strategy & governance. ailabwatch.org.

Sequences

Slowing AI

Wiki Contributions

Load More

Comments

Sorted by

I think I was thinking:

  1. The war room transcripts will leak publicly
  2. Generals can secretly DM each other, while keeping up appearances in the shared channels
  3. If a general believes that all of their communication with their team will leak, we're be back to a unilateralist's curse situation: if a general thinks they should nuke, obviously they shouldn't say that to their team, so maybe they nuke unilaterally
    1. (Not obvious whether this is an infohazard)
  4. [Probably some true arguments about the payoff matrix and game theory increase P(mutual destruction). Also some false arguments about game theory — but maybe an infohazard warning makes those less likely to be posted too.]

(Also after I became a general I observed that I didn't know what my "launch code" was; I was hoping the LW team forgot to give everyone launch codes and this decreased P(nukes); saying this would would cause everyone to know their launch codes and maybe scare the other team.)

I don't think this is very relevant to real-world infohazards, because this is a game with explicit rules and because in the real world the low-hanging infohazards have been shared, but it seems relevant to mechanism design.

Update with two new responses:

I think this is 10 generals, 1 petrov, and one other person (either the other petrov or a citizen, not sure, wasn't super rigorous)

The post says generals' names will be published tomorrow.

No. I noticed ~2 more subtle infohazards and I was wishing for nobody to post them and I realized I can decrease that probability by making an infohazard warning.

I ask that you refrain from being the reason that security-by-obscurity fails, if you notice subtle infohazards.

Some true observations are infohazards, making destruction more likely. Please think carefully before posting observations. Even if you feel clever. You can post hashes here instead to later reveal how clever you were, if you need.

LOOSE LIPS SINK SHIPS

I think it’s better to be angry at the team that launched the nukes?

No. But I’m skeptical: seems hard to imagine provable safety, much less competitive with the default path to powerful AI, much less how post-hoc evals are relevant.

I ~always want to see the outline when I first open a post and when I'm reading/skimming through it. I wish the outline appeared when-not-hover-over-ing for me.

Figuring out whether an RSP is good is hard.[1] You need to consider what high-level capabilities/risks it applies to, and for each of them determine whether the evals successfully measure what they're supposed to and whether high-level risk thresholds trigger appropriate responses and whether high-level risk thresholds are successfully operationalized in terms of low-level evals. Getting one of these things wrong can make your assessment totally wrong. Except that's just for fully-fleshed-out RSPs — in reality the labs haven't operationalized their high-level thresholds and sometimes don't really have responses planned. And different labs' RSPs have different structures/ontologies.

Quantitatively comparing RSPs in a somewhat-legible manner is even harder.

I am not enthusiastic about a recent paper outlining a rubric for evaluating RSPs. Mostly I worry that it crams all of the crucial things—is the response to reaching high capability-levels adequate? are the capability-levels low enough that we'll reach them before it's too late?—into a single criterion, "Credibility." Most of my concern about labs' RSPs comes from those things just being inadequate; again, if your response is too weak or your thresholds are too high or your evals are bad, it just doesn't matter how good the rest of your RSP is. (Also, minor: the "Difficulty" indicator punishes a lab for making ambitious commitments; this seems kinda backwards.)

(I gave up on making an RSP rubric myself because it seemed impossible to do decently well unless it's quite complex and has some way to evaluate hard-to-evaluate things like eval-quality and planned responses.)

(And maybe it's reasonable for labs to not do so much specific-committing-in-advance.)

  1. ^

    Well, sometimes figuring out that in RSP is bad is easy. So: determining that an RSP is good is hard. (Being good requires lots of factors to all be good; being bad requires just one to be bad.)

Load More