All of Krieger's Comments + Replies

AGI might not literally search over all possible policies, but just employ some heuristics to get a good approximation of the best policy. But then this is a capabilities short-coming, not misalignment

...

Coming back to our scenario, if our model just finds an approximate best policy, it would seem very unlikely that this policy consistently brings about some misaligned goal

In my model this isn't a capabilities failure, because there are demons in imperfect search; what you would get out of a heuristic-search-to-approximate-the-best-policy wouldn't only be ... (read more)

1Martín Soto
Update: Vanessa addressed this concern.
2Martín Soto
I think you're right, and I wasn't taking this into account, and I don't know how Vanessa would respond to this. Her usual stance is that we might expect all mesa-optimizers to be acausal attackers (that is, simulation / false hypotheses), since in this architecture the only way to determine actions is by determining hypotheses (and in fact, she now believes these acausal attackers can all be dealt with in one fell sweep in light of one single theoretical development). But that would seem to ignore the other complex processes going on to update these hypotheses from one time step to the next (as if the updates happened magically and instantaneously, without any further subcomputations). And we don't even need to employ possibly non-perfect heuristics for these demons to appear: I think they would also appear even if we (in the ideal, infinite compute scenario) brute-forced by searching over all possible hypotheses updates and assessing each one on some metric. In a sense the two appearances of demons are equivalent, but in the latter limit they are more clearly encoded in certain hypotheses (that game the assessment of hypotheses), while in the former their relationship to hypotheses will be less straight-forward, since there will be non-trivial "hypotheses updating" code inside the AI which is not literally equivalent to the hypothesis chosen (and so, parts of this code which aren't the final chosen hypothesis could also be part of a demon). I'm not 100% sure the existence of these demons already implies inner misalignment, since these demons will only be optimized for their continued existence (and this might be gained by some strategy that, by sheer luck, doesn't disrupt too much the outer performance of the AI, or at most turns the hypothesis search a bit less efficient). But I think this is just what always happens with mesa-optimizers, and the worry for inner alignment is that any one of these mesa-optimizers can be arbitrarily disruptive to outer performance

After having chosen a utility function to maximize, how would it maximize? I'm thinking that the search/planning process for finding good policies naturally introduce mesa-optimizers, regardless of everything that came before in the PreDCA (detecting precursors and extrapolating their utility function).

1Martín Soto
Once the AGI has some utility function and hypothesis (or hypotheses) about the world, then it just employs counterfactuals to decide which is the best policy (set of actions). That is, it performs some standard and obvious procedure like "search over all possible policies, and for each compute how much utility exists in the world if you were to perform that policy". Of course, this procedure will always yield the same actions given a utility function and hypotheses, that's why I said: That said, you might still worry that due to finite computing power our AGI might not literally search over all possible policies, but just employ some heuristics to get a good approximation of the best policy. But then this is a capabilities short-coming, not misalignment. And as I mentioned in another comment: Coming back to our scenario, if our model just finds an approximate best policy, it would seem very unlikely that this policy consistently brings about some misaligned goal (which is not the AGI's goal) like "killing humans", instead of just being the best policy with some random noise in all directions.

It seems like the AI risk mitigation solutions you've listed aren't mutually exclusive, but we'll likely have to use a combination of them to succeed. While I agree that it would be ideal for us to end up with a FAS, the pathway towards the outcome would likely involve "sponge coordination" and "pivotal acts" as mechanisms by which our civilization can buy some time before FAS.

A possible scenario in a world where FAS takes some time (chronological):

  1. "sponge coordination" with the help of various AI policy work
  2. one cause-aligned lab executes a "pivotal act" w
... (read more)

It seems like the exact model which the AI will adopt is kinda confounding my picture when I'm trying to imagine how "existentially secure" a world looks like. I'm current thinking there are two possible existentially secure worlds:

The obvious one is where all human dependence is removed from setting/modifying the AI's value system (like CEV, fully value-aligned)—this would look much more unipolar.

The alternate is for the well-intentioned-and-coordianted group to use a corrigible AI that is aligned with its human instructor. To me, whether this scenario lo... (read more)

Is it even possible for a non-pivotal act to ever achieve existential security? Even if we max-ed up AI lab communication and had awesome interpretability, that doesn't help in the long-run given that the amount of minimum resources required to build a misaligned AGI will probably be keep dropping.

1Ivan Vendrov
Depends on offense-defense balance, I guess. E.g. if well-intentioned and well-coordinated actors are controlling 90% of AI-relevant compute then it seems plausible that they could defend against 10% of the compute being controlled by misaligned AGI or other bad actors - by denying them resources, by hardening core infrastructure, via MAD, etc.

Thanks, I found your post very helpful and I think this community would benefit from posts similar as such.

I agree that we would need a clear categorization. Ideally, they would provide us a way to explicitly quantify/make-legible the claims of various proposals e.g. "my proposal, under these assumptions about the world, may give us X years of time, changes the world in these ways, and interacts with proposal A, B, C in these ways.

The lack of such is perhaps one of the reasons as to why I feel the pivotal act framing is still necessary. It seems to me that... (read more)

1Ivan Vendrov
I would be interested in a detailed analysis of pivotal act vs gradual steering; my intuition is that many of the differences dissolve once you try to calculate the value of specific actions. Some unstructured thoughts below: 1. Both aim to eventually end up in a state of existential security, where nobody can ever build an unaligned AI that destroys the world. Both have to deal with the fact that power is currently broadly distributed in the world, so most plausible stories in which we end up with existential security will involve the actions of thousands if not millions of people, distributed over decades or even centuries.  2. Pivotal acts have stronger claims of impact, but generally have weaker claims of the sign of that impact - actually realistic pivotal-seeming acts like "unilaterally deploy a friendly-seeming AI singleton" or "institute a stable global totalitarianism" are extremely, existentially dangerous. If someone identifies a pivotal-seeming act that is actually robustly positive, I'll be the first to sign on. 3. In contrast, gradual steering proposals like "improve AI lab communication" or "improve interpretability" have weaker claims to impact, but stronger claims to being net positive across many possible worlds, and are much less subject to multi-agent problems like races and the unilateralist's curse. 4. True, complete existential safety probably requires some measure of "solving politics" and locking in current human values, hence may not be desirable. Like what if the Long Reflection decides that the negative utilitarians are right and the world should in fact be destroyed? I won't put high credence on that, but there is some level of accidental existential risk that we should be willing to accept in order to not lock in our values.

well crap, that was fast. does anyone know what karma threshold the button was pressed at?

2dirk
i found out exactly three hours in, so most likely 2,100 but possibly 2,000. Edit: per habryka's comment, the relevant threshold was apparently zero.