How actually do you sidestep the need for the One True Objective Function given an ELK solution? I get that it might seem plausible to take a rough objective like "do what I intend" and look at the internal knowledge of the thing for signs that it is deliberately deceiving you. If you do that, you'll get, at best, an AI that doesn't know that it is deceiving you (for whatever operationalization of "know" you come up with as you use ELK for training). But it could still be deceiving you, and very likely will be if optimization pressure is merely towards "AIs that don't know that they are being deceptive".
Our very broad hope is to use ELK to select actions that (i) keep humans safe, and give them time and space to evolve according to their current (essentially local) preferences, (ii) are expected to produce outcomes that would be judged favorably by the future humans, primarily by maximizing option value until it becomes clear what those future humans want (see the strategy stealing assumption).
This is discussed very briefly in this appendix of the ELK report and the subsequent appendix. There are two or three big foreseeable difficulties with this approach and likely a bunch of other problems.
I don't think this should be particularly persuasive, but it hopefully illustrates how ARC is currently thinking about this part of the problem. Overall my current view is that this is fairly unlikely to be the weakest link in the plan, i.e. if it doesn't work it will be because of a failure at an earlier step, and so it's not one of the main things I'm thinking about.
Reiterating a point above: observe how this whole scheme has basically assumed that capabilities won't start to generalize relevantly out of distribution. My model says that they eventually will, and that this is precisely when things start to get scary, and that one of the big hard bits of alignment is that once that starts happening, the capabilities generalize further than the alignment. A problem that has been simply assumed away in this agenda, as far as I can tell, before we even dive into the details of this framework.
My reply last time is still relevant: link.
For what it's worth I found this writeup informative and clear. So lowering your standards still produced something useful (at least to me).
... THEN the Paulian family of plans don't provide much hope.
My understanding is that Ryan was tentatively on board with this conditional statement, but Paul was not.
I forget the extent to which I communicated (or even thought) this in the past, but at the moment, the current claim I'd agree with is: "this specific plan is much less likely to work".
My best guess is that even if I was quite confident in those conditions being true, work on various subparts of this plan seems like quite a good bet.
Context: This post is my attempt to make sense of Ryan Greenblatt's research agenda, as of April 2022. I understand Ryan to be heavily inspired by Paul Christiano, and Paul left some comments on early versions of these notes.
Two separate things I was hoping to do, that I would have liked to factor into two separate writings, were (1) translating the parts of the agenda that I understand into a format that is comprehensible to me, and (2) distilling out conditional statements we might all agree on (some of us by rejecting the assumptions, others by accepting the conclusions). However, I never got around to that, and this has languished in my drafts folder too long, so I'm lowering my standards and putting it out there.
The process that generated this document is that Ryan and I bickered for a while, then I wrote up what I understood and shared it with Ryan, and we repeated this process a few times. I've omitted various intermediate drafts, on the grounds that sharing a bunch of intermediate positions that nobody endorses is confusing (moreso than seeing more of the process is enlightening), and on the grounds that if I try to do something better then what happens instead is that the post languishes in the drafts folder for half a year.
(Thanks to Ryan, Paul, and a variety of others for the conversations.)
Nate's model towards the end of the conversation
Ryan’s plan, as Nate currently understands it:
Nate's response:
An attempt at conditional agreement
I suggested the following:
If it is the case that:
... THEN the Paulian family of plans don't provide much hope.
My understanding is that Ryan was tentatively on board with this conditional statement, but Paul was not.
Postscript
Reiterating a point above: observe how this whole scheme has basically assumed that capabilities won't start to generalize relevantly out of distribution. My model says that they eventually will, and that this is precisely when things start to get scary, and that one of the big hard bits of alignment is that once that starts happening, the capabilities generalize further than the alignment. A problem that has been simply assumed away in this agenda, as far as I can tell, before we even dive into the details of this framework.
To be clear, I'm not saying that this decomposition of the problem fails to capture difficult alignment problems. The "prevent the AGI from figuring out it's in deployment" problem is quite difficult! As is the "get an ELK head that can withstand superintelligent adversaries" problem. I think these are the wrong problems to be attacking, in part on account of their difficulty. (Where, to be clear, I expect that toy versions of these problems are soluble, just not solutions rated for the type of opposition it sounds like the rest of this plan requires.)