Overall this is still encouraging. It seems to take serious that
I feel like there are enough shared assumptions that collaboration or dialogue with AI notkilleveryoneists could be very useful.
That said, I wish there were more details about his Scientist AI idea:
Also it is not clear to me whether the safety is supposed to come from:
This also seems very encouraging to me! In some sense he seems to be where Holden was at 10 years ago, and he seems to be a pretty good and sane thinker on AGI risk now, and I have hope that similar arguments will be compelling to both of them so that Bengio will also realize some of the same errors that I see him making here.
I think it's an example of an AI that completely lacks the notion of in-world goals. Its goal is restricted to a purely symbolic system; that system happens to map to parts of the world, but the AI lacks the self-reflection to realise which symbols map to itself and its immediate environment, and how manipulating those symbols may make it better at accomplishing its goals. Severing that feedback loop is IMO the key to avoiding instrumental convergence. Without that, all you get is another variety of a chess playing AI: superhumanly smart at its own task, but the world within which that task optimizes goals is too abstract for its skill to be "portable" to a dangerous domain.
There's not much context to this claim made by Yoshua Bengio, but while searching in Google News I found a Spanish online newspaper that has an article* in which he claims that:
We need to create machines that assist us, not independent beings. That would not be a good idea; it would lead us down a very dangerous path.
*https://www.larazon.es/sociedad/20221121/5jbb65kocvgkto5hssftdqe7uy.html
while encouraging for dialog, I don't think he's got a very good model of what implications his attempts to create scientist ai have on the ways an executive ai could attack us. even without mesaoptimization (something his approaches are absolutely not safer from at all), executive ai actually doesn't need to be very strong at first escape to kill us all. it need only know how to use a scientist ai to finish the job. and any science results we get from human-steered scientist ai will make it that much easier for executive ai by not even requiring it to run the experiments. banning executive ai will be incredibly hard and likely permanently curtail human freedom to compute, which is likely to result in such bans being highly ideologically laden one way or another unless normally-conflicting non-centrist ideologies[1] can maintain classical liberal "value compromise" centrism for long enough to find a way to guarantee the constraints that prevent ai executives still allow everyday people to keep their general life freedom of non-ai computation, ideally even including doing research using ai scientists. Auth versions of ideologies are gonna want to control this outcome, and it's worrying, as both progressive and regressive ideologies in the US have been going auth.
on the other hand, I do think that successfully banning dangerous ai in a way compatible with all ideologies is the most promising practical approach anyone has proposed besides formal goal alignment and low-impact-bounded informal goal alignment, both of which have major research problems still open.
the issue I still see is - how do you recognize an ai executive that is trying to disguise itself?
a sketch of non-centrist ideologies might be, authoritarian progressive/center/regressive, libertarian progressive/center/regressive, as well as another dimension of community-first/multiscale/individual-first that can apply to the others; if you want me to hunt down citations for why I think this is a good map of ideologies I can do so, please comment. I consider myself libertarian progressive multiscale
the issue I still see is - how do you recognize an ai executive that is trying to disguise itself?
It can't disguise itself without researching disguising methods first. The question is will interpretability tools be up to the task of catching it.
It will not work for catching AI executive originating outside of controlled environment (unless it queries AI scientist). But given that such attempts will originate from uncoordinated relatively computationally underpowered sources, it may be possible to preemptively enumerate disguising techniques that such AI executive could come up with. If there are undetectable varieties..., well, it's mostly game over.
I find this really encouraging. This pretty clearly says that he takes AI X-risk seriously. In combination with Hinton recently going on-record as believing AGI is an X-risk, I'd say this may be a sea change in the most respected researchers essentially taking up the AInotkilleveryoneism banner.
For reference, I replied to Bengio's post in a separate post: https://www.lesswrong.com/posts/kGrwufqxfsyuaMREy/annotated-reply-to-bengio-s-ai-scientists-safe-and-useful-ai. TLDR: pretty much the same points that other commentators to this post are making, just in a more elaborate form.
I generally agree with the notion that IMO this sort of AI (non-agentic oracular tools) should be as far as we go along that road, and they pretty much destroy all the arguments for why we should rush to AGI as they provide most of the same benefits at a fraction of the cost. However, the crucial point remains that once you have these, the step to agentic AGI seems tiny; possibly as easy as rigging up an AutoGPT-like system which uses these Scientist AIs as one of its core components.
The comparison with human cloning seems apt, as a technology that we know well enough that we surely could do it, we just have for now mostly successfully avoided to do it out of ethical concerns. That said, human cloning is both more instinctively repugnant and less economically useful than building AGI, so the incentives at play are very different. It probably would be much safer to not even have the temptation.
The main advantage of Tool-AIs is that they can be used to solve alignment for more agentic approaches. You don't need to prevent people from building agentic AI for all time, just in the intermittent period while we have Tool AI, but don't yet have alignment.
Well, that's assuming there is something akin to a solution for alignment. I think it's feasible for the technical aspect but I highly doubt it for the social/political one. I think most or all aligned AIs would just be aligned with someone, specifically.
There is also some commentary here: https://www.lesswrong.com/posts/kGrwufqxfsyuaMREy/annotated-reply-to-bengio-s-ai-scientists-safe-and-useful-ai
Potentially relevant: Yoshua Bengio got funding from OpenPhil in 2017:
Yoshua Bengio wrote a blogpost yesterday in which he argues for developing "scientist AI", which seems in-structure very similar to historical Tool-AI proposals.
For the (IMO) best response to this kind of proposal see Gwern's: Why Tool AIs Want to Be Agent AIs.
Below I copies the blogpost in-full, since all of it seems pretty relevant.