I like the thrust of this paper, but I feel that it overstates how robust the safety properties will be, by drawing an overly sharp distinction between agentic and non-agentic systems, and not really engaging with the strongest counterexamples
To give some examples from the text:
A chess-playing AI, for instance, is goal-directed because it prefers winning to losing. A classifier trained with log likelihood is not goal-directed, as that learning objective is a natural consequence of making observations
But I could easily train an AI which simply classifies chess moves by quality. What takes that to being an agent is just the fact that its outputs are labelled as 'moves' rather than as 'classifications', rather than any feature of the model itself. More generally, even a LM can be viewed as "merely" predicting next tokens -- the fact that there is some perspective from which a system is non-agentic does not actually tell us very much.
Paralleling a theoretical scientist, it only generates hypotheses about the world and uses them to evaluate the probabilities of answers to given questions. As such, the Scientist AI has no situational awareness and no persistent goals that can drive actions or long-term plans.
I think it's a stretch to say something generating hypotheses about the world has no situational awareness and no persistent goals -- maybe it has indexical uncertainty, but a sufficiently powerful system is pretty likely to hypothesise about itself, and the equivalent of persistent goals can easily fall out of any ways its world model doesn't line up with reality. Note that this doesn't assume the AI has any 'hidden goals' or that it ever makes inaccurate predictions.
I appreciate that the paper does discuss objections to the safety of Oracle AIs, but the responses also feel sort of incomplete. For instance:
Overall, I'm excited by the direction, but it doesn't feel like this approach actually gets any assurances of safety, or any fundamental advantages.
The arguments in the paper are representative of Yoshua's views rather than mine, so I won't directly argue for them, but I'll give my own version of the case against
the distinctions drawn here between RL and the science AI all break down at high levels.
It seems commonsense to me that you are more likely to create a dangerous agent the more outcome-based your training signal is, the longer time-horizon those outcomes are measured over, the tighter the feedback loop between the system and the world, and the more of the world lies between the model you're training and the outcomes being achieved.
At the top of the spectrum, you have systems trained based on things like the stock price of a company, taking many actions and recieving many observations per second, over years-long trajectories.
Many steps down from that you have RL training of current llms: outcome-based, but with shorter trajectories which are less tightly coupled with the outside world.
And at bottom of the spectrum you have systems which are trained with an objective that depends directly on their outputs and not on the outcomes they cause, with the feedback not being propogated across time very far at all.
At the top of the spectrum, if you train a comptent system it seems almost guaranteed that it's a powerful agent. It's a machine for pushing the world into certain configurations. But at the bottom of the spectrum it seems much less likely -- its input-output behaviour wasn't selected to be effective at causing certain outcomes.
Yes there are still ways you could create an agent through a training setup at the bottom of the spectrum (e.g. supervised learning on the outputs of a system at the top of the spectrum), but I don't think they're representative. And yes depending on what kind of a system it is you might be able to turn it into an agent using a bit of scaffolding, but if you have the choice not to, that's an importantly different situation compared to the top of the spectrum.
And yes, it seems possible such setups lead to an agentic shoggoth compeletely by accident -- we don't understand enough to rule that out. But I don't see how you end up judging the probability that we get a highly agentic system to be more or less the same wherever we are on the spectrum (if you do)? Or perhaps it's just that you think the distinction is not being handled carefully in the paper?
Ah I should emphasise, I do think all of these things could help -- it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.
The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I'll set that aside for now.
It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you:
It is good to notice the spectrum above. Likely, for a fixed amount of compute/effort, one extreme of this spectrum gets much less agency than the other extreme. Call that the direct effect.
Are there other direct effects? for instance, do you get the same ability to "cure cancer" for a fixed amount of compute/effort across the spectrum? Seems like agency is useful so, probably the ability you get per unit compute is correlated with the agency across this spectrum.
If we are in a setting where an outside force demands you reach a given ability level, then this other indirect effect matters, because it means you will have to use a larger amount of compute.
[optional] To illustrate this problem, consider something that I don't think people think is safer: instead of using gradient descent, just sample the weights of the neural net at random until you get a low loss. (I am not trying to make an analogy here)
It would be great if someone had a way to compute the "net" effect on agency across the spectrum, also taking into account the indirect path of more compute needed -> more compute = more agency across the spectrum. I suspect it might depend on which ability you need to reach, and we might/might not be able to figure it out without experiments.
If you're planning to actually do the experiments it suggests, or indeed act on any advice it gives in any way, then it's an agent.
Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?
(And notably the proposal here isn’t to train the model on the outcomes of experiments it proposes, in case that’s what you’re thinking.)
Is this possibly a "Chinese room" kind of situation? The model alone is not an agent, but "the model + the way it is used" might be...
And to be more precise, I don't mean things like "the model could be used by an agent", because obviously yes; but more like "the model + a way of using it that we also separately wouldn't call an agent" could be.
"Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?"
---> Nah, pre training, fine tuning, scaffolding and especially RL seem like they all affect it. Currently scaffolding only gets you shitty agents, but it at least sorta works
Pre-training, finetuning and RL are all types of training. But sure, expand 'train' to 'create' in order to include anything else like scaffolding. The point is it's not what you do in response to the outputs of the system, it's what the system tries to do.
yeah, if the system is trying to do things I agree it's (at least a proto) agent. My point is that creation happens in lots of places with respect to an LLM, and it's not implausible that use steps (hell even sufficiently advanced prompt engineering) can effect agency in a system, particularly as capabilities continue to advance.
we might disagree some. I think the original comment is pointing at the (reasonable as far i can tell) claim that oracular AI can have agent like qualities if it produces plans that people follow
I agree that it can be possible to turn such a system into an agent. I think the original comment is defending a stronger claim that there's a sort of no free lunch theorem: either you don't act on the outputs of the oracle at all, or it's just as much of an agent as any other system.
I think the stronger claim is clearly not true. The worrying thing about powerful agents is that their outputs are selected to cause certain outcomes, even if you try to prevent those outcomes. So depending on the actions you're going to take in response to its outputs, its outputs have to be different. But the point of an oracle is to not have that property -- its outputs are decided by a criterion (something like truth) -- that is independent of the actions you're going to take in response[1]. So if you respond differently to the outputs, they cause different outcomes. Assuming you've succeeded at building the oracle to specification, it's clearly not the case that the oracle has the worrying property of agents just because you act on its outputs.
I don't disagree that by either hooking the oracle up in a scaffolded feedback loop with the environment, or getting it to output plans, you could extract more agency from it. Of the two I think the scaffolding can in principle easily produce dangerous agency in the same way long-horizon RL can, but that the version where you get it to output a plan is much less worrrying (I can argue for that in a separate comment if you like).
I'm ignoring the self-fulfilling prophecy case here. ↩︎
thanks, I appreciate the reply.
It sounds like I have somewhat wider error bars but mostly agree on everything but the last sentence, where I think it's plausibly but not certainly less worrying.
If you felt like you had crisp reasons why you're less worried, I'd be happy to hear them, but only if it feels positive for you to produce such a thing.
Good point. I think that if you couple the answers of an oracle to reality by some random process, then you are probably fine.
However, many want to use the outputs of the oracle in very obvious ways. For instance, you ask it what code you should put into your robot, and then you just put the code into the robot.
Could we have an oracle (i.e. it was trained according to some Truth criterion) where when you use it very straightforwardly, it exerts optimization pressure on the world?
A new paper by Yoshua Bengio and the Safe Artificial Intelligence For Humanity (SAIFH) team argues that the current push towards building generalist AI agents presents catastrophic risks, creating a need for more caution and an alternative approach. We propose such an approach in the form of Scientist AI, a non-agentic AI system that aims to be the foundation for safe superintelligence. (Note that this paper is intended for a broad audience, including readers unfamiliar with AI safety.)
Abstract
Executive Summary
You can read the full paper here.
Career Opportunities at SAIFH
If you are interested in working on this research agenda, we are currently hiring for an ML Research Developer position, apply here (French appears first, scroll down for English). We are also open to expressions of interest from individuals with backgrounds in machine learning research & engineering, as well as AI safety. If that's you, please reach out here.