Superintelligent Agents Pose Catastrophic Risks:
Can Scientist AI Offer a Safer Path?

Yoshua Bengio; Jesse Richardson; dwk; mattmacdermott

A new paper by Yoshua Bengio and the Safe Artificial Intelligence For Humanity (SAIFH) team argues that the current push towards building generalist AI agents presents catastrophic risks, creating a need for more caution and an alternative approach. We propose such an approach in the form of Scientist AI, a non-agentic AI system that aims to be the foundation for safe superintelligence. (Note that this paper is intended for a broad audience, including readers unfamiliar with AI safety.)

Abstract

The leading AI companies are increasingly focused on building generalist AI agents—systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory.
Accordingly, we propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of over-confident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path.

Executive Summary

Highly effective AI without agency
For decades, AI development has pursued both intelligence and agency, following human cognition as a model. Human capabilities encompass many facets including the understanding of our environment, as well as agency, i.e., the ability to change the world to achieve goals. In the pursuit of human-level performance, we are naturally encoding both intelligence and agency in our AI systems. Agency is an important attribute for the survival of living entities and would be required to perform many of the tasks that humans execute. After recent technological breakthroughs have led to large language models that demonstrate some level of general intelligence, leading AI companies are now focusing on building generalist AI agents: systems that will autonomously act, plan, and pursue goals across almost all tasks that humans can perform.
Human-like agency in AI systems could reproduce and amplify harmful human tendencies, potentially with catastrophic consequences. Through their agency and to advance their self-interest, humans can exhibit deceptive and immoral behavior. As we implement agentic AI systems, we should ask ourselves whether and how these less desirable traits will also arise in the artificial setting, especially in the case of anticipated future AI systems with intelligence comparable to humans (often called AGI, for artificial general intelligence) or superior to humans (ASI, for artificial superintelligence). Importantly, we still do not know how to set an AI agent’s goals so as to avoid unwanted behaviors. In fact, many concerns have been raised about the potential dangers and impacts from AI more broadly. Crucially, there are severe risks stemming from advances in AI that are highly associated with autonomous agents. These risks arguably extend even to human extinction, a concern expressed by many AI researchers.
Combining agency with superhuman capabilities could enable dangerous rogue AI systems. Certain capabilities – such as persuasion, deception and programming – could be learned by an AI from human behavior or emerge from reinforcement learning, a standard way of training an AI to perform novel tasks through goal-seeking behavior. Even if an AI is only imitating human goals and ways of thinking from its text completion pre-training, it could reach superior cognitive and executive capability due to advantages such as high communication bandwidth and the ability to run many instances of itself in parallel. These superhuman capabilities, if present in a generalist agent with even ordinary human self-preservation instincts or human moral flaws (let alone poorly aligned values), could present a serious danger.
Strategies to mitigate the risks of agency can be employed, including the use of non-agentic trustworthy AI as a safety guardrail. For example, we could reduce the cognitive ability of an AI by making its knowledge narrow and specialized in one domain of expertise, yielding a narrow AI system. We can reduce its potential impact in the world by reducing the scope of its actions. We can reduce its ability to hatch complex and dangerous plans by making sure it can only plan over a short horizon. We can mitigate its dangerous actions by using another AI, one that is preferably safe and trustworthy, like the non-agentic AI proposed here, as a guardrail that detects dangerous actions. This other AI is made trustworthy by training it to scientifically explain human behavior rather than imitate it, where trustworthy here means “honest”, avoiding the deceptive tendencies of modern frontier AIs. If society chooses to go ahead with building agentic AGIs in spite of the risks, a pragmatic risk management avenue would be to overlay them with such trustworthy and non-agentic guardrails, which is one of the motivations for our proposal.
With the objective to design a safer yet powerful alternative to agents, we propose “Scientist AIs” – AI systems designed for understanding rather than pursuing goals. Inspired by a platonic and idealized version of a scientist, we propose the design and construction of Scientist AIs. We do so by building on the state-of-the-art in probabilistic deep learning and inspired by the methodology of the scientific process, i.e., first understanding or modeling the world and then making probabilistic inferences based on that knowledge. We show in the paper how probabilistic predictions can be turned into experimental design, obviating the need for reinforcement learning agents in scientific discovery. In contrast to an agentic AI, which is trained to pursue a goal, a Scientist AI is trained to provide explanations for events along with their estimated probability. An agentic AI is motivated to act on the world to achieve goals, while the Scientist AI is trained to construct the best possible understanding of its data. We explain in this paper why understanding is intrinsically safer than acting.
We foresee three primary use cases for Scientist AIs:
as a tool to help human scientists dramatically accelerate scientific progress, including high-reward areas like healthcare;
as a guardrail to protect from unsafe agentic AIs, by double-checking actions they propose to perform and enabling their safe deployment; and
as an AI research tool to help more safely build even smarter (superintelligent) AIs in the future, a task which is particularly dangerous to attempt by leveraging agentic systems.
This alternative path could allow us to harness AI’s benefits while maintaining crucial safety controls. Scientist AIs might allow us to reap the benefits of AI innovation in areas that matter most to society while avoiding major risks stemming from unintentional loss of human control. Crucially, we believe our proposed system will be able to interoperate with agentic AI systems, compute the probability of various harms that could occur from a candidate action, and decide whether or not to allow the action based on our risk tolerances. As the stakes become higher, either because of increased capabilities of the AI or because of the domains in which it is applied (e.g., involving human life in war, medical treatments or the catastrophic misuse of AI), we will need trustworthy AIs. We hope that our proposal will motivate researchers, developers and policymakers invest in safer paths such as this one.
Strategies are presented to ensure that the Scientist AI remains non-agentic. Building AI agents with superhuman intelligence before figuring out how to control them is viewed by some as analogous to the risk posed by the creation of a new species with a superhuman intellect. With this in mind, we use various methodologies, such as fixing a training objective independent of real-world interactions, or restricting to counterfactual queries, to reduce the risk of agency emerging in the Scientist AI, or it exerting influence on the world in other, more subtle ways.
Mapping out ways of losing control
Powerful AI agents pose significant risks, including loss of human control. Scenarios have been identified, without arguments proving their impossibility, that an irreversible loss of human control over agentic AI can occur, due to technical failures, corner cutting, or intentional malicious use. Making sure an AI will not cause harm is a notoriously difficult unsolved technical problem, which we illustrate below through the concepts of goal misspecification and goal misgeneralization. The less cautious the developer of the AI, e.g., because of perceived competitive pressures, the greater the risk of loss-of-control accidents. Some players may even want to intentionally develop or deploy an unaligned or dangerous ASI.
Loss of control may arise due to goal misspecification. This failure mode occurs when there are multiple interpretations of a goal, i.e., it is poorly specified or under-specified and may be pursued in a way that humans did not intend. Goal misspecification is the result of a fundamental difficulty in precisely defining what we find unacceptable in AI behavior. If an AI takes life-and-death decisions, we would like it to act ethically. It unfortunately appears impossible to formally articulate the difference between morally right and wrong behavior without enumerating all the possible cases. This is similar to the difficulty of stating laws in legal language without having any loopholes for humans to exploit. When it is in one’s interest to find a way around the law, by satisfying its letter but not its spirit, one often dedicates substantial effort to do so.
Even innocuous-seeming goals can lead agentic AI systems to dangerous instrumental subgoals such as self-preservation and power-seeking. As with Goodhart’s law, overoptimization of a goal can yield disastrous outcomes: a small ambiguity or fuzziness in the interpretation of human-specified safety instructions could be amplified by the computational capabilities given to the AI for devising its plans. Even for apparently innocuous human-provided goals, it is difficult to anticipate and prevent the AI from taking actions that cause significant harm. This can occur, for example, in pursuit of an instrumental goal (a subgoal to help accomplish the overall goal). Several arguments and case studies have been presented strongly suggesting that dangerous instrumental goals such as self-preservation and power-seeking are likely to emerge, no matter the initial goal. In this paper, we devise methods to detect and mitigate such loopholes in our goal specifications.
Even if we specify our goals perfectly, loss of control may also occur through the mechanism of goal misgeneralization. This is when an AI learns a goal that leads it to behave as intended during training and safety testing, but which diverges at deployment time. In other words, the AI’s internal representation of its goal does not align precisely – or even at all – with the goal we used to train it, despite showing the correct behavior on the training examples.
One particularly concerning possibility is that of reward tampering. This is when an AI “cheats” by gaining control of the reward mechanism, and rewards itself handsomely. A leading AI developer has already observed (unsuccessful) such attempts from one model. In such a scenario, the AI would again be incentivised to preserve itself and attain power and resources to ensure the ongoing stream of maximal rewards. It can be shown that, if feasible, self preservation plus reward tampering is the optimal strategy for maximizing reward.
Besides unintentional accidents, some operators may want to deliberately deploy self-preserving AI systems. They might not understand the magnitude of the risk, or they might decide that deploying self-replicating agentic ASI to maximize economic or malicious impact is worth that risk (according to their own personal calculus). For others, such as those who would like to see humanity replaced by superintelligent entities, releasing self-preserving AI may in fact be desirable.
With extreme severity and unknown likelihood of catastrophic risks, the precautionary principle must be applied. The above scenarios could lead to one or more rogue AIs posing a catastrophic risk for humanity, i.e., one with very high severity if the catastrophe happens. On the other hand, it is very difficult to ascertain the likelihood of such events. This is precisely the kind of circumstance in which the precautionary principle is mandated, and has been applied in the past, in biology to manage risks from dual-use and gain-of-function research and in environmental science to manage the risks of geoengineering. When there are high-severity risks of unknown likelihood, which is the case for AGI and ASI, the common sense injunction of the precautionary principle is to proceed with sufficient caution. That means evaluating the risks carefully before taking them, thus avoiding experimenting or innovating in potentially catastrophic ways. Recent surveys suggest that a large number of machine learning researchers perceive a significant probability (greater than 10%) of catastrophic outcomes from creating ASI, including human extinction. This is also supported by the arguments presented in this paper. With such risks of non-negligible likelihood and extreme severity, it is crucial to steer our collective AI R&D efforts toward responsible approaches that minimize unacceptable risks while, ideally, preserving the benefits.
The Scientist AI research plan
Without using any equations, this paper argues that it is possible to reap many of the benefits of AI without incurring extreme risks. For example, it is not necessary to replicate human-like agency to generate scientific hypotheses and design good scientific experiments to test them. This even applies to the scientific modeling of agents, such as humans, which does not require the modeler themselves to be an agent.
Scientist AI is trustworthy and safe by design. It provides reliable explanations for its outputs and comes with safeguards to prevent hidden agency and influence on the events it predicts. Explanations take the form of a summary, but a human or another AI can ask the system to do a deep dive into why each argument is justified, just like human scientists do among themselves when peer-reviewing each other’s claims and results. To avoid overconfident predictions, we propose to train the Scientist AI to learn how much to trust its own outputs, so that it can also be used to construct reliable safety guardrails based on quantitative assessments of risk. To counter any doubt about the possibility of a hidden agent under the hood, predictions can be made in a conjectured setting of the simulated world in which the Scientist AI either does not exist or does not affect the rest of the world. This would avoid any possible agentic effect in the AI’s forecasts, e.g., via self-fulfilling predictions, such as an AI making predictions about election results that end up influencing the outcomes. A guardrail system based on another instance of the Scientist AI itself could also be added so that if the prediction would influence the world in ways that go against ethical guidelines (such as influencing elections), then the output is not provided. Finally, we describe how the training objective can allow the Scientist AI to form an understanding of dangerous agents, including those exhibiting deception or reward tampering, and predict their behavior without itself being agentic.
Scientist AI becomes safer and more accurate with additional computing power, in contrast to current AI systems. The Scientist AI is meant to compute conditional probabilities, i.e., the probability of an answer or an interpretation being true or an event happening, given some question and context. It is trained by optimizing a training objective over possible explanations of the observed data which has a single optimal solution to this computational problem. The more computing power (“compute”) is available, the more likely it is that this unique solution will be approached closely. Crucially, this is in contrast with experimental evidence showing that current AI systems tend to become more susceptible to misalignment and deceptive behavior as they are trained with more compute, as well as theoretical evidence that misalignment is likely to emerge specifically in AI agents that are sufficiently advanced. There is already a rich scientific literature showing different training objectives which have as a unique global optimum the desired and well-defined conditional probabilities. These could be used to compute the probability of any answer to any question if the objective has been fully optimized, which may in general require very large compute resources, but can otherwise be approximated with more modest resources. This allows us to obtain hard safety guarantees asymptotically as the amount of compute is increased. This does not change the fact that more data or data that is more informative would reduce the uncertainty expressed by those probabilities. As usual, more and better data would allow the model to discover aspects of the world that may otherwise remain invisible.
While Scientist AI is intended to prevent accidental loss of control, further measures are needed to prevent misuse. Bad actors could for example decide to turn the non-agentic AI into an unguarded agent, maybe for military or economic purposes. If done without the proper societal guardrails, this could yield loss of human control. This transformation from non-agentic to agentic can be done by asking the Scientist AI what one should do to achieve some goal, for example how to build a dangerous new weapon, and by continuously feeding the AI with the observations that follow from each of its actions. These types of issues must be dealt with through technical guardrails derived from the Scientist AI, through the security measures surrounding the use of the Scientist AI, and through legal and regulatory means.
To address the uncertainty in the timeline to AGI, we adopt an anytime preparedness strategy. We structure our research plan with a tiered approach, featuring progressively safer yet more ambitious solutions for different time horizons. The objective is to hedge our bets and allocate resources to both short-term and long-term efforts in parallel rather than only start the long-term plans when the short-term ones are completed, so as to be ready with improved solutions at any time compared with a previous time point.

You can read the full paper here.

Career Opportunities at SAIFH

If you are interested in working on this research agenda, we are currently hiring for an ML Research Developer position, apply here (French appears first, scroll down for English). We are also open to expressions of interest from individuals with backgrounds in machine learning research & engineering, as well as AI safety. If that's you, please reach out here.

I like the thrust of this paper, but I feel that it overstates how robust the safety properties will be, by drawing an overly sharp distinction between agentic and non-agentic systems, and not really engaging with the strongest counterexamples

To give some examples from the text:

A chess-playing AI, for instance, is goal-directed because it prefers winning to losing. A classifier trained with log likelihood is not goal-directed, as that learning objective is a natural consequence of making observations

But I could easily train an AI which simply classifies chess moves by quality. What takes that to being an agent is just the fact that its outputs are labelled as 'moves' rather than as 'classifications', rather than any feature of the model itself. More generally, even a LM can be viewed as "merely" predicting next tokens -- the fact that there is some perspective from which a system is non-agentic does not actually tell us very much.

Paralleling a theoretical scientist, it only generates hypotheses about the world and uses them to evaluate the probabilities of answers to given questions. As such, the Scientist AI has no situational awareness and no persistent goals that can drive actions or long-term plans.

I think it's a stretch to say something generating hypotheses about the world has no situational awareness and no persistent goals -- maybe it has indexical uncertainty, but a sufficiently powerful system is pretty likely to hypothesise about itself, and the equivalent of persistent goals can easily fall out of any ways its world model doesn't line up with reality. Note that this doesn't assume the AI has any 'hidden goals' or that it ever makes inaccurate predictions.

I appreciate that the paper does discuss objections to the safety of Oracle AIs, but the responses also feel sort of incomplete. For instance:

The counterfactual query proposal basically breaks down in the face of collusion
The point about isolating the training process from the real world says that "a reward-maximizing agent alters the real world to increase its reward", which I think is importantly wrong. In general, I think the distinctions drawn here between RL and the science AI all break down at high levels.
The uniqueness of solutions still leaves a degree of freedom in how the AI fills in details we don't know -- it might be able to, for example, pick between several world models that fit the data which each offer a different set of entirely consistent answers to all our questions. If it's sufficiently superintelligent, we wouldn't be able to monitor whether it was even exercising that freedom.

Overall, I'm excited by the direction, but it doesn't feel like this approach actually gets any assurances of safety, or any fundamental advantages.

The arguments in the paper are representative of Yoshua's views rather than mine, so I won't directly argue for them, but I'll give my own version of the case against

the distinctions drawn here between RL and the science AI all break down at high levels.

It seems commonsense to me that you are more likely to create a dangerous agent the more outcome-based your training signal is, the longer time-horizon those outcomes are measured over, the tighter the feedback loop between the system and the world, and the more of the world lies between the model you're training and the outcomes being achieved.

At the top of the spectrum, you have systems trained based on things like the stock price of a company, taking many actions and recieving many observations per second, over years-long trajectories.

Many steps down from that you have RL training of current llms: outcome-based, but with shorter trajectories which are less tightly coupled with the outside world.

And at bottom of the spectrum you have systems which are trained with an objective that depends directly on their outputs and not on the outcomes they cause, with the feedback not being propogated across time very far at all.

At the top of the spectrum, if you train a comptent system it seems almost guaranteed that it's a powerful agent. It's a machine for pushing the world into certain configurations. But at the bottom of the spectrum it seems much less likely -- its input-output behaviour wasn't selected to be effective at causing certain outcomes.

Yes there are still ways you could create an agent through a training setup at the bottom of the spectrum (e.g. supervised learning on the outputs of a system at the top of the spectrum), but I don't think they're representative. And yes depending on what kind of a system it is you might be able to turn it into an agent using a bit of scaffolding, but if you have the choice not to, that's an importantly different situation compared to the top of the spectrum.

And yes, it seems possible such setups lead to an agentic shoggoth compeletely by accident -- we don't understand enough to rule that out. But I don't see how you end up judging the probability that we get a highly agentic system to be more or less the same wherever we are on the spectrum (if you do)? Or perhaps it's just that you think the distinction is not being handled carefully in the paper?

Ah I should emphasise, I do think all of these things could help -- it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.

The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I'll set that aside for now.

It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you:

It's easy to accidentally scaffold some kind of agent out of an oracle as soon as there's any kind of consistent causal process from the oracle's outputs to the world, even absent feedback loops. In other words, I agree you can choose to create agents, but I'm not totally sure you can easily choose not to
Any system trained to predict the actions of agents over long periods of time will develop an understanding of how agents could act to achieve their goals -- in a sense this is the premise of offline RL and things like decision transformers
It might be pretty easy for agent-like knowledge to 'jump the gap', e.g. a model trained to predict deceptive agents might be able to analogise to itself being deceptive
Sufficient capability at broad prediction is enough to converge on at least the knowledge of how to circumvent most of the guardrails you describe, e.g. how to collude

It is good to notice the spectrum above. Likely, for a fixed amount of compute/effort, one extreme of this spectrum gets much less agency than the other extreme. Call that the direct effect.

Are there other direct effects? for instance, do you get the same ability to "cure cancer" for a fixed amount of compute/effort across the spectrum? Seems like agency is useful so, probably the ability you get per unit compute is correlated with the agency across this spectrum.

If we are in a setting where an outside force demands you reach a given ability level, then this other indirect effect matters, because it means you will have to use a larger amount of compute.

[optional] To illustrate this problem, consider something that I don't think people think is safer: instead of using gradient descent, just sample the weights of the neural net at random until you get a low loss. (I am not trying to make an analogy here)

It would be great if someone had a way to compute the "net" effect on agency across the spectrum, also taking into account the indirect path of more compute needed -> more compute = more agency across the spectrum. I suspect it might depend on which ability you need to reach, and we might/might not be able to figure it out without experiments.

If you're planning to actually do the experiments it suggests, or indeed act on any advice it gives in any way, then it's an agent.

Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?

(And notably the proposal here isn’t to train the model on the outcomes of experiments it proposes, in case that’s what you’re thinking.)

Is this possibly a "Chinese room" kind of situation? The model alone is not an agent, but "the model + the way it is used" might be...

And to be more precise, I don't mean things like "the model could be used by an agent", because obviously yes; but more like "the model + a way of using it that we also separately wouldn't call an agent" could be.

"Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?"
---> Nah, pre training, fine tuning, scaffolding and especially RL seem like they all affect it. Currently scaffolding only gets you shitty agents, but it at least sorta works

Pre-training, finetuning and RL are all types of training. But sure, expand 'train' to 'create' in order to include anything else like scaffolding. The point is it's not what you do in response to the outputs of the system, it's what the system tries to do.

yeah, if the system is trying to do things I agree it's (at least a proto) agent. My point is that creation happens in lots of places with respect to an LLM, and it's not implausible that use steps (hell even sufficiently advanced prompt engineering) can effect agency in a system, particularly as capabilities continue to advance.

Seems like we don’t really disagree

we might disagree some. I think the original comment is pointing at the (reasonable as far i can tell) claim that oracular AI can have agent like qualities if it produces plans that people follow

I agree that it can be possible to turn such a system into an agent. I think the original comment is defending a stronger claim that there's a sort of no free lunch theorem: either you don't act on the outputs of the oracle at all, or it's just as much of an agent as any other system.

I think the stronger claim is clearly not true. The worrying thing about powerful agents is that their outputs are selected to cause certain outcomes, even if you try to prevent those outcomes. So depending on the actions you're going to take in response to its outputs, its outputs have to be different. But the point of an oracle is to not have that property -- its outputs are decided by a criterion (something like truth) -- that is independent of the actions you're going to take in response^[1]. So if you respond differently to the outputs, they cause different outcomes. Assuming you've succeeded at building the oracle to specification, it's clearly not the case that the oracle has the worrying property of agents just because you act on its outputs.

I don't disagree that by either hooking the oracle up in a scaffolded feedback loop with the environment, or getting it to output plans, you could extract more agency from it. Of the two I think the scaffolding can in principle easily produce dangerous agency in the same way long-horizon RL can, but that the version where you get it to output a plan is much less worrrying (I can argue for that in a separate comment if you like).

I'm ignoring the self-fulfilling prophecy case here. ↩︎

thanks, I appreciate the reply.
It sounds like I have somewhat wider error bars but mostly agree on everything but the last sentence, where I think it's plausibly but not certainly less worrying.
If you felt like you had crisp reasons why you're less worried, I'd be happy to hear them, but only if it feels positive for you to produce such a thing.

Good point. I think that if you couple the answers of an oracle to reality by some random process, then you are probably fine.

However, many want to use the outputs of the oracle in very obvious ways. For instance, you ask it what code you should put into your robot, and then you just put the code into the robot.

Could we have an oracle (i.e. it was trained according to some Truth criterion) where when you use it very straightforwardly, it exerts optimization pressure on the world?

To give some examples from the text:

A chess-playing AI, for instance, is goal-directed because it prefers winning to losing. A classifier trained with log likelihood is not goal-directed, as that learning objective is a natural consequence of making observations

Paralleling a theoretical scientist, it only generates hypotheses about the world and uses them to evaluate the probabilities of answers to given questions. As such, the Scientist AI has no situational awareness and no persistent goals that can drive actions or long-term plans.

I appreciate that the paper does discuss objections to the safety of Oracle AIs, but the responses also feel sort of incomplete. For instance:

The counterfactual query proposal basically breaks down in the face of collusion
The point about isolating the training process from the real world says that "a reward-maximizing agent alters the real world to increase its reward", which I think is importantly wrong. In general, I think the distinctions drawn here between RL and the science AI all break down at high levels.
The uniqueness of solutions still leaves a degree of freedom in how the AI fills in details we don't know -- it might be able to, for example, pick between several world models that fit the data which each offer a different set of entirely consistent answers to all our questions. If it's sufficiently superintelligent, we wouldn't be able to monitor whether it was even exercising that freedom.

Overall, I'm excited by the direction, but it doesn't feel like this approach actually gets any assurances of safety, or any fundamental advantages.

The arguments in the paper are representative of Yoshua's views rather than mine, so I won't directly argue for them, but I'll give my own version of the case against

the distinctions drawn here between RL and the science AI all break down at high levels.

At the top of the spectrum, you have systems trained based on things like the stock price of a company, taking many actions and recieving many observations per second, over years-long trajectories.

Many steps down from that you have RL training of current llms: outcome-based, but with shorter trajectories which are less tightly coupled with the outside world.

It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you:

It's easy to accidentally scaffold some kind of agent out of an oracle as soon as there's any kind of consistent causal process from the oracle's outputs to the world, even absent feedback loops. In other words, I agree you can choose to create agents, but I'm not totally sure you can easily choose not to
Any system trained to predict the actions of agents over long periods of time will develop an understanding of how agents could act to achieve their goals -- in a sense this is the premise of offline RL and things like decision transformers
It might be pretty easy for agent-like knowledge to 'jump the gap', e.g. a model trained to predict deceptive agents might be able to analogise to itself being deceptive
Sufficient capability at broad prediction is enough to converge on at least the knowledge of how to circumvent most of the guardrails you describe, e.g. how to collude

If you're planning to actually do the experiments it suggests, or indeed act on any advice it gives in any way, then it's an agent.

Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?

(And notably the proposal here isn’t to train the model on the outcomes of experiments it proposes, in case that’s what you’re thinking.)

Is this possibly a "Chinese room" kind of situation? The model alone is not an agent, but "the model + the way it is used" might be...

Seems like we don’t really disagree

we might disagree some. I think the original comment is pointing at the (reasonable as far i can tell) claim that oracular AI can have agent like qualities if it produces plans that people follow

I'm ignoring the self-fulfilling prophecy case here. ↩︎

LESSWRONG
LW

LESSWRONG
LW

45

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

45

Ω 13

Abstract

Executive Summary

Highly effective AI without agency

Mapping out ways of losing control

The Scientist AI research plan

Career Opportunities at SAIFH

45

Ω 13

45

Ω 13