[Epistemic status: mostly writing to clarify my intuitions, with just a few weak attempts to convince others. It's no substitute for reading Drexler's writings.]

I've been struggling to write more posts relating to Drexler's vision for AI (hopefully to be published soon), and in the process got increasingly bothered by the issue of whether AI researchers will see incentives to give AI's broad goals that turn them into agents.

Drexler's CAIS paper convinced me that our current trajectory is somewhat close to a scenario where human-level AI's that are tool-like services are available well before AGI's with broader goals.

Yet when I read LessWrong, I sympathize with beliefs that developers will want quite agenty AGI's around the same time that CAIS-like services reach human levels.

I'm fed up with this epistemic learned helplessness, and this post is my attempt to reconcile those competing intuitions.

Please recall that Drexler's distinction here focuses on a system's goals, not its knowledge. Software is more agenty when its goals cover a wide range of domains, and long time horizons. Services are designed to produce specific outputs using a system's current procedures.

The Easy Part?

An ESRogs comment on LW nudged me a bit more toward Drexler's position.

"The smart part is not the agent-y part": intelligence and agency are independent, with intelligence being the hard part, and agency being more like prompt engineering.

This seems mostly correct, at least for Drexler's meaning of agency.

There are still significant problems with getting agency correct, but they're mostly associated with figuring out what we want, and likely not helped much by math or computer science expertise.

Orthogonality squared

ESRogs also suggests a potentially valuable concept that he calls "Orthogonality squared". I.e. the domain of an agent's goals is independent of its intelligence.

We can, in principle, create AI's that don't care about atoms. At least for some simple tasks, it's fairly natural to have the AI care only about abstractions such as math. You can't alter the laws of arithmetic by rearranging atoms - laws of arithmetic are in a different realm from atoms. An AI that cares only about a distant part of Tegmark IV won't care whether we shut down an arrangement of atoms in our realm.

Many people concerned about AI safety seem reluctant to accept this. E.g. Rob Bensinger's initial reply to ESRogs worried about this unsafe(?) case:

The system can think about atoms/<U+200B>physics, and it knows that our world exists, but it still only terminally cares about digital things in the simulated environment.

"Simulated environment" seems ambiguous as to whether it's referring to a distant part of Tegmark IV, or a special environment within our world. The latter would imply that it's tricky to analyze whether the AI's goals conflict with ours; the former does not.

It seems odd that Eliezer's AGI lethality #19 warns that it's hard to get software to care directly about the real world, yet considers it inevitable that AGI's will care about the real world. That's not a contradiction, but the tension between those positions ought to raise further questions.

I don't know how an AI could run a company without caring about atoms, so the basic claim behind orthogonality squared is not sufficient to reassure me.

But it illustrates the extent to which AI goals can be compatible with leaving humans in control of the real world. So it should be possible to broaden such goals to care about narrow aspects of our universe, without caring about all of our future lightcone.

Analogies

We have some relevant experience with weakly superhuman agents in the forms of governments, corporations, labor unions, and universities.

These all have some sort of limits on the breadth of their goals, yet those goals often become broad enough to cause more harm than I'd predict if I modeled them as purely tools / services.

Let's examine what happens if I try to create a corporation that has no dangerously agenty goals. I'll call it Peter's Paleo Pizza.

I start out as the sole owner / employee, making all strategy decisions myself. I hire cooks who I tell to follow my recipe exactly.

I can grow the company to a few hundred people while micromanaging it enough that I have all the policy control over it that I want. But by that time the micromanagement takes most of my attention.

If I grow it to thousands of employees, I'll delegate enough decisions that the system will be at least weakly agenty. I don't want to give the system detailed instructions about who to hire or how they ought to acquire skills. Instead, I want employees to have a vision for how to balance goals such as making food tasty versus making it healthy. I want them to devote most of the company's resources to directly making food. If I thought they might devote more than a few percent of the company's resources to longer-term development plans, I'd place strict limits on that.

Such a company is a weakly superhuman system, in the sense of being an expert in more domains than an individual human can be. The limit on development-oriented resources leaves me confident that it won't do anything close to fooming via self-improvement. But the company is stretching the boundaries of what I'd classify as a service.

When I want to broaden my business empire to the limits of what humans have done, I give up on managing a company. I don't have Elon Musk's ability to manage corporate policies of multiple industries. Instead, I invest in many companies. That involves heuristics that tend to reward companies for maximizing the discounted value of future profits. I don't try very hard to predict how much they'll spend on R&D. This clearly violates the constraints that Drexler wants on safe AI services.

How much would I and other investors do differently if corporations were new phenomena with hard-to-evaluate risks, and unusually rapid changes in abilities? It does not seem obvious. It seems clear that human monitoring and guidance have costs which will tempt us to make superhuman systems more agenty. The corporate analogy weakly suggests we'll postpone the risky parts of that until after they become superhuman.

Software Examples

Gwern's Why Tool AIs Want to Be Agent AIs has examples of places where being more agenty seems valuable.

Note that I'm using "agent" in a slightly different way than Gwern. I'm attempting to focus on what Drexler considers important. So I'm mostly disagreeing with Gwern's framing, not his answers.

High-frequency trading is a task for which humans are too slow to be in the loop.

There is some sense in which the best software for this might be to tell an AI to maximize expected returns over some long time period. But there are safer strategies that I expect will usually produce the same benefits:

  • perform all the learning and make all strategic decisions before deploying, so there's time for human-AI collaboration.
  • have the AI only care about reacting to patterns in numbers. If we do nothing to cause it to care about how those number relate to the real world, I don't see how it would start caring about the real world.
  • have the AI care only about the next number it's about to output.

These options don't involve the AI self-improving while it's trading. But that has little effect on whether the AI gets improved. The CAIS approach to improving it involves a separate AI improvement process, for which I don't see much pressure to have the AI make the most agenty decisions.

Gwern unintentionally(?) illustrates why we should be cautious about making AI's more agenty:

we don't want excellent advice on which stock to buy for a few microseconds, we want a money pump spitting cash at us

I don't want anything resembling a human-level AI choosing such a money pump, as it probably won't distinguish between one which will fail gracefully versus one that will blow up like LTCM or Bernie Madoff. But I likely would have chosen a dangerous money pump when I had less experience.

Chess provides a different perspective. Computer-human collaborations fit the CAIS paradigm fairly well for a decade, then the human became mostly a liability, due to low human reliability. This is evidence that CAIS is a temporary stopgap, but does support hopes that CAIS will buy us some time.

Note that for chess, humans here weren't trying to be the more agenty part of this collaboration (the chess programs were likely safely aligned and using their agency appropriately). But similar considerations would likely apply if the humans were primarily trying to provide alignment: humans would make enough mistakes at aligning individual decisions to pressure us to replace those decisions with those of an apparently aligned AI.

Richard Ngo expresses similar concerns: "requiring the roles of each module and the ways they interface with each other to be ... human-comprehensible will be very uncompetitive". I don't see a general rule to that effect - sometimes comprehensibility will enhance competitiveness via making a service easier to debug or to build upon. Also, tools to assist human comprehension will enable humans to use increasingly powerful interfaces. But I'm unwilling to bet that those effects will enable human-controlled services to remain competitive.

Conclusion

Given mildly optimistic assumptions about how responsible the leading AI developers will be, I expect human-level CAIS-style services anywhere from months to a decade before human-level AGI's become broadly agenty.

I have little idea how hard it will be to use CAIS services to produce longer-term solutions to AI risk.

I'm guessing Eliezer will disagree with a key part of this post. I expect some of his reasons are related to a belief in a core of general intelligence that has not yet been discovered. But it looks to me like anything that I'd classify as such a core is already known.

This post seems mostly consistent with Drexler's claims, but with more emphasis on the dangers.

New Comment
3 comments, sorted by Click to highlight new comments since:

AI risk is the risk of failure of civilization's inner alignment, whether the unaligned mesa-optimizers are agenty or not. Solving inner alignment in an AGI is more urgent than solving AI risk (civilizational inner alignment), because an aligned AGI without its own inner alignment problems can help with solving AI risk. But similarly, if an aligned AGI can function despite having its own inner alignment risks (but not yet an active inner alignment failure), it might help with solving its own inner alignment risk, and this seems more urgent than for such AGI to solve AI risk (civilizational inner alignment).

In the CAIS setting, there is no clear distinction between mesa-optimizers within civilization and mesa-optimizers within an AGI built this way, so the two inner alignment problems are even closer, solving both might be a result of the same activity.

I don't think general intelligence will look anything like CAIS under the current path. For example, Demis Hassibis was recently on Lex Friedman's podcast. Demis said that the experience of ML experts is that end-to-end training always ends up working better, since the machine is better at figuring out the constraints of any problem than humans. I think the dangerous type of general intelligence looks like a single large sparse model trained with reinforcement learning in many domains, not a bunch of different models stitched together by human engineers in an ad-hoc way. The latter doesn't even sound plausibly buildable. (Maybe I'm misunderstanding CAIS though?)

Drexler wrote his QNR paper in part to address this issue. I'm trying to write a blog post about QNR.