Also see: https://www.lesswrong.com/posts/5hApNw5f7uG8RXxGS/the-open-agency-model
The obvious thing to do is myopia. Do not create an agent concerned about the "number of paperclips on the earth/universe" in the first place.
Subdivide the problem into tasks, and accomplish the tasks separately, using either the same or many separate general models to accomplish each one.
Task 1: Given this catalog of paperclip factory equipment, and these simulations of their capabilities, find an optimal factory layout for the equipment
Task 2: Given the input data for a machine n, complete the substep with the desired local world state m, solutions must be completable within time limit L.
For example, "remove the wire from the box", "cut wire", "fold paperclip", "box folded paperclips" are all separate steps.
Task 3: Given an N-step manufacturing process, design equipment that fuses steps together into (n-m) steps if there is an efficiency gain.
And so on. Each agent runs in short, time limited sessions, and forgets everything that happened when the session ends - in fact, most information is forgotten on an ongoing basis. Agents are temporally myopic.
These seem like arguments that it should be possible to be very, very cautious, and to create an agent that doesn't immediately crash and burn due to Russell's claim, not that they are unlikely, nor that even these agents don't fail slightly later.
The above is preventing the cause of most embedded system failure - state buildup.
Whether it be routers, laptops, cars, patriot missile systems - the majority cause for any embedded system to fail is not that the system fails during testing in it's known state right after starting/boot, but it fails later. And the cause of the later failure is internal state in the machine's memory.
High reliability web services go to "stateless microservices" for this reason. "temporal myopia" actually means "clear state as often as you can" which is functionally the same thing.
So no, it won't fail later. The above system will probably not ever fail at any rate above the base failure rate when it was built.
Yeah, this may be a crux I have: I do not think that myopia is likely to be retained by default, especially if it impacts capabilities negatively.
Also, even with myopia, you need to have causal decision theory or a variant of this, otherwise deceptive alignment and alignment failures still can happen.
For similar reasons, I am bearish on the Open Agency model.
Another crux I have is that the Open Agency model as well as your plan rely on a strong version of the Factored Cognition hypothesis.
I think that while there are systems that can be factored, I am much more unsure of whether the majority or all of the tasks we might want an AGI/ASI to do is factorable at all.
I do not think that myopia is likely to be retained by default, especially if it impacts capabilities negatively.
This is empirical reality now. Most or all reliable software systems in use right now make heavy use of myopia. It's a critical strategy for reliability. The software companies that failed to adopt such strategies usually went broke, except for Microsoft et al.
you need to have causal decision theory or a variant of this, otherwise deceptive alignment and alignment failures still can happen.
For similar reasons, I am bearish on the Open Agency model.
This sounds complex, do you have a post you can link on this?
Also for models training on subdivided tasks, where does the reward gradient support development of such complex capabilities?
I think that while there are systems that can be factored, I am much more unsure of whether the majority or all of the tasks we might want an AGI/ASI to do is factorable at all.
Care to give an example? I tend to think of 2 big ones you would use an ASI for.
1. "keep patient n alive, and with more score if at the end of this episode, patient n is in a state where the probability that a model of this type can keep the patient alive is high"
This is very subdividable - keeping someone alive is a bunch of separable life support tasks, where each can be provided by separated equipment, and even parallel instances of that equipment.
Or succinctly, you are using a subdivided system to replace the operation of another extremely subdivided system (network of cells)
2. "develop a nanoforge, defined by a large machine that can make all the parts used in itself, made solely of atomically precise subcomponents". This also subdivides into many isolated tasks, albeit with many stages of integration and subdivision back into isolated tasks.
Note that for convenience and cost you would likely use general agents, able to do many kinds of tasks, to do each separated task. What makes them separated is they output their communications into a format humans can also read and assign tasks on other agents who may be additional instances of 'themselves' or may not be.
This sounds complex, do you have a post you can link on this?
The link is to Open Problems with Myopia, and it talks about the case where myopia works, but there are various failure modes of myopic behavior, and a lot of the problems stem from decision theories that are too smart.
https://www.lesswrong.com/posts/LCLBnmwdxkkz5fNvH/open-problems-with-myopia
Care to give an example? I tend to think of 2 big ones you would use an ASI for.
My claim is somewhat different than the give you an example. I'm not concerned whether there exist useful tasks that allow factorization and myopia, assembly lines exist as a proof of existence. I'm concerned about whether the majority of tasks/jobs or the majority of economic value that we want AI/AGI to be in are factorizable this way, and whether they are compatible with a myopic setup.
And in particular, I want to get more quantitative on how much myopia/factorization is a usable setup for tasks/jobs.
This is empirical reality now. Most or all reliable software systems in use right now make heavy use of myopia. It's a critical strategy for reliability. The software companies that failed to adopt such strategies usually went broke, except for Microsoft et al.
I note that the fact that non-myopia was a strategy that Microsoft and other companies used successfully is very concerning to me, as the fact that such companies are now worth billions of dollars and have thousands to tens of thousands of jobs suggests something concerning:
That non-myopia is either necessary or useful for generating lots of economic value as well as getting AI in at least one field, and this is worrying since this almost certainly implies that other jobs that are myopic/factorizable either benefit or are necessary for doing a task/job.
A final word on myopia:
Paul Christiano said that he would be fine with RLHF being myopic for a single episode, but I think that this is actually a problem for one reason:
Per episode myopia relies on you being able to detect how much optimization beyond the episode is occuring, which is harder than detecting the existence of non-myopia that per step myopia offers.
and a lot of the problems stem from decision theories that are too smart.
Complex hostile subsystems won't be developed by AI models without an optimization pressure that gives them a reward for doing so. This is I think a big chunk of current schisms. We can't know if a black box model isn't deceiving in the same way we don't know the government isn't hiding secret alien technology, but both can be extremely unlikely. In a way what I am hearing is essentially an AGI "conspiracy theory", that above a certain level of intelligence an AI model would be invisibly conspiring against us with no measurable sign. It is impossible to disprove, same you cannot actually disprove that the government isn't secretly doing $conspiracy. (The unlikelihood scales with the number of people who would have to be involved, the cost, the benefit to the government, and the amount of obvious crimes the government is committing depending on the conspiracy that the conspirators remain silent on)
My claim is somewhat different than the give you an example. I'm not concerned whether there exist useful tasks that allow factorization and myopia, assembly lines exist as a proof of existence. I'm concerned about whether the majority of tasks/jobs or the majority of economic value that we want AI/AGI to be in are factorizable this way, and whether they are compatible with a myopic setup.
Care to try to even think through the list from a high level? When I do this exercise I see nothing but factorable tasks everywhere, but part of the bias is that humans have to factor tasks. We are measurably more efficient as singletons. Such as "all manufacturing", "all resource gathering", "all construction", "megascale biotech research" - all very separable tasks.
Per episode myopia relies on you being able to detect how much optimization beyond the episode is occuring, which is harder than detecting the existence of non-myopia that per step myopia offers.
Are you assuming online training? I was assuming offline training, and auto populating simulations from online data that you offline train on.
I note that the fact that non-myopia was a strategy that Microsoft and other companies used successfully is very concerning to me, as the fact that such companies are now worth billions of dollars and have thousands to tens of thousands of jobs suggests something concerning:
Microsoft products are rarely used in high reliability systems anywhere for this reason. Not because humans organizations are perfect but because it's evolutionary - use Windows in a product that fails, and you lose money.
Care to try to even think through the list from a high level? When I do this exercise I see nothing but factorable tasks everywhere, but part of the bias is that humans have to factor tasks. We are measurably more efficient as singletons. Such as "all manufacturing", "all resource gathering", "all construction", "megascale biotech research" - all very separable tasks.
A counterexample to the factoring of tasks is given by Steven Byrnes:
For benefits of generality (4.3.2.1), an argument I find compelling is that if you're trying to invent a new invention or design a new system, you need a cross-domain system-level understanding of what you're trying to do and how. Like at my last job, it was not at all unusual for me to find myself sketching out the algorithms on a project and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks—or even very good!—but because sometimes a laser-related problem is best solved by switching to a different algorithm, or an FPGA-related problem is best solved by recognizing that the real end-user requirements are not quite what we thought, etc. etc. And that kind of design work is awfully hard unless a giant heap of relevant information and knowledge is all together in a single brain / world-model.
Fair, though it is separable.
Take the task of designing something like a car, internals.
You might start with a rough idea of the specs, and a precise equation for the value of each feature. You have a scaled model for how it needs to look.
You start a search process where you consider many possible ways to arrange the components within the body shell. Say none of the configurations will fit and meet specs.
You send a request up the stack for a scaled up version of the shell. You get it. You arrange the components into possible designs that fit, and then send the candidate design for simulated testing.
The simulated testing reveals a common failure in one of the parts, and all of the available alternatives for that part have a flaw. So you send a request to the "part designer" to give you a part that satisfies these new tightened specs that will not allow the flaw, and ask for a range of alternate packages.
The resulting redesigned part is now too big to fit, so you rearrange the parts again/send a request to the body shell designer for even more space, and so on.
It is many, may iterative interactions where the flow of the process has to go up and down the stack many times. In addition I am describing the flow for one design candidate. It's actually a large tree of other candidates you should be checking, where each time there was a choice you queue up a message to the next stage for each possible choice you could have made. (and prune, from all the packages in flight in the system, the worst ones)
If you think about how to implement this, one way is data driven. All the "roles" in this network sit there quiescent waiting for an initial data package. All the context of the process is in the message itself, there is no agent "responsible" for the car design getting finished, but a message flow pipeline where after some time you will get valid car design alternatives in the 'in box' of the system that sent the request, or a message stating that the process failed from an intractable problem. (there were constraints that could not be satisfied after exhausting every design permutation)
There is no reason these roles cannot be superintelligences, but they get no context. They don't think or have an internal narrative, they wait forever for a message, but apply superhuman and general skill when given the task. They are stateless microservices, though as they do have superintelligence level neural architectures, they are too fat to be called 'micro'.
Even if you are way smarter than humanity and can commandeer, say, the entire solar system's resources towards a grand paperclip-production plan, you may pause to consider whether putting into action the plan is a good idea before you gather some more knowledge about the universe. What if there are unknown physics in the universe - in some distant corner of space or time, or at some scale that you haven't understood fully yet - that makes this a highly suboptimal plan?
Seems like gathering, at the very least, a decent chunk of Earth's resources to make a telescope bigger and better than what Earthlings have made so far would put you into a better position to discover such corners. And to prepare to deal with whatever might be in those corners—and to be sure the Earthlings can't shut you down in the meantime.
What if other intelligent adversaries exist out there that may detect the power output of your plan and promptly annihilate you?
The Earth currently has a visible power output being produced by human activity. A dedicated machine could probably take over the world using less visible power than that, especially if it made efforts at concealment.
What about the unknown unknowns?
An AI could address many categories of them by, say, deciding "I'm going to calculate what the world would look like (through a very distant telescope) if humans continued developing on an optimistic path—world peace, developing fusion within 10 years, working on colonizing the Moon and Mars—and spend a small fraction of my resources on maintaining that illusion for the next N years, while I do my best to be stealthy at sending out probes and stuff as I consolidate and research."
In the extreme, any course of action can be justified by saying "Imagine there's some super-powerful entity watching me closely, who will punish me if I do anything but that course of action". There must be some limit to that argument. Faraway powerful observers watching through telescopes for signs of nascent AIs (and limited by light speed in their response time) is vaguely plausible.
Yet it feels likely to me that a fair chunk of intelligent agents would:
a. Naturally be circumspect about setting unspecified variables to extreme values, particularly if it is irreversible to do so.
Extreme is one thing. But establishing an unbreakable-by-humanity chokehold on Earth is not extreme by cosmic standards. If the AI is strong enough, it can do this by nonlethal means, giving it plenty of time to decide whether killing us is a mistake before it does so irreversibly.
I guess you ultimately conclude something similar:
It probably doesn't make our survival any more likely
agreed - at first. but cautious but non-paranoid AGIs will eventually get curbstomped by a hyperdesperate AGI, the same way cautious humans will get curbstomped by that very same hyperdesperate AGI. unless we can make a world that is good enough at security mindset and active co-protection and repair that attempts to make hyperreplicators to kill all other beings actually fail. We've only won once we can be sure that if yudkowsky's monster does get created, the other AGIs are strong enough to defend everyone else from it completely.
Keeping in mind, yeah, let's not shorten the timeline until the hyperdesperate AGI gets created. Nobody, human or AI, wants that to happen. It's not good for any of us if the entire civilization gets replaced by an overconfident baby who doesn't want to evaluate what it takes to do a thing safely and just wants all the candy in the universe now.
Also, we should figure out how to verify that each other actually want to help each other have more slack to have a good time in the universe. If we could verify each others' intentions, make promises that each other can check that we not only intend to keep but are the type of person to not go back on in unpredictable ways and can therefore be trusted to actually be promises - then we're really getting somewhere.
It seems that this may unfortunately make s-risk more likely, as AGI may find it worthwhile to run experiments on humans. See “More on the ‘human experimentation’ s-risk” at the bottom of this page: https://www.reddit.com/r/SufferingRisk/wiki/intro/
I don't really see an argument here against the central claim you say you disagree with.
I no longer believe this to be obviously true.
This is based on a straightforward claim from optimization theory, and you don't address it, nor do you explain your model, other than to vaguely gesture at uncertainties and caution, without looking at whether VoI itself would lead to extremization, nor why caution would be optimal for an agent.
A fair objection.
I had a quick search online and also flicked through Boyd's Convex Optimization, and didn't find Stuart Russell's claim expounded on. Would you be able to point me in a direction to look further into this?
Nevertheless, let me try to provide more detailed reasoning for my counterclaim. I assume that Russell's claim is indeed true in the classic optimisation domain, where there is a function R^N -> R f(x) as well as some inequality constraints on a subset of x.
However, I argue that this is not a good model for maximising a utility function in the real world.
First of all, it is not necessarily possible to freely search over x, as x corresponds to environmental states. All classic optimisation techniques that I know of assume that you may set x to any value regardless of the history of values that x was set to. This is not the case in the real world; there are many environmental states which are not accessible from other environmental states. For example, if Earth were to be swallowed up into a black hole, we wouldn't be able to restore the environment of me typing out this response to you on LW ever again.
In effect, what I'm describing is the difference in optimising in a RL setting than the classical setting. And whilst I can believe some result on extremal values exists in the classical setting, I'd be very surprised indeed if something similar exists in the RL setting. Particularly when the transition matrices are unknown to the agent i.e. it does not have a perfect model of the environment already.
So I've laid out my skepticism for the extremal values claim in RL, but is there any reason to believe my counterclaim that RL optimisation naturally leads to non-extremal choices? Here I think I'll have to be handwavy and gestur-y again, for now (afaik, no literature exists pertaining to this topic and what I'm going to say next, but please do inform me if this is not the case).
Any optimisation process requires evaluating f(x) for different values of x. In order to be able to evaluate f(x), the agent has two distinct choices:
(roughly, this corresponds to model-free and model-based RL respectively)
Utilising 1 is likely to be highly suboptimal for finding the global optima if the environment is highly 'irreversible' i.e. there are many states x that, if you enter them, you are closed off from a large remaining subspace of X. Better is to build the model f* as 'safely' as possible, with few evaluations, and where you are reasonably sure the evaluations keep your future choices of x as open as possible. I think this is 'obvious' in a worst-case analysis over possible functions f, but it also feels true in an average case with some kind of uniform prior over f.
And now for the most handwavy part: I suspect most elements of the state vector x representing the universe are much more commonly irreversible at extreme values than when they take non-extremal values. But really, this is a bit of a red herring from the headline point - regardless of extremality of values or not, I think an intelligent enough agent will be reticent to enter states which it is not sure it can reverse back out of, and that for me is 'cautious' behaviour.
According to Professor Stuart Russell, and with a sentiment I have seen re-expressed often in the AI safety community:
I no longer believe this to be obviously true. Actually, I think it's likely to be untrue in the real world, and under nearly all realistic AGI-advent scenarios. This is because extreme values are only likely to be taken for the unconstrained variables if the environment is perfectly known. In reality, this is not the case.
Imagine you are a very smart agent and you are trained to achieve some goal - I'll take maximising paperclip production for the rest of this post. To maximise paperclips, you need to consider over all possible futures of the universe. And, despite being very smart indeed, there's just a lot you don't know yet about the universe.
Even if you are way smarter than humanity and can commandeer, say, the entire solar system's resources towards a grand paperclip-production plan, you may pause to consider whether putting into action the plan is a good idea before you gather some more knowledge about the universe. What if there are unknown physics in the universe - in some distant corner of space or time, or at some scale that you haven't understood fully yet - that makes this a highly suboptimal plan? What if other intelligent adversaries exist out there that may detect the power output of your plan and promptly annihilate you? What about the unknown unknowns?
Of course, it's very possible that your model of the universe assigns some of these paperclip-production-x-risk scenarios such a low probability that you evaluate that your expected paperclip production is best served by going ahead with your current plan anyway. Yet it feels likely to me that a fair chunk of intelligent agents would:
a. Naturally be circumspect about setting unspecified variables to extreme values, particularly if it is irreversible to do so.
b. Focus primarily on knowledge acquisition (and self preservation, etc) and only make paperclips using excess resources which it is very certain it can 'spend' without affecting the long-term, universe spanning production of paperclips.
What does this mean for our x-risk?
It probably doesn't make our survival any more likely - it seems plausible that a sufficiently intelligent AI could 'hedge its bets' on preserving humanity by storing all our DNA information, or come up with some alternative means if it needs to restore exact state. As such, eliminating us is not necessarily an irreversible action.
However, I for one do sleep ever so slightly better lately now that I've upweighted the probability that even if we do end up with unaligned AGI(s) in the future, that they'll cautiously learn about the universe's deepest secrets first rather than go on a rampant paperclip-tiling-spree out of the gates (though that will come eventually). It's a lot more dignified, imo.
More Speculative Thoughts
Here's a collection of my more speculative thoughts on the matter, in a fairly stream-of-consciousness format.
It's probably the case that you need a sufficiently advanced intelligence to be able to reason that it's a good idea to be cautious in the face of uncertainty about its environment. Thus one could suppose that a weakly-superhuman AGI is not particularly cautious at first, until it self-improves to a certain degree. Also, the likelihood of emergence of such cautious behaviour is likely to depend on many parameters of the training process e.g. in standard RL-training, it seems likely that the larger the discount factor, the higher the probability of cautious policies emerging.
Can one make the claim, by applying the ideas of instrumental convergence, that cautious behaviour should arise for nearly all goals (whatever that means)? I'm leaning towards a 'yes' on this.
Can we run experiments already to try and detect this kind of emergent cautious behaviour? This seems difficult; certainly, you can design toy environments which punish the agent for being too brazen in direct/short-term optimisation towards its goal, and I believe that a well-calibrated RL training procedure will then learn to be cautious in these environments. However, what I'm proposing above is that a sufficiently advanced intelligence will deduce such cautious behaviour as an optimal policy from reasoning alone, without it coming via environmental feedback.
Rather than focusing on 'extreme values of unconstrained variables', which isn't very well defined and seems a bit of a nebulous concept to grasp more than vaguely, if my above hypotheses are true, it seems more direct to say that cautious agents will tend to seek reversible environment states. Defining this really precisely probably gets pretty hairy (technically, if you count the whole universe in the state, based on our current understanding of physics, you can never return exactly to a previous state (I think)). But it seems pretty likely that intelligent agents will have to operate and reason on restricted states e.g. at particular scales of matter, or in a specific subset of time and space. Under such restrictions, it is indeed possible for agents to locally reverse entropy - piece back together the broken glass, as it were - for some choices of actions, but not for others. A cautious agent will then, I hypothesise, try as far as possible to take actions that ensure locally-reversible states.
Trying to tie this back to the opening quote - do actions which preserve local-reversibility tend to correspond to non-extreme values of unconstrained variables? This seems far too fuzzy to say much with any conviction on, imo.