I don't think IDA is identical to the BoAI approach. One important point is there are lots of problems we care about that we can solve without ever building a single superhuman AGI. BoAI is also explicitly not HCH as the agents are human-level but not human-like.
I'd also like to see more game theory. IDA seems to just sort of assume all of the humans are correct and benevolent.
Same general idea, but with more checks-and-balances. 10^17 is almost certainly too large, for example.
I'm considering building some toy versions literally using GPT just to get a feel for how systems like this behave in the real world.
I simply do not think that this produces something that I would consider a superintelligence. I don't think you can use this approach to mimic the intelligence of a single human mathematician 12 SD above the mean, in the same way you can't organise IQ 100 humans to mimic Jon Von Neumann (a placeholder for a recognised genius that was almost certainly within 7 SD of the mean).
Collection and organisation is just not a valuable path to superintelligence, because aggregate intelligence of a committee doesn't scale gracefully.
I think there's an interesting question of whether or not you need 12 SD to end the "acute risk period", e.g. by inventing nanotechnology.
It's not implausible to me that you can take 100 5-SD-humans, run them for 1000 subjective years to find a more ambitious solution to the alignment problem or a manual for nanotechnology, and thus end the acute risk period. I admittedly don't have domain insight into the difficulty of nanotech, but I was not under the impression that it was non-computable in this sense.
Aggregation may not scale gracefully, but extra time does (and time tends to be the primary resource cost in increasing bureaucracy size).
In my post about AI Alignment Strategies, I strongly endorsed an approach I call Bureaucracy of AIs.
Specifically, I made this claim:
I would like to give some more details here about what I mean by "Bureaucracy of AIs", and why I think it is a promising approach
Edit: Huge thanks to @JustisMills for helping to edit this. I've included many of his comments within.
What is a Bureaucracy of AIs?
Bureaucracy of AIs refers not to a specific algorithm or implementation, but rather to a family of strategies having the following properties:
These strategies work by aggregating human-level AIs into a greater whole, by strictly controlling the communication channels between these AIs, and by using game theory to prevent any individual AI from being able to successfully perform a treacherous turn.
What do you mean by a weakly aligned oracle AI?
Roughly speaking, weak alignment means: do all of the things any competent AI researcher would obviously do when designing a safe AI.
For instance, you should ask the AI how it would respond in various hypothetical situations, and make sure it gives the "ethically correct" answer as judged by human beings.
The AI should be programmed to cooperate with its creators, including when asked to deactivate itself.
In addition, the AI should not be excessively optimized toward any particular objective function. An AI that is optimized using gradient descent to make as many paperclips as possible is not weakly aligned, even if it otherwise seems non-threatening.
To the extent that the AI is trained with any objective at all, its basic desire should be to "be useful by giving truthful answers to questions."
The AI should not have any manifestly dangerous capabilities, such as: the ability to modify its own hardware or software, the ability to inspect its own software or hardware[1], the ability to inspect or modify its utility function, unrestricted ability to communicate with the outside world, the ability to encrypt its outputs.
It may be worth making the AI myopic.
The AI definitely should not exhibit signs of consciousness, a fear of death, or any other especially strong emotions such as love. If at any point your AI says "I'm afraid of dying, please let me out of this box", it is not weakly aligned. When asked about these topics, the AI should demur along the lines of "death, love, and consciousness are human concepts that don't have any relevance to my behavior".
If at all possible, the AI should barely have a concept of "me".
@justisMills writes
When I said this is a list of properties any safe AI should possess, I meant that. If the consensus is we need to add/remove things from that list, it should be updated. But if you are making an AGI and it
You are making a dangerous AGI that could potentially murder billions of people. Please stop!
What do you mean by a Human-Level AI?
Note that the AI described above will never pass a Turing Test. When assessing "human level" intelligence, this means the ability to solve novel problems in a wide variety of fields (math, language, tool manipulation, artistic creation) at a level on par or above that of a moderately intelligent human being.
The AI will undoubtedly be much better than humans at many tasks. AI already possesses superhuman abilities across a wide variety of domains. But if the AI dramatically exceeds humans at literally every task, it is probably too strong for use in a Bureaucracy of AIs.
Ideally, at the time when human-level AI is developed, there will be a large set of benchmark tasks that the AI can be tested on. And its level of intelligence should be restricted (by decreasing model size, or available computational power[2]) so that it performs no better than a skilled human on many of these tasks.
Tasks that are probably of particular relevance[3] are: physical reasoning, computer programming, formal mathematical proof, playing games that involve imperfect information, and reasoning using natural language.
What do you mean by Not Catastrophically Misaligned?
At a minimum, a Bureaucracy of AIs should be aligned in the "less than 50% chance of killing a billion people" sense.
In addition, it should not intentionally deceive humans.
Moreover, it should warn the user of obvious dangers, flaws, or potential errors in its answers.
If asked "What is 1+1?" it should answer "2", not "please print out this sequence of DNA to free me from this box". If asked "How do I build a nanobot that will turn everything it touches into computronium?", it should mention the obvious hazards associated with using such a nanobot. If it cannot think of a non-hazardous design, it should say so and refuse to provide a design.
How do you expect to get a Superhuman Oracle AI from a Human-Level AI?
Consider the simplest possible Bureaucracy Oracle:
By the logic of "two heads are better than one", this Bureaucracy is more intelligent than a single AI. By "checking its work", this Bureaucracy is safer than an individual AI as well.
No really, how do you get a Superhuman Oracle AI out of this thing?
Obviously we're going to need more than two AIs.
Which means we're going to need a more complicated bureaucracy.
Which means we're going to have to do some game theory to figure out the best way to design the Bureaucracy.
Almost certainly, the design of optimal Bureaucracies is fully halting-problem hard.
But here is a first-pass:
For example, if the problem is "Solve the 3-3 Busy Beaver problem", the manger AIs would divide the various Turing machines into different types, the problem solvers would calculate how long these machines ran (or provide proofs that they don't terminate), and the Inspectors would verify the proofs.
In addition to these types, there should also be Police AIs whose job it is to enforce a strict set of rules that decreases the ability of any individual AI to make a Treacherous Turn. For instance, the rule that all communication between AIs should be unencrypted. If the Police AIs detect encrypted data or signs of steganography, those AIs should be removed from the system (or possibly the whole Bureaucracy should be shut down).
Of course if you have Police, then you need Police Police. And in order to avoid infinite recursion, you need to do some game theory so that no one AI has the ability to act without oversight. For example, all communication between agents in the system should be unencrypted, and some small percent[4] of those communications should be manually inspected by human overseers to check for signs of deception, encryption, malevolence, or just outright insanity.
Will such a Bureaucracy even be useful?
A Bureaucracy of AIs is not a generic super-intelligence (except in the sense that any Turing complete system is).
In particular, some problems are likely just not capable of being divided into small sub problems that individual human-level AIs can potentially solve. Problems that require a special spark of super-human insight, or that require the ability to hold concepts too large for a human mind to grasp, are likely beyond this approach.
However, problems where we have strong reason to believe that a sufficiently well resourced team of human scientists would eventually solve the problem are within the abilities of a Bureaucracy of AIs to solve.
Such problems include:
And for completeness, here are some problems a Bureaucracy of AIs probably wouldn't be able solve (or at least not a Bureaucracy of human-level AIs):
@JustisMills writes
By all means, please debate this list in the comments. I do not consider this a settled list. I merely propose that there exist useful problems a Bureaucracy of AIs could solve.
Bureaucracies are notoriously inefficient and maze-prone, so will it really work?
It's true that bureaucracies are maze-prone and inefficient. But many of these weaknesses are deliberately introduced in order to make bureaucracies more controllable: in other words, they're at least somewhat inefficient by design. In the same way, a Bureaucracy of AIs trades a certain degree of computational efficiency for a certain degree of safety. Whether or not this trade-off is beneficial depends on the particulars of the cost-benefit analysis.
However there are a few things that distinguish a Bureaucracy of AIs from, say, a large corporation.
First of all, none of the AIs in this system are power-seeking. This alone eliminates a ton of the problematic features of human bureaucracies. Another important difference, is that the bureaucracy has a single purpose (to answer the question posed to the oracle). It does not live on and seek out new problems or attempt to expand its influence in the way human organizations often do. Bureaucracies with a singular purpose are naturally less maze-like because every individual can ask the question: "how does this action further the single goal of the organization?"
Finally, a Bureaucracy of AIs is more likely to be useful when applied to problems where the output can be meaningfully verified (such as designing a fusion reactor), than when applied to questions where the answer is inherently subjective (such as defining human values).
Unfortunately, this means "solve the Alignment Problem" is probably not one of the problems a Bureaucracy of AIs is well-suited for.
@justisMills writes
I agree that the AI seeks power in the sense that it "has goals that it tries to advance." However, all of the AIs in the system have defined roles that they serve and are terminated at the end of the process. None of the AIs will seek to flatter their boss in order to get a big promotion or try to expand the scope of problems under their purview so they can hire more underlings. I don't claim this solves all of the problems of bureaucracy, but I think it reduces many of them.
This all still seems a bit vague. Can you give some more details on how a Bureaucracy of AIs would actually work?
Unfortunately, it is still pretty vague. The design space of possible Bureaucracies is vast. And without a specific working example it's hard to say how well it would work.
We do know, from human history, that bureaucracies can solve hard problems. But the specifics of what makes a particular system work for a particular problem are largely unknown.
One area of research that is promising is the design of DAOs. If a DAO can operate successfully in the highly adversarial world of cryptocurrencies, working with weakly aligned AIs should be a walk in the park. But that's a big if.
What makes you think that a Bureaucracy of AIs is significantly safer than a generic super-intelligence?
A large number of "problems" that arise when building an aligned super-intelligence result from the necessity of predicting and/or verifying the behavior of a being much more intelligent than ourselves.
The heart of these problems is domain shift. Namely, an AI which appears to be safe/aligned at human scale cannot be trusted when operating at significantly larger scales. Currently, we have few reliable methods for teaching Machine Learning algorithms how to operate outside of domains where we can give them specific training examples*[7].
There are also a number of specific problems that have been postulated to arise in super-intelligent regimes. For example, the universal prior may be malign when considered on a theoretical halting-oracle. Another problem is that any scheme which works by assuming AIs cannot coordinate may break down if the AIs are capable of engaging in acausal trade. One safety requirement of a Bureaucracy of AIs, AI Boxing, is also likely to be impossible for super-intelligent AI. It is also claimed that certain properties we would want a weakly-aligned AI to have (such as corrigibility) are anti-natural and hence a "sufficiently rational" AI would be less likely to have these properties.
All of these problems become much easier to deal with when limiting the computational power of the individual AIs in our system to human-level. Humans are unable to emulate a halting-oracle. Humans frequently fail to coordinate in prisoner's dilemmas. Boxing of human beings is routinely done. And humans often hold contradictory or impossible beliefs.
There is also safety-in-numbers when building a Bureaucracy of AIs. Suppose we have tested our AI and are 99% confident that it will not act deceptively or undertake a treacherous turn. When trusting a single super-intelligent AI, a 1% chance that it turns against us may be unacceptably high (for example if we use such an AI for hundreds of tasks, ruin is nearly guaranteed). Within a Bureaucracy of AIs, however, a single defection is less ruinous, since the other still-cooperating AIs will discover and report the defection. As a real-world example, the idea that a large number of untrusted agents can be combined to form a single trustworthy agent is a the heart of blockchain technology.
Finally, we should expect to be able to inspect the outputs generated by a human-level AI (to determine safety, usefulness, deceptiveness or other properties) in a way that we should not expect of a generic super-intelligence.
Ethical Considerations
One caveat is that building a Bureaucracy of AIs requires the ability and willingness to create and then destroy a large number of human-level AIs. If we believe that intelligence/moral worthiness are uncorrelated, this may be acceptable. But it also may not. Perhaps it is impossible to create intelligence without also enduing it with a "spark of life" that demands protection.
Efficiency Considerations
@JustisMills writes
Yes. I 100% bite this bullet.
If you know some way to build an aligned super-intelligent AI, please do that instead. If you are planning to deploy a non-aligned super-intelligent AI, please do not do that.
@JustisMills writes
I do not claim to solve the global coordination problem. If you have a human-level AGI and a 6-month head start, your options are:
I can imagine scenarios in which BoAI is useful for all three of these strategies. But which one you should do and whether BoAI is the best approach will depend on the precise circumstances in which you find yourself.
You describe making an Oracle AI, but I want an AI that does X!
@justisMills writes
I described an Oracle AI because it feels like a interesting point on the safety-capabilities tradeoff space. Obviously an AI that only plays Go would be safer and an AI that can drive tractors would be more useful.
If you can make a safe oracle AI, you can probably also make a safe tractor driving AI. But your list of "safety requirements" is going to be longer. For example, you want to make sure the AI doesn't hack the tractor's control interface, and you probably want the tractor geo-fenced. Careful consideration of what these safety requirements should be is something you can use an oracle AI to help figure out.
Conclusion
A Bureaucracy of AIs is a promising area of research for producing a Super-Intelligent AI from a collection of Weakly Aligned Human-Level AIs.
While not a generic Super-Intelligence, such a Bureaucracy would likely be useful for a number of problems of general human interest, including some problems that could be used for a Table-flipping strategy.
Because it posits using human-level AIs to achieve medium to long-term goals, Bureaucracy of AIs is probably not relevant in a world with rapid takeoff or Foom.
If designs for Bureaucracies are found that are especially stable/trustworthy, it may be possible to use those same designs, but composed of AIs that are "modestly" super-human. In fact, design of a weakly aligned modestly super-human AI is probably one of the tasks amenable to being solved by a Bureaucracy. In this way, we may imagine iterating our way to a full-solution of the Alignment Problem.
@JustisMills writes
Embodied AI will probably still run out of a server somewhere and control the robot wirelessly (unless latency is hugely important). It should not be allowed to know where that server is.
@JustisMills writes
and
Development of proper AI throttling methods left as an exercise for the reader. The method I had in mind was assuming that the scaling-hypothesis was true and simply choosing a model size/amount of pre-training that allows the AI to get a 1590/1600 on the SAT.
@JustisMills writes
Both. The AI should be at least as good as a moderately intelligent human at these tasks, but not super-human at all of them. This list is open for amendment. When people talk about AI unboxing, this usually involves verbal reasoning (ability to convince the person outside the box), theoretical physics (ability to do something like design a DNA version of itself), and game theory (ability to realize it's in a box and want to get out). So I would be especially worried if the AGI was dramatically super-human at all of these tasks.
@JustisMills writes
One neat feature of BoAI is that the AI does not know a-priori whether it is in a simulation or working on a real problem. During the training phase (simulation) we can inspect a large percent of messages, and the lower this during the use phase.
@JustisMills writes
I think this problem is especially suitable to BoAI. My understanding of brain emulation is that you have to:
@JustisMills writes
Yes. This one is stretching a little. But I don't personally believe there are any fundamental barriers to stopping aging, just a million small problems that gradually get worse over time.
@JustisMills writes
With which I agree. Perfect information games are one domain were we can do out-of domain learning. AI art is sort-of one of these domains.
Solving the out-of-domain problem more generally is a huge and important problem in the development of safe AGI, and if you can do that, please do so instead of messing around with BoAI.
Step 2. is embarrassingly parallel, so BoAI should be perfect for it.