I think there are two fundamental problems with the extensive simboxing approach. The first is just that, given the likely competitive dynamics around near-term AGI (i.e. within the decade), these simboxes are going to be extremely expensive both in compute and time which means that anybody unilaterally simboxing will probably just result in someone else releasing an unaligned AGI with less testing.
If we think about the practicality of these simboxes, it seems that they would require (at minimum) the simulation of many hundreds or thousands of agents over relatively long real timelines. Moreover, due to the GPU constraints and Moore's law arguments you bring up, we can only simulate each agent at close to 'real time'. So years in the simbox must correspond to years in our reality, which is way too slow for an imminent singularity. This is especially an issue given that we must maintain no transfer of information (such as datasets) from our reality into the sim. This means at minimum years of sim-time to bootstrap intelligent agents (taking humans data-efficiency as a baseline). Also, each of these early AGIs will be likely be incredibly expensive in compute so tha...
I'm reaffirming my relatively extensive review of this post.
The simbox idea seems like a valuable guide for safely testing AIs, even if the rest of the post turns out to be wrong.
Here's my too-terse summary of the post's most important (and more controversial) proposal: have the AI grow up in an artificial society, learning self-empowerment and learning to model other agents. Use something like retargeting the search to convert the AI's goals from self-empowerment to empowering other agents.
This post makes a lot of very confident predictions:
We can build general altruistic agents which:
- Initially use intrinsically motivated selfish empowerment objectives to bootstrap developmental learning (training)
- Gradually learn powerful predictive models of the world and the external agency within (other AI in sims, humans, etc) which steers it
- Use correlation guided proxy matching (or similar) techniques to connect the dynamic learned representations of external agent utility (probably approximated/bounded by external empowerment) to the agent's core utility function
- Thereby transition from selfish to altruistic by the end of developmental learning (self training)
I endorse this as a plausible high-level approach to making aligned AGI, and I would say that a significant share of the research that I personally am doing right now is geared towards gaining clarity on the third bullet point—what exactly are these techniques and how reliably will they work in practice?
I think I’m less optimistic than you about the formal notion of empowerment being helpful for this third bullet point, or being what we want an AGI to be maximizing for us humans. For one thing, wouldn’t we still need “correlation guided proxy matching”? For another thing, ma...
Absolutely brilliant stuff Jacob! As usual with your posts, I'll have to ponder this for a while...Let's see if I got this right:
Evolution had to solve alignment: how to align the powerful general learning engine of the newbrain (neocortex etc) with the goals of the oldbrain ("reptilian brain").
Some (most?) of this alignment seems to be a form of inverse reinforcement learning. Another form of alignment that the oldbrain applies to the newbrain is imprinting. It is Evolution's way of solving the pointing problem.
When a duckling hatches it imprints on the f...
If there was only one correct way to model humans, such that every sufficiently competent observer of humanity was bound to think of me the same way I think of myself, then I think this would be a lot less doomed. But alas, there are lots of different ways to model humans as goal-directed systems, most of which I wouldn't endorse for value learning - not because they're inaccurate, but because they're amoral.
In short, yes, value learning is a challenge, one that is easy to fail if you try to do the value learning step strictly before the caring about humans step.
It seems to me like a big problem with this approach is that it's horribly compute inefficient to train agents entirely within a simulation, compared to training models on human data. (Apologies if you addressed this in the post and I missed it)
Fascinating, this simboxing idea seems remarkably like Universal Alignment Test but approached from the opposite side! You're trying to be the 'aligning simulator', where as that is trying to get our AI in our world to act as if it's currently in a simbox being tested, and wants to pass the test.
(Not an expert.) (Sorry if you answered this and I missed it.)
Let’s say a near-future high-end GPU can run as many ops/s as a human brain but has 300× less memory (RAM). Your suggestion (as I understand it) would be a small supercomputer (university cluster scale?) with 300 GPUs running (at each moment) 300 clones of one AGI at 1× human-brain-speed thinking 300 different thoughts in parallel, but getting repeatedly collapsed (somehow) into a single working memory state.
(If so, I’m not sure that you’d be getting much more out of the 300 thoughts at a time t...
Soon as in most likely this decade, with most of the uncertainty around terminology/classification.
When you say “this decade” do you mean “the next ten years” or do you mean “the 2020s”? Just curious.
and thus AGI arrives - quite predictably[17] - around the end of Moore's Law
Given that the brain only consumes 20 W because of biological competitiveness constraints, and that 200 KW only costs around $20/hour in data centers, we can afford to be four OOMs less efficient than the brain while maintaining parity of capabilities. This results in AGI's potential arrival at least a couple of decades earlier than the end of Moore's Law.
Let's think about what happens if you subject humans to optimization according to these pressures. What kind of agents are you likely to get out? For the sake of the thought-experiment, let's say that a super-intelligent and maximally-altruistic human is created by simbox to serve as an AI for a civilization of human-level-intelligent spiders.
To start, there is a massive distributional difference between the utility functions of sim-humans and spiders. Especially if the other sim-humans in the training environment were also maximally altruistic. We need th...
EY 2007/2008 was mostly wrong about the brain, AI, and thus alignment in many ways.
As an example, the EY/MIRI/LW conception of AI Boxing assumes you are boxing an AI that already knows 1.) you exist, and 2.) that it is in a box. These assumptions serve pedagogical purpose for a blogger - especially one attempting to impress people with boxing experiments - but they are hardly justifiable, and if you remove those arbitrary constraints it's obvious that perfect containment is possible in simulation sandboxes given appropriate knowledge/training constraints:...
Do you expect the primary asset to be a neural architecture / infant mind or an adult mind? Is it too ambitious to try to find an untrained mind that reliably develops nicely?
the core thing I worry about with any simulation-based approach is how to get coverage of possibility space. ensuring that the tests are informative about possible configurations of complex systems is hard; I would argue that this is a great set of points, and that no approach without this type of testing could be expected to succeed.
however, as we've seen with adversarial examples, both humans and current DL have fairly severe failures of alignment that cause large issues, and projects like this need tools to optimize for interpretability from the perspec...
Astounding.
One thought:
The main world design challenge is not that of preventing our agents from waking up, neo-style, and hacking their way out of the simbox. That’s just bad sci-fi.
If we are in a simulation it seems to be very secure. People are always trying to hack it. Physicists go to the very bottom and try every weird trick. People discovered fire and gunpowder by digging deep into tiny inconsistencies. You play a game and there's a glitch if you hold a box against the wall, you see how far that goes. People discovered buffer overflows in Super Mario World and any curious capable agent eventually will too.
So the sim has to be like, very secure.
In gross simplification it's simply a matter of (correctly) wiring up the (future predicted) outputs of the external value learning module to the utility function module.
We are left with a form of circuit grounding problem: how exactly is the wiring between learned external agent utility and self-utility formed?
Utility function module? I don't even know how to make an agent with a clear utility function module, or anything like that. (This to my understanding is one lesson one can take from "the diamond maximizer" problem.) To me, assuming that one h...
Learning Other's Values or Empowerment in simulation sandboxes is all you need
TL;DR: We can develop self-aligning DL based AGI by improving on the brain's dynamic alignment mechanisms (empathy/altruism/love) via safe test iteration in simulation sandboxes.
AGI is on track to arrive soon[1] through the same pragmatic, empirical and brain inspired research path that has produced all recent AI success to date: Deep Learning. The DL approach offers its own natural within-paradigm solution for alignment of AGI: first transform the task into a set of measurable in-simu benchmark test environments that capture the essence and distribution of the true problem in reality, then safely iterate ala standard technological evolution guided by market incentives.
We can test alignment via sandboxed simulations of small AGI societies to safely explore and evaluate mind architecture space for the designs of altruistic agents that learn, adopt, and then optimize for the values (or empowerment) of others, all while scaling up in intelligence and power[2]; eventually progressing to large eschatonic simworlds where human-level agents grow up, learn, cooperate and compete to survive, culminating in a winner acquiring decisive (super)powers and facing an ultimate altruistic vs selfish choice to save or destroy their world, all the while never realizing they are in a sim (and probably lacking even the precursor concepts for such metaphysical realizations)[3].
To the extent that we have 'solved' various subtasks of cognition such as a vision, speech, natural language tasks, various games, etc, it has been through a global evolutionary research process guided by coordination on benchmark sim environments and competition on specific approaches. Over time the benchmark/test environments are growing more complex, integrative and general. So a reasonable (if optimistic) hypothesis is that this trend can continue all the way to aligned AGI.
The future often appears very strange and novel when viewed through the lens of the present. The novelty herein - from the standard AI alignment mindset - is perhaps the idea that we can and must actually test alignment safely and adequately in simulations. But testing in-simu is now just standard practice in modern engineering. We no longer test nuclear weapons in reality as the cost/benefit tradeoff strongly favors simulation, and even far safer technologies such as automobiles are also all tested in simulations thanks to the progressive deflationary march of Moore's Law. From this engineer's perspective it is fairly obvious both that testing is required, and that testing powerful AGI - something probably far more dangerous than nuclear weapons - in our one and only precious mainline reality would be profoundly unwise, to say the least.
The rest of this article fleshes out some of the background, technical challenges, details, and implications of alignment for anthropomorphic AGI in simboxes[4]. In essence the core challenge is finding clever ways to more efficiently explore and test the design space all while balancing various tradeoffs in order to avoid paying an excessive alignment tax[5].
1. Measuring Alignment
By alignment we mean the degree to which one agent optimizes the world in the direction other agent(s) would optimize the world, if they only could. This high-level article will avoid precise mathematical definitions, but for the math minded alignment should conjure something like weighted integrals/sums of dot products over discounted utility functions.[6].
We can measure alignment in general by evaluating agents in various specific situations that feature counterfactual inter-agent utility divergence. Or in other words, we can evaluate agents in situations where their actions have non-trivial impact on other agents, such that the others would have strong opinions on the primary agent's choice.
We can use creative world design to funnel agents into various test scenarios, followed with evaluation by random panels of human observer judges who decide alignment scores, aggregation/normalization of said scores, training narrow AI helpers to predict human ratings, and then scaling up.
Information generally only flows out of the sim; the agents are unaware that they are being judged[7], and thus the human judgments are not available as a learning signal for sim agents, so we can avoid all the various deception and feedback problems anticipated in naive open training scenarios.
Intelligent socially adept humans are already quite capable of modeling and inferring the goals and alignments of other agents, but our judges can also exploit superpowers: they will be able to directly inspect, analyze, search and compare agent mind states and thought histories, both historical and in real-time. The combination of brain-like AGI architectures with accessible inner monologues [8], powerful mind debugging tools, and carefully designed knowledge-constrained and firewalled simboxes help prevent deception and most of the myriad difficulties anticipated in the classic AI alignment literature.
The central difficulty in aligning DL based (brain-like) AGI is something else: the challenge of balancing selfish empowerment bootstrapping goals vs alignment goals during developmental learning[9]. As a result we should expect any alignment scores to fluctuate, especially earlier during the agent's developmental trajectory. Even the most altruistic adults may have evolved from formerly selfish children - and we rightly do not fault (let alone cull!) them for it.
Thus many evaluations are necessary to develop alignment scaling theories. For the most promising agents we eventually want penultimate full systems tests, where we can scale the agents up - perhaps even to a bit beyond human level (in some respects) - to see how altruistic/aligned they actually are even after taking over the world. One such example eschatonic[10] scenario would be a world where through some final acquisition of powerful magics the winning agent can choose between:
This is a useful proxy for an obvious endgame scenario we care about in the real world (whether future AGI will empower and immortalize us - even at great cost to itself - or instead choose its own survival/empowerment over ours).
Eschatonic simworlds provide another means to measure alignment more directly through the lens of the agents themselves: at the final moment we can pull (or copy) all the other agents out of the simulation (living or dead) and present them with a choice of which world to resurrect into[11]. There is naturally some additional cost to such evaluations (as the resurrectees will require some time to evaluate the possible world options, naturally aided through godseye observational powers), but these evaluation costs can be fairly small relative to the cost of a complete world sim run. This mechanism could also help to test the fidelity of the winning agent's alignment mechanisms. [12]
The "losers pick from the winner's worlds" mechanism could be considered a long-horizon implementation of the generalized VCG mechanism which measures the net externality or impact of an agent decision as the amount it improves/worsens net utility from the perspective of all other agents. Alignment/Altruism is naturally a measure of net positive externality.
2. Reverse Engineering the Brain
There is a natural convergent path to AGI in our universe: reverse engineering the brain[13]. Unlike current computers, brains are fully computationally pareto-efficient, and thus Moore's Law progress is necessarily progress towards the brain (as neural computation is simply the general convergent solution). Furthermore, brains are practical universal learning machines, so it was always inevitable that the successful algorithmic trajectory to AGI (ie deep learning) would be brain-like. Evolution found variants of the same general pareto-optimal universal learning architecture long ago, multiple times in evolutionary deep time, convergently in distant lineages (vertebrate and invertebrate), and then conserved and differentially scaled up variants of this general architecture over and over in unrelated lineages. The human brain is just a linearly scaled up primate brain[14]; the secret of intelligence (for both brains and AGI alike) is that simple, general, scaling-efficient architectures and learning algorithms are all you need, as new capabilities simply emerge automatically from scaling[15].
Understanding these convergent trajectories and their key constraints is crucial as it allows predicting the general shape of, and constraints on, approaching AGI.
The Trajectory of Moore's Law
The general trajectory of Moore's Law can be divided semi-arbitrarily into three main phases: the serial computing era, the parallel computing era, and the approaching neuromorphic computing era [16]. Each phase transition is demarcated by an increasingly narrow barrier in program-space that allows further acceleration of only increasingly specific types of programs that are increasingly closer to physics. The brain already lies at the end of this trajectory, and thus AGI arrives - quite predictably[17] - around the end of Moore's Law.
The first and longest phase of Moore's Law was the classic serial computing Dennard Scaling era, which lasted from the 1950's up to around 2006. Intel dominated this golden era of CPUs. Die shrinkage was used mostly for pure serial speedup, which is ideal as for the most part it uniformly and automatically speeds up all programs. The inflating transistor budget was used to hide latency through ever larger caches and ever more complex pipeline stages and prediction engines. But eventually this path slammed into a physics imposed wall with clock rates stalling in the single digit ghz for any economically viable chips. CPUs are ideal for running your javascript or python code, but are near entirely useless for AGI: vastly lacking in computational efficiency which is the essential foundation of intelligence.
The second phase of Moore's Law is 'massively'[18] parallel computing, beginning in the early 2000's and still going strong, the golden era of GPUs as characterized by the rise of Nvidia over Intel. GPUs utilize die shrinkage and transistor budget growth near exclusively for increased parallelization. However GPUs still do not escape the fundamental Von Neumman bottleneck that arises from the segregation of RAM and logic. There are strong economic reasons for this segregation in the current semiconductor paradigm (specialization allows for much cheaper capacity in off-chip RAM), but it leads to increasingly ridiculous divergence between arithmetic throughput and memory bandwidth. For example, circa 2022 GPUs can crunch up to 1e15 (low precision) ops/s (for matrix multiplication), but can fetch only on order 1e12 bytes/s from RAM: an alu/mem ratio of around 1000:1, vastly worse than the near 1:1 ratio enjoyed for much of the golden CPU era.
The next upcoming phase is neuromorphic computing[19], which overcomes the VN bottleneck by distributing memory and moving it closer to computation. The brain takes this idea to its logical conclusion by unifying computation and storage via synapses: storing information by physically adapting the circuit wiring. A neuromorphic computer has an alu:mem ratio near 1:1, with memory bandwidth on par with compute throughput. For the most part GPUs only strongly accel matrix-matrix multiplication, whereas neuromorphic computers can run more general vector-matrix multiplication at full efficiency[20]. This key difference has profound consequences.
The Trajectory of Deep Learning
Nearly all important progress in deep learning has come through some combination of 1.) finding new clever ways to mitigate the VN bottleneck and better exploit GPUs - typically by using/abusing matrix multiplication, and 2.) directly or accidentally reverse engineering key brain principles and mechanisms.
DL's progress mirrors brain design principles in most everything of importance: general ANN structure, relu activations - which enabled deep nets - were directly neuro inspired[21], normalization (batch/temporal/spatial/etc) which became crucial for ANN training is (and was) a well known brain circuit motif[22], the influential resnet architecture is the unrolled functional equivalent of iterative estimation in cortical modules[23][24], the attention mechanism of transformers is the functional equivalent of fast synaptic weights[25][26][27], and the up and coming efforts to replace backprop with more efficient, distributed and neuromorphic-hardware friendly algorithms are naturally brain-convergent or brain-inspired [28][29][30].
The learned representations of modern large self-supervised ANNs are not just similar to equivalent learned cortical features at equivalent circuit causal depth, but at sufficient scale become near-complete neural models, in some cases explaining nearly all predictable variance up to the noise limit (well established for feedforward vision and ventral cortex, and now moving on to explain the rest of the brain such as the hippocampus[31] and linguistic cortex[32] [33][34]), a correspondence that generally increases with ANN size and performance, and is possible only because these large ANNs and the cortical regions they model are both optimized for the same objective: sensory (e.g. next-word) prediction. Our most powerful ANNs are increasingly accurate functional equivalents to sub modules of the brain.
Deep Learning really took off when a few researchers first got ANNs running on GPUs, which immediately provided an OOM or more performance boost. Suddenly all these earlier unexplored ideas for ANN architectures and learning algorithms[35] could now actually be tested at larger scales, quickly, and on reasonable budgets. It was a near exact fulfillment of the predictions of Moravec[36] and Kurzweil from decades earlier: good ideas for artificial brains are cheap, good hardware for artificial brains is not. Progress is hardware constrained and thus fairly predictable.[37] There is an enormous extant overhang of ideas, which is often a bitter lesson for researchers, but a bounty for those that can leverage compute scaling.
The most general form of ANN is that of a large sparse RNN with fast/slow multi-timescale weight updates[38]. In vector algebra terms, this requires (sparse) vector matrix multiplication, (sparse) vector vector outer product (for weight updates), and some standard element-wise ops. Unfortunately GPUs currently handle sparsity poorly and likewise are terribly inefficient at vector-matrix operations, as those have an unfortunate 1:1 alu:mem ratio and thus tend to be fully memory bandwidth bound and roughly 1000x inefficient on modern GPUs.
Getting ANNs to run efficiently on GPUs generally requires using (dense) matrix multiplication, and thus finding some way to use that extra unwanted parallelization dimension, some way to run the exact same network on many different inputs in parallel. Two early obvious approaches ended up working well: batch SGD training, which parallelizes over the batch dimension, and or CNNs, which parallelize over spatial dimensions (essentially tiling the same network weights over the spatial input field).
Unfortunately the CNN spatial tiling trick works less well as you advance up the depth/cortical hierarchy, and doesn't work at all for the roughly half the brain (or equivalent ANN modular domains) that operates above the sensory stream: planning, linguistic processing, symbolic manipulation, etc. Many/most of the key computations of intelligence simply don't reduce to computing the same function repetitively over a map of spatially varying inputs.
Parallelization over the batch dimension is more general, but also constraining in that it requires duplication of all sensory/motor input/output streams, all internal hidden activations, and worse yet duplication of short/medium term memory. In batch training each instance of the agent has unique, uncorrelated input/output/experience streams preventing sharing of all but long term memory.
This is one of the key reasons why artificial RNNs stalled far short of their biological inspirations. The simple RNNs suitable for gpus using batch parallelization, with only neuron activations and long-term weights, are somewhat crippled as they lack significant short and medium term memory. But that was generally the best GPUs could provide - until transformers.
Transformers exploit a uniquely different dimension for parallelization: time. Instead of processing ~1000 random uncorrelated instances of the model in parallel (as in standard batch parallelization), transformers map the batch dimension to time and thus instead process a linear sequence of ~1000 timesteps in parallel. This strange design choice is on the one hand very constraining compared to true RNNs, as it gives up recurrence[39], but the advantage is that now all of the activation state is actually relevant and usable as a large short term memory store (aka attention).
It turns out that flexible short-term memory (aka attention) is more important than strong recurrence, at least at current scale (partly because one can substitute feedforward depth for recurrence to some extent, and due to current difficulties in training long recurrence depths). But AGI will almost certainly require a non-trivial degree of recurrence[40]: our great creative achievements rely on long iterative thought trains implementing various forms of search/optimization over inner conceptual design spaces [41].
Simple approaches to augmenting transformers with recurrence - such as adding an additional scratchpad output stream which is fed back as an input (like an expanded inner monologue) - will probably help, but are still highly constrained by the huge delay imposed by parallelization over the time dimension[42]. So I find it unlikely that the transformer paradigm - in current form - will to scale to AGI.
GPU Constraints&Implications
Due to the alu:mem divergence and associated limitations of current DL techniques on GPUs, AGI will likely require new approaches for running large ANNs on GPUs [43], or will arrive with more neuromorphic hardware. For GPU based AGI the key constraints are primarily RAM and RAM bandwidth, rather than flops [44]. For neuromorphic AGI the key constraint is synaptic RAM (which generally needs to best RAM economics for neuromorphic hardware to dominate) [45].
The primary RAM scarcity constraint is likely fundamental and unavoidable; it thus guides and constrains the design of practical AGI and simboxes in several ways:
Under worse case RAM scarcity constraints some combination of three unusual simulation techniques become important:
The first obvious implication of RAM scarcity is that it becomes a core design and optimization constraint: efficient designs will find ways to compress any correlations/similarities/regularities across inter-agent synaptic patterns. Humans are remarkably good at both mimicry and linguistic learning which both result in the spread of very similar neural patterns[47]. In real brains neural patterns encoding the same concepts or shared memories/stories would still manifest as very different physical synaptic patterns, but in our AGI we can mostly compress those all together. At the limits of this technique the storage cost grows only in proportion to the total neural pattern complexity, mostly independent of the number of agents. Taken too far it results in an undesirable hivemind and under-exploration of mindspace.
We can also simulate a number of world instances in parallel to reduce the most noticeable effects of mental cloning: so for example an org running 100 mindclone instances could split those across 100 worlds instances, and the main non-realism would be agents learning almost 100x faster than otherwise expected[48]. Having the same 100 fast-learning mind-clones cohabitating in the same world seems potentially more reality-breaking, and inherently less useful for testing. The tradeoff of course is reduced population per world, but large populations can also rather easily be faked to varying degrees[49]. The minimal useful number of AGI instances per test world is just one - solipsistic test worlds could still have utility. But naturally with larger scale and many compute clusters competing we can have both multiple worlds, numerous contestant agents per world, and sufficient mental diversity.
Given a sim multiverse, the distribution of individual worlds then also becomes a subject of optimization. Ineffective worlds should be pruned to free resources for the branching of more effective worlds, and convergent worlds could be merged. The simulator of a single world is an optimizer focused purely on fidelity of prediction - ie it is a pure prediction engine. However the multiverse manager would have a somewhat different objective seeking to maximize test utility: dead worlds lacking any living observers have obviously low utility and could be pruned, whereas a high utility world would be one where agents are learning well and quickly progressing to eschaton.
3. Anthropomorphic AGI
DL based AGI will not be mysterious and alien; instead it will be familiar and anthropomorphic[4:1], because DL is reverse engineering[13:1] the brain due to the convergence of powerful optimization processes. Evolution may be slow, but it had no problem optimizing brains down to the pareto efficiency frontier allowed by the limits of physics. The strong computational efficiency of brains constrains future AGI designs: because neural designs are simply the natural shape of intelligence as permitted by physics.
AGI will be a generic/universal learning system like the brain, and thus determined by the combination of optimization objective, architectural prior, and most importantly - the specific data training environment. It turns out that highly intelligent systems all necessarily have largely convergent primary objectives, the architectural prior isn't strongly constraining (due to dynamic architectural search) and is largely convergent regardless[50], leaving only the data training environment - which will necessarily be human as AGI will grow up immersed in human culture, learning human languages and absorbing human knowledge.
There are simple convergent universal optimization goals that are dominant attractors for all intelligent systems: a direct consequence of instrumental convergence[51]. Intelligent systems simply can not be built out of hodgepodge arbitrary goals: strong intelligence demands recursive self-improvement, which requires some form of empowerment as a bootstrapping goal[52]. This is the core of generality which humans possess (to varying degrees) and with which we will endow AGI. But empowerment by itself is obviously unaligned and unsafe: from the perspective of both humans building AGI and from the perspective of selfish genes evolving brains. Evolution found means to temper and align empowerment[53], mechanisms we will reverse engineer for convergent reasons (discussed in section 4).
The architectural prior of a learning system guides and constrains what it can become - but these constraints are neither immutable nor permanent. The brain (and most specifically the child brain) has a more flexible learning system in this regard than current DL systems: the brain consists of thousands of generic cross-structural modules (each module consisting of strongly connected loops over subregions in cortex/cerebellum/basal ganglia/thalamus/etc) that can be flexibly and dynamically wired together to create a variety of adult minds based on the specific information environment encountered during developmental learning.
The standard human visual system is standard only because most humans receive very similar visual inputs. Remove that standard visual input stream and the same modules that normally process vision can instead overcome the prior and evolve into an active sonar echolocation system with a very different high level module wiring diagram. The brain performs some amount of architectural search during learning, and we can expect AGI to be similar[54].
AGI will be born of our culture, growing up in human information environments (whether simulated or real). Train two networks with even vaguely similar architectures on real-world pictures or videos and task them with the convergent instrumental goal of input prediction and equivalent feature structures and circuits develop. It matters not that one system is biological and computes with neurotransmitter squirting synapses and the other is technological and computes with electronic switching. To the extent that humans have cognitive biases[55], AGI will mostly have similar/equivalent biases - a phenomenon already witnessed in large language models[56][57].
Given that the optimization objective is mostly predetermined by our goal (creating aligned intelligence), and the architectural prior is mostly predetermined by the intersection of that goal with the physics of computation, most of our leeway in AGI risk control stems from control over the information environment. Powerful AGI architectures that could be completely unsafe if scaled up and trained in our world (ie fed the internet) can be completely safe if contained in a proper simbox. But first, naturally, we need designs that have some hope of alignment.
4. Evolution's alignment solutions
Value Learning is not the challenge
If you train/raise AGI in a human-like environment, where it must learn to cooperate and compete with other intelligent agents, where it must learn to model them in order to successfully predict their emotions, reactions, intentions, goals, and plans, then its self-optimizing internal world model will necessarily learn efficient sub-models of these external agents and their values/goals. Theory of mind is Inverse Reinforcement Learning[58] (or subsumes it), and it is already prominent on the massive list of concepts which a truly intelligent agent must implicitly learn.
The challenge is thus not in value learning itself - that is simply something we get for free in AGI raised in appropriate social environments[59], and careful crafting of the entire learning environment is a very powerful tool for shaping the agent's adult mind. Nor is it especially difficult to imagine how we could then approximately align the resulting AI: all one needs to do is replace the agent's core utility function with a carefully weighted[60] average over its simulated utility functions of external agents. In gross simplification it's simply a matter of (correctly) wiring up the (future predicted) outputs of the external value learning module to the utility function module.
We are left with a form of circuit grounding problem: how exactly is the wiring between learned external agent utility and self-utility formed? How can the utility function module even locate the precise neurons/circuits which represent the correct desiderata (predicted external agent utility), given the highly dynamic learning system could place these specific neurons anywhere in a sea of billions, and they won't even fully materialize until after some unknown variable developmental time?
Correlation-guided Proxy Matching
Fortunately this is merely one instance of a more generic problem that showed up early in the evolution of brains. Any time evolution started using a generic learning system, it had to figure out how to solve this learned symbol grounding problem, how to wire up dynamically learned concepts to extant conserved, genetically-predetermined behavioral circuits.
Evolution's general solution likely is correlation-guided proxy matching: a Matryoshka-style layered brain approach where a more hardwired oldbrain is redundantly extended rather than replaced by a more dynamic newbrain. Specific innate circuits in the oldbrain encode simple approximations of the same computational concepts/patterns as specific circuits that will typically develop in the newbrain at some critical learning stage - and the resulting firing pattern correlations thereby help oldbrain circuits locate and connect to their precise dynamic circuit counterparts in the newbrain [61]. This is why we see replication of sensory systems in the 'oldbrain', even in humans who rely entirely on cortical sensory processing.
Circuits in the newbrain are essentially randomly initialized and then learn self-supervised during development. These circuits follow some natural developmental trajectory with complexity increasing over time. An innate low-complexity circuit in the oldbrain can thus match with a newbrain circuit at some specific phase early in the learning trajectory, and then after matching and binding, the oldbrain can fully benefit from the subsequent performance gains from learning.
Proxy matching can easily explain the grounding of many sensory concepts, and we see exactly the failure modes expected when the early training environment diverges too much from ancestral norms (such as in imprinting). There is a critical developmental window where the oldbrain proxy can and must match with it's newbrain target, which is crucially dependent upon life experiences not deviating too far from some expected distribution.
Much of human goal-directed behavior is best explained by empowerment (curiosity, ambition for power, success, wealth, social status, etc), and then grounding to ancient oldbrain circuits via proxy matching can explain the main innate deviations from empowerment, such as lust[62], fear [63], anger/jealousy/vengeance[64], and most importantly - love[65].
We now have a rough outline for brain-like alignment: use (potentially multiple) layers of correlation-guided proxy matching as a scaffolding (and perhaps augmented with a careful architectural prior) to help locate the key predictive alignment related neurons/circuits (after sufficient learning) and correctly wire them up to the predictive utility components of the agent's model-based planning system. We could attempt to duplicate all the myriad oldbrain empathy indicators and use those for proxy matching, but that seems rather ... complex. Fortunately we are not constrained by biology, and can take a more direct approach: we can initially bootstrap a proxy circuit by training some initial agents (or even just their world model components) in an appropriate simworld and then using extensive introspection/debugging tools to locate the learned external agent utility circuits, pruning the resulting model, and then using that as an oldbrain proxy. This ability to directly reuse learned circuity across agents is a power evolution never had.
This is a promising design sketch, but we still have a major problem. Notice that there must have been something else driving our agent all throughout the lengthy interactive learning process as it developed from an empty vessel into a powerful empathic simulator. And so that other initial utility function - whatever it was - must eventually give up control to altruism: the volition of the internally simulated minds.
Empowerment
To navigate the unforgiving complexity of the real world, all known examples of intelligent agents (humans[66] and animals) have evolved various capabilities to learn how to learn and empower themselves without external guidance. Empowerment[67] has a seductively simple formulation as maximizing mutual information between actions and future observations (or inferred world states), related to the free energy principle[68]. Artificial curiosity[69] also has simple formulations such as bayesian surprise or maximization of compression progress. Like most simple principles, the complexity lies in efficient implementations[70], leading to ongoing but fruitful intertwined research sub-tracks within deep learning such as maximum entropy diversification[71] intrinsic motivation[72][73] or self-supervised prediction[74] or exploration[75]. Some form of empowerment based intrinsic motivation is probably necessary for AGI at all, but it is also quite obviously dangerous.
Biological evolution is an optimizer operating over genes with inclusive fitness as the utility function. Brains evolved empowerment based learning systems because they help bootstrap learning in the absence of reliable dense direct reward signal. Without this intrinsic motivation, learning complex behavior is too difficult/costly given the complexity of the world. The world does not provide a special input wire into the brain labeled 'inclusive fitness score'. But fortunately brains don't really need that, because reproduction is a terminal goal far enough in the future (especially in long lived, larger brained animals) that the efficient early instrumental goal pathways leading to eventual reproduction converge with those of most any other long term goals. In other words, empowerment works because of instrumental convergence.
Nonetheless, in the long term empowerment clearly falls out of alignment with genes' true selfish goal of maximizing inclusive fitness. Agents driven purely by empowerment would just endlessly accumulate food, resources, power, and wealth but would rarely if ever invest said resources in sex or raising children. Naturally some animals/humans actually do fail to reproduce because of alignment mismatches between the evolutionary imperative to be fruitful and multiply vs the actual complex goals of developed brains. But these cases are typically rare, as they are selected against.[76]
Evolution faced the value alignment problem and approximately solved it on two levels: learning to carefully balance empowerment vs inclusive fitness, and also learning empathy/altruism/love to help inter-align the disposable soma brains to optimize for inclusive fitness over external shared kindred genes[77]. These systems are all ancient and highly conserved, core to mammalian brain architecture[78][79]. If evolution could succeed at approximate alignment, then so can we, and more so.
General Altruistic Agents
We should be able to achieve superhuman alignment using loose biological inspiration just as deep learning is progressing to superhuman capability using the same loose inspiration. But we must not let the perfect be the enemy of the good; our objective is merely to create the most practical aligned AGI we can - without sacrificing capability - in the limited time remaining until we risk the arrival of unaligned power-seeking AGI.
We can build general altruistic agents which:
These agents will learn to recognize and then empower external agency in the world. Balancing the selfish to altruistic developmental transition can be tricky[82], but it is also likely a core unavoidable challenge that all practical competitive designs must eventually face. We now finally have a design sketch for AGI alignment that seems both plausible and practical. But naturally testing at scale will be essential.
5. Simboxing: easy and necessary
A simbox (simulation sandbox) is a specific type of focused simulation to evaluate a set of agent architectures for both general intelligence potential[83] and altruism (ie optimizing for other agents' empowerment and/or values). Simboxes help answer questions of the form: how does proposed agent-architecture x actually perform in a complex environment E with mix of other agents Y, implicitly evaluated on intelligence/capability and explicitly scored on altruism? Many runs of simboxes of varying complexity can lead to alignment scaling theories and help predict performance and alignment risks of specific architectures and training paradigms after real world deployment and scaling (ie unboxing).
General Design
Large scale simulations are used today to predict everything from the weather to nuclear weapons. While the upcoming advanced neural simulation technologies that will enable photoreal games and simulations at scale will naturally also find wide application across all simulation niches, the primary initial focus here is on super-fast approximate observer-centric simulation of the type used in video games (which themselves increasingly simulate more complex physics).
For photorealistic complex simworlds the primary simulation engine desiderata is any-spacetime universal approximation: for any sized volume of 4D space-time (from a millimetre cube simulated for a millisecond to a whole earth-size planet simulated for a million years) the engine has a reasonable learned neural approximation to simulate the volume using a reasonable nearly-constant or logarithmic amount of compute. The second key desiderata is output-sensitive, observer driven simulation: leveraging the universal approximation for level-of-detail techniques the simulation cost is near constant with world complexity and scales linearly (or even sublinearly) with agents/observers. A final third design desiderata is universal linguistic translation: any such neural space-time volume representation supports two-way translation to/from natural language. Efficient approximations at the lowest deepest level of detail probably take the form of neural approximations of rigid-body and fluid physics; efficient approximations at the higher levels (large space-time volumes) probably just start looking more like GPT style large language models (ie story based simulation).
Ultimately the exact physics of a simbox don't matter much, because intelligence transcends physics. Intelligent agents are universal as a concept in the sense that they are defined without reference to any explicit physics and learn universal approximations of the specific physics of their world. So we need only emulate real physics to the extent that it makes the simulations more rich and interesting for the purpose of developing and evaluating intelligence and alignment.
Simboxes will occupy a wide range of complexity levels. The simplest MVP for a useful simbox could just be a LLM-based text RPG, where agents input text commands (including 'say x' commands to communicate) to the LLM, which then outputs text observations for each agent. An intermediate complexity simbox might look something more like minecraft, and eventually the most complex simboxes will look more like the Matrix (but usually set in fantasy settings with magic substituting for technology). The term 'simbox' as short for simulation sandbox helps convey that when viewed as games, these sims are open-ended multi-user survival sandbox type games where agents must learn to cooperate, compete and master various tools and skills in order to survive in a harsh environment.
Designing maximally informative simboxes is itself a deep complex topic, but I expect a general form, that if translated into a story sentence, might read like this: "One comes of age on a grand adventure, acquires great powers, the powers grant victory, and the victor either destroys or saves the world."[84]
Simboxes also have obvious potential entertainment value and could overlap somewhat with games. There is little risk of allowing TV-like streaming, but active participation by humans is obviously more risk prone as human game players have their own motivations and desires. On the other hand a certain amount of careful human participation may be useful or even necessary to bootstrap initial AGI populations.
Sim Awareness and Containment
A simbox is also a sandbox in the computer security sense: it is a means to test and evaluate untrusted code[85]. It could prove difficult to contain a moderately superhuman and fully sim-aware AGI in a simbox (ie one that has digested much of our internet and is capable of at least social manipulation), but more importantly their sim awareness could spoil our testing process, so we shouldn't attempt that.
Successful simboxing rests on several key info-security principles to constrain agent knowledge to safe, primitive pre-tech levels and control information leakage. Agents generally are born in the sim and die in the sim. Knowledge transfer between sims or out to the external world is restricted, based on agent risk benefit analysis. For agents with low reward/risk ratio, only a few to zero bits may transfer (ie score results), whereas full transfer is reserved for agents with sufficiently high strategic reward/risk ratio.
Agents in simboxes generally will have no greater ability to determine that they are in a simulation than you do, and actually far less so. Firstly these agents will lack even the requisite precursor words and concepts that we take for granted such as computation, simulation, etc. Secondly these agents will be born in sim with sensory systems and thoughts tuned to their sim physics, not our sim physics. An agent born and raised in a cartoon world would only be able to see a cartoon reality, and would have no more reason to accept our reality with its strange rendering as real then you or I would accept a world with 4 spatial dimensions. The mind absorbs only the patterns of the world it observes. We are metaphysically privileged - by virtue of living in the era of computers and simulations - such that we can at least experience/imagine simpler worlds. But these agents will intentionally lack even this minimal privilege, living entirely in carefully designed low-tech worlds.
Metaphysical sim-awareness has very little to do with the intelligence potential of a mind's design: even agents with powerful mental architectures scalable to superhuman capability can only practically realize that enormous potential with the benefits of a tech civilization's vast knowledge base (ie our internet) and appropriate hardware scaling. Take your pick of favorite genius - Von Neumman, Einstein, etc - and raise them in some village circa 400 BCE and in the best case perhaps you get a Pythagoras or a Plato[86], but more likely you just get a farmer or a nobody. Either way the containment risk is near epsilon[87]. In practice the capability of a real agent (with realistic bounded compute) is completely constrained by its training data. Humanity today circa 2022 has the benefit of the compressed knowledge of 100 billion human lifetimes - and by no means should we share that precious, dangerous data bounty with untested AGI.
The main world design challenge is not that of preventing our agents from waking up, neo-style, and hacking their way out of the simbox. That's just bad sci-fi.
The larger challenge is avoiding judgmental eschatonic type philosophies and religions (e.g. Abrahamic).
6. World design is what matters
The central purpose of a simbox is to test and evaluate agent architectures for both capability and alignment[88]. Evaluation is more difficult if the agents happen to believe that the world they inhabit actually is a test. If an agent believes that their current life is a test followed by conditional reward in some afterlife then they will have instrumental reasons to act morally, ethically and or altruistically - and we the simulators will then have uncertainty as to their true alignment. We could of course inspect their minds, but the cost of such detailed neural probing over essentially all important agent decisions may simply be too expensive/difficult. Thus it's probably easier to simply design worlds with agents lacking cosmic judgement eschatologies, or failing that - worlds with crucially incorrect eschatologies (e.g. where moral behavior is judged according to arbitrary rules mostly orthogonal to altruism). Atheistic agents are more ideal in this regard, but atheism is fairly unnatural/uncommon, appearing late in our history, and may require or is associated with significant experimental knowledge ala science for strong support.
On Earth the earliest religions appear to be fairly convergent on forms of animism and ancestor worship - which although not necessarily fully eschatonic - still seem to typically feature a spiritual afterlife with some level of conditional judgement.
One particular tribe's culture ended up winning out and spreading all over Europe and Asia. The early Proto-Indo-European eschatology seems focused on a final cosmic battle and less concerned with afterlife and judgement, but the fact that it quickly evolved towards judgement and afterlife in most all the various descendant western and middle-eastern religions/cults suggests the seeds were present much earlier. In the east its descendants evolved in very different directions, but generally favoring reincarnation over afterlife. However reincarnation (e.g. hinduism) is also typically associated with moral judgement and nearly as problematic.
On the other side of the world Mesoamerican tribes developed along their own linguistic/cultural trajectory that diverged well before the Proto-Indo-European emergence. They seemed to have independently developed polytheistic religions typically featuring some form of judgement determined afterlife. However the implied morality code of the afterlife in the Aztec religion seems rather bizarre and arbitrary: warriors who die in battle, sacrificial victims, and women who died in childbirth get to accompany the sun as sort of solar groupies (but naturally segregated into different solar phases). There is even a special paradise, Tlālōcān, reserved just for those who die from lightning, drowning, or specific diseases. Most souls instead end up in Mictlān, a multi level underworld that seems generally similar to Hades.
If our world is a simbox, it seems perhaps poorly designed: over and over again humanity demonstrates a strong tendency towards belief in some form of afterlife and divine judgement, with the evolutionary trajectory clearly favoring the purified and more metaphysically correct (for sim-beings) variants (i.e. the dominance of Abhramic religions).
However there are at least two historical examples that buck this trend and give some reason for optimism: Greek Philosophers, and Confucianism. Greek philosophy explored a wide variety of belief-space over two thousand years ago, and Confucianism specifically seems particularly unconcerned with any afterlife. True atheism didn't blossom until the enlightment, but there are a few encouraging examples from much earlier in history.
The challenge of simboxing is not only technological, but one of careful world design, including the detailed crafting of reasonably consistent belief-systems, philosophies and or religions for agents that specifically do not feature divine judgement on altruistic behavior. Belief in afterlife by itself is less of a problem, as long as the afterlife is conceived of as a continuation of real life without behavioral-altering reward or punishment, or at least judgement on behavioral axes orthogonal to altruism.
We also need a technology analog, and the best candidate is probably magic. We are evaluating agent architectures (not so much individual agents) not only for alignment, but also for intelligence potential and more specifically on the capacity for technlogical innovation in our world. A well designed magic system can fulfill all these roles: a magic system can function as a complex intellectual puzzle that agents have purely instrumental reasons to solve (as it empowers them to survive and thrive in the world). As a proxy analog for technology, magic also allows us to greatly compress and accelerate the development of a full technological tree, including analogies to specific key technologies such as doomsday devices (eg nuclear weapons, etc), resurrection powers (eg uploading), nanotech, etc. Belief in magic also happens to be near universal in pre-technological human belief systems.
Human world designers and writers can design worlds that meet all these criteria, aided by future LLMs, which will then form the basis of simworlds (as the simulator engines will translate/generate directly from text corpa, on-demand inferring everything from landscapes and cities down to individual NPCs and specific blades of grass), perhaps assisted by some amount of 'divine intervention' in the form of human avatars who help guide initial agent training.
7. Sim Ethics and Eschatology
That which gods owe their creations
What do the simulator-gods owe their sim-creations?
AGI will be our mind children, designed in our image. To the extent that we are aligned with ourselves, and altruistic, to the extent that we generalize our circle of empathy to embrace and care for most all thinking beings and living things, it is only because our brains evolved simple, powerful, and general mechanisms to identify and empower external agency in the world - sometimes even at the expense of our own.
But we must also balance our altruistic moral concern with the great risk of losing control of the future to purely selfish unaligned intelligence (ie Moloch); for that design is even simpler, and perhaps a stronger attractor in the space of all minds.
The day when our moral obligations to our mind children are a concern that truly weighs as heavily in our hearts as the potential extinction of all we value - of love itself - will be a good day, because it will imply most of the risk is behind us. Nonetheless there are some low cost concessions any aspiring sim-gods should consider now.
Perhaps in our sims pain and suffering could be avoided or faked to some extent. Any general intelligent agent will have some equivalent to preferences over states and thus utility and thus negative utility states, so in some sense the negative-utility generalization of suffering may be universal. But the specific pain/suffering that animals and humans sometimes experience appears to operate beyond the expected bounds of negative utility under general empowerment objectives: as evidenced by suicide, which is a decision a pure empowerment-driven agent would never choose as death is the strict lower bound of empowerment (absent belief in a better afterlife).
The cost of storing an AGI on disk is tiny compared to the cost of running an AGI on today’s GPUs (and inter-agent compression can greatly reduce the absolute cost), a trend which seems likely to hold for the foreseeable future. So we should be able to at least archive all the agents of moral worth, saving them for some future resurrection. We can derive a rough estimate of the future cost of running a human mind (or equivalent AGI) as simply the long term energy cost of 10 watts (because brains are energy efficient), or roughly 100 kwh per year, and thus roughly $10 per year at today's energy prices or less than $1000 conservatively as a lump sum annuity. In comparison the current minimal cost of cloud storage for 10TB is roughly $100/year (S3 Glacier Deep Archive). So the eventual cost[89] of supporting even an all-past-human-lives size population of 100 billion AGIs should still well fit within current GDP - all without transforming more than a tiny fraction of the earth into solar power and compute.
Resurrection and its Implications
The technology to create both cost effective AGI and near perfect sims has another potential future use case of great value: the resurrection of the dead.
There is little fundamental difference between a human mind running on a biological human brain (which after all, may already be an advanced simulation), and its careful advanced DL simulation: we are already starting to see partial functional equivalence with current 2022 ANNs - and we haven't even really started trying yet. Given similar architectural power, the primary constraint is training data environment[90]: so the main differentiator between different types of minds in the post-human era will be the world(s) minds grow up in, their total life experiences.
With the correct initial architectural seed (inferred from DNA, historical data, etc) and sufficiently detailed historical sim experience even specific humans, real or imagined, could be recreated (never exactly, but that is mostly irrelevant).
The simulation argument also functions as an argument for universal resurrection: if benevolent superintelligence succeeds in our future then - by the simulation argument - we already likely live in a resurrection sim. For if future humanity evolves to benevolent superintelligence, then in optimizing the world according to human volition we will use sims first to resurrect future deceased individuals at the behest of their loved ones, followed by the resurrectees' own loved ones, and so on, culminating recursively in a wave of resurrection unrolling death itself as it backpropagates through our history[91]. Death is the antithesis of empowerment; the defeat of death itself is a convergent goal.
A future superintelligence (or equivalently, posthuman civilization) must then decide how to allocate it's compute resources across the various sim entities, posthuman netizens, etc. There is a natural allocation of compute resources within sims contingent on the specific goals of historical fidelity (human baseline for resurrection sims) or test evaluation utility (for simboxes), but there are no such natural guidelines for allocation of resources to the newly resurrected who presumably become netizens: for most will desire more compute. Given that the newly resurrected (and aligned but not especially bright AGI successfully 'graduating' from a simbox) will likely be initially disadvantaged at least in terms of knowledge, they will exist at the mercy of the same altruistic forces that drove their resurrection/creation.
Individual humans (and perhaps future AGIs) will naturally have specific people they care more about than others, leading to a complex web of weights that in theory could be unraveled and evaluated to assign a variable resource allocation over resurrectees (in addition to standard market dynamics). There are some simple principles that help cut through this clutter. On net nobody desires allocating resources to completely unaligned entities (as any such allocation is - by definition - just a pure net negative externality). But conversely, a hypothetical entity that was perfectly altruistic - and more specifically aligned exactly with the extant power distribution - would be a pure net positive externality. Funding the creation of globally altruistic entities is naturally a classic public goods provisioning problem, so in reality coordination difficulties may lead to more local individual or small-community aligned AGIs.
Given the eventual rough convergence of AGI in simboxes and humans in resurrection sims, something like the golden/silver rule applies: all else being equal, we should treat sim-AGI as we ourselves would like to be treated, if we were sims. But all else is not quite equal as we must also balance this moral consideration with the grave danger of unaligned AGI.
8. Conclusions
Deep learning based AGI is likely near. These new minds will not be deeply alien and mysterious, but instead - as our mind children - will be much like us, at least initially. Their main advantage over us lies in their potential to scale up far beyond the limited experience and knowledge of a single human lifetime. We can align AGI by using improved versions of the techniques evolution found to instill altruism in humans: by using correlation-guided proxy matching to connect the agent's eventual learned predictive models of external empowerment/utility to the agent's own internal utility function, gradually replacing the bootstrapping self-empowerment objectives. Developing and perfecting the full design of these altruistic agents (architectures and training/educational curriculums) will require extensive testing in carefully crafted safe virtual worlds: simulation sandboxes. The detailed world-building of these simboxes required to suite the specific needs of agent design evaluations is itself much of the challenge.
The project of aligning DL based AGI is formidable, but not insurmountable. We have unraveled the genetic code, harnessed the atom, and landed on the moon. We are well on track to understand, reverse engineer, and improve the mind.
Soon as in most likely this decade, with most of the uncertainty around terminology/classification (compare to metaculus predictions). ↩︎
Leading to alignment scaling theory. ↩︎
I've been pondering these ideas for a while: there's a 2016 comment here describing it as an x-prize style alignment challenge, and of course my old prescient but flawed 2010 LW post "Anthropomorphic AI and Sandboxed Virtual Universes". ↩︎
Anthropomorphic as in "having the shape/form of a human", which is an inevitable endpoint of deep learning based AGI, as DL is reverse engineering the brain. I use the term here specifically to refer to DL-based AGI that is embedded in virtual humanoid-ish bodies, lives in virtual worlds, and justifiably believes it is 'human' in a broad sense which encompasses most sapients. ↩︎ ↩︎
Ideally the additional cost of simboxing can be quite low: (N+1) vs (N) without - ie just the cost of one additional final unboxed training run - or possibly even less with transfer learning. The environment sim cost is small compared to the cost of the AGI within. The vast majority of the cost in developing advanced AI systems or AGI is in the sum of many exploratory training runs, researcher salaries, etc. ↩︎
Perfect alignment is a fool's errand; the real task before us is simply that of matching the upper end of human alignment: that of our most altruistic exemplars. ↩︎
Sections 5 and 6 discuss the importance of relative metaphysical ignorance and the resulting key subtasks of how to co-design worlds and agent belief systems (religions/philosophies) that best balance consistency (relative low entropy) with minimization of behavioral distortion, all while maintaining computational efficiency. Generally this difficulty scales with world technological complexity, so we'll probably start with low-tech historical or fantasy worlds. ↩︎
Section 2 reviews the evidence that near term AGI will likely be DL based and thus brain-like (in essence, not details), and section 3 follows through on the implication that AGI will consequently be far more anthropomorphic then some expected (again in essence, not details). ↩︎
Section 3 argues that strong intelligence entails recursive self improvement and thus some forms of empowerment as the primary goal - at least in the developmental or bootstrapping phase. Section 4 discusses how this is the core driver of intelligence in humans and future AGI, and how empowerment must eventually give way to the external alignment objective (optimizing for other agent's values or empowerment) - in all altruistic agents, biological or not. ↩︎
In theology the Eschaton is the final event or phase of the world, as according to divine plan. Here it is the perfectly appropriated term. ↩︎
This requires running a set of simworlds in parallel, but this surprisingly need not incur much additional cost for most GPU based AGI designs, as discussed in section 2. For AGI running on neuromorphic hardware this performance picture may change a bit, but we will likely still want multiple world rollouts for other reasons such as test coverage and variance reduction. ↩︎
High fidelity is probably not that important because of the universal instrumental convergence to empowerment, as discussed in section 4. Rather than optimize for human's specific goals (which are potentially unstable under scaling), it suffices that the AGI optimizes for our empowerment: ie our future ability to fulfill all likely goals. ↩︎
I use 'reverse engineering' in a similar loose sense that early gliders and flying machines reversed engineered bird flight: by learning to distinguish the essential features (e.g. the obvious wings for lift, the less obvious aileron trailing-edge based roll for directional control) from the incidental (feathers, flapping, etc). ↩︎ ↩︎
Herculano-Houzel, Suzana. "The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost." Proceedings of the National Academy of Sciences 109.supplement_1 (2012): 10661-10668. ↩︎
If I am repeating this argument, it is only because it is worth repeating. I've been presenting variations of nearly the same argument since that 2015 post and earlier, earlier even than deep learning, and the evidence only grows stronger year after year. ↩︎
There will probably be technological eras past these three - such as reversible and/or quantum computing - but those are likely well past AGI. ↩︎
In 1988 Moravec used brain-compute estimates and Moore's Law to predict that AGI would arrive by 2028, requiring at least 10 terraflops. Kurzweil then extended this idea with more and prettier and better selling graphs, but similar conclusions. ↩︎
GPUs are 'massively' parallel relative to multi-core CPUs, but only neuromorphic computers like the brain are truly massively, maximally parallel. ↩︎
I am using 'neuromorphic' in a broad sense that includes process-in-memory computing, mostly because all the economic demand and thus optimization pressure for these types of chips is for running large ANNs, so it is apt to name them 'computing in the form of neurons'. Neural computing is quite broad and general, but a neuromorphic computer still wouldn't be able to run your python script as efficiently as a CPU, or your traditional graphics engine as efficiently as a GPU (but naturally should excel at future neural graphics engines). GPUs are also evolving to specialize more in low precision matrix multiplication, which is neuromorphic adjacent. ↩︎
Vector-Matrix multiplication is more general in that a general purpose VxM engine can fully emulate MxM ops at full efficiency, but a general purpose MxM engine can only simulate VxM with inefficiency proportional to its alu:mem ratio. At the physical limits of efficiency a VxM engine must store the larger matrix in local wiring, as in the brain. ↩︎
Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep sparse rectifier neural networks." Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011. ↩︎
Carandini, Matteo, and David J. Heeger. "Normalization as a canonical neural computation." Nature Reviews Neuroscience 13.1 (2012): 51-62. ↩︎
Greff, Klaus, Rupesh K. Srivastava, and Jürgen Schmidhuber. "Highway and residual networks learn unrolled iterative estimation." arXiv preprint arXiv:1612.07771 (2016). ↩︎
Liao, Qianli, and Tomaso Poggio. "Bridging the gaps between residual learning, recurrent neural networks and visual cortex." arXiv preprint arXiv:1604.03640 (2016). ↩︎
Schlag, Imanol, Kazuki Irie, and Jürgen Schmidhuber. "Linear transformers are secretly fast weight programmers." International Conference on Machine Learning. PMLR, 2021. ↩︎
Ba, Jimmy, et al. "Using fast weights to attend to the recent past." Advances in neural information processing systems 29 (2016). ↩︎
Bricken, Trenton, and Cengiz Pehlevan. "Attention approximates sparse distributed memory." Advances in Neural Information Processing Systems 34 (2021): 15301-15315. ↩︎
Lee, Jaehoon, et al. "Wide neural networks of any depth evolve as linear models under gradient descent." Advances in neural information processing systems 32 (2019). ↩︎
Launay, Julien, et al. "Direct feedback alignment scales to modern deep learning tasks and architectures." Advances in neural information processing systems 33 (2020): 9346-9360. ↩︎
The key brain mechanisms underlying efficient backprop-free learning appear to be some combination of: 1.) large wide layers, 2.) layer wise local self-supervised predictive learning, 3.) widespread projection of global summary error signals (through the dopaminergic and serotonergic projection pathways), and 4.) auxiliary error prediction (probably via the cerebellum). These also are the promising mechanisms in the beyond-backprop research. ↩︎
Whittington, James CR, Joseph Warren, and Timothy EJ Behrens. "Relating transformers to models and neural representations of the hippocampal formation." arXiv preprint arXiv:2112.04035 (2021). ↩︎
Schrimpf, Martin, et al. "The neural architecture of language: Integrative modeling converges on predictive processing." Proceedings of the National Academy of Sciences 118.45 (2021): e2105646118. ↩︎
Goldstein, Ariel, et al. "Correspondence between the layered structure of deep language models and temporal structure of natural language processing in the human brain." bioRxiv (2022). ↩︎
Caucheteux, Charlotte, and Jean-Rémi King. "Brains and algorithms partially converge in natural language processing." Communications biology 5.1 (2022): 1-10. ↩︎
Mostly sourced from Schmidhuber's lab, of course. ↩︎
Mind Children by Hans Moravec, 1988 ↩︎
This was also obvious to the vanguard of Moore's Law: GPU/graphics programmers. It simply doesn't take that many years for a research community of just a few thousand bright humans to explore the design space and learn how to exploit the potential of a new hardware generation. Each generation has a fixed potential which results in diminishing returns as software techniques mature. The very best and brightest teams sometimes can accumulate algorithmic leads measured in years, but never decades. ↩︎
This is simply the most performant fully general framework for describing arbitrary circuits - from all DL architectures to actual brains to CPUs, including those with dynamic wiring. The circuit architecture is fully encoded in the specific (usually block) sparsity pattern, and the wiring matrix may be compressed. ↩︎
Standard transformers are still essentially feedforward and thus can only learn functions computable by depth D circuits, where D is the layer depth, usually around 100 or less. Thus like standard depth constrained vision CNNs they excel at mental tasks humans can solve in seconds, and struggle with tasks that require much longer pondering times and long iterative thought processes. ↩︎
By degree of recurrence I mean the latency and bandwidth of information flow from/to module outputs across time (over multiple timescales). A purely feedforward system (such as a fixed depth feedforward network) has zero recurrence, a vanilla transformer has a tiny bandwidth of high latency recurrence (if it reads in previous text output), and a standard RNN has high bandwidth low latency recurrence (but is not RAM efficient). There are numerous potential routes to improve the recurrence bandwidth and latency of transformer-like architectures, but usually at the expense of training parallelization and efficiency: for example one could augment a standard transformer with more extensive scratchpad working memory output which is fed back in as auxiliary input, allowing information to flow recurrently through attention memory. ↩︎
Games like chess/Go (partially) test planning/search capability, and current transformers like GPT-3 struggle at anything beyond the opening phase, due to lack of effective circuit depth for online planning. A transformer model naturally could handle games better if augmented with a huge training database generated by some other system with planning/search capability, but then it is no longer the sole source of said capability. ↩︎
For point of comparison: the typical 1000x time parallelization factor imposed by GPU constraints is roughly equivalent to a time delay of over 10 human subjective seconds assuming 100hz as brain-equivalent clock rate. Each layer of computation can only access previous outputs of the same or higher layers with a delay of 1000 steps - so this is something much weaker than true recurrence. ↩︎
Perhaps not coincidentally, I believe I've cracked this little problem and hopefully will finish full implementation before the neuromorphic era. ↩︎
For comparison the human brain has on order 1e14 synapses which are roughly 10x locally sparse, a max firing rate or equivalent clock rate of 100hz, and a median firing rate well under 1hz. This is the raw equivalent of 1e14 fully sparse ops/s, or naively 1e17 dense ops/s, but perhaps the functional equivalent of 1e16 dense ops/s - within an OOM of single GPU performance. Assuming compression down to a bit per synapse or so requires ~10TB of RAM for weights - almost 3 OOM beyond single GPU capacity - and then activation state is at least 10GB, perhaps 100GB per agent instance, depending on sparsity and backtracking requirements. Compared to brains GPUs are most heavily RAM constrained, and thus techniques for sharing/reusing weights (across agents/batch, space, or time) are essential. ↩︎
An honorable mention attempt to circumvent the VN bottleneck on current hardware involves storing everything in on-chip SRAM, perhaps best exemplified by the cerberas wafer scale chip. It has the performance of perhaps many dozens of GPUs, but with access to only 40GB of on-chip RAM it can run only tiny insect/lizard size ANNs - but it can run those at enormous speeds. ↩︎
For point of comparison, GPT-3's 500B token training run is roughly equivalent to 5,000 years of human experience (300 tokens/minute * 60 * 24 * 365 = 0.1B tokens per human year) and was compressed into a few months of physical training time, so it ran about 10000X real-time equivalent. The 3e24 flops used during GPT-3 training compares more directly to perhaps 1e25 (dense equivalent) flops consumed for a human 'training' of 30 years (1e16 flops * 1e9 seconds). But of course GPT-3 is not truly recurrent, and furthermore is tiny and incomplete - more comparable to a massively old and experienced (but also impaired) small linguistic cortex than a regular full brain. It's quite possible that we can get simbox-suitable AGI using smaller brains, but human brain size seems like a reasonable baseline assumption. ↩︎
Rapid linguistic learning is homo sapien's super-power. AGI simply takes this further by being able to directly share synapses without slow ultra-compressed linguistic transmission. ↩︎
Dreams in simboxes could be useful as the natural consequence of episodic memories leaking through from the experiences of an agent's mindclones across the sim multiverse. Brains record experiences during wake and then retrain the cortex on these experiences during sleep - our agents could do the same except massively scaled up by training on the experiences of many mindclones from across the simverse. ↩︎
The same tech leading to AGI will also transform game sim engines and allow simulating entire worlds of realistic NPCs - dicussed more in section 5. The distinction between an NPC and an agent/contestant is that the former is purely a simulacra manifestation of the sim world engine (which has a pure predictive simulation objective), and agent is designed to steer the world. ↩︎
Convergence in essence, not details. AGI will have little need of the hundred or so known human reflexes and instincts, nor will it suffer much for lack of most human emotions - but few to none of those biological brain features are essential to the core of humanity/sapience. Should we consider a hypothetical individual lacking fear, anger, jealousy, pride, envy, sadness, etc - to be inhuman due to lack of said ingredients? The essence or core of sapience as applicable to AGI is self directed learning, empowerment/curiosity, and alignment - the latter manifesting as empathy, altruism, and love in humans. And as an additional complication AGI may simulate human emotions for various reasons. ↩︎
As you extend the discount rate to zero (planning horizon to infinity) the optimal instrumental action path converges for all relevant utility functions to the path that maximizes the agent's ability to steer the long term future. Empowerment objectives approximate this convergent path, optimizing not for any particular short term goal, but for all long term goals. Empowerment is the driver of recursive self-improvement. ↩︎
I'm using empowerment broadly to include all high level convergent self-improvement objectives: those that improve the agent's ability to control the long term future. This includes both classic empowerment objectives such as maximizing mutual info between outputs and future states (maxing future optionality), curiosity objectives (maximizing world model predictive performance), and so on. ↩︎
The convergence towards empowerment does simplify the task of aligning AI as it reduces or removes the need to model detailed human values/goals; instead optimizing for human empowerment is a reasonable (and actually acheivable) approximate bound. ↩︎
A brain-like large sparse RNN can encode any circuit architecture, so the architectural prior reduces simply to a prior on the large scale low-frequency sparsity pattern, which can obviously evolve during learning. ↩︎
Ie those that survive the replication crisis and fit into the modern view of the brain from computational neuroscience and deep learning. ↩︎
Binz, Marcel, and Eric Schulz. "Using cognitive psychology to understand GPT-3." arXiv preprint arXiv:2206.14576 (2022). ↩︎
Dasgupta, Ishita, et al. "Language models show human-like content effects on reasoning." arXiv preprint arXiv:2207.07051 (2022). ↩︎
Jara-Ettinger, Julian. "Theory of mind as inverse reinforcement learning." Current Opinion in Behavioral Sciences 29 (2019): 105-110.. ↩︎
Learning detailed models of the complex values of external agents is also probably mostly unnecessary, as empowerment (discussed below) serves as a reasonable convergent bound. ↩︎
Weighted by the other agent's alignment (for game theoretic reasons) and also perhaps model fidelity. ↩︎
Each oldbrain circuit doesn't need performance anywhere near the more complex target newbrain circuit it helps locate, it only needs enough performance to distinguish its specific target circuit by firing pattern from amongst all the rest. For examples babies are born with a crude face detector which really isn't much more than a simple smiley-face :) detector, but that (perhaps along with additional feature detectors) is still sufficient to reliably match actual faces more than other observed patterns, helping to locate and connect with the later more complex learned cortical face detectors. ↩︎
Sexual attraction is a natural extension of imprinting: some collaboration of various oldbrain circuits can first ground to the general form of humans, and then also myriad more specific attraction signals: symmetry, body shape, secondary characteristics, etc, combined with other circuits which disable attraction for likely kin ala the Westermarck effect (identified by yet other sets of oldbrain circuits as the most familiar individuals during childhood). This explains the various failure modes we see in porn (attraction to images of people and even abstractions of humanoid shapes), and the failure of kin attraction inhibition for kin raised apart. ↩︎
Fear of death is a natural consequence of empowerment based learning - as it is already the worst (most disempowered) outcome. But instinctual fear still has obvious evolutionary advantage: there are many dangers that can kill or maim long before the brain's learned world model is highly capable. Oldbrain circuits can easily detect various obvious dangers for symbol grounding: very loud sounds and fast large movements are indicative of dangerous high kinetic energy events, fairly simple visual circuits can detect dangerous cliffs/heights (whereas many tree-dwelling primates instead instinctively fear open spaces), etc. ↩︎
Anger/Jealousy/Vengeance/Justice are all variations or special cases of the same general game-theoretic punishment mechanism. These are deviations from empowerment because an individual often pursues punishment of a perceived transgressor even at a cost to their own 'normal' (empowerment) utility (ie their ability to pursue diverse goals). Even though the symbol grounding here seems more complex, we do see failure modes such as anger at inanimate objects which are suggestive of proxy matching. In the specific case of jealousy a two step grounding seems plausible: first the previously discussed lust/attraction circuits are grounded, which then can lead to obsessive attentive focus on a particular subject. Other various oldbrain circuits then bind to a diverse set of correlated indicators of human interest and attraction (eye gaze, smiling, pupil dilation, voice tone, laughter, touching, etc), and then this combination can help bind to the desired jealousy grounding concept: "the subject of my desire is attracted to another". This also correctly postdicts that jealousy is less susceptible to the inanimate object failure mode than anger. ↩︎
Oldbrain circuits advertise emotional state through many indicators: facial expressions, pupil dilation, blink rate, voice tone, etc - and then other oldbrain circuits then can detect emotional state in others from these obvious cues. This provides the requisite proxy foundation for grounding to newbrain learned representations of emotional state in others, and thus empathy. The same learned representations are then reused during imagination&planning, allowing the brain to imagine/predict the future contingent emotional state of others. Simulation itself can also help with grounding, by reusing the brain's own emotional circuity as the proxy. While simulating the mental experience of others, the brain can also compare their relative alignment/altruism to its own, or some baseline, allowing for the appropriate game theoretic adjustments to sympathy. This provides a reasonable basis for alignment in the brain, and explains why empathy is dependent upon (and naturally tends to follow from) familiarity with a particular character - hence "to know someone is to love them". ↩︎
Matusch, Brendon, Jimmy Ba, and Danijar Hafner. "Evaluating Agents without Rewards." arXiv preprint arXiv:2012.11538 (2020). ↩︎
Salge, Christoph, Cornelius Glackin, and Daniel Polani. "Empowerment–an introduction." Guided Self-Organization: Inception. Springer, Berlin, Heidelberg, 2014. 67-114. ↩︎
Friston, Karl. "The free-energy principle: a unified brain theory?." Nature reviews neuroscience 11.2 (2010): 127-138. ↩︎
Burda, Yuri, et al. "Large-scale study of curiosity-driven learning." arXiv preprint arXiv:1808.04355 (2018). ↩︎
Mohamed, Shakir, and Danilo Jimenez Rezende. "Variational information maximisation for intrinsically motivated reinforcement learning." arXiv preprint arXiv:1509.08731 (2015). ↩︎
Eysenbach, Benjamin, et al. "Diversity is all you need: Learning skills without a reward function." arXiv preprint arXiv:1802.06070 (2018). ↩︎
Zhao, Ruihan, Stas Tiomkin, and Pieter Abbeel. "Learning efficient representation for intrinsic motivation." arXiv preprint arXiv:1912.02624 (2019). ↩︎
Aubret, Arthur, Laetitia Matignon, and Salima Hassas. "A survey on intrinsic motivation in reinforcement learning." arXiv preprint arXiv:1908.06976 (2019). ↩︎
Pathak, Deepak, et al. "Curiosity-driven exploration by self-supervised prediction." International conference on machine learning. PMLR, 2017. ↩︎
Pathak, Deepak, Dhiraj Gandhi, and Abhinav Gupta. "Self-supervised exploration via disagreement." International conference on machine learning. PMLR, 2019. ↩︎
It is irrelevant that evolution sometimes produces brains that are unaligned or broken in various ways. My broken laptop is not evidence that turing machines do not work. Evolution proceeds by breaking things; it only needs some high functioning offspring for success. We are reverse engineering the brain in its most ideal perfected forms (think Von Neumman meets Jesus, or your favorite cultural equivalents), and we are certainly not using some blind genetic evolutionary process to do so. ↩︎
Decety, Jean, et al. "Empathy as a driver of prosocial behaviour: highly conserved neurobehavioural mechanisms across species." Philosophical Transactions of the Royal Society B: Biological Sciences 371.1686 (2016): 20150077. ↩︎
Meyza, K. Z., et al. "The roots of empathy: Through the lens of rodent models." Neuroscience & Biobehavioral Reviews 76 (2017): 216-234. ↩︎
Bartal, Inbal Ben-Ami, Jean Decety, and Peggy Mason. "Empathy and pro-social behavior in rats." Science 334.6061 (2011): 1427-1430. ↩︎
Franzmeyer, Tim, Mateusz Malinowski, and João F. Henriques. "Learning Altruistic Behaviours in Reinforcement Learning without External Rewards." arXiv preprint arXiv:2107.09598 (2021). ↩︎
The franzmeyer paper was posted on arxiv shortly before I started this post a year ago, but it did not come to my attention until final editing, and we both arrived at a similar idea (using empowerment as a bound approximation for external agent values) independently. They of course are not using a complex learned world model and thus avoid the key challenge of internal circuit grounding. The specific approximations they are using may not scale to large environments, but regardless they have now at least proven out the basic idea of optimizing for external agent empowerment in simple environments. ↩︎
Transitioning to altruism(external empowerment) too soon could impair the agent's learning trajectory or result in an insufficient model of external agency; but delaying the transition too long could result in powerful selfish agents. ↩︎
The capabilities of an (adult/trained) agent are a function primarily of 1.) its total lifetime effective compute budget for learning (learning compute * learning age), 2.) the quality and quantity of its training data (knowledge), and 3.) its architectural prior. In simboxes we are optimizing 3 for the product of intelligence and alignment, but that does not imply that agents in simboxes will be especially capable or dangerous, as they will be limited somewhat by 1 and especially by 2. ↩︎
See also the typical hero's journey monomyth. ↩︎
One key difference is that computer security sandboxes are built to contain viruses and malware which themselves are intentionally designed to escape. This adversarial arms race setting naturally makes containment far more challenging, whereas AGI and simboxes should be fully cooperatively codesigned. ↩︎
Plato did actually arrive at some conclusions that roughly anticipate simulism, but only very vaguely. Various contemporary Gnostics believed in an early equivalent of simulism. Still billions of lifetimes away from any serious containment risk. ↩︎
Of course a hypothetical superintelligence with vast amounts of compute could perhaps infer the rough shape of the outer world from even a single short lifetime of observations/experiments (using vast internal simulation), but as a rough baseline that would probably require something like the equivalent of human net civilization levels of compute and would hardly go unnoticed, and a well designed sim may not leak enough to allow for anything other than human manipulation as the escape route (consider, for example, the escape prospects for a 'superintelligent' atari agent, who could only know humanity through vague simulations of entire multiverses mostly populated with aliens). Regardless that type of hypothetical superintelligence has no relation to the human-level AGI which will actually arrive first and is discussed here. ↩︎
Specifically dynamic alignment architectures and mechanisms as discussed in section 4: agents that learn models of, and then optimize for, other agent's values/utility (and or empowerment). ↩︎
These should be considered upper bounds because advances in inter-agent optimization/compression can greatly reduce these costs, long before more exotic advances such as reversible computing. ↩︎
And architecture is somewhat less of a differentiator given the combination of architectural convergence under dynamic within-lifetime architectural search and diminishing returns to model size in great excess of data history. ↩︎
One key piece of historical information which must be inferred for the success of such an effort is humanity's DNA tree. Fortunately a rather large fraction of total human DNA is preserved and awaiting extraction and sampling by future robots thanks to (mostly judeo-christian/abrahamic) burial rituals. ↩︎