When there is a difference between what I said, what I meant, what I should have said/meant, and what is best for me... I think I would prefer the AI to explain to me this difference, if possible. If impossible to explain, at least to say that there is a difference (and let me figure out how to deal with that information).
The collective version of the same problem is more difficult, because it turns possible resulting intrapersonal conflicts into interpersonal.
From the perspective of ontology and memetic engineering, the whole "ontology" or classification of alignments that you give, "fragile, friendly, ..." is bad because it's not based on some theory but rather on the cacophony of commonsensical ideas. These "alignments" don't even belong to the same type: "Fragile" is an engineering approach (but there are also many other engineering approaches which you haven't mentioned!), 2-3 and 5-6 are black-box descriptions of some alignment characteristics (at least these seem to belong to the same type), and "Strict" looks like a description of a mathematical mechanism. Also, the names are bad.
Fragile alignment
The name is really bad you use a property of this engineering approach (its fragility) as its name. But there are infinitely many other engineering approaches that are very fragile. For example, consider post-filtering LLM's outputs for any occurrences of the words like "bomb", "kill", "poison", n-word, etc., and not passing these rollouts through, but passing all the rest. Is this an alignment technique? Yes, if we consider the filter as part of the cognitive architecture, as we should. Is it fragile? Yes.
But when multiple such approaches are combined, this may actually lead to a cognitive architecture that is robustly aligned. This is the promise of LMCAs and the natural language alignment.
It is not clear fragile alignment is even meaningfully helpful – that it does much, survives for long, or causes actions compatible with our survival, once the AGI is smarter than we are, even if we get its details mostly right and faced relatively good conditions. There are overlapping and overdetermined reasons (the strength and validity of which are of course disputed) to expect any good properties to break down exactly when it is important they not break down.
Laissez-faire (any prompt goes), "bare", non-post-filtered LLM rollouts as a cognitive architecture is obviously doomed. OpenAI has stopped deploying and giving access to base models (GPT-4 is available only in SFT'ed and RLHF'ed form), and I expect that in the next iteration (GPT-5), they will (my wishful thinking) stop even that -- they will give access only to an LMCA from the beginning. Even "rollout + post-filtering" is already a primitive form of such LMCA.
In turn, LMCAs in general shouldn't necessarily inherit the cardinal sin of LLMs (that are "exponentially diverging diffusion process", in the words of LeCun). And SFT/RLHF (things that you have called "fragile alignment") could be a part of a robust architecture, as I already noted above.
The first four and next four kinds of alignment you propose are parallel except that they concern a single person or society as a whole. So I suggest the following names which are more parallel. (Not happy about 3 and 7.)
Is this ‘alignment’ a natural thing you can get easily or even by default, that is essentially a normal engineering problem, or is it a highly unnatural outcome where security mindset and bulletproof approaches as yet unfound even in principle are required, with any flaws are exploited, amplified and fatal, and many lethal problems all of which one must avoid?
To answer these questions specifically, it's really important not just to consider AI--human alignment "in the abstract", but embedded in the current civilisation, with its infrastructure and incentives structures. As I wrote here:
[...] we should address this strategic concern by rewiring the economic and action landscapes (which also interacts with the "game-theoretic, mechanism-design" alignment paradigm mentioned above). The current (internet) infrastructure and economic systems are not prepared for the emergence of powerful adversarial agents at all:
- There are no systems of trust and authenticity verification at the root of internet communication (see https://trustoverip.org/)
- The storage of information is centralised enormously (primarily in the data centres of BigCos such as Google, Meta, etc.)
- Money has no trace, so one may earn money in arbitrary malicious or unlawful ways (i.e., gain instrumental power) and then use it to acquire resources from respectable places, e.g., paying for ML training compute at AWS or Azure and purchasing data from data providers. Formal regulations such as compute governance and data governance and human-based KYC procedures can only go so far and could probably be social-engineered by a superhuman imposter or persuader AI.
In essence, we want to design civilisational cooperation systems such that being aligned is a competitive advantage. Cf. "The Gaia Attractor" by Rafael Kaufmann.
This is a very ambitious program to rewire the entirety of the internet, other infrastructure, and the economy, but I believe this must be done anyway, just expecting a "miracle" HRAD invention to be sufficient without fixing the infrastructure and system design layers doesn't sound like a good strategy. By the way, such infrastructure and economy rewiring is the real "pivotal act".
If we imagined that the world had a "right" kind of infrastructure and social structure (really decentralised, trust-first), probably alignment would be much more of an "ordinary engineering" problem. With the current economic and infrastructural vulnerabilities mentioned, however, the alignment becomes a much higher-stakes problem, requiring more of "bulletproof" solutions "on the first try", I think.
Strict alignment. The damn thing will actually follow some set of instructions to the letter subject to its optimization constraints, hopefully you like the consequences of that. It is a potentially important crux if you disagree with the claim that for almost all specified instruction sets you won’t like the consequences, and there is no known good one yet, due to various alignment difficulties.
The name is uninformative and possibly misleading. If the set of instructions is in a natural or a formal language, you push the alignment difficulty into the semantics and semiotics, which are not "strict", and the alignment ends up not "strict" either.
In the planning-as-inference frame, I guess you probably mean something like an external evaluation (perhaps with some "good old-fashioned algorithm", a-la "type checker", rather than another AI, although it's really questionable whether such "type checker" could be built) of the inferred plans in their entirety. But again, even internal AI's representations are symbols, not "ground truth", so they are subject to the same difficulty of semantic and semiotic interpretation as natural language.
It is not clear to what extent strict alignment or strawberry alignment gives us affordance to reach good outcomes, how universal and deadly the various sources of lethality involved would be, or how difficult it would be to locate such affordances, especially on the first try.
"Strict" alignment is an engineering technique or approach, that shouldn't be judged in isolation but rather as a part of a cognitive architecture as a whole, as I explained in this comment.
"Strawberry" alignment is an external evaluation criterion or characteristic. However, I would go even further and say that "strawberry" is just a thought experiment which is not meant to be an actual eval that we will actually try, and the purpose of this thought experiment is just to show that the process of reasoning (and the result of reasoning, i.e., the plan), and the resulting behaviour are the actual objects of ethical evaluation and alignment rather than simply "goals". Goals become "good" or "bad" only in the context of larger plans and behaviour. This thought could be expressed in different ways, e.g., directly, as I just did (as well as in this comment). The latest OpenAI paper, "Let's Verify Step By Step" highlights this idea, too.
It is not clear that human-level or friendly alignment would do us much good for long either, given the nature and history of humans, and the competitive dynamics involved, and the various reasons to expect change. If AGIs are much smarter and more capable and efficient than us, is there reason to think this level of alignment might be sufficient for long?
"Human-level" is just more commonly called "value alignment" (or "alignment with human values" if you want). But I agree with the conclusion: "friendly" is an attempt at "moral fact alignment" ("humanity is valuable to preserve"), which is probably futile without considering and aligning on the underlying theory ethics, i.e., without the methodological and scientific alignment, as I described in a different comment. Value alignment, if taken literally, i.e., as attempting to impart AI with humans' heuristics about value, is also a species of "moral fact alignment", just somewhat more concrete than just "humanity is valuable to preserve" (although the latter is also one of the human values).
Have this alignment and the surrounding dynamics cause humans to choose to remain in control over time, or somehow be unable to choose differently.
This is self-contradictory: if the surrounding dynamics strongly preclude humans from "choosing otherwise", humans are no longer "in control". Also, under certain definitions of "choosing differently", humans may be precluded from moving into different biological and computational substrates, which in itself might be a cosmic tragedy because it may forever preclude humans from realising vast amounts of potential.
And Zvi points out these contradictions himself:
It is not clear to what extent robust alignment is a coherent concept especially in a competitive world or even how it interacts with maximization, as it contains many potential contradictions and requirements.
It is impossible in theory to have all these different kinds of alignment simultaneously. You cannot simultaneously (without any claim of completeness):
- Do what I say
- Also do what I mean
- Also do what I should have said and meant
- Also do what is best for me
- Also do what broader society or humanity says
- Also do what broader society or humanity means or should have said
- Also do what broader society or humanity should have said given their values
- Also do what is best for everyone
- Do some ideal friendly combination of all of it that a broadly good guy would do, in a way that is respectful of and preserves what is valuable on all levels
- Strictly follow some other set of rules that were set up long ago, no matter the cost
I think we should already forget about items 1 and 2 (as well as 10, but that goes without saying). In the context of communicating and aligning with superhumanly smart AI, it is really hubristic and stupid to think that although we will be much stupider than AI, it will be able to write a superhumanly good theory of ethics (such that humans didn't come up with in thousands of years) in a second, and design an AI (or change oneself) to follow this theory, humans' thoughts about value will still be somehow "covered with gold" and worth adhering to, for that AI.
Given this, I disagree that "we will need to: 1) Be able to get an AGI to do something a human selects at all, rather than something not selected. Be able to retain some form of control over what it does in the future, or set it on a chosen course. At all. ...". I'm not sure that this is possible (which implies that the orthogonality thesis holds to a sufficiently strong degree, which I doubt), but if this is possible, I don't think this would be desirable.
Items 3-9 raise an important and valid concern, that of the (infinity) multiplicity of (collective) identity and alignment subjectivity.
We must pick at most one of those, or another variation on them, or something else, as our primary target. A machine cannot serve two masters any more than a man can, and the ability to put the machine under arbitrary stress, and its additional capabilities and intelligence, makes this that much more clear.
I think this is a wrong reaction to the above concern. Humans can serve multiple masters (themselves, their family, their community, their nation/society, the whole of humanity, and Gaia), so why AIs couldn't?
In the phrase "It is impossible in theory to have all these different kinds of alignment simultaneously", it's unclear what you mean by "having alignment". If you meant something like formal, "total" alignment, then sure, it's impossible to have perfect alignment in the physical reality outside us (i.e., outside simple simulations and mathematical abstractions). But if we strive for continuously increasing alignment along various dimensions and various levels of intelligent entities (alignment subjects), it should definitely be possible in general (although some "bad" situations may call for a choice where the alignment between certain entities or along a certain dimension is worsened in favour of the alignment between some other types of entities or along other dimensions).
I think that to increase our chances to realise a good future, we must find a principled way of addressing the issue of the multiplicity of identity and alignment subjectivity, which is the essence of a scale-free theory of ethics.
the system has unintended (harmful or dangerous) goals or behaviors.
Note that judgements about the harmfulness and dangerousness of some goals or behaviours are themselves theory-laden. This is why Goal alignment without alignment on epistemology, ethics, and science is futile. From the perspective of any theory of cognition/intelligence that includes a generative model (which is not only Active Inference, but also LeCun's H-JEPA, LMCAs such as the "exemplary actor", and more theories of cognition and/or AI architectures) for performing planning-as-inference, I think a straightforward and useful ladder of aligned could be introduced: methodological, scientific, and fact alignment:
Thus, the crux of alignment is aligning the generative models of humans and AIs. Generative models could be "decomposed", vaguely (there is a lot of intersection between these categories), into
- Methodology: the mechanics of the models themselves (i.e., epistemology, rationality, normative logic, ethical deliberation),
- Science: mechanics, or "update rules/laws" of the world (such as the laws of physics or the heuristical learnings about society, economy, markets, psychology, etc.), and
- Fact: the state of the world (facts, or inferences about the current state of the world: CO2 level in the atmosphere, the suicide rate in each country, distance from Earth to the Sun, etc.)
These, we can conceptualise, give rise to "methodological alignment", "scientific alignment", and "fact alignment" respectively. Evidently, methodological alignment is most important: it in principle allows for alignment on science, and methodology plus science helps to align on facts.
Under this framework, goals are a specific type of facts, laden by a specific theory of mind of another agent (natural or AI). A theory of mind here should be a specialised version of a general theory of cognition which itself, as noted above, includes a generative model and planning-as-inference, under which goals become future world states or some features of future world states, predicted/planned (prediction and planning is the same thing, under planning-as-inference) by the other mind (or oneself, if the agent reflects about its own goals).[1]
Thus, goal alignment is easy (practically, automatic) when two agents are aligned on methodology and science (albeit goal alignment even between methodologically and scientifically aligned agents usually still requires communication and coordination, unless we enter the territory of logical handshakes...), but is also futile when there is no common methodological and scientific ground.
Incidentally, this means that RL is not a very useful framework for discussing goals, because goals couldn't be conceptualised under RL easily, which causes a lot of trouble to people in the AI safety community who tend to think that there should be a single "right" theory or framework of cognition and intelligence. There should not: For alignment, we should simultaneously use multiple theories of cognition and value. And RL, although probably couldn't be deployed very usefully to discuss goal alignment specifically, could still be used to discussed some aspects of value alignment between the minds.
What would it mean to solve the alignment problem sufficiently to avoid catastrophe? What do people even mean when they talk about alignment?
The term is not used consistently. What would we want or need it to mean? How difficult and expensive will it be to figure out alignment of different types, with different levels of reliability? To implement and maintain that alignment in a given AGI system, including its copies and successors?
The only existing commonly used terminology whose typical uses are plausibly consist is the contrast between Inner Alignment (alignment of what the AGI inherently wants) and Outer Alignment (alignment of what the AGI provides as output). It is not clear this distinction is net useful.
An alignment failure or misalignment (being misaligned) can mean among other things:
It is impossible in theory to have all these different kinds of alignment simultaneously. You cannot simultaneously (without any claim of completeness):
(And do it all according to a variety of contradictory human heuristics and biases, while looking friendly, while engaging in unnatural behaviors like corrigibility, and not tricking people into giving you requests you can fulfil, and so on, please.)
We must pick at most one of those, or another variation on them, or something else, as our primary target. A machine cannot serve two masters any more than a man can, and the ability to put the machine under arbitrary stress, and its additional capabilities and intelligence, makes this that much more clear. Even individually, many of the requests and desired behavioral sets above are not actually logically coherent or consistent.
Getting any one of those ten is hard enough. It is a problem we do not know how to solve for systems more intelligent than we are. We do not even know how to robustly solve it for current systems.
To solve alignment and retain control of AGIs and their actions, we will need to:
Useful and consistent terminology and taxonomy beyond this are urgently needed.
We could call these ten forms of alignment these names (by all means please replace with better names, this is hard), again this list is not claimed to be complete:
We do not currently have a known method of creating reliable alignment of any kind for future AGI systems, or a path known to lead to this. How promising various existing proposals or plans are for getting us there is heavily disputed and a common crux.
In addition to the type of alignment, one can talk about various aspects of the strength, reliability, precision and robustness of that alignment, as well as what ways exist to weaken, risk or break that alignment. These and related words are not used consistently.
In very broad terms, combining aspects that can be distinct for ease of discussion, one might speak of things like, in terms of either inner alignment, outer alignment, or a combination of both:
All these targets have problems, in addition to ‘we don’t know how to get this’ beyond the first one, ‘do we know what the components mean or how to specify them’ and ‘we don’t understand human values’ and ‘is this even a coherent concept,’ such as:
It is not clear fragile alignment is even meaningfully helpful – that it does much, survives for long, or causes actions compatible with our survival, once the AGI is smarter than we are, even if we get its details mostly right and faced relatively good conditions. There are overlapping and overdetermined reasons (the strength and validity of which are of course disputed) to expect any good properties to break down exactly when it is important they not break down.
It is not clear that human-level or friendly alignment would do us much good for long either, given the nature and history of humans, and the competitive dynamics involved, and the various reasons to expect change. If AGIs are much smarter and more capable and efficient than us, is there reason to think this level of alignment might be sufficient for long?
It is not clear to what extent strict alignment or strawberry alignment gives us affordance to reach good outcomes, how universal and deadly the various sources of lethality involved would be, or how difficult it would be to locate such affordances, especially on the first try.
It is not clear to what extent robust alignment is a coherent concept especially in a competitive world or even how it interacts with maximization, as it contains many potential contradictions and requirements. Or how one could get or even specify this level of alignment even under ideal conditions.
A better and more complete future version of this document would include a better taxonomy here similar to the one above.
A key crux is the type and degree of alignment necessary to avoid catastrophe and achieve good outcomes. Another is the how difficult such alignment will be to achieve with what level of reliability, and which particular obstacles we need to worry about.
A Missing Additional Post: Alignment Difficulties
My post on the progression through various stages of AGI development handwaved ‘alignment’ to focus on when we might need how much of it depending on what path we take in terms of what AGIs or potential-AGIs exist under how much human control, including the implications for type and degree of alignment necessary.
This post, on degrees and types of alignment, asks what alignment actually means, and what forms and degrees it might take and which of them would be required to survive various scenarios, and spread and preserved how robustly, and so on. Are these types of alignment even possible in theory, or coherent logically consistent concepts? If you get the thing you ostensibly want, what would happen?
A third post might ask, how likely would it be, and how hard would it be, for us to achieve a given form or degree of alignment, in systems smarter and more capable than us or any previously existing system?
Is this ‘alignment’ a natural thing you can get easily or even by default, that is essentially a normal engineering problem, or is it a highly unnatural outcome where security mindset and bulletproof approaches as yet unfound even in principle are required, with any flaws are exploited, amplified and fatal, and many lethal problems all of which one must avoid?
How much hope or doom lies in various potential approaches? Would scaled up versions of things that work on non-intelligent systems likely work out of the box or with ordinary reasonable adjustments, or do we know reasons they definitely fail? Can we use incrementally smarter AIs to solve our problems for us? Will the results naturally be robust, have nice properties, be nicely self-maintaining? Does it fall out of this ‘one weird trick’?
How much investment of time and money, how much sacrifice of capability including continuously, is required to get what we need to make a real attempt? To what extent do we need to ‘get it right on the first try’ due to failure not being something we can recover from, and how much does that increase the difficulty level versus problems where we can iterate?
The most lethal-looking, hard-to-avoid, unnatural-to-solve problems include instrumental convergence, power seeking and corrigibility, yet the list of even central ones is very long – see Yudkowsky 2022, A List of Lethalitites.
Creating a neutral-perspective version of such a list, especially an exhaustive one, and getting all the implied cruxes including potential solutions into the crux list, would likely be valuable. Especially if it was combined with those resulting from potential solutions and paths to those solutions, and so on.
Unfortunately, for now, the scope of that project is intractable. I have run out of time if not space, and leave expanding this out to others or to the future. If the answers here matter to you, it will be a long slog of evaluating many complexities, and I urge you not to outsource or abstract it, avoid falling back on social cognition or normality heuristics or grabbing onto metaphors, and instead think hard about the concrete details and logical arguments.