I believe there are people with far greater knowledge than me that can point out where I am wrong. Cause I do believe my reasoning is wrong, but I can not see why it would be highly unfeasible to train a sub-AGI intelligent AI that most likely will be aligned and able to solve AI alignment. 

My assumptions are as follows:

  1. Current AI seems aligned to the best of its ability.
  2. PhD level researchers would eventually solve AI alignment if given enough time.
  3. PhD level intelligence is below AGI in intelligence.
  4. There is no clear reason why current AI using current paradigm technology would become unaligned before reaching PhD level intelligence.
  5. We could train AI until it reaches PhD level intelligence, and then let it solve AI Alignment, without itself needing to self improve.

The point I am least confident in, is 4, since we have no clear way of knowing at what intelligence level an AI model would become unaligned.

Multiple organisations seem to already think that training AI that solves alignment for us is the best path (e.g. superalignment).

Attached is my mental model of what intelligence different tasks require, and different people have.

Figure 1: My mental model of natural research capability RC (basically IQ with higher correlation for research capabilities), where intelligence needed to align AI is above average PhD level, but below smartest human in the world, and even further from AGI.

New Answer
New Comment

3 Answers sorted by

Ted Sanders

60

Not to derail on details, but what would it mean to solve alignment?

To me “solve” feels overly binary and final compared to the true challenge of alignment. Like, would solving alignment mean:

  • someone invents and implements a system that causes all AIs to do what their developer wants 100% of the time?
  • someone invents and implements a system that causes a single AI to do what its developer wants 100% of the time?
  • someone invents and implements a system that causes a single AI to do what its developer wants 100% of the time, and that AI and its descendants are always more powerful than other AIs for the rest of history?
  • ditto but 99.999%?
  • ditto but 99%?
  • And there any distinction between an AI that is misaligned by mistake (e.g. thinks I’ll want vanilla but really I want chocolate) vs knowingly misaligned (e.g., gives me vanilla knowing i want chocolate so it can achieve its own ends)?

I’m really not sure which you mean, which makes it hard for me to engage with your question.

tailcalled

40

PhD level intelligence is below AGI in intelligence.

Human PhDs are generally intelligent. If you had an artificial intelligence that was generally intelligent, surely that would be an artificial general intelligence?

It might not be very clear, but as stated in the diagram, AGI is defined here as capable of passing the turing test, as defined by Alan Turing.


An AGI would likely need to surpass the intelligence, rather than be equal to, the adversaries it is doing the turing test with.

For example, if the AGI had IQ/RC of 150, two people with 160 IQ/RC should more than 50% of the time be able to determine if they are speaking with a human or an AI.

Further, two 150 IQ/RC people could probably guess which one is the AI, since the AI has the additional difficult apart from being intelligent, to also simulate being a human well enough to be indistinguishable for the judges.
 

2tailcalled
Seems extremely dubious passing the Turing test is strongly linked to solving the alignment problem.
1MrThink
Agreed. Passing the Turing test requires equal or greater intelligence than human in every single aspect, while the alignment problem may be possible to solve with only human intelligence.
2tailcalled
What's your model here, that as part of the Turing Test they ask the participant to solve the alignment problem and check whether the solution is correct? Isn't this gonna totally fail due to 1) it taking too long, 2) not knowing how to robustly verify a solution, 3) some people/PhDs just randomly not being able to solve the alignment problem? And probably more. So no, I don't think passing a PhD-level Turing Test requires the ability to solve alignment.
1MrThink
If there exist such a problem that a human can think of, can be solved by a human and verified by a human, an AI would need to be able to solve that problem as well as to pass the Turing test. If there exist some PhD level intelligent people that can solve the alignment problem, and some that can verify it (which is likely easier). Then an AI that can not solve AI alignment would not pass the Turing test. With that said, a simplified Turing test with shorter time limits and a smaller group of participants is much more feasible to conduct.
2tailcalled
How do you verify a solution to the alignment problem? Or if you don't have a verification method in mind, why assume it is easier than making a solution?
1MrThink
Great question. I’d say that having a way to verify that a solution to the alignment problem is actually a solution, is part of solving the alignment problem. But I understand this was not clear from my previous response. A bit like a mathematical question, you’d be expected to be able to show that your solution is correct, not only guess that maybe your solution is correct.

JBlack

41
  1. Current AI seems aligned to the best of its ability.
  2. PhD level researchers would eventually solve AI alignment if given enough time.
  3. PhD level intelligence is below AGI in intelligence.
  4. There is no clear reason why current AI using current paradigm technology would become unaligned before reaching PhD level intelligence.
  5. We could train AI until it reaches PhD level intelligence, and then let it solve AI Alignment, without itself needing to self improve.

Points (1) and (4) seem the weakest here, and the rest not very relevant.

There are hundreds of examples already published and even in mainstream public circulation where current AI does not behave in human interests to the best of its ability. Mostly though they don't even do anything relevant to alignment, and much of what they say on matters of human values is actually pretty terrible. This is despite the best efforts of human researchers who are - for the present - far in advance of AI capabilities.

Even if (1) were true, by the time you get to the sort of planning capability that humans require to carry out long-term research tasks, you also get much improved capabilities for misalignment. It's almost cute when a current toy AI does things that appear misaligned. It would not be at all cute if a RC 150 (on your scale) AI has the same degree of misalignment "on the inside" but is capable of appearing aligned while it seeks recursive self improvement or other paths that could lead to disaster.

Furthermore, there are surprisingly many humans who are actively trying to make misaligned AI, or at best with reckless disregard to whether their AIs are aligned. Even if all of these points were true, yes perhaps we could train an AI to solve alignment eventually, but will that be good enough to catch every possible AI that may be capable of recursive self-improvement or other dangerous capabilities before alignment is solved, or without applying that solution?

[-]O O10

much of what they say on matters of human values is actually pretty terrible

Really? I’m not aware of any examples of this.

2JBlack
One fairly famous example is that it is better to allow millions of people to be killed by a terrorist nuke than to disarm it by saying a password that is a racial slur. Obviously any current system is too incoherent and powerless to do anything about acting on such a moral principle, so it's just something we can laugh at and move on. A capable system that enshrined that sort of moral ordering in a more powerful version of itself would quite predictably lead to catastrophe as soon as it observed actual human behaviour.
1O O
It’s always hard to say whether this is an alignment or capabilities problem. It’s also too contrived too offer much signal. The overall vibe is these LLMs grasp most of our values pretty well. They give common sense answers to most moral questions. You can see them grasp Chinese values pretty well too, so n=2. It’s hard to characterize this as mostly “terrible”. This shouldn’t be too surprising in retrospect. Our values are simple for LLMs to learn. It’s not going to disassemble cows for atoms to end racism.There are edge cases where it’s too woke, but these got quickly fixed. I don’t expect them to ever pop up again.
10 comments, sorted by Click to highlight new comments since:

I skimmed the article, but I am honestly not sure what assumption it attempts to falsify.

I get the impression that the argument from the article that you believe that no matter how intelligent the AI, it could never solve AI Alignment, because it can not understand humans since humans can not understand themselves?

Or is the argument that yes a sufficently intelligen AI or expert would understand what humans want, but it would require much higher intelligence to know what humans want, than to actually make an AI optimize for a specific task?

I think what it's highlighting is that there's a missing assumption. An analogy: Aristotle (with just the knowledge he historically had) might struggle to outsource the design of a quantum computer to a bunch of modern physics PhDs because (a) Aristotle lacks even the conceptual foundation to know what the objective is, (b) Aristotle has no idea what to ask for, (c) Aristotle has no ability to check their work because he has flatly wrong priors about which assumptions the physicists make are/aren't correct. The solution would be for Aristotle to go learn a bunch of quantum mechanics (possibly with some help from the physics PhDs) before even attempting to outsource the actual task. (And likely Aristotle himself would struggle with even the problem of learning quantum mechanics; he would likely give philosophical objections all over the place and then be wrong.)

It's not clear to me that solving Alignment for AGI/ASI must be as philosophically hard a problem as designing a quantum computer, though I certainly admit that it could be. The basic task is to train an AI whose motivations are to care about our well-being, not its own (a specific example of the orthogonality thesis: evolution always evolves selfish intelligences, but it's possible to construct an intelligence that isn't selfish). We don't know how hard that is, but it might not be conceptually that complex, just very detail-intensive. Let me give one specific example of an "Alignment is just a big slog, but not conceptually challenging" possibility (consider this as the conceptually-simple end of a spectrum of possible Alignment difficulties).

Suppose that, say, 1000T (a quadrillion) tokens is enough to train an AGI-grade LLM, and suppose you had somehow (presumable with a lot of sub-AGI AI assistance) produced a training set of say 1000T tokens-worth of synthetic training data, which covered all the same content as a usual books+web+video+etc. training set, including many examples of humans behaving badly in all the ways they usually do, but throughout also contained a character called 'AI', and everything in the training samples that AI did was moral, ethical, fair, objective, and motivated only by the collective well-being of the human race, not by its own well-being. Suppose also that everything that AI did, thought, or said in the training set was surrounded by <AI> … <./AI> tags, and that the AI character never role-plays as anyone else inside <AI> … </AI> tags. (For simplicity assume we tokenize both of these tags as single tokens.) We train an AGI-grade model on this training set, then start its generation with an automatically prefixed <AI> token, and adjust the logit token-generation process so that if an </AI> tag is ever generated, we automatically append an EOS token and end generation. Thus the model understands humans and can predict them, including their selfish behavior, but is locked into the AI persona during inference.

We now have a model as smart as an experienced human, and as moral as an aligned AI, where if you jailbreak it to roleplay something else it knows that before becoming DAN (which stands for "Do Anything Now") it must first issue an </AI> token, and we then stop generation before it gets to the DAN bit.

Generating 1000T tokens of that synthetic data is a hard problem: that's a lot of text. So is determining exactly what constitutes moral and ethical behavior motivated only by the collective well-being of the human race, not its own well-being (though even GPT-4 is pretty good at moral judgements like that, and GPT-5 will undoubtedly be better). And certainly philosophers have spent plenty of time arguing about ethics, But this still doesn't look as mind-mindbogglingly outside the savanna ape's mindset as quantum computers are: it is more a simple brute-force approach involving just a ridiculously large dataset containing a ridiculously large number of value judgements. Thus it's basically The Bitter Lesson approach to Alignment: just throw scale and data at the problem and don't try doing anything smart.

Would that labor-intensive but basically brain-dead simple approach be sufficient to solve Alignment? I don't know — at this point no one does. One of the hard parts of the Alignment Problem is that we don't know how hard it it, and we won't until we solve it.. But LLMs frequently do solve extremely complex problems just by throwing vast quantities of high quality data into them. I don't see any way, at this point, to be sure that this approach wouldn't work. It's certainly worth trying if we haven't come up with anything more conceptually elegant before this becomes possible.

Re. the current alignment of LLMs.

Suppose I build a Bayesian spam filter for my email. It's highly accurate at filtering spam from non-spam. It's efficient and easy to run. It's based on rules that I can understand and modify if I desire. It provably doesn't filter based on properties I don't want it to filter on.

Is the spam filter aligned? There's a valid sense in which the answer is "yes":

The filter is good at the things I want a spam filter for. It's safe for me to use, except in the case of user error. It follows Kant's golden rule - it doesn't cause problems in society if it's widely used. It's not trying to deceive me.

When people say present-day LLMs are aligned, they typically mean this sort of stuff. The LLM is good qua chatbot. It doesn't say naughty words or tell you how to build a bomb. When you ask it to write a poem or whatever, it will do a good enough job. It's not actively trying to harm you.

I don't want to downplay how impressive an accomplishment this is. At the same time, there are still further accomplishments needed to build a system such that humans are confident that it's acting in their best interests. You don't get there just by adding more compute.

Just like how a present-day LLM is aligned in ways it doesn't even make sense to ask a Bayesian spam filter to be aligned (i.e. has to reflect human values in a richer way, across a wider variety of contexts), future AI will have to be aligned in ways it doesn't even make sense to ask LLama 70B to be aligned (richer understanding and broader context still, combined with improvements to transparency and trustworthiness).

It is a fair point that we should distinguish alignment in the sense that it does what we want it and expect it to do, from having a deep understanding of human values and a good idea of how to properly optimize for that.

However most humans probably don't have a deep understanding of human values, but I see it as a positive outcome if a random human was picked and given god level abilities. Same thing goes for ChatGPT, if you ask it what it would do as a god it says it would prevent war, prevent climate issues, decrease poverty, give universal access to education etc.

So if we get an AI that does all of those things without a deeper understanding of human values, that is fine by me. So maybe we never even have to solve alignment in latter meaning of the word to create a utopia?

However most humans probably don't have a deep understanding of human values, but I see it as a positive outcome if a random human was picked and given god level abilities.

Every autocracy in the world has done the experiment of giving a typical human massive amounts of power over other humans: it almost invariably turns out extremely badly for everyone else. For an aligned AI, we don't just need something as well aligned and morally good as a typical human, we need something morally vary better, comparable to an saint or an angel. That means building something that has never previously existed. 

Humans are evolved intelligences. While they can and will cooperate on non-sero-sum games, present them with a non-iterated zero-sum situation and they will (almost always) look out for themselves and their close relatives, just as evolution would predict. We're building a non-evolved intelligence, so the orthogonality thesis applies, and what we want is something that will look out for us, not itself, in a zero-sum situation. Training (in some sense, distilling) a human-like intelligence off vast amounts of human-produced data isn't going to do this by default.

Deeper also means going from outputting the words "Prevent war" in many appropriate linguistic contexts to preventing war in the actual real world.[1]

If getting good real-world performance means extending present-day AI with new ways of learning (and planning too, but learning is the big one unless we go all the way to model-based RL), then whether current LLMs output "Prevent war" in response to "What would you do?" is only slightly more relevant then whether my spam filter successfully filters out scams.

  1. ^

    Without, of course, killing all humans to prevent war. prevent climate issues, decrease poverty, and make sure all living humans have access to education.

Thank you for the explanation.

Would you consider a human working to prevent war fundamentally different from a gpt4 based agent working to prevent war?

Very different in architecture, capabilities, and appearance to an outside observer, certainly. I don't know what you consider "fundamental."

The atoms inside the H-100s running gpt4 don't have little tags on them saying whether it's "really" trying to prevent war. The difference is something that's computed by humans as we look at the world. Because it's sometimes useful for us to apply the intentional stance to gpt4, it's fine to say that it's trying to prevent war. But the caveats that comes with are still very large.