Mostly I'd agree with this, but I think there needs to be a bit of caution and balance around:
How do we get more streams of evidence? By making productive mistakes. By attempting to leverage weird analogies and connections, and iterating on them. We should obviously recognize that most of this will be garbage, but you’ll be surprised how many brilliant ideas in the history of science first looked like, or were, garbage.
Do we want variety? Absolutely: worlds where things work out well likely correlate strongly with finding a variety of approaches.
However, there's some risk in Do(increase variety). The ideal is that we get many researchers thinking about the problem in a principled way, and variety happens. If we intentionally push too much for variety, we may end up with a lot of wacky approaches that abandoned too much principled thinking too early. (I think I've been guilty of this at times)
That said, I fully agree with the goal of finding a variety of approaches. It's just rather less clear to me how much an individual researcher should be thinking in terms of boosting variety. (it's very clear that there should be spaces that provide support for finding different approaches, so I'm entirely behind that; currently it's much more straightforward to work on existing ideas than to work on genuinely new ones)
Certainly many great ideas initially looked like garbage - but I'll wager a lot of garbage initially looked like garbage too. I'd be interested in knowing more about the hidden-greatness-garbage: did it tend to have any common recognisable qualities at the time? Did it tend to emerge from processes with common recognisable qualities? In environments with shared qualities?...
It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…
I bet Adam will argue about this (or something similar) is the minimal we want for a research idea, because I agree with your idea that we shouldn’t expect solution to alignment to fall out of the marketing program for Oreos. We want to constrain it to at least “has a plausible story on reducing x-risk” and maybe what’s mentioned in the quote as well.
For sure I agree that the researcher knowing these things is a good start - so getting as many potential researchers to grok these things is important.
My question is about which ideas researchers should focus on generating/elaborating given that they understand these things. We presumably don't want to restrict thinking to ideas that may overcome all these issues - since we want to use ideas that fail in some respects, but have some aspect that turns out to be useful.
Generating a broad variety of new ideas is great, and we don't want to be too quick in throwing out those that miss the target. The thing I'm unclear about is something like:
What target(s) do I aim for if I want to generate the set of ideas with greatest value?
I don't think that "Aim for full alignment solution" is the right target here.
I also don't think that "Aim for wacky long-shots" is the right target - and of course I realize that Adam isn't suggesting this.
(we might find ideas that look like wacky long-shots from outside, but we shouldn't be aiming for wacky long-shots)
But I don't have a clear sense of what target I would aim for (or what process I'd use, what environment I'd set up, what kind of people I'd involve...), if my goal were specifically to generate promising ideas (rather than to work on them long-term, or to generate ideas that I could productively work on).
Another disanalogy with previous research/invention... is that we need to solve this particular problem. So in some sense a history of:
[initially garbage-looking-idea] ---> [important research problem solved] may not be relevant.
What we need is: [initially garbage-looking-idea generated as attempt to solve x] ---> [x was solved]
It's not good enough if we find ideas that are useful for something, they need to be useful for this.
I expect the kinds of processes that work well to look different from those used where there's no fixed problem.
In the end you do want to solve the problem, obviously. But the road from here to there goes through many seemingly weird and insufficient ideas that are corrected, adapted, refined, often discarded except for a small bit. Alignment is no different, including “strong” alignment.
There is an implicit assumption here that is not covering all the possible outcomes of research progress.
With progress on understanding some open problems in mathematics and computer science, they have turned out unsolvable. That is a valuable, decision-relevant conclusion. It means it is better to do something else than keep hacking away at solving that maths problem.
E.g.
We cannot just rely on a can-do attitude, as we can with starting a start-up (where even if there’s something fundamentally wrong about the idea, and it fails, only a few people’s lives are impacted hard).
With ‘solving for’ the alignment of generally intelligent and scalable/replicable machine algorithms, it is different.
This is the extinction of human society and all biological life we are talking about. We need to think this through rationally, and consider all outcomes of our research impartially.
I appreciate the emphasis on diverse conceptual approaches. Please, be careful in what you are looking for.
I'm confused on what your point here even is. For the first part, if you're trying to say
research that gives strong arguments/proofs that you cannot solve alignment by doing X (like showing certain techniques aren't powerful enough to prove P!=NP) is also useful.
, then that makes sense. But the post didn't mention anything about that?
You said:
We cannot just rely on a can-do attitude, as we can with starting a start-up (where even if there’s something fundamentally wrong about the idea, and it fails, only a few people’s lives are impacted hard).
which I feel is satirizing the post. I read the post to say
It's extremely difficult to get all the bits of hidden information in one shot, so it's important to go at it from many different angles like what's happened historically. There will always be problems with individual approaches, but we can steelman them to think about what hidden bits of info they could reveal about the actual solution.
We don't have any proofs that the approaches the referenced researchers are doomed to fail like we have for P!=NP and what you linked. I would predict that Adam does think approaches that run counter to "instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis" to be doomed to fail.
We don't have any proofs that the approaches the referenced researchers are doomed to fail like we have for P!=NP and what you linked.
Besides looking for different angles or ways to solve alignment, or even for strong arguments/proofs why a particular technique will not solve alignment,
... it seems prudent to also look for whether you can prove embedded misalignment by contradiction (in terms of the inconsistency of the inherent logical relations between essential properties that would need to be defined as part of the concept of embedded/implemented/computed alignment).
This is analogous to the situation Hilbert and others in the Vienna circle found themselves in trying to 'solve for' mathematical models being (presumably) both complete and consistent. Gödel, who was a semi-outsider, instead took the inverse route of proving by contradiction that a model cannot be simultaneously complete and consistent.
If you have an entire community operating under the assumption that a problem is solvable or at least resolving to solve the problem in the hope that it is solvable, it seems epistemically advisable to have at least a few oddballs attempting to prove that the problem is unsolvable.
Otherwise you end up skewing your entire 'portfolio allocation' of epistemic bets.
I understand your point now, thanks. It's:
An embedded aligned agent is desired to have properties (1),(2), and (3). But, suppose (1) & (2), then (3) cannot be true. Then, suppose (2) & ...
or something of the sort.
Yeah, that points well to what I meant. I appreciate your generous intellectual effort here to paraphrase back!
Sorry about my initially vague and disagreeable comment (aimed at Adam, who I chat with sometimes as a colleague). I was worried about what looks like a default tendency in the AI existential safety community to start from the assumption that problems in alignment are solvable.
Adam has since clarified with me that although he had not written about it in the post, he is very much open to exploring impossibility arguments (and sent me a classic paper on impossibility proofs in distributed computing).
… making your community and (in this case) the wider world fragile to reality proving you wrong.
We don't know the status or evolution of internal MIRI or LW independent/individual Safety Align Research.
But it seems that A.G.I. has a (much?) higher probability of getting invented away.
So the problem is not only to discover how to Safely Align A.G.I. but also to invent A.G.I.
Inventing A.G.I. seems to be a step before than discovering how to Safely Align A.G.I. right?
How probable is it estimated that the first A.G.I. will be the Singularity? isn't it a spectrum? The answer is probably in the take-off speed and acceleration.
If anyone could provide resources on this it would be much appreciated.
Thousands of highly competent people are working on projects aimed at increasing AI capabilities, there is vast financial incentive there already. We don't need to and should not help with that.
If we only figure out alignment after the intelligence explosion, it will be too late. We might get a chance to course correct in a slow take-off, but we definitely can't count on it.
As for resources, Rob Miles has many excellent introductory videos to AI alignment.
This post is part of the work done at Conjecture.
I wouldn’t bet on any current alignment proposal. Yet I think that the field is making progress and abounds with interesting opportunities to do even more, giving us a shot. Isn’t there a contradiction?
No, because research progress so rarely looks like having a clearly correct insight that clarifies everything; instead it often looks like building on apparently unpromising ideas, or studying the structure of the problem. Copernican heliocentrism didn’t initially predict observations as well as Ptolemaic astronomy; both ionic theory and the determination of basic molecular formula came from combining multiple approaches in chemistry, each getting some bits but not capturing the whole picture; Computer Science emerged from the arid debate over the foundations of mathematics; and Computational Complexity Theory has made more progress by looking at why some of its problems are hard than by waiting for clean solutions.
In the end you do want to solve the problem, obviously. But the road from here to there goes through many seemingly weird and insufficient ideas that are corrected, adapted, refined, often discarded except for a small bit. Alignment is no different, including “strong” alignment.
Research advances through productive mistakes, not perfect answers.
I’m taking this terminology from Goro Shimura’s characterization of his friend Yutaka Taniyama, with whom he formulated the Taniyama-Shimura Conjecture that Andrew Wiles proved in order to prove Fermat’s last theorem.
(Yutaka Taniyama and his time. Very personal recollections, Goro Shimura, 1989)
So much of scientific progress takes the form of many people proposing different ideas that end up being partially right, where we can look back later and be like “damn, that was capturing a chunk of the solution.” It’s very rare that people arrive at the solution of any big scientific problem in one nice sweep of a clearly adequate idea. Even when it looks like it (Einstein is an example people like to bring up), they so often build on many of the weird and contradictory takes that came before, as well as the understanding of how the problem works at all (in Einstein’s case, this includes the many, many unconvincing attempts to unify mechanics and electromagnetism, the shape of Maxwell’s equations, the ether drag hypothesis, and Galileo’s relativity principle; he also made a lot of productive mistakes of his own).
Paul Graham actually says the same thing about startups that end up becoming great successes.
(What Microsoft Is This The Basic Altair Of?, Paul Graham, 2015)
Graham proposes a change of polarity in considering lame ideas: instead of looking for flaws, he encourages us to steelman not the idea itself, but how it could lead to greatness.
(What Microsoft Is This The Basic Altair Of?, Paul Graham, 2015)
That’s this mindset that makes me excited about on-going conceptual alignment research.
I look at ARC’s ELK, and I have disagreement about the constraints, and the way of stating the problem, and about each proposed solution; but I also see how much productive discussion ELK has generated by pushing people to either solve it or articulate why it’s impossible or why it falls short of capturing the key problems that we want to solve.
I look at Steve’s Brain-like AGI Alignment work, and I’m not convinced that we will build brain-like AGI before ML-based AGI or automated economies; but I also see that Steve has been pushing the thinking around value learning and its subtleties, and has found a robust way of transferring results and models from neuroscience to alignment.
I look at John’s Natural Abstraction work, and I’m still unsure whether the natural abstraction hypothesis is correct, and if it might at all lead to tractable extraction/analysis of the abstractions used in prediction; but I also see how it reframes the thinking and ideas around fragility of value, and provide ideas for forcing an ontological lock (if the natural abstraction hypothesis doesn’t hold by default).
I look at Evan’s training stories, and I’m unclear whether this is the right frame to argue for alignment guarantees, and if it has blindspots; but I also see how it clarifies misunderstandings around inner alignment, and provide the first step for a common language to discuss failure modes in prosaic alignment.
I look at Alex’s power-seeking theorems, and I wonder if it’s not missing a crucial component about how power is spent, and if the set of permutations considered fit with how goals are selected in real life; but I also realize that the formalization made these subtleties of instrumental convergence more salient, and provided some intuitions about ways of sampling goals that might reduce power-seeking incentives.
I look at Vanessa’s Infra-bayesianism work, and I worry that it’s not tackling the crucial part of inferring and capturing human values, as well as going for too much generality at the cost of shareability; but I also see that it looks particularly good for tackling question of realizability and breaking self-reference, while yielding powerful enough math that I expect progress on the agenda.
I look at Critch’s RAAPs work, and I don’t know if competitive pressure is a strong enough mechanism to cause that kind of problem, nor am I so sure that the agentic and structural effects can be disentangled; but I also appreciate the attempt to bring more structural-type thinking into alignment, and how this addresses a historical gap in how to think about AI risk and alignment strategies.
And so on for many other works on the AF.[1]
It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…
None of these approaches looks good enough on its own, and I expect many to shift, get redirected, or even abandoned to iterate on a new version. I also expect to criticize their development and disagree with the researchers involved. Yet I still see benefits and insights they might deliver, and want more work to be put into them for that reason.
But isn’t all that avoiding the real problem of finding a solution to the alignment problem right now? No, because they each give us better tools and ideas and handles for tackling the problem, and all our current proposals don’t work.
That doesn’t look fast, you might answer. And I agree that fundamental science and solving new and complex problems have historically taken way too long for the short timelines we seem to be on. But that’s not a reason to refuse to do the necessary work, or despair; it’s a reason to find ways of accelerating science! For example, looking at what historically hampered progress, and remove it as much as possible. Or how hidden bits of evidence were revealed, and leverage that to explore the space of ideas and approaches faster.
Okay, but shouldn’t we focus all our efforts on finding smarter and smarter people to work on this problem instead of pushing for the small progress we’re able to make now? I think this misses the point: we don’t want smartness, we want the ability to reveal hidden bits of evidence. That’s certainly correlated with smartness, but with one big difference: there’s often diminishing returns to the bits of evidence you can get from one angle, and that leads to wanting a more diverse portfolio of researchers who are good at harnessing and revealing different streams of evidence. That’s one thing which the common “Which alignment researcher would you want to have 10 copies of?” misses: we want variety, because no one is that good at revealing bits from all relevant streams of evidence.
To go back to the Einstein example, he was clearly less of a math genius than most of his predecessors who attempted to unify mechanics and electromagnetism, like Poincaré. But that didn’t matter, because what Einstein had was a knack for revealing the hidden bits of evidence in what we already knew about physics and the shape of our laws of physics. And after he did that, many mathematicians and physicians with better math chops pushed his theory and ideas and revealed incredibly rich models and insights and predictions.
How do we get more streams of evidence? By making productive mistakes. By attempting to leverage weird analogies and connections, and iterating on them. We should obviously recognize that most of this will be garbage, but you’ll be surprised how many brilliant ideas in the history of science first looked like, or were, garbage.
So if you’re worried about AI risk, and want to know if there’s anything that can be done, the answer is a resounding yes. There are so many ways of improving our understanding and thus our chances: participating in current research programs and agendas, coming up with new weird takes and approaches, exploring the mechanism, history, and philosophy of science to accelerate the process as much as we can…[2]
I don’t know if we’ll make it in time. 5 to 15 years[3] is a tight deadline indeed, and the strong alignment problem is incredibly complex and daunting. But I know this: if we solve the problem and get out of this alive, this will not be by waiting for an obviously convincing approach; it will come instead from making as many productive mistakes as we can, and learning from them as fast as we can.
I’m not discussing applied alignment research here, like the work of Redwood, but I also find this part crucial and productive. It’s just that such work is less about “formulating a solution” and more about “exploring the models and the problems experimentally”, which fit well with the model I’m drawing here.
I’m currently finishing a sequence arguing for more pluralism in alignment and providing an abstraction of the alignment problem that I find particularly good for generating new approaches and understanding how all the different takes and perspectives relate.
The range where many short timelines put the bulk of their probability mass.