I want to see a dialogue happen between someone with Nate's beliefs and someone with Nora's beliefs. The career decisions of hundreds of people, including myself, depend on clearly thinking through the arguments behind various threat models. I find it pretty embarrassing for the field that there is mutual contempt between people who disagree the most, when such a severe disagreement means the greatest opportunity to understand basic dynamics behind AGI.
Sure, communication is hard sometimes so maybe the dialogue is infeasible, and in fact I can't think of any particular people I'd want to do this. It still makes me sad.
Less of a dialogue and more of a one-way interview / elicitation of models/cruxes but I nominate my interview with Quintin, where I basically wanted to understand what he thought and how he responded to my understanding of 'orthodox' AI alignment takes.
Strong upvote, for mentioning a dialogue on both sides would be a huge positive for people's careers. I actually can see the discussion be even as big as influencing the scope of how we should think "about what is and is not easy in alignment". Hope Nate and @Nora Belrose are up for that. the discussion will be a good thing to document, deconfuse the divide between both perspectives.
(Edit: But to be fare with Nate, he does explain in his posts[1] why alignment or solving the alignment problem is hard to solve. So maybe more elaboration on the other camp is much more required.)
Edit: Added the posts I was referring for clarity.
a. On how various plans miss the hard bits of the alignment challenge
b. Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
I would be up for having a dialogue with Nate. Quintin, myself, and the others in the Optimist community are working on posts which will more directly critique the arguments for pessimism.
I am appreciative of folks like yourself Nora and Quintin building detailed models of the alignment problem and presenting thoughtful counterarguments to existing arguments about the difficulty. I think anyone would consider it a worthwhile endeavor regardless of their perspective on how hard the problem is, and wish you good luck in your efforts to do so.
In my culture, people understand and respect that humans can easily trick themselves into making terrible collective decisions because of tribal dynamics. They respond to this in many ways, such as by working to avoid making it a primary part of people's work or of people's attention, and also by making sure to not accidentally trigger tribal dynamics by inventing tribal distinctions that didn't formerly exist but get picked up by the brain and thunk into being part of our shared mapmaking [edit: and also by keeping their identity small]. It is generally considered healthy to spend most of our attention on understanding the world, solving problems, and sharing arguments, rather than making political evaluations about which group one is a member of. People are also extra hesitant about creating groups that exist fundamentally in opposition to other groups.
My current belief is that the vast majority of the people who have thought about the impacts and alignment of advanced AI (academics like Geoffrey Hinton, forecasters like Phil Tetlock, rationalists like Scott Garrabrant, and so forth) don't think of themselves as participating in 'optimist' or 'pessimist' communities, and would not use the term to describe their community. So my sense is that this is a false description of the world. I have a spidey-sense that language like this often tries to make itself become true by saying it is true, and is good at getting itself into people's monkey brains and inventing tribal lines between friends where formerly there were none.
I think that the existing so-called communities (e.g. "Effective Altruism" or "Rationality" or "Academia") are each in their own ways bereft of some essential qualities for functioning and ethical people and projects. This does not mean that if you or I create new ones quickly they will be good or even better. I do ask that you take care to not recklessly invent new tribes that have even worse characteristics than those that already exist.
From my culture to yours, I would like to make a request that you exercise restraint on the dimension of reifying tribal distinctions that did not formerly exist. It is possible that there are two natural tribes here that will exist in healthy opposition to one another, but personally I doubt it, and I hope you will take time to genuinely consider the costs of greater tribalism.
I agree that it would be terrible for people to form tribal identities around "optimism" or "pessimism" (and have criticized Belrose and Pope's "AI optimism" brand name on those grounds). However, when you say
don't think of themselves as participating in 'optimist' or 'pessimist' communities, and would not use the term to describe their community. So my sense is that this is a false description of the world
I think you're playing dumb. Descriptively, the existing "EA"/"rationalist" so-called communities are pessimistic. That's what the "AI optimists" brand is a reaction to! We shouldn't reify pessimism as an identity (because it's supposed to be a reflection of reality that responds to the evidence), but we also shouldn't imagine that declining to reify a description as a tribal identity makes it "a false description of the world".
I think the words "optimism" and "pessimism" are really confusing, because they conflate the probability, utility and steam of things:
You can be "optimistic" if you believe a good event is likely (or a bad one unlikely), you can be optimistic because you believe a future event (maybe even unlikely) is good, or you have a plan or idea or stance for which you have a high recursive self-trust/recursive reflectively stable prediction that you will engage in it.
So you could be "pessimistic" in the sense that extinction due to AI is unlikely (say, <1%) but you find it super bad and you currently don't have anything concrete that you can latch onto to decrease it.
Or (in the case of e.g. MIRI) you might have ("indefinitely optimistic"?) steam for reducing AI risk, find it moderately to extremely likely, and think it's going to be super bad.
Or you might think that extinction would be super bad, and believe it's unlikely (as Belrose and Pope do) and have steam for both AI and AI alignment.
But the terms are apparently confusing to many people, and I think using these terminologies can "leak" optimism or pessimism from one category into another, and can lead to worse decisions and incorrect beliefs.
It's correct that there's a distinction between whether people identify as pessimistic and whether they are pessimistic in their outlook. I think the first claim is false, and I actually also think the second claim is false, though I am less confident in that.
Interview with Rohin Shah in Dec '19
Rohin reported an unusually large (90%) chance that AI systems will be safe without additional intervention. His optimism was largely based on his belief that AI development will be relatively gradual and AI researchers will correct safety issues that come up.
Paul Christiano in Dec '22
...without AI alignment, AI systems are reasonably likely to cause an irreversible catastrophe like human extinction. I think most people can agree that this would be bad, though there’s a lot of reasonable debate about whether it’s likely. I believe the total risk is around 10–20%, which is high enough to obsess over.
Scott Alexander, in Why I Am Not (As Much Of) A Doomer (As Some People) in March '23
I go back and forth more than I can really justify, but if you force me to give an estimate it’s probably around 33%; I think it’s very plausible that we die, but more likely that we survive (at least for a little while).
John Wentworth in Dec '21 (also see his to-me-inspiring stump speech from a month later):
What’s your plan for AI alignment?
Step 1: sort out our fundamental confusions about agency
Step 2: ambitious value learning (i.e. build an AI which correctly learns human values and optimizes for them)
Step 3: …
Step 4: profit!
… and do all that before AGI kills us all.
That sounds… awfully optimistic. Do you actually think that’s viable?
Better than a 50/50 chance of working in time.
Davidad also feels to me like an optimist to me about the world — someone who is excited about solving the problems and finding ways to win, and is excited about other people and ready to back major projects to set things on a good course. I don't know his probability of an AI takeover but I stand by that he doesn't seem pessimistic in personality.
On occasion when talking to researchers, I talk to someone who is optimistic that their research path will actually work. I won't name who but I recently spoke with a long-time researcher who believes that they have a major breakthrough and will be able to solve alignment. I think researchers can trick themselves into thinking they have a breakthrough when they don't, and this field is unusually lacking in feedback, so I'm not saying I straightforwardly buy their claims, but I think it's inaccurate to describe them all as pessimistic.
A few related thoughts:
I think there are at least two definitions of optimistic/pessimistic that are often conflated:
Certainly these are correlated to some extent: if you believe there's a high chance of everyone dying, probably this is not great for your mental health. Also probably people who are depressed are more likely to have negatively distorted epistemics. This would explain why it's tempting to use the same term to refer to both.
However, I think using the same term to refer to both leads to some problems:
I personally strive to be someone with an optimistic disposition and also to try my best to have my beliefs track the truth. I also try my best to notice and avoid the tribal pressures.
I think Nora is reacting to tribal line strategy in use by ai nihilists (e/acc). I also think your comment could use the clarity of being a third of its length without losing any meaning.
I think that short, critical comments can sometimes read as snarky/rude, and I don't want to speak that way to Nora. I also wanted to take some space to try to invoke the general approach to thinking about tribalism and show how I was applying it here, to separate my point from one that is only arguing against this particular tribal line that Nora is reifying, but instead to encourage restraint in general. Probably you're right that I could make it substantially shorter; writing concisely is a skill I want to work on.
I don't know who the "ai nihilists" are supposed to be. My sense is that you could've figured out from my comment objecting to playing and fast and loose with group names that I wouldn't think that phrase carved reality and that I wasn't sure who you have in mind!
I'd be happy to have a dialogue with you too (I think my view is maybe not so different from Nate's?)
(Didn't consult Nora on this; I speak for myself)
I only briefly skimmed this response, and will respond even more briefly.
Re "Re: "AIs are white boxes""
You apparently completely misunderstood the point we were making with the white box thing. It has ~nothing to do with mech interp. It's entirely about whitebox optimization being better at controlling stuff than blackbox optimization. This is true even if the person using the optimizers has no idea how the system functions internally.
Re: "Re: "Black box methods are sufficient"" (and the other stuff about evolution)
Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics. You should not rely on one to predict the other, as I have argued extensively elsewhere.
Trying to draw inferences about ML from bio evolution is only slightly less absurd than trying to draw inferences about cheesy humor from actual dairy products. Regardless of the fact they can both be called "optmization processes", they're completely different things, with different causal structures responsible for their different outcomes, and crucially, those differences in causal structure explain their different outcomes. There's thus no valid inference from "X happened in biological evolution" to "X will eventually happen in ML", because X happening in biological evolution is explained by evolution-specific details that don't appear in ML (at least for most alignment-relevant Xs that I see MIRI people reference often, like the sharp left turn).
Re: "Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care""
This wasn't the point we were making in that section at all. We were arguing about concept learning order and the ease of internalizing human values versus other features for basing decisions on. We were arguing that human values are easy features to learn / internalize / hook up to decision making, so on any natural progression up the learning capacity ladder, you end up with an AI that's aligned before you end up with one that's so capable it can destroy the entirety of human civilization by itself.
Re "Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are."
I think you badly misunderstood the post (e.g., multiple times assuming we're making an argument we're not, based on shallow pattern matching of the words used: interpreting "whitebox" as meaning mech interp and "values are easy to learn" as "it will know human values"), and I wish you'd either take the time to actually read / engage with the post in sufficient depth to not make these sorts of mistakes, or not engage at all (or at least not be so rude when you do it).
(Note that this next paragraph is speculation, but a possibility worth bringing up, IMO):
As it is, your response feels like you skimmed just long enough to pattern match our content to arguments you've previously dismissed, then regurgitated your cached responses to those arguments. Without further commenting on the merits of our specific arguments, I'll just note that this is a very bad habit to have if you want to actually change your mind in response to new evidence/arguments about the feasibility of alignment.
Re: "Overall take: unimpressed."
I'm more frustrated and annoyed than "unimpressed". But I also did not find this response impressive.
You apparently completely misunderstood the point we were making with the white box thing.
I think you need to taboo the term white box and come up with a new term that will result in less confusion/less people talking past each other.
Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics
One major intuition pump I think important: evolution doesn't get to evaluate everything locally. Gradient descent does. As a result, evolution is slow to eliminate useless junk though it does do so eventually. Gradient descent is so eager to do it that we call it catastrophic forgetting.
Gradient descent wants to use everything in the system for whatever it's doing, right now.
I disagree with the optimists that this makes it trivial because to me it appears that the dynamics that make short term misalignment likely are primarily organizational among humans - the incentives of competition between organizations and individual humans. Also RL-first ais will inline those dynamics much faster than RLHF can get them out.
So it seems both "sides" are symmetrically claiming misunderstanding/miscommunication from the other side, after some textual efforts to bridge the gap have been made. Perhaps an actual realtime convo would help? Disagreement is one thing, but symmetric miscommunication and increasing tones of annoyance seem avoidable here.
Perhaps Nora's/your planned future posts going into more detail regarding counters to pessimistic arguments will be able to overcome these miscommunications, but this pattern suggests not.
Also I'm not so sure this pattern of "its better to skim and say something, half-baked rather than not read or react at all" is helpful, rather than actively harmful in this case. At least, maybe 3/4th baked or something might be better? Miscommunications and anti-willingness to thoroughly engage are only snowballing.
I also could be wrong in thinking such a realtime convo hasn't happened.
We were arguing about concept learning order and the ease of internalizing human values versus other features for basing decisions on. We were arguing that human values are easy features to learn / internalize / hook up to decision making, so on any natural progression up the learning capacity ladder, you end up with an AI that’s aligned before you end up with one that’s so capable it can destroy the entirety of human civilization by itself.
Yes, but you were arguing for that using examples of "morally evaluating" and "grokking the underlying simple moral rule", not of caring.
Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care" (nope; not by default; that's the hard bit).
I think it's worth reflecting on what type of evidence would be sufficient to convince you that we're actually making progress on the "caring" bit of alignment and not merely the "understanding" bit. Because I currently don't see what type of evidence you'd accept beyond near-perfect mechanistic interpretability.
I think current LLMs demonstrate a lot more than mere understanding of human values; they seem to actually 'want' to do things for you, in a rudimentary behavioral sense. When I ask GPT-4 to do some task for me, it's not just demonstrating an understanding of the task: it's actually performing actions in the real world that result in the task being completed. I think it's totally reasonable, prima facie, to admit this as evidence that we are making some success at getting AIs to "care" about doing tasks for users.
It's not extremely strong evidence, because future AIs could be way harder to align, maybe there's ultimately no coherent sense in which GPT-4 "cares" about things, and perhaps GPT-4 is somehow just "playing the training game" despite seemingly having limited situational awareness.
But I think it's valid evidence nonetheless, and I think it's wrong to round this datum off to a mere demonstration of "understanding".
We typically would not place such a high standard on other humans. For example, if a stranger helped you in your time of need, you might reasonably infer that the stranger cares about you to some extent, not merely that they "understand" how to care about you, or that they are merely helping people out of a desire to appear benevolent as part of a long-term strategy to obtain power. You may not be fully convinced they really care about you because of a single incident, but surely it should move your credence somewhat. And further observations could move your credence further still.
Alternative explanations of aligned behavior we see are always logically possible, and it's good to try to get a more mechanistic understanding of what's going on before we confidently declare that alignment has been solved. But behavioral evidence is still meaningful evidence for AI alignment, just as it is for humans.
I think it's worth reflecting on what type of evidence would be sufficient to convince you that we're actually making progress on the "caring" bit of alignment and not merely the "understanding" bit. Because I currently don't see what type of evidence you'd accept beyond near-perfect mechanistic interpretability.
I'm not Nate, but a pretty good theoretical argument that X method of making AIs would lead to an AI that "cared" about the user would do it for me, and I can sort of conceive of such arguments that don't rely on really good mechanistic interpretability.
Can you give an example of a theoretical argument of the sort you'd find convincing? Can be about any X caring about any Y.
Not sure how close you want it to be but how about this example: "animals will typically care about their offspring's survival and reproduction in worlds where their action space is rich enough for them to be helpful and too rich for them to memorize extremely simple heuristics, because if they didn't their genes wouldn't propagate as much". Not air-tight, and also I knew the stylized fact before I heard the argument so it's a bit unfair, but I think it's pretty good as it goes.
I admit I'm a bit surprised by your example. Your example seems to be the type of heuristic argument that, if given about AI, I'd expect would fail to compel many people (including you) on anything approaching a deep level. It's possible I was just modeling your beliefs incorrectly.
Generally speaking, I suspect there's a tighter connection between our selection criteria in ML and the stuff models will end up "caring" about relative to the analogous case for natural selection. I think this for similar reasons that Quintin Pope alluded to in his essay about the evolutionary analogy.
If you think you'd be persuaded that animals will end up caring about their offspring because a heuristic argument about that type of behavior being selected for in-distribution, I'm not sure why you'd need a lot of evidence to be convinced the same will be true for AIs with regard to what we train them to care about. But again, perhaps you don't actually need that much evidence, and I was simply mistaken about what you believe here.
Your example seems to be the type of heuristic argument that, if given about AI, I'd expect would fail to compel many people (including you) on anything approaching a deep level.
I think people are often persuaded of things about AI by heuristic arguments, like "powerful AI will probably be able to reason well and have a decent model of the world because if you don't do that you can't achieve good outcomes" (ok that argument needs some tightening, but I think there's something that works that's only ~2x as long). I think it's going to be harder to persuade me of alignment-relevant stuff about AI with this sort of argument, because there are more ways for such arguments to fail IMO - e.g. the evolution argument relies on evolutionary pressure being ongoing.
Two meta points:
Generally speaking, I suspect there's a tighter connection between our selection criteria in ML and the stuff models will end up "caring" about relative to the analogous case for natural selection. I think this for similar reasons that Quintin Pope alluded to in his essay about the evolutionary analogy.
For the record I'm not compelled of this enough to be optimistic about alignment, but I'm roughly at my budget for internet discussion/debate right now, so I'll decline to elaborate.
If you think you'd be persuaded that animals will end up caring about their offspring because a heuristic argument about that type of behavior being selected for in-distribution, I'm not sure why you'd need a lot of evidence to be convinced the same will be true for AIs with regard to what we train them to care about.
Roughly because AI can change the distribution and change the selection pressure that gets applied to it. But also I don't think I need a lot of evidence in terms of likelihood ratio---my p(doom) is less than 99%, and people convince me of sub-1-in-100 claims all the time---I'm just not seeing the sort of evidence that would move me a lot.
Testing it on out of distribution examples seems helpful. If an AI still acts as if it follows human values out of distribution, it probably truly cares about human values. For AI with situational awareness, we can probably run simulations to an extent (and probably need bootstrap this after a certain capabilities threshold)
How about an argument in the shape of:
For 1 it would depend on how alignment-relevant the concepts and values are. Also I wouldn't think of the papers you linked as much evidence here.
For 2, that would for sure do it, but it doesn't feel like much of a reduction.
3 sounds like it's maybe definitionally true? At the very least, I don't doubt it much.
So I just skimmed the abstracts you linked so maybe I was too hasty there, but I'd want to see evidence that (a) a language model was representing concept C really well and (b) it's really relevant for alignment. I think those papers show something like "you can sort of model brain activations by language model activations" or "there's some embedding space for what brains are sort of doing in conversation" which seems like a different thing (unless the fit is good enough that you can reconstruct one from the other without loss of functionality, then I'm interested).
Makes sense. Just to clarify, the papers I shared for 1 were mostly meant as methodological examples of how one might go about quantifying brain-LLM alignment; I agree about b), that they're not that relevant to alignment (though some other similar papers do make some progress on that front, addressing [somewhat] more relevant domains/tasks - e.g. on emotion understanding - and I have/had an AI safety camp '23 project trying to make similar progress - on moral reasoning). W.r.t. a), you can (also) do decoding (predicting LLM embeddings from brain measurements), the inverse of encoding; this survey, for example, covers both encoding and decoding.
Would any of these count for you?
We have promising alignment plans with low taxes
Each of the three plans I mention are attempts to put the "understanding" part into the "wanting" slot (a "steering subsystem" that controls goals for decision-making purposes) in a different AGI design. That brief post links to somewhat more detailed plans.
Overall take: unimpressed.
Very simple gears in a subculture's worldview can keep being systematically misperceived if it's not considered worthy of curious attention. On the local llama subreddit, I keep seeing assumptions that AI safety people call for never developing AGI, or claim that the current models can contribute to destroying the world. Almost never is there anyone who would bother to contradict such claims or assumptions. This doesn't happen because it's difficult to figure out, this happens because the AI safety subculture is seen as unworthy of engagement, and so people don't learn what it's actually saying, and don't correct each other on errors about what it's actually saying.
This gets far worse with more subtle details, the standard of willingness to engage is raised higher to actually study what the others are saying, when it would be difficult to figure out even with curious attention. Rewarding engagement is important.
I agree. It's rare enough to get reasonable arguments for optimistic outlooks, so this seems worth for someone to openly engage with in some detail.
Yeah, the fact that the responses to the optimistic arguments sometimes rely on simply not engaging with it at all in detail has really dimmed my prospects of reaching out, and causes me to think poorer of AI doom, epistemically.
This actually happened before with Eliezer's Comment here:
I wish this post was never written at all, because as it is, I find it far too unnuanced, and your reasoning for not reading it over fully isn't all that great, as it's I think only 3000 words in the document.
In general, I think this is one of those cases where you need to slow down, and actually read over the document you plan to criticize, and this would be a better post if it did that.
It sounds like you wish this wasn't written at all because you'd prefer a more detailed post? But that's not always a realistic option. There's a lot to attend to.
If you're saying the that posting a "I ignored this but here's some guesses I took about it that lead me to ignore it", isn't helpful, that's more reasonable.
I find it helpful because Nate's worldview is an important one, and he's at least bothered to tell us something about why he's not engaging more deeply.
Fortunately, that more detailed analysis has been done by Steve Byrnes: Thoughts on “AI is easy to control” by Pope & Belrose
If you're saying the that posting a "I ignored this but here's some guesses I took about it that lead me to ignore it", isn't helpful, that's more reasonable.
This is what I was talking about, combined with "I expected better of Nate than to throw out an unnuanced take in response to something complicated."
“AI will be able to figure out what humans want” (yes; obviously; this was never under dispute)
I think the problem is that how existing systems figure out what humans want doesn't seem to do anything with your theory of why it supposed to be relatively easy. Therefore the theory's prediction of alignment being relatively hard is also doesn't have anything to do with reality.
I think a very simple and general pessimistic take is "AI will make human thinking irrelevant". It almost doesn't matter if it happens by subversion or just normal economic activity. One way or another, we'll end up with a world where human thinking is irrelevant, and nobody has described a good world like that.
The only good scenarios are where human thinking somehow avoids "habitat destruction". Maybe by uplifting, or explicit "habitat preservation", or something else. But AI companies are currently doing the opposite, reallocating more and more tasks to AI outright, so it's hard to be an optimist.
A friend asked me for my quick takes on “AI is easy to control”, and gave an advance guess as to what my take would be. I only skimmed the article, rather than reading it in depth, but on that skim I produced the following:
My friend also made guesses about what my takes would be (in italics below), and I responded to their guesses:
Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are.