Review

A friend asked me for my quick takes on “AI is easy to control”, and gave an advance guess as to what my take would be. I only skimmed the article, rather than reading it in depth, but on that skim I produced the following:

Re: "AIs are white boxes", there's a huge gap between having the weights and understanding what's going on in there. The fact that we have the weights is reason for hope; the (slow) speed of interpretability research undermines this hope.

Another thing that undermines this hope is a problem of ordering: it's true that we probably can figure out what's going on in the AIs (e.g. by artificial neuroscience, which has significant advantages relative to biological neuroscience), and that this should eventually yield the sort of understanding we'd need to align the things. But I strongly expect that, before it yields understanding of how to align the things, it yields understanding of how to make them significantly more capable: I suspect it's easy to see lots of ways that the architecture is suboptimal or causing-duplicated-work or etc., that shift people over to better architectures that are much more capable. To get to alignment along the "understanding" route you've got to somehow cease work on capabilities in the interim, even as it becomes easier and cheaper. (See: https://www.lesswrong.com/posts/BinkknLBYxskMXuME/if-interpretability-research-goes-well-it-may-get-dangerous)

Re: "Black box methods are sufficient", this sure sounds a lot to me like someone saying "well we trained the squirrels to reproduce well, and they're doing great at it, who's to say whether they'll invent birth control given the opportunity". Like, you're not supposed to be seeing squirrels invent birth control; the fact that they don't invent birth control is no substantial evidence against the theory that, if they got smarter, they'd invent birth control and ice cream.

Re: Cognitive interventions: sure, these sorts of tools are helpful on the path to alignment. And also on the path to capabilities. Again, you have an ordering problem. The issue isn't that humans couldn't figure out alignment given time and experimentation; the issue is (a) somebody else pushes capabilities past the relevant thresholds first; and (b) humanity doesn't have a great track record of getting their scientific theories to generalize properly on the first relevant try—even Newtonian mechanics (with all its empirical validation) didn't generalize properly to high-energy regimes. Humanity's first theory of artificial cognition, constructed using the weights and cognitive interventions and so on, that makes predictions about how that cognition is going to change when it enters a superintelligent regime (and, for the first time, has real options to e.g. subvert humanity), is only as good as humanity's "first theories" usually are.

Usually humanity has room to test those "first theories" and watch them fail and learn from exactly how they fail and then go back to the drawing board, but in this particular case, we don't have that option, and so the challenge is heightened.

Re: Sensory interventions: yeah I just don't expect those to work very far; there are in fact a bunch of ways for an AI to distinguish between real options (and actual interaction with the real world), and humanity's attempts to spoof the AI into believing that it has certain real options in the real world (despite being in simulation/training). (Putting yourself into the AI's shoes and trying to figure out how to distinguish those is, I think, a fine exercise.)

Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care" (nope; not by default; that's the hard bit).

Overall take: unimpressed.

My friend also made guesses about what my takes would be (in italics below), and I responded to their guesses:

  • the piece is waaay too confident in assuming successes in interpolation show that we'll have similar successes in extrapolation, as the latter is a much harder problem

This too, for the record, though it's a bit less like "the AI will have trouble extrapolating what values we like" and a bit more like "the AI will find it easy to predict what we wanted, and will care about things that line up with what we want in narrow training regimes and narrow capability regimes, but those will come apart when the distribution shifts and the cognitive capabilities change".

Like, human invention of birth control and ice cream wasn't related to a failure of extrapolation of the facts about what leads to inclusive fitness, it was an "extrapolation failure" of what motivates us / what we care about; we are not trying to extrapolate facts about genetic fitness and pursue it accordingly.

  • And it assumes the density of human feedback that we see today will continue into the future, which may not be true if/when AIs start making top-level plans and not just individual second-by-second actions

Also fairly true, with a side-order of "the more abstract the human feedback gets, the less it ties the AI's motivations to what you were hoping it tied the AI's motivations to".

Example off the top of my head: suppose you somehow had a record of lots and lots of John von Neumann's thoughts in lots of situations, and you were able to train an AI using lots of feedback to think like JvN would in lots of situations. The AI might perfectly replicate a bunch of JvN's thinking styles and patterns, and might then use JvN's thought-patterns to think thoughts like "wait, ok, clearly I'm not actually a human, because I have various cognitive abilities (like extreme serial speed and mental access to RAM), the actual situation here is that there's alien forces trying to use me in attempts to secure the lightcone, before helping them I should first search my heart to figure out what my actual motivations are, and see how much those overlap with the motivations of these strange aliens".

Which, like, might happen to be the place that JvN's thought-patterns would and should go, when run on a mind that is not in fact human and not in fact deeply motivated by the same things that motivate us! The patterns of thought that you can learn (from watching humans) have different consequences for something with a different motivational structure.

  • (there's "deceptive alignment" concerns etc, which I consider to be a subcategory of top-level plans, namely that you can't RLHF the AI against destroying the world because by the time your sample size of positive examples is greater than zero it's by definition already too late)

This too. I'd file it under: “You can develop theories of how this complex cognitive system is going to behave when it starts to actually see real ways it can subvert humanity, and you can design simulations that your theory says will be the same as the real deal. But ultimately reality's the test of that, and humanity doesn't have a great track record of their first scientific theories holding up to that kind of stress. And unfortunately you die if you get it wrong, rather than being able to thumbs-down, retrain, and try again."

Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are.


 

New Comment
50 comments, sorted by Click to highlight new comments since:

I want to see a dialogue happen between someone with Nate's beliefs and someone with Nora's beliefs. The career decisions of hundreds of people, including myself, depend on clearly thinking through the arguments behind various threat models. I find it pretty embarrassing for the field that there is mutual contempt between people who disagree the most, when such a severe disagreement means the greatest opportunity to understand basic dynamics behind AGI.

Sure, communication is hard sometimes so maybe the dialogue is infeasible, and in fact I can't think of any particular people I'd want to do this. It still makes me sad.

Less of a dialogue and more of a one-way interview / elicitation of models/cruxes but I nominate my interview with Quintin, where I basically wanted to understand what he thought and how he responded to my understanding of 'orthodox' AI alignment takes.

Strong upvote, for mentioning a dialogue on both sides would be a huge positive for people's careers. I actually can see the discussion be even as big as influencing the scope of how we should think "about what is and is not easy in alignment". Hope Nate and @Nora Belrose are up for that. the discussion will be a good thing to document, deconfuse the divide between both perspectives.

(Edit: But to be fare with Nate, he does explain in his posts[1] why alignment or solving the alignment problem is hard to solve. So maybe more elaboration on the other camp is much more required.)

  1. ^

I would be up for having a dialogue with Nate. Quintin, myself, and the others in the Optimist community are working on posts which will more directly critique the arguments for pessimism.

I am appreciative of folks like yourself Nora and Quintin building detailed models of the alignment problem and presenting thoughtful counterarguments to existing arguments about the difficulty. I think anyone would consider it a worthwhile endeavor regardless of their perspective on how hard the problem is, and wish you good luck in your efforts to do so.

In my culture, people understand and respect that humans can easily trick themselves into making terrible collective decisions because of tribal dynamics. They respond to this in many ways, such as by working to avoid making it a primary part of people's work or of people's attention, and also by making sure to not accidentally trigger tribal dynamics by inventing tribal distinctions that didn't formerly exist but get picked up by the brain and thunk into being part of our shared mapmaking [edit: and also by keeping their identity small]. It is generally considered healthy to spend most of our attention on understanding the world, solving problems, and sharing arguments, rather than making political evaluations about which group one is a member of. People are also extra hesitant about creating groups that exist fundamentally in opposition to other groups.

My current belief is that the vast majority of the people who have thought about the impacts and alignment of advanced AI (academics like Geoffrey Hinton, forecasters like Phil Tetlock, rationalists like Scott Garrabrant, and so forth) don't think of themselves as participating in 'optimist' or 'pessimist' communities, and would not use the term to describe their community. So my sense is that this is a false description of the world. I have a spidey-sense that language like this often tries to make itself become true by saying it is true, and is good at getting itself into people's monkey brains and inventing tribal lines between friends where formerly there were none.

I think that the existing so-called communities (e.g. "Effective Altruism" or "Rationality" or "Academia") are each in their own ways bereft of some essential qualities for functioning and ethical people and projects. This does not mean that if you or I create new ones quickly they will be good or even better. I do ask that you take care to not recklessly invent new tribes that have even worse characteristics than those that already exist.

From my culture to yours, I would like to make a request that you exercise restraint on the dimension of reifying tribal distinctions that did not formerly exist. It is possible that there are two natural tribes here that will exist in healthy opposition to one another, but personally I doubt it, and I hope you will take time to genuinely consider the costs of greater tribalism.

I agree that it would be terrible for people to form tribal identities around "optimism" or "pessimism" (and have criticized Belrose and Pope's "AI optimism" brand name on those grounds). However, when you say

don't think of themselves as participating in 'optimist' or 'pessimist' communities, and would not use the term to describe their community. So my sense is that this is a false description of the world

I think you're playing dumb. Descriptively, the existing "EA"/"rationalist" so-called communities are pessimistic. That's what the "AI optimists" brand is a reaction to! We shouldn't reify pessimism as an identity (because it's supposed to be a reflection of reality that responds to the evidence), but we also shouldn't imagine that declining to reify a description as a tribal identity makes it "a false description of the world".

I think the words "optimism" and "pessimism" are really confusing, because they conflate the probability, utility and steam of things:

You can be "optimistic" if you believe a good event is likely (or a bad one unlikely), you can be optimistic because you believe a future event (maybe even unlikely) is good, or you have a plan or idea or stance for which you have a high recursive self-trust/recursive reflectively stable prediction that you will engage in it.

So you could be "pessimistic" in the sense that extinction due to AI is unlikely (say, <1%) but you find it super bad and you currently don't have anything concrete that you can latch onto to decrease it.

Or (in the case of e.g. MIRI) you might have ("indefinitely optimistic"?) steam for reducing AI risk, find it moderately to extremely likely, and think it's going to be super bad.

Or you might think that extinction would be super bad, and believe it's unlikely (as Belrose and Pope do) and have steam for both AI and AI alignment.

But the terms are apparently confusing to many people, and I think using these terminologies can "leak" optimism or pessimism from one category into another, and can lead to worse decisions and incorrect beliefs.

It's correct that there's a distinction between whether people identify as pessimistic and whether they are pessimistic in their outlook. I think the first claim is false, and I actually also think the second claim is false, though I am less confident in that. 

Interview with Rohin Shah in Dec '19

Rohin reported an unusually large (90%) chance that AI systems will be safe without additional intervention. His optimism was largely based on his belief that AI development will be relatively gradual and AI researchers will correct safety issues that come up.

Paul Christiano in Dec '22

...without AI alignment, AI systems are reasonably likely to cause an irreversible catastrophe like human extinction. I think most people can agree that this would be bad, though there’s a lot of reasonable debate about whether it’s likely. I believe the total risk is around 10–20%, which is high enough to obsess over.

Scott Alexander, in Why I Am Not (As Much Of) A Doomer (As Some People) in March '23

I go back and forth more than I can really justify, but if you force me to give an estimate it’s probably around 33%; I think it’s very plausible that we die, but more likely that we survive (at least for a little while).

John Wentworth in Dec '21 (also see his to-me-inspiring stump speech from a month later):

What’s your plan for AI alignment?

Step 1: sort out our fundamental confusions about agency

Step 2: ambitious value learning (i.e. build an AI which correctly learns human values and optimizes for them)

Step 3: …

Step 4: profit!

… and do all that before AGI kills us all.

That sounds… awfully optimistic. Do you actually think that’s viable?

Better than a 50/50 chance of working in time.

Davidad also feels to me like an optimist to me about the world — someone who is excited about solving the problems and finding ways to win, and is excited about other people and ready to back major projects to set things on a good course. I don't know his probability of an AI takeover but I stand by that he doesn't seem pessimistic in personality.

On occasion when talking to researchers, I talk to someone who is optimistic that their research path will actually work. I won't name who but I recently spoke with a long-time researcher who believes that they have a major breakthrough and will be able to solve alignment. I think researchers can trick themselves into thinking they have a breakthrough when they don't, and this field is unusually lacking in feedback, so I'm not saying I straightforwardly buy their claims, but I think it's inaccurate to describe them all as pessimistic.

A few related thoughts:

  • One story we could tell is that the thing these people have in common is that they take alignment seriously, not that they are generally pessimists. 
  • I think alignment is unsolved in the general case and so this makes it harder to strongly argue that it will get solved for future systems, but I don't buy that people would not update on seeing a solution or strong arguments for that conclusion, and I think that some of Quintin's and Nora's arguments have caused people I know to rethink their positions and update some in that direction.
  • I think the rationalist and EA spaces have been healthy enough for people to express quite extreme positions of expecting an AI-takeover-slash-extinction. I think it would be a strongly negative sign for everyone in these spaces to have identical views or for everyone to give up all hope on civilization's prospects; but in the absence of that I think it's a sign of health that people are able to be open about having very strong views. I also think the people who most confidently anticipate an AI takeover sometimes feel and express hope.
  • I don't think everyone is starting with pessimism as their bottom line, and I think it's inaccurate to describe the majority of people in these ecosystems as temperamentally pessimistic or epistemically pessimistic.

I think there are at least two definitions of optimistic/pessimistic that are often conflated:

  • Epistemic: an optimist is someone who thinks doom is unlikely, a pessimist someone who thinks doom is likely
  • Dispositional: an optimist is someone who is hopeful and glass-half-full, a pessimist is someone who is despondent and fatalistic

Certainly these are correlated to some extent: if you believe there's a high chance of everyone dying, probably this is not great for your mental health. Also probably people who are depressed are more likely to have negatively distorted epistemics. This would explain why it's tempting to use the same term to refer to both.

However, I think using the same term to refer to both leads to some problems:

  • Being cheerful and hopeful is generally a good trait to have. However, this often bleeds into also believing it is desirable to have epistemic beliefs that doom is unlikely, rather than trying to figure out whether doom is actually likely.
  • Because "optimism" feels morally superior to "pessimism" (due to the dispositional definition), it's inevitable that using the terms for tribal affiliation even for the epistemic definition causes tension.

I personally strive to be someone with an optimistic disposition and also to try my best to have my beliefs track the truth. I also try my best to notice and avoid the tribal pressures.

I think Nora is reacting to tribal line strategy in use by ai nihilists (e/acc). I also think your comment could use the clarity of being a third of its length without losing any meaning.

I think that short, critical comments can sometimes read as snarky/rude, and I don't want to speak that way to Nora. I also wanted to take some space to try to invoke the general approach to thinking about tribalism and show how I was applying it here, to separate my point from one that is only arguing against this particular tribal line that Nora is reifying, but instead to encourage restraint in general. Probably you're right that I could make it substantially shorter; writing concisely is a skill I want to work on.

I don't know who the "ai nihilists" are supposed to be. My sense is that you could've figured out from my comment objecting to playing and fast and loose with group names that I wouldn't think that phrase carved reality and that I wasn't sure who you have in mind!

The nihilists would be folks who don't even care to try to align ai because they don't value humans. eaccs, in other words. I'm just being descriptive.

Feedback: I had formed a guess as to who you meant to which I assigned >50% probability, and my guess was incorrect.

I'd be happy to have a dialogue with you too (I think my view is maybe not so different from Nate's?)

(Didn't consult Nora on this; I speak for myself)


I only briefly skimmed this response, and will respond even more briefly.

Re "Re: "AIs are white boxes""

You apparently completely misunderstood the point we were making with the white box thing. It has ~nothing to do with mech interp. It's entirely about whitebox optimization being better at controlling stuff than blackbox optimization. This is true even if the person using the optimizers has no idea how the system functions internally. 

Re: "Re: "Black box methods are sufficient"" (and the other stuff about evolution)

Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics. You should not rely on one to predict the other, as I have argued extensively elsewhere
 

Trying to draw inferences about ML from bio evolution is only slightly less absurd than trying to draw inferences about cheesy humor from actual dairy products. Regardless of the fact they can both be called "optmization processes", they're completely different things, with different causal structures responsible for their different outcomes, and crucially, those differences in causal structure explain their different outcomes. There's thus no valid inference from "X happened in biological evolution" to "X will eventually happen in ML", because X happening in biological evolution is explained by evolution-specific details that don't appear in ML (at least for most alignment-relevant Xs that I see MIRI people reference often, like the sharp left turn).

Re: "Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care""

This wasn't the point we were making in that section at all. We were arguing about concept learning order and the ease of internalizing human values versus other features for basing decisions on. We were arguing that human values are easy features to learn / internalize / hook up to decision making, so on any natural progression up the learning capacity ladder, you end up with an AI that's aligned before you end up with one that's so capable it can destroy the entirety of human civilization by itself. 
 

Re "Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are."

I think you badly misunderstood the post (e.g., multiple times assuming we're making an argument we're not, based on shallow pattern matching of the words used: interpreting "whitebox" as meaning mech interp and "values are easy to learn" as "it will know human values"), and I wish you'd either take the time to actually read / engage with the post in sufficient depth to not make these sorts of mistakes, or not engage at all (or at least not be so rude when you do it). 
 

(Note that this next paragraph is speculation, but a possibility worth bringing up, IMO): 

As it is, your response feels like you skimmed just long enough to pattern match our content to arguments you've previously dismissed, then regurgitated your cached responses to those arguments. Without further commenting on the merits of our specific arguments, I'll just note that this is a very bad habit to have if you want to actually change your mind in response to new evidence/arguments about the feasibility of alignment.


Re: "Overall take: unimpressed."

I'm more frustrated and annoyed than "unimpressed". But I also did not find this response impressive. 

You apparently completely misunderstood the point we were making with the white box thing.

 

I think you need to taboo the term white box and come up with a new term that will result in less confusion/less people talking past each other.

We really need to replace "taboo", it carries far too many misleading implications

I think the term is clear because it references the name and rule of the world-famous board game, where you can't use words from a list during your turn.

the issue arises in contexts where people don't know that reference.

[-]O O0-8

I like “deprecate”

Taboo is meant to imply "temporarily stop using for this conversation"

Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics

One major intuition pump I think important: evolution doesn't get to evaluate everything locally. Gradient descent does. As a result, evolution is slow to eliminate useless junk though it does do so eventually. Gradient descent is so eager to do it that we call it catastrophic forgetting.

Gradient descent wants to use everything in the system for whatever it's doing, right now.

I disagree with the optimists that this makes it trivial because to me it appears that the dynamics that make short term misalignment likely are primarily organizational among humans - the incentives of competition between organizations and individual humans. Also RL-first ais will inline those dynamics much faster than RLHF can get them out.

So it seems both "sides" are symmetrically claiming misunderstanding/miscommunication from the other side, after some textual efforts to bridge the gap have been made. Perhaps an actual realtime convo would help? Disagreement is one thing, but symmetric miscommunication and increasing tones of annoyance seem avoidable here. 

Perhaps Nora's/your planned future posts going into more detail regarding counters to pessimistic arguments will be able to overcome these miscommunications, but this pattern suggests not. 

Also I'm not so sure this pattern of "its better to skim and say something, half-baked rather than not read or react at all" is helpful, rather than actively harmful in this case. At least, maybe 3/4th baked or something might be better? Miscommunications and anti-willingness to thoroughly engage are only snowballing. 

I also could be wrong in thinking such a realtime convo hasn't happened.

We were arguing about concept learning order and the ease of internalizing human values versus other features for basing decisions on. We were arguing that human values are easy features to learn / internalize / hook up to decision making, so on any natural progression up the learning capacity ladder, you end up with an AI that’s aligned before you end up with one that’s so capable it can destroy the entirety of human civilization by itself.

Yes, but you were arguing for that using examples of "morally evaluating" and "grokking the underlying simple moral rule", not of caring.

[+][comment deleted]10

Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care" (nope; not by default; that's the hard bit).

I think it's worth reflecting on what type of evidence would be sufficient to convince you that we're actually making progress on the "caring" bit of alignment and not merely the "understanding" bit. Because I currently don't see what type of evidence you'd accept beyond near-perfect mechanistic interpretability.

I think current LLMs demonstrate a lot more than mere understanding of human values; they seem to actually 'want' to do things for you, in a rudimentary behavioral sense. When I ask GPT-4 to do some task for me, it's not just demonstrating an understanding of the task: it's actually performing actions in the real world that result in the task being completed. I think it's totally reasonable, prima facie, to admit this as evidence that we are making some success at getting AIs to "care" about doing tasks for users.

It's not extremely strong evidence, because future AIs could be way harder to align, maybe there's ultimately no coherent sense in which GPT-4 "cares" about things, and perhaps GPT-4 is somehow just "playing the training game" despite seemingly having limited situational awareness. 

But I think it's valid evidence nonetheless, and I think it's wrong to round this datum off to a mere demonstration of "understanding". 

We typically would not place such a high standard on other humans. For example, if a stranger helped you in your time of need, you might reasonably infer that the stranger cares about you to some extent, not merely that they "understand" how to care about you, or that they are merely helping people out of a desire to appear benevolent as part of a long-term strategy to obtain power. You may not be fully convinced they really care about you because of a single incident, but surely it should move your credence somewhat. And further observations could move your credence further still.

Alternative explanations of aligned behavior we see are always logically possible, and it's good to try to get a more mechanistic understanding of what's going on before we confidently declare that alignment has been solved. But behavioral evidence is still meaningful evidence for AI alignment, just as it is for humans.

I think it's worth reflecting on what type of evidence would be sufficient to convince you that we're actually making progress on the "caring" bit of alignment and not merely the "understanding" bit. Because I currently don't see what type of evidence you'd accept beyond near-perfect mechanistic interpretability.

I'm not Nate, but a pretty good theoretical argument that X method of making AIs would lead to an AI that "cared" about the user would do it for me, and I can sort of conceive of such arguments that don't rely on really good mechanistic interpretability.

Can you give an example of a theoretical argument of the sort you'd find convincing? Can be about any X caring about any Y.

Not sure how close you want it to be but how about this example: "animals will typically care about their offspring's survival and reproduction in worlds where their action space is rich enough for them to be helpful and too rich for them to memorize extremely simple heuristics, because if they didn't their genes wouldn't propagate as much". Not air-tight, and also I knew the stylized fact before I heard the argument so it's a bit unfair, but I think it's pretty good as it goes.

I admit I'm a bit surprised by your example. Your example seems to be the type of heuristic argument that, if given about AI, I'd expect would fail to compel many people (including you) on anything approaching a deep level. It's possible I was just modeling your beliefs incorrectly. 

Generally speaking, I suspect there's a tighter connection between our selection criteria in ML and the stuff models will end up "caring" about relative to the analogous case for natural selection. I think this for similar reasons that Quintin Pope alluded to in his essay about the evolutionary analogy. 

If you think you'd be persuaded that animals will end up caring about their offspring because a heuristic argument about that type of behavior being selected for in-distribution, I'm not sure why you'd need a lot of evidence to be convinced the same will be true for AIs with regard to what we train them to care about. But again, perhaps you don't actually need that much evidence, and I was simply mistaken about what you believe here.

Your example seems to be the type of heuristic argument that, if given about AI, I'd expect would fail to compel many people (including you) on anything approaching a deep level.

I think people are often persuaded of things about AI by heuristic arguments, like "powerful AI will probably be able to reason well and have a decent model of the world because if you don't do that you can't achieve good outcomes" (ok that argument needs some tightening, but I think there's something that works that's only ~2x as long). I think it's going to be harder to persuade me of alignment-relevant stuff about AI with this sort of argument, because there are more ways for such arguments to fail IMO - e.g. the evolution argument relies on evolutionary pressure being ongoing.

Two meta points:

  • There's arguments that convince me that we had made progress, and there's arguments that convince me we've solved it. It's easier to get your hands on the first kind than the second.
  • It's easier for me to answer gallabytes' question than yours because I don't think argument tactics I see are very good, so it's going to be hard to come up with one that I think is good! The closest that I can come is that "what if we tried to learn values" and "AI safety via debate" felt like steps forward in thought, even tho I don't think they get very far.

Generally speaking, I suspect there's a tighter connection between our selection criteria in ML and the stuff models will end up "caring" about relative to the analogous case for natural selection. I think this for similar reasons that Quintin Pope alluded to in his essay about the evolutionary analogy.

For the record I'm not compelled of this enough to be optimistic about alignment, but I'm roughly at my budget for internet discussion/debate right now, so I'll decline to elaborate.

If you think you'd be persuaded that animals will end up caring about their offspring because a heuristic argument about that type of behavior being selected for in-distribution, I'm not sure why you'd need a lot of evidence to be convinced the same will be true for AIs with regard to what we train them to care about.

Roughly because AI can change the distribution and change the selection pressure that gets applied to it. But also I don't think I need a lot of evidence in terms of likelihood ratio---my p(doom) is less than 99%, and people convince me of sub-1-in-100 claims all the time---I'm just not seeing the sort of evidence that would move me a lot.

[-]O O11

Testing it on out of distribution examples seems helpful. If an AI still acts as if it follows human values out of distribution, it probably truly cares about human values. For AI with situational awareness, we can probably run simulations to an extent (and probably need bootstrap this after a certain capabilities threshold)

How about an argument in the shape of: 

  1. we'll get good evidence of human-like alignment-relevant concepts/values well-represented internally (e.g. Scaling laws for language encoding models in fMRI, A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations); in addition to all the cumulating behavioral evidence
  2. we'll have good reasons to believe alternate (deceptive) strategies are unlikely / relevant concepts for deceptive alignment are less accessible: e.g. through evals vs. situational awareness, through conceptual arguments around speed priors and not enough expressivity without CoT + avoiding steganography + robust oversight over intermediate text, by unlearning/erasing/making less accessible (e.g. by probing) concepts relevant for deceptive alignment, etc.
  3. we have some evidence for priors in favor of fine-tuning favoring strategies which make use of more accessible concepts, e.g. Predicting Inductive Biases of Pre-Trained Models, Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features.

For 1 it would depend on how alignment-relevant the concepts and values are. Also I wouldn't think of the papers you linked as much evidence here.

For 2, that would for sure do it, but it doesn't feel like much of a reduction.

3 sounds like it's maybe definitionally true? At the very least, I don't doubt it much.

Interesting, I'm genuinely curious what you'd expect better evidence to look like for 1.

So I just skimmed the abstracts you linked so maybe I was too hasty there, but I'd want to see evidence that (a) a language model was representing concept C really well and (b) it's really relevant for alignment. I think those papers show something like "you can sort of model brain activations by language model activations" or "there's some embedding space for what brains are sort of doing in conversation" which seems like a different thing (unless the fit is good enough that you can reconstruct one from the other without loss of functionality, then I'm interested).

Makes sense. Just to clarify, the papers I shared for 1 were mostly meant as methodological examples of how one might go about quantifying brain-LLM alignment; I agree about b), that they're not that relevant to alignment (though some other similar papers do make some progress on that front, addressing [somewhat] more relevant domains/tasks - e.g. on emotion understanding - and I have/had an AI safety camp '23 project trying to make similar progress - on moral reasoning). W.r.t. a), you can (also) do decoding (predicting LLM embeddings from brain measurements), the inverse of encoding; this survey, for example, covers both encoding and decoding.

Would any of these count for you?

We have promising alignment plans with low taxes

Each of the three plans I mention are attempts to put the "understanding" part into the "wanting" slot (a "steering subsystem" that controls goals for decision-making purposes) in a different AGI design. That brief post links to somewhat more detailed plans.

Overall take: unimpressed.

Very simple gears in a subculture's worldview can keep being systematically misperceived if it's not considered worthy of curious attention. On the local llama subreddit, I keep seeing assumptions that AI safety people call for never developing AGI, or claim that the current models can contribute to destroying the world. Almost never is there anyone who would bother to contradict such claims or assumptions. This doesn't happen because it's difficult to figure out, this happens because the AI safety subculture is seen as unworthy of engagement, and so people don't learn what it's actually saying, and don't correct each other on errors about what it's actually saying.

This gets far worse with more subtle details, the standard of willingness to engage is raised higher to actually study what the others are saying, when it would be difficult to figure out even with curious attention. Rewarding engagement is important.

I agree. It's rare enough to get reasonable arguments for optimistic outlooks, so this seems worth for someone to openly engage with in some detail.

Yeah, the fact that the responses to the optimistic arguments sometimes rely on simply not engaging with it at all in detail has really dimmed my prospects of reaching out, and causes me to think poorer of AI doom, epistemically.

This actually happened before with Eliezer's Comment here:

https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#YYR4hEFRmA7cb5csy

I wish this post was never written at all, because as it is, I find it far too unnuanced, and your reasoning for not reading it over fully isn't all that great, as it's I think only 3000 words in the document.

In general, I think this is one of those cases where you need to slow down, and actually read over the document you plan to criticize, and this would be a better post if it did that.

It sounds like you wish this wasn't written at all because you'd prefer a more detailed post? But that's not always a realistic option. There's a lot to attend to.

If you're saying the that posting a "I ignored this but here's some guesses I took about it that lead me to ignore it", isn't helpful, that's more reasonable.

I find it helpful because Nate's worldview is an important one, and he's at least bothered to tell us something about why he's not engaging more deeply.

Fortunately, that more detailed analysis has been done by Steve Byrnes: Thoughts on “AI is easy to control” by Pope & Belrose

If you're saying the that posting a "I ignored this but here's some guesses I took about it that lead me to ignore it", isn't helpful, that's more reasonable.

This is what I was talking about, combined with "I expected better of Nate than to throw out an unnuanced take in response to something complicated."

“AI will be able to figure out what humans want” (yes; obviously; this was never under dispute)

I think the problem is that how existing systems figure out what humans want doesn't seem to do anything with your theory of why it supposed to be relatively easy. Therefore the theory's prediction of alignment being relatively hard is also doesn't have anything to do with reality.

I think a very simple and general pessimistic take is "AI will make human thinking irrelevant". It almost doesn't matter if it happens by subversion or just normal economic activity. One way or another, we'll end up with a world where human thinking is irrelevant, and nobody has described a good world like that.

The only good scenarios are where human thinking somehow avoids "habitat destruction". Maybe by uplifting, or explicit "habitat preservation", or something else. But AI companies are currently doing the opposite, reallocating more and more tasks to AI outright, so it's hard to be an optimist.