Nate, please correct me if I'm wrong, but it looks like you:
You've clearly put a lot of time into this. If you want to understand the argument, why not just read the original post and talk to the authors directly? It's very well-written.
I don't want to speak for Nate, and I also don't want to particularly defend my own behavior here, but I have kind of done similar things around trying to engage with the "AI is easy to control" stuff.
I found it quite hard to engage with directly. I have read the post, but I would not claim to be able to be close to passing an ITT of its authors and bounced off a few times, and I don't currently expect direct conversation with Quintin or Nora to be that productive (though I would still be up for it and would give it a try).
So I have been talking to friends and other people in my social circle who I have a history of communicating well with about the stuff, and I think that's been valuable to me. Many of them had similar experiences, so in some sense it did feel like a group of blind men groping around an elephant, but I don't have a much better alternative. I did not find the original post easy to understand, or the kind of thing I felt capable of responding to.
I would kind of appreciate better suggestions. I have not found just forcing myself to engage more with the original post to help me much. Dialogues like this do actually seem helpful to me (and I found reading this valuable).
How much have you read about deep learning from "normal" (non-xrisk-aware) AI academics? Belrose's Tweet-length argument against deceptive alignment sounds really compelling to the sort of person who's read (e.g.) Simon Prince's textbook but not this website. (This is a claim about what sounds compelling to which readers rather than about the reality of alignment, but if xrisk-reducers don't understand why an argument would sound compelling to normal AI practitioners in the current paradigm, that's less dignified than understanding it well enough to confirm or refute it.)
I think I could pass the ITTs of Quintin/Nora sufficiently to have a productive conversation while also having interesting points of disagreement. If that's the bottleneck, I'd be interested in participating in some dialogues, if it's a "people genuinely trying to understand each other's views" vibe rather than a "tribalistically duking it out for the One True Belief" vibe.
This is really interesting, because I find Quintin and Nora's content incredibly clear and easy to understand.
As one hypothesis (which I'm not claiming is true for you, just a thing to consider)—When someone is pointing out a valid flaw in my views or claims, I personally find the critique harder to "understand" at first. (I know this because I'm considering the times where I later agreed the critique was valid, even though it was "hard to understand" at the time.) I think this "difficulty" is basically motivated cognition.
I am a bit stressed right now, and so maybe am reading your comment too much as a "gotcha", but on the margin I would like to avoid psychologizing of me here (I think it's sometimes fine, but the above already felt a bit vulnerable and this direction feels like it disincentivizes that). I generally like sharing the intricacies and details of my motivations and cognition, but this is much harder if this immediately causes people to show up to dissect my motivations to prove their point.
More on the object-level, I don't think this is the result of motivated cognition, though it's of course hard to rule out. I would prefer if this kind of thing doesn't become a liability to say out loud in contexts like this. I expect it will make conversations where people try to understand where other people are coming from go better.
Sorry if I overreacted in this comment. I do think in a different context, on maybe a different day I would be up for poking at my motivations and cognition and see whether indeed they are flawed in this way (which they very well might be), but I don't currently feel like it's the right move in this context.
I think it's sometimes fine, but the above already felt a bit vulnerable and this direction feels like it disincentivizes that
FWIW, I think your original comment was good and I'm glad you made it, and want to give you some points for it. (I guess that's what the upvote buttons are for!)
Fwiw, I generally find Quintin’s writing unclear and difficult to read (I bounce a lot) and Nora’s clear and easy, even though I agree with Quintin slightly more (although I disagree with both of them substantially).
I do think there is something to “views that are very different from ones own” being difficult to understand, sometimes, although I think this can be for a number of reasons. Like, for me at least, understanding someone with very different beliefs can be both time intensive and cognitively demanding—I usually have to sit down and iterate on “make up a hypothesis of what I think they’re saying, then go back and check if that’s right, update hypothesis, etc.” This process can take hours or days, as the cruxes tend to be deep and not immediately obvious.
Usually before I’ve spent significant time on understanding writing in this way, e.g. during the first few reads, I feel like I’m bouncing, or otherwise find myself wanting to leave. But I think the bouncing feeling is (in part) tracking that the disagreement is really pervasive and that I’m going to have to put in a bunch of effort if I actually want to understand it, rather than that I just don't like that they disagree with me.
Because of this, I personally get a lot of value out of interacting with friends who have done the “translating it closer to my ontology” step—it reduces the understanding cost a lot for me, which tends to be higher the further from my worldview the writing is.
Yeah, for me the early development of shard theory work was confusing for similar reasons. Quintin framed values as contextual decision influences and thought these were fundamental, while I'd absorbed from Eliezer that values were like a utility function. They just think in very different frames. This is why science is so confusing until one frame proves useful and is established as a Kuhnian paradigm.
ehh it feels to me like i can get you more than 100:1 against alignment by default in the very strongest sense; i feel like my knowledge of possible mind architectures (and my awareness of stochastic gradient descent-accessible shortcut-hacks) rules out "naive training leads to friendly AIs"
probably more extreme than 2^-100:1, is my guess
What is the 2^-100:1 part intended to mean? Was it a correction to the 100:1 part or a different claim? Seems like an incredibly low probability.
Separately:
Ronny Fernandez
I’m like look, I used to think the chances of alignment by default were like 2^-10000:1
I still think I can do this if we’re searching over python programs
This seems straightforwardly insane to me, in a way that is maybe instructive. Ronnie has updated from an odds ratio of 2^-10000:1 to one that is (implicitly) thousands of orders of magnitude different, which should essentially never happen. Ronnie has just admitted to being more wrong than practically anyone who has ever tried to give a credence. And then, rather than being like "something about the process by which I generate 2^-10000:1 chances is utterly broken", he just.... made another claim of the same form?
I don't think there's anything remotely resembling probabilistic reasoning going on here. I don't know what it is, but I do want to point at it and be like "that! that reasoning is totally broken!" And it seems to me that people who assign P(doom) > 90% are displaying a related (but far less extreme) phenomenon. (My posts about meta-rationality are probably my best attempt at actually pinning this phenomenon down, but I don't think I've done a great job so far.)
my original 100:1 was a typo, where i meant 2^-100:1.
this number was in reference to ronny's 2^-10000:1.
when ronny said:
I’m like look, I used to think the chances of alignment by default were like 2^-10000:1
i interpreted him to mean "i expect it takes 10k bits of description to nail down human values, and so if one is literally randomly sampling programs, they should naively expect 1:2^10000 odds against alignment".
i personally think this is wrong, for reasons brought up later in the convo--namely, the relevant question is not how many bits is takes to specify human values relative to the python standard library; the relevant question is how many bits it takes to specify human values relative to the training observations.
but this was before i raised that objection, and my understanding of ronny's position was something like "specifying human values (in full, without reference to the observations) probably takes ~10k bits in python, but for all i know it takes very few bits in ML models". to which i was attempting to reply "man, i can see enough ways that ML models could turn out that i'm pretty sure it'd still take at least 100 bits".
i inserted the hedge "in the very strongest sense" to stave off exactly your sort of objection; the very strongest sense of "alignment-by-default" is that you sample any old model that performs well on some task (without attempting alignment at all) and hope that it's aligned (e.g. b/c maybe the human-ish way to perform well on tasks is the ~only way to perform well on tasks and so we find some great convergence); here i was trying to say something like "i think that i can see enough other ways to perform well on tasks that there's e.g. at least ~33 knobs with at least ~10 settings such that you have to get them all right before the AI does something valuable with the stars".
this was not meant to be an argument that alignment actually has odds less than 2^-100, for various reasons, including but not limited to: any attempt by humans to try at all takes you into a whole new regime; there's more than a 2^-100 chance that there's some correlation between the various knobs for some reason; and the odds of my being wrong about the biases of SGD are greater than 2^-100 (case-in-point: i think ronny was wrong about the 2^-100000 claim, on account of the point about the relevant number being relative to the observations).
my betting odds would not be anywhere near as extreme as 2^-100, and i seriously doubt that ronny's would ever be anywhere near as extreme as 2^-10000; i think his whole point in the 2^-10k example was "there's a naive-but-relevant model that say's we're super-duper fucked; the details of it causes me to think that we're not in particulary good shape (though obviously not to that same level of credence)".
but even saying that is sorta buying into a silly frame, i think. fundamentally, i was not trying to give odds for what would actually happen if you randomly sample models, i was trying to probe for a disagreement about the difference between the number of ways that a computer program can be (weighted by length), and the number of ways that a model can be (weighted by SGD-accessibility).
I don't think there's anything remotely resembling probabilistic reasoning going on here. I don't know what it is, but I do want to point at it and be like "that! that reasoning is totally broken!"
(yeah, my guess is that you're suffering from a fairly persistent reading comprehension hiccup when it comes to my text; perhaps the above can help not just in this case but in other cases, insofar as you can use this example to locate the hiccup and then generalize a solution)
(I obviously don’t speak for Ronny) I’d guess this is kinda the within-model uncertainty, he had a model of “alignment” that said you needed to specify all 10,000 bits of human values. And so the odds of doing this by default/at random was 2^-10000:1. But this doesn’t contain the uncertainty that this model is wrong, which would make the within-model uncertainty a rounding error.
According to this model there is effectively no chance of alignment by default, but this model could be wrong.
If Ronnie had said "there is one half-baked heuristic that claims that the probability is 2^-10000" then I would be sympathetic. That seems very different to what he said, though. In some sense my objection is precisely to people giving half-baked heuristics and intuitions an amount of decision-making weight far disproportionate to their quality, by calling them "models" and claiming that the resulting credences should be taken seriously.
I think that would be a more on-point objection to make on a single-author post, but this is a chat log between two people, optimized to communicate to each other, and as such generally comes with fewer caveats and taking a bunch of implicit things for granted (this makes it well-suited for some kind of communication, and not others). I like it in that it helps me get a much better sense of a bunch of underlying intuitions and hunches that are often hard to formalize and so rarely make it into posts, but I also think it is sometimes frustrating because it's not optimized to be responded to.
I would take bets that Ronny's position was always something closer to "I had this robust-seeming inside-view argument that claimed the probability was extremely low, though of course my outside view and different levels of uncertainty caused my betting odds to be quite different".
I don't see why it should rule out high probability of doom for some folks who present themselves as having good epistemics to actually be quite bad at picking up new models and stuck in an old, limiting paradigm, refusing to investigate new things properly because of believing themselves to already know. It certainly does weaken appeals to their authority, but the reasoning stands on its own, to the degree it's actually specified using valid and relevant formal claims.
To be clear, I did not think we were discussing the AI optimist post. I don't think Nate thought that. I thought we were discussing reasons I changed my mind a fair bit after talking to Quintin.
seems kinda hard to make something formal to me because the basic argument is, i think, "there's really a lot of ways for a model to do well in training", but i don't know how one is supposed to formalize that. i guess i'm curious where you think the force of formality comes in for the analogous argument when it comes to python programs
This may not be easily formalizable, but this does seem easily testable? Like, whats wrong with just training a bunch of different models, and seeing if they have similar generalization properties? If they're radically different, then there's many ways of doing well in training. If they're pretty similar, then there's very few ways of doing well in training.
I think evolution-of-humans is kinda like taking a model-based RL algorithm (for within-lifetime learning), and doing a massive outer-loop search over neural architectures, hyperparameters, and also reward functions. In principle (though IMO almost definitely not in practice), humans could likewise do that kind of grand outer-loop search over RL algorithms, and get AGI that way. And if they did, I strongly expect that the resulting AGI would have a “curiosity” term in its reward function, as I think humans do. After all, a curiosity reward-function term is already sometimes used in today’s RL, e.g. the Montezuma’s Revenge literature, and it’s not terribly complicated, and it’s useful, and I think innate-curiosity-drive exists not only in humans but also in much much simpler animals. Maybe there’s more than one way to implement curiosity-drive in detail, but something in that category seems pretty essential for an RL algorithm to train successfully in a complex environment, and I don’t think I’m just over-indexing on what’s familiar.
Again, this is all pretty irrelevant on my models because I don’t expect that people will program AGI by doing a blind outer-loop search over RL reward functions. Rather, I expect that people will write down the RL reward function for AGI in the form of handwritten source code, and that they will put curiosity-drive into that reward function source code (as they already sometimes do), because they will find that it’s essential for capabilities.
Separately, insofar as curiosity-drive is essential for capabilities (as I believe), it doesn’t help alignment, but rather hurts alignment, because it’s bad if an AI wants to satisfy its own curiosity at the expense of things that humans directly value. Hopefully that’s obvious to everyone here, right? Parts of the discussion seemed to be portraying AIs-that-are-curious as a good thing rather than a bad thing, which was confusing to me. I assume I was just failing to follow all the unspoken context?
maintaining uncertainty about the true meaning of an objective is important, but there's a difference between curiosity about the true values one holds, intrinsic curiosity as a component of a value system, and instrumental curiosity as a consequence of an uncertain planning system. I'm surprised to see disagree from MiguelDev and Noosphere, could either of you expand on what you disagree with?
@the gears to ascension Hello! I just think curiosity is a low level attribute that allows a reaction and it maybe good or bad all things considered, with this regard curiosity (or studying curiosity) may help in alignment as well.
For example, an AI is in a situation that it needs to save someone from a burning house, it should be curious enough to consider all possible options available and eventually if it is aligned - it will choose the actions that will result to good outcomes (after also studying all the bad options). That is why I don't agree with the idea that it purely hurts alignment as described in the comment.
(I think Nate and Ronny shares important knowledge in this dialogue - low level forces (birthed by evolution) that I think is misunderstood by many.)
Your example is about capabilities (assuming the AI is trying to save me from the fire, will it succeed?) but I was talking about alignment (is the AI trying to save me from the fire in the first place?)
I don’t want the AI to say “On the one hand, I care about Steve’s welfare. On the other hand, man I’m just really curious how people behave when they’re on fire. Like, what do they say? What do they do? So, I feel torn—should I save Steve from the fire or not? Hmm…”
(I agree that, if an AI is aligned, and if it is trying to save me from a burning house, then I would rather that the AI be more capable rather than less capable—i.e., I want the AI to come up with and execute very very good plans.)
See also colorful examples in Scott Alexander’s post such as:
Even if an AI decides human flourishing is briefly interesting, after a while it will already know lots of things about human flourishing and want to learn something else instead. Scientists have occasionally made colonies of extremely happy well-adjusted rats to see what would happen. But then they learned what happened, and switched back to things like testing how long rats would struggle against their inevitable deaths if you left them to drown in locked containers.
As for capabilities, I think curiosity drive is probably essential during early RL training. Once the AI is sufficiently intelligent (including in metacognitive / self-reflective ways), it’s plausible that we could turn curiosity drive off without harming capabilities. After all, it’s possible for an AI to “consider all possible options” not because it’s curious, but rather because it wants me to not die in the fire, and it’s smart enough to know that “considering all possible options” is a very effective means-to-an-end for preventing me from dying in the fire.
Humans can do that too. We don’t only seek information because we’re curious; we can also do it as a means to an end. For example, sometimes I have really wanted to do something, and so then I read an mind-numbingly-boring book that I expect might help me do that thing. Curiosity is not driving me to read the book; on the contrary, curiosity is pushing me away from the book with all its might, because anything else on earth would be more inherently interesting than this boring book. But I read the book anyway, because I really want to do the thing, and I know that reading the book will help. I think an AI which is maximally beneficial to humans would have a similar kind of motivation. Yes it would often brainstorm, and ponder, and explore, and seek information, etc., but it would do all those things not because they are inherently rewarding, but rather because it knows that doing those things is probably useful for what it really wants at the end of the day, which is to benefit humans.
Once the AI is sufficiently intelligent (including in metacognitive / self-reflective ways), it’s plausible that we could turn curiosity drive off without harming capabilities. After all, it’s possible for an AI to “consider all possible options” not because it’s curious, but rather because it wants me to not die in the fire, and it’s smart enough to know that “considering all possible options” is a very effective means-to-an-end for preventing me from dying in the fire.
Interesting view but I have to point out that situations change and there will be many tiny details that will become like a back and forth discussion inside the AI's network as it performs its tasks and turning off curiosity will most likely end up in the worst outcomes as it my not be able to update its decisions (eg. oops didn't saw there was a fire hose available or oops I didn't felt the heat of the floor earlier).
Obviously, Person B is correct here, because AlphaZero-chess works well.
To my ears, your claim (that an AI without intrinsic drive to satisfy curiosity cannot learn to update its decisions) is analogous to Person A’s claim (that an AI without intrinsic drive to protect its queen cannot learn to do so).
In other words, if it’s obvious to you that the AI is insufficiently updating its decisions, it would be obvious to the AI as well (once the AI is sufficiently smart and self-aware). And then the AI can correct for that.
Thanks for explaining your views and this had helped me deconfuse myself, when I was replying and thinking: I am now drawing lines wherein curiosity and self-awareness overlaps also making me feel the expansive nature of studying the theoretical alignment, it's very dense and it's so easy to drown in information - this discussion made me feel a whack of a baseball bat then survived to write this comment. Moreover, how to get to Person B still requires knowledge of curiosity and its mechanisms, I still err on the side of finding out how it works[1] or gets imbued to intelligent systems (us and AI) - for me this is very relevant to alignment work.
I'm speculating a simplified evolutionary cognitive chain in humans: curiosity + survival instincts (including hunger) → intelligence → self-awareness → rationality.
you can argue all you want that any flying device will have to flap its wings, and that won’t constrain airplane designs
You can argue all you want that any thinking device will have to reflect on its thoughts, and that won’t constrain mind designs.
the prior is still really wide, so wide that a counting argument still more-or-less works
And it also works for arguing that GPT3 won't happen - there are more hacks that give you low loss than there useful to humans hacks that give you low loss.
so does your whole sense of difference go out the window if we do something autogpt-ish?
I think it should be analyzed separately, but intuitively if your gpt never thinks of killing humans, it should be less likely that the plans with these thoughts would result in killing humans.
at this juncture i interpret the shard theory folk as arguing something like "well the shards that humans build their values up around are very proximal to minds
In the spirit of pointing out subtle things that seem wrong: My understanding of the ST position is that shards are values. There's no "building values around" shards; the idea is that shards are what implements values and values are implemented as shards.
At least, I'm pretty sure that's what the position was a ~year ago, and I've seen no indications the ST folk moved from that view.
most humans (with fully-functioning brains) have in some sense absorbed sufficiently similar values and reflective machinery that they converge to roughly the same place
The way I would put it is "it's plausible that there is an utility function such that the world-state maximizing it is ranked as very high by the standards of most humans' preferences, and we could get that utility function by agglomerating and abstracting over individual humans' values".
Like, if Person A loves seafood and hates pizza, and Person B loves pizza and hates seafood, then no, agglomerating these individual people's preferences into Utility Function A and Utility Function B won't result in the same utility function (and more so for more important political/philosophical stuff). But if we abstract up from there, we get "people like to eat tasty-according-to-them food", and then a world in which both A and B are allowed to do that would rank high by the preferences of both of them.
Similarly, it seems plausible that somewhere up there at the highest abstraction levels, most humans' preferences (stripped of individual nuance on their way up) converge towards the same "maximize eudaimonia" utility function, whose satisfaction would make ~all of us happy. (And since it's highly abstract, its maximal state would be defined over an enormous equivalence class of world-states. So it won't be a universe frozen in a single moment of time, or tiled with people with specific preferences, or anything like that.)
I was excited to read this, because Nate is a clear writer and a clear thinker, who has a high p(doom) for reasons I don't entirely understand. This did pay off for me in a brief statement that clarified some of his reasons I hadn't understood:
Nate said
this is part of what i mean by "i don't think alignment is all that hard"
my high expectation of doom comes from a sense that there's lots of hurdles and that humanity will flub at least one (and probably lots)
I find this disturbingly compelling. I hadn't known Nate thought alignment might be fairly easy. Given that, his pessimism is more relevant to me, since I'm pretty sure alignment is do-able even in the near future.
I'm afraid I found the rest of this convoluted and to make little progress on a contentful discussion.
Let me try to summarize the post in case it's helpful. None of these are direct quotes
Nate: I think alignment by default is highly unlikely Ronny: I think alignment by default is highly unlikely (this somehow took most of the conversation) Ronny: But we won't do alignment by default. We'll do it with RL. Sometimes, when I talk to Quintin, I think we might get working alignment by doing RL and pointing the system at lots of stuff we want it to do. It might reproduce human values accurately enough to do that. Nate: There are a lot of ways to get anything done. So telling it what you want it to do is probably not going to make it generalize well or actually value the things you value. Ronny: I agree, but I don't have a strong argument for it. ...
So in sum I didn't see any strong argument for it beyond the "lots of ways to get things done, so a value match is unlikely".
Like Rob and Nate, my intuition says that's unlikely to work.
The number of ways to get things done is substantially constrained if the system is somehow trained to use human concepts and thinking patterns. So maybe that's the source of optimism for Quintin and the Shard Theorists? Training on language does seem to substantially constrain a model to use human-like concepts.
I think the bulk of the disagreement is deeper and vaguer. One point of vague disagreement seems to be something like: Theory suggests that alignment is hard. Empirical data (mostly from LLMs) suggests that it's easy to make AI do what you want. Which do you believe?
Fortunately, I don't think RL alignment is our only or best option, so I'm not hugely invested in the disagreement as it stands, because both perspectives are primarily thinking about RL alignment. I think We have promising alignment plans with low taxes
I think they're promising because they're completely different than RL approaches. More on that in an upcoming post.
Context: somebody at some point floated the idea that Ronny might (a) understand the argument coming out of the Quintin/Nora camp, and (b) be able to translate them to Nate. Nate invited Ronny to chat. The chat logs follow, lightly edited.
The basic (counting) argument
Evolution / Reflection Process is Path Dependent
Summary and discussion of training an agent in a simulation
Alignment problem probably fixable, but likely won't be fixed
Discussing whether this argument about training can be formalized