even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy - if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!)
It seems to me that your analogy is the wrong way arond. IE, the right analogy would be "if I knew that a bunch of 5-year olds were reading my mind, I have...actually, a pretty good idea how to think deceptive thoughts in a way that avoids their tools".
(For what it's ...
After reading the first section and skimming the rest, my impression is that the document is a good overview, but does not present any detailed argument for why godlike AI would lead to human extinction. (Except for the "smarter species" analogy, which I would say doesn't qualify.) So if I put on my sceptic hat, I can imagine reading the whole document in detail and somewhat-justifiably going away with "yeah, well, that sounds like a nice story, but I am not updating based on this".
That seems fine to me, given that (as far as I am concerned) no detailed co...
Some suggestions for improving the doc (I noticed the link to the editable version too late, apologies):
What is AI? Who is building it? Why? And is it going to be a future we want?
Something weird with the last sentence here (substituting "AI" for "it" makes the sentence un-grammatical).
Machines of hateful competition need not have such hindrances.
"Hateful" seems likely to put off some readers here, and I also think it is not warranted -- indifference is both more likely and also sufficient for extinction. So "Machines of indifferent competition" might work...
Good point. Also, for the purpose of the analogy with AI X-risk, I think we should be willing to grant that the people arrive at the alternative hypothesis through theorising. (Similarly to how we came up with the notion of AI X-risk before having any powerful AIs.) So that does break my example somewhat. (Although in that particular scenario, I imagine that sceptic of Newtonian gravity would came up with alternative explanations for the observation. Not that this seems very relevant.)
I agree with all of this. (And good point about the high confidence aspect.)
The only thing that I would frame slightly differently is that:
[X is unfalsifiable] indeed doesn't imply [X is false] in the logical sense. On reflection, I think a better phrasing of the original question would have been something like: 'When is "unfalsifiability of X is evidence against X" incorrect?'. And this amended version often makes sense as a heuristic --- as a defense against motivated reasoning, conspiracy theories, etc. (Unfortunately, many scientists seem to take this ...
Some partial examples I have so far:
Phenomenon: For virtually any goal specification, if you pursue it sufficiently hard, you are guaranteed to get human extinction.[1]
Situation where it seems false and unfalsifiable: The present world.
Problems with the example: (i) We don't know whether it is true. (ii) Not obvious enough that it is unfalsifiable.
Phenomenon: Physics and chemistry can give rise to complex life.
Situation where it seems false and unfalsifiable: If Earth didn't exist.
Problems with the example: (i) if Earth didn't exist, there wouldn't b...
Nitpick on the framing: I feel that thinking about "misaligned decision-makers" as an "irrational" reason for war could contribute to (mildly) misunderstanding or underestimating the issue.
To elaborate: The "rational vs irrational reasons" distinction talks about the reasons using the framing where states are viewed as monolithic agents who act in "rational" or "irrational" ways. I agree that for the purpose of classifying the risks, this is an ok way to go about things.
I wanted to offer an alternative framing of this, though: For any state, we can consid...
[I am confused about your response. I fully endorse your paragraph on "the AI with superior ontology would be able to predict how humans would react to things". But then the follow-up, on when this would be scary, seems mostly irrelevant / wrong to me --- meaning that I am missing some implicit assumptions, misunderstanding how you view this, etc. I will try react in a hopefully-helpful way, but I might be completely missing the mark here, in which case I apologise :).]
I think the problem is that there is a difference between:
(1) AI which can predict how t...
Nitpicky edit request: your comment contains some typos that make it a bit hard to parse ("be other", "we it"). (So apologies if my reaction misunderstands your point.)
[Assuming that the opposite of the natural abstraction hypothesis is true --- ie, not just that "not all powerful AIs share ontology with us", but actually "most powerful AIs don't share ontology with us":]
I also expect that an AI with superior ontology would be able to answer your questions about its ontology, in a way that would make you feel like[1] you understand what is happening. ...
As a quick reaction, let me just note that I agree that (all else being equal) this (ie, "the AI understanding us & having superior ontology") seems desirable. And also that my comment above did not present any argument about why we should be pessimistic about AI X-risk if we believe that the natural abstraction hypothesis is false. (I was just trying to explain why/how "the AI has a different ontology" is compatible with "the AI understands our ontology".)
As a longer reaction: I think my primary reason for pessimism, if natural abstraction hypothetis ...
Simplifying somewhat: I think that my biggest delta with John is that I don't think the natural abstraction hypothesis holds. (EG, if I believed it holds, I would become more optimistic about single-agent alignment, to the point of viewing Moloch as higher priority.) At the same time, I believe that powerful AIs will be able to understand humans just fine. My vague attempt at reconciling these two is something like this:
Humans have some ontology, in which they think about the world. This corresponds to a world model. This world model has a certain amount o...
An illustrative example, describing a scenario that is similar to our world, but where "Extinction-level Goodhart's law" would be false & falsifiable (hat tip Vincent Conitzer):
Suppose that we somehow only start working on AGI many years from now, after we have already discovered a way to colonize the universe at the close to the speed of light. And some of the colonies are already unreachable, outside of our future lightcone. But suppose we still understand "humanity" as the collection of all humans, including those in the unreachable colonies. Then a...
tl;dr: "lack of rigorous arguments for P is evidence against P" is typically valid, but not in case of P = AI X-risk.
A high-level reaction to your point about unfalsifiability:
There seems to be a general sentiment that "AI X-risk arguments are unfalsifiable ==> the arguments are incorrect" and "AI X-risk arguments are unfalsifiable ==> AI X-risk is low".[1] I am very sympathetic to this sentiment --- but I also think that in the particular case of AI X-risk, it is not justified.[2] For quite non-obvious reasons.
Why I believe this?
Take this ...
Assumption 2 is, barring rather exotic regimes far into the future, basically always correct, and for irreversible computation, this always happens, since there's a minimum cost to increase the features IRL, and it isn't 0.
Increasing utility IRL is not free.
I think this is a misunderstanding of what I meant. (And the misunderstanding probably only makes sense to try clarifying it if you read the paper and disagree with my interpretation of it, rather than if your reaction is only based on my summary. Not sure which of the two is the case.)
What I was trying...
I do agree that debate could be used in all of these ways. But at the same time, I think generality often leads to ambiguity and to papers not describing any such application in detail. And that in turn makes it difficult to critique debate-based approaches. (Both because it is unclear what one is critiquing and because it makes it too easy to accidentally dimiss the critiques using the motte-and-bailey fallacy.)
I was previously unaware of Section 4.2 of the Scalable AI Safety via Doubly-Efficient Debate paper and, hurray, it does give an answer to (2) in Section 4.2. (Thanks for mentioning, @niplav!) That still leaves (1) unanswered, or at least not answered clearly enough, imo. Also I am curious about the extent that other people, who find debate promising, consider this paper's answer to (2) as the answer to (2).
For what it's worth, none of the other results that I know about were helpful for me for understanding (1) and (2). (The things I know about are the or...
Further disclaimer: Feel free to answer even if you don't find debate promising, but note that I am primarily interested in hearing from people who do actively work on it, or find it promising --- or at least from people who have a very good model of specific such people.
Motivation behind the question: People often mention Debate as a promising alignment technique. For example, the AI Safety Fundamentals curriculum features it quite prominently. But I think there is a lack of consensus on "as far as the proposal is concerned, how is Debate actually meant t...
I believe that a promising safety strategy for the larger asteroids is to put them in a secure box prior to them landing on earth. That way, the asteroid is -- provably -- guaranteed to have no negative impact on earth.
Proof:
| | | | | | | |
v v v v v v v v
__________ CC
| ___ | CCCC
| / O O \ &...
Agreed.
It seems relevant, to the progression, that a lot of human problem solving -- though not all -- is done by the informal method of "getting exposed to examples and then, somehow, generalising". (And I likewise failed to appreciate this, not sure until when.) This suggests that if we want to build AI that solves things in similar ways that humans solve them, "magic"-involving "deepware" is a natural step. (Whether building AI in the image of humans is desirable, that's a different topic.)
tl;dr: It seems noteworthy that "deepware" has strong connotations with "it involves magic", while the same is not true for AI in general.
I would like to point out one thing regarding the software vs AI distinction that is confusing me a bit. (I view this as complementing, rather than contradicting, your post.)
As we go along the progression "Tools > Machines > Electric > Electronic > Digital", most[1] of the examples can be viewed as automating a reasonably-well-understood process, on a progressively higher level of abstraction.[2]
[For exa...
I want to flag that the overall tone of the post is in tension with the dislacimer that you are "not putting forward a positive argument for alignment being easy".
To hint at what I mean, consider this claim:
Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.
I think this claim is only valid if you are in a situation such as "your probability of scheming was >95%, and this was based basically only on this particular version of the 'counting argument' ". That is, if you somehow thought that we had ...
I feel a bit confused about your comment: I agree with each individual claim, but I feel like perhaps you meant to imply something beyond just the individual claims. (Which I either don't understand or perhaps disagree with.)
Are you saying something like: "Yeah, I think that while this plan would work in theory, I expect it to be hopeless in practice (or unneccessary because the homework wasn't hard in the first place)."?
If yes, then I agree --- but I feel that of the two questions, "would the plan work in theory" is the much less interesting one. (For exa...
However, note that if you think we would fail to sufficiently check human AI safety work given substantial time, we would also fail to solve various issues given a substantial pause
This does not seem automatic to me (at least in the hypothetical scenario where "pause" takes a couple of decades). The reasoning being that there is difference between [automate a current form of an institution, and speed-run 50 years of it in a month] and [an institutions, as it develops over 50 years].
For example, my crux[1] is that current institutions do not subscribe ...
Assumming that there is an "alignment homework" to be done, I am tempted to answer something like: AI can do our homework for us, but only if we are already in a position where we could solve that homework even without AI.
An important disclaimer is that perhaps there is no "alignment homework" that needs to get done ("alignment by default", "AGI being impossible", etc). So some people might be optimistic about Superalignment, but for reasons that seem orthogonal to this question - namely, because they think that the homework to be done isn't particularly d...
Quick reaction:
I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).
What would you suggest instead? Something like [50% chance the AI kills > 99% of people]?
(My current take is that for a majority reader, sticking to "literal extinction" is the better tradeoff between avoiding confusion/verbosity and accuracy. But perhaps it deserves at least a footnote or some other qualification.)
I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).
That seems fair. For what it's worth, I think the ideas described in the sequence are not sensitive to what you choose here. The point isn't as much to figure out whether the particular arguments go through or not, but to ask which properties must your model have, if you want to be able to evaluate those arguments rigorously.
A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven't solved everything, you must have made a bunch of progress.
Right, I agree. I didn't realise the bolded statement was a poor/misleading summary of the non-bolded text below. I guess it would be more accurate to say something like "[% of loss explained] is a good metric for tracking intellectual progress in interpretability. However...
[% of loss explained] isn't a good interpretability metric [edit: isn't enough to get guarantees].
In interpretability, people use [% of loss explained] as a measure of the quality of an explanation. However, unless you replace the system-being-explained by its explanation, this measure has a fatal flaw.
Suppose you have misaligned superintelligence X pretending to be a helpful assistant A --- that is, acting as A in all situations except those where it could take over the world. Then the explanation "X is behaving as A" will explain 100% of loss, but actual...
I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the "distribute AI(x-1) quickly" part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure. (Also, there is the "single point of failure" effect, though it seems unclear how large.)
To clarify: The question about improving Steps 1-2 was meant specifically for [improving things that resemble Steps 1-2], rather than [improving alignment stuff in general]. And the things you mention seem only tangentially related to that, to me.
But that complaint aside: sure, all else being equal, all of the points you mention seem better having than not having.
Might be obvious, but perhaps seems worth noting anyway: Ensuring that our boundaries are respected is, at least with a straightforward understanding of "boundaries", not sufficient for being safe.
For example:
An aspect that I would not take into account is the expected impact of your children.
Most importantly, it just seems wrong to make personal-happiness decisions subservient to impact.
But even if you did want to optimise impact through others, then betting on your children seems riskier and less effective than, for example, engaging with interested students. (And even if you wanted to optimise impact at all costs, then the key factors might not be your impact through others. But instead (i) your opportunity costs, (ii) second order effects, where having kids...
In fact it's hard to find probable worlds where having kids is a really bad idea, IMO.
One scenario where you might want to have kids in general, but not if timelines are short, is if you feel positive about having kids, but you view the first few years of having kids as a chore (ie, it costs you time, sleep, and money). So if you view kids as an investment of the form "take a hit to your happiness now, get more happiness back later", then not having kids now seems justifiable. But I think that this sort of reasoning requires pretty short timelines (which I...
(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that "it was never about getting the AI to predict human preferences". So when I later saw Yudkowsky's comment and your reaction, it seemed perhaps useful to share my view.)
It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?
My reaction to this is that: Actually, current LLMs do care about our preferences...
Nitpicky comment / edit request: The circle inversion figure was quite confusing to me. Perhaps add a note to it saying that solid green maps onto solid blue, red maps onto itself, and dotted green maps onto dotted blue. (Rather than colours mapping to each other, which is what I intuitively expected.)
Fun example: The evolution of offensive words seems relevant here. IE, we frown upon using currently-offensive words, so we end up expressing ourselves using some other words. And over time, we realise that those other words are (primarily used as) Doppelgangers, and mark them as offensive as well.
Related questions:
Two points that seem relevant here:
(At the moment, both of these questions seem open.)
This made me think of "lawyer-speak", and other jargons.
More generally, this seems to be a function of learning speed and the number of interactions on the one hand, and the frequency with which you interact with other groups on the other. (In this case, the question would be how often do you need to be understandable to humans, or to systems that need to be understandable to humans, etc.)
I would distinguish between "feeling alien" (as in, most of the time, the system doesn't feel too weird or non-human to interact with, at least if you don't look too closely) and "being alien" (a in, "having the potential to sometimes behave in a way that a human never would").
My argument is that the current LLMs might not feel alien (at least to some people), but they definitely are. For example, any human that is smart enough to write a good essay will also be able to count the number of words in a sentence --- yet LLMs can do one, but not the othe...
I agree that we shouldn't be deliberately making LLMs more alien in ways that have nothing to do with how alien they actually are/can be. That said, I feel that some of the examples I gave are not that far from how LLMs / future AIs might sometimes behave? (Though I concede that the examples could be improved a lot on this axis, and your suggestions are good. In particular, the GPT-4 finetuned to misinterpret things is too artificial. And with intentional non-robustness, it is more honest to just focus on naturally-occurring failures.)
To elaborate: My view...
I would like to point out one aspects of the "Vulnerable ML systems" scenario that the post doesn't discuss much: the effect on adversarial vulnerability on widespread-automation worlds.
Using existing words, some ways of pointing towards what I mean are: (1) Adversarial robustness solved after TAI (your case 2), (2) vulnerable ML systems + comprehensive AI systems, (3) vulnerable ML systems + slow takeoff, (4) fast takeoff happening in the middle of (3).
But ultimately, I think none of these fits perfectly. So a longer, self-contained description is somethi...
An idea for increasing the impact of this research: Mitigating the "goalpost moving" effect for "but surely a bit more progress on capabilities will solve this".
I suspect that many people who are sceptical of this issue will, by default, never sit down and properly think about this. If they did, they might make some falsifiable predictions and change their minds --- but many of them might never do that. Or perhaps many people will, but it will all happen very gradually, and we will never get a good enough "coordination point" that would allow us to take ne...
Ok, got it. Though, not sure if I have a good answer. With trans issues, I don't know how to decouple the "concepts and terminology" part of the problem from the "political" issues. So perhaps the solution with AI terminology is to establish the precise terminology? And perhaps to establish it before this becomes an issue where some actors benefit from ambiguity (and will therefore resist disambiguation)? [I don't know, low confidence on all of this.]
- If I could assume things like "they are much better at reading my inner monologue than my non-verbal thoughts", then I could create code words for prohibited things.
- I could think in words they don't know.
- I could think in complicated concepts they haven't understood yet. Or references to events, or my memories, that they don't know.
- I could leave a part of my plans implicit, and only figure out the details later.
- I could harm them through some action for which they won't understand that it is harmful, so they might not be alarmed even if they catch me thinkin
... (read more)