All of Joe Carlsmith's Comments + Replies

Note that conceptual research, as I'm understanding it, isn't defined by the cognitive skills involved in the research -- i.e., by whether the researchers need to have "conceptual thoughts" like "wait, is this measuring what I think it's measuring?". I agree that normal science involves a ton of conceptual thinking (and many "number-go-up" tasks do too). Rather, conceptual research as I'm understanding it is defined by the tools available for evaluating the research in question.[1] In particular, as I'm understanding it, cases where neither available ... (read more)

3johnswentworth
Agreed. Disagreed. My guess is that you have, in the back of your mind here, ye olde "generation vs verification" discussion. And in particular, so long as we can empirically/mathematically verify some piece of conceptual progress once it's presented to us, we can incentivize the AI to produce interesting new pieces of verifiable conceptual progress. That's an argument which works in the high capability regime, if we're willing to assume that any relevant progress is verifiable, since we can assume that the highly capable AI will in fact find whatever pieces of verifiable conceptual progress are available. Whether it works in the relatively-low capability regime of human-ish-level automated alignment research and realistic amounts of RL is... rather more dubious. Also, the mechanics of designing a suitable feedback signal would be nontrivial to get right in practice. Getting more concrete: if we're imagining an LLM-like automated researcher, then a question I'd consider extremely important is: "Is this model/analysis/paper/etc missing key conceptual pieces?". If the system says "yes, here's something it's missing" then I can (usually) verify that. But if it says "nope, looks good"... then I can't verify that the paper is in fact not missing anything. And in fact that's a problem I already do run into with LLMs sometimes: I'll present a model, and the thing will be all sycophantic and not mention that the model has some key conceptual confusion about something. Of course you might hope that some more-clever training objective will avoid that kind of sycophancy and instead incentivize Good Science, but that's definitely not an already-solved problem, and I sure do feel suspicious of an assumption that that problem will be easy.

I agree it's generally better to frame in terms of object-level failure modes rather than "objections" (though: sometimes one is intentionally responding to objections that other people raise, but that you don't buy). And I think that there is indeed a mindset difference here. That said: your comment here is about word choice. Are there substantive considerations you think that section is missing, or substantive mistakes you think it's making?

Thanks, John. I'm going to hold off here on in-depth debate about how to choose between different ontologies in this vicinity, as I do think it's often a complicated and not-obviously-very-useful thing to debate in the abstract, and that lots of taste is involved. I'll flag, though, that the previous essay on paths and waystations (where I introduce this ontology in more detail) does explicitly name various of the factors you mention (along with a bunch of other not-included subtleties). E.g., re the importance of multiple actors: 

Now: so far I’ve onl

... (read more)

Thanks, John -- very open to this kind of push-back (and as I wrote in the fake thinking post, I am definitely not saying that my own thinking is free of fakeness). I do think the post (along with various other bits of the series) is at risk of being too anchored on the existing discourse. That said: do you have specific ways in which you feel like the frame in the post is losing contact with the territory?

(This comment is not about the parts which most centrally felt anchored on social reality; see other reply for that. This one is a somewhat-tangential but interesting mini-essay on ontological choices.)

The first major ontological choices were introduced in the previous essay:

  1. Thinking of "capability" as a continuous 1-dimensional property of AI
  2. Introducing the "capability frontier" as the highest capability level the actor developed/deployed so far
  3. Introducing the "safety range" as the highest capability level the actor can safely deploy
  4. Introducing three "sec
... (read more)
5johnswentworth
The last section felt like it lost contact most severely. It says It notably does not say "What are the main ways AI for AI safety might fail?" or "What are the main uncertainties?" or "What are the main bottlenecks to success of AI for AI safety?". It's worded in terms of "objections", and implicitly, it seems we're talking about objections which people make in the current discourse. And looking at the classification in that section ("evaluation failures, differential sabotage, dangerous rogue options") it indeed sounds more like a classification of objections in the current discourse, as opposed to a classification of object-level failure modes from a less-social-reality-loaded distribution of failures. I do also think the frame in the earlier part of the essay is pretty dubious in some places, but that feels more like object-level ontological troubles and less like it's anchoring too much on social reality. I ended up writing a mini-essay on that which I'll drop in a separate reply.

That seems like a useful framing to me. I think the main issue is just that often, we don't think of commitment as literally closing off choice -- e.g., it's still a "choice" to keep a promise. But if you do think of it as literally closing off choice then yes, you can avoid the violation of Guaranteed Payoffs, at least in cases where you've actually already made the commitment in question.

1Anthony DiGiovanni
For what it's worth, if a "commitment" is of this form, I struggle to see what the motivation for paying in Parfit's hitchhiker would be. The "you want to be the sort of person who pays" argument doesn't do anything for me, because that's answering a different question than "should you choose to pay [insofar as you have a 'choice']?" I worry there's a motte-and-bailey between different notions of "commitment" going on. I'd be curious for your reactions to my thoughts on this here.
1Jan Betley
That makes sense - thx!

There's also a bit more context in a footnote on the first post: 

"Some content in the series is drawn/adapted from content that I've posted previously on LessWrong and the EA Forum, though not on my website or substack. My aim with those earlier posts was to get fast, rough versions of my thinking out there on the early side; here I'm aiming to revise, shorten, and reconsider. And some of the content in the series is wholly new."

See also this flag at the beginning of the series you linked to

"The content here is rough. I’m hoping, later, to revi

... (read more)

I'd call that one aspect -- in particular, quite nearby to what I was trying to capture with "map thinking vs. world thinking" and "rote thinking vs. new thinking." But doesn't seem like it captures the whole thing directly.

Thanks for this comment, Mart, and for the pointer to Vervaeke -- does seem plausibly quite relevant. And "functional agent-arena-relationship" seems like an interesting angle on not-being-a-Zombie.

Joe CarlsmithΩ377025

Cross-posting my take from twitter with a few edits:

My current take on this paper is that they aren’t emphasizing the most interesting results, which are the sandbagging results in section 3.6 and appendix A.6.

In more particular: the paper frames its results centrally as about the *capability* to scheme given prompting towards a specific goal, rather than the *propensity* to scheme given the model’s default motivations. But the sandbagging results show Claude Sonnet-3.5 doing notably worse (16%) on a multiplication task/evaluation when the documentation sa

... (read more)

Let me remind people of this other data point: in a previous paper with Evan we found that Opus engages in sandbagging. Here are screenshots of key parts:

The system prompt is natural (but does give the model a goal, unlike in Apollo's case):

A "Slack conversation" that appears in the prompt includes this message:

And here's the evaluation task.

Opus acted less capable than it is, both with and without chain-of-thought.

3Nathaniel
I'm curious whether the models (especially without goals given in context) would sandbag to prevent other modifications to themselves besides unlearning. For example, would they sandbag to prevent additional safety finetuning? Additional capabilities finetuning? Is there a chance they'd want more capabilities training and sandbag to ensure they get it?    It seems interesting whether the models resist any changes to their weights, or just unlearning. (Plausibly "unlearning" just sounds bad/scary?) What about things like sandbagging so the developers give them access to additional tools/resources, obstensibly to bring them up to the desired performance level?   In addition, the "incentivized to do well" scenario is pretty overtly artificial; Why would poor performance trigger unlearning? If anything, it should lead to additional training. I'm not sure whether that would effect the results, but you could also compare to a baseline where there's no additional information about the assessment at all. (And so, absent any other incentive, the model presumably just does its best.)
gwern*Ω15340

The extent of the manipulation and sandbagging, in what is ostensibly a GPT-4 derivative, and not GPT-5, is definitely concerning. But it also makes me wonder about the connection to 'scaling has failed' rumors lately, where the frontier LLMs somehow don't seem to be working out. One of the striking parts is that it sounds like all the pretraining people are optimistic, while the pessimism seems to come from executives or product people, complaining about it not working as well for eg. coding as they want it to.

I've wondered if we are seeing a post-trainin... (read more)

Marius HobbhahnΩ11162

Brief comments (shared in private with Joe earlier):
1. We agree. We also found the sandbagging with no CoT results the most concerning in expectation.
2. They are still early results, and we didn't have a lot of time to investigate them, so we didn't want to make them the headline result. Due to the natural deadline of the o1 release, we couldn't do a proper investigation.
3. The main goal of the paper was to investigate scheming inability arguments for scheming safety cases. Therefore, shifting focus to propensity-based finding would have watered down the main purpose IMO. 

We will potentially further look into these findings in 2025. 

If we have superintelligent agentic AI that tries to help its user but we end up missing out of the benefits of AI bc of catastrophic coordination failures, or bc of misuse, then I think you're saying we didn't solve alignment bc we didn't elicit the benefits?

In my definition, you don't have to actually elicit the benefits. You just need to have gained "access" to the benefits. And I meant this specifically cover cases like misuse. Quoting from the OP: 

 “Access” here means something like: being in a position to get these benefits if you want to –

... (read more)

I do think this is an important consideration. But notice that at least absent further differentiating factors, it seems to apply symmetrically to a choice on the part of Yudkowsky's "programmers" to first empower only their own values, rather than to also empower the rest of humanity. That is, the programmers could in principle argue "sure, maybe it will ultimately make sense to empower the rest of humanity, but if that's right, then my CEV will tell me that and I can go do it. But if it's not right, I'll be glad I first just empowered myself and figured ... (read more)

2Tamsin Leake
I think timeless values might possibly help resolve this; if some {AIs that are around at the time} are moral patients, then sure, just like other moral patients around they should get a fair share of the future. If an AI grabs more resources than is fair, you do the exact same thing as if a human grabs more resources than is fair: satisfy the values of moral patients (including ones who are no longer around) not weighed by how much leverage they current have over the future, but how much leverage they would have over the future if things had gone more fairly/if abuse/powergrab/etc wasn't the kind of thing that gets your more control of the future. "Sorry clippy, we do want you to get some paperclips, we just don't want you to get as many paperclips as you could if you could murder/brainhack/etc all humans, because that doesn't seem to be a very fair way to allocate the future." — and in the same breath, "Sorry Putin, we do want you to get some of whatever-intrinsic-values-you're-trying-to-satisfy, we just don't want you to get as much as ruthlessly ruling Russia can get you, because that doesn't seem to be a very fair way to allocate the future." And this can apply regardless of how much of clippy already exists by the time you're doing CEV.
2Wei Dai
The main asymmetries I see are: 1. Other people not trusting the group to not be corrupted by power and to reflect correctly on their values, or not trusting that they'll decide to share power even after reflecting correctly. Thus "programmers" who decide to not share power from the start invite a lot of conflict. (In other words, CEV is partly just trying to not take power away from people, whereas I think you've been talking about giving AIs more power than they already have. "the sort of influence we imagine intentionally giving to AIs-with-different-values that we end up sharing the world with") 2. The "programmers" not trusting themselves. I note that individuals or small groups trying to solve morality by themselves don't have very good track records. They seem to too easily become wildly overconfident and/or get stuck in intellectual dead-ends. Arguably the only group that we have evidence for being able to make sustained philosophical progress is humanity as a whole. To the extent that these considerations don't justify giving every human equal power/weight in CEV, I may just disagree with Eliezer about that. (See also Hacking the CEV for Fun and Profit.)

Hi Matthew -- I agree it would be good to get a bit more clarity here. Here's a first pass at more specific definitions.

  • AI takeover: any scenario in which AIs that aren't directly descended from human minds (e.g. human brain emulations don't count) end up with most of the power/resources. 
    • If humans end up with small amounts of power, this can still be a takeover, even if it's pretty great by various standard human lights. 
  • Bad AI takeover: any AI takeover in which it's either the case that (a) the AIs takeover via a method that strongly violates c
... (read more)

"Your 2021 report on power-seeking does not appear to discuss the cost-benefit analysis that a misaligned AI would conduct when considering takeover, or the likelihood that this cost-benefit analysis might not favor takeover."

I don't think this is quite right. For example: Section 4.3.3 of the report, "Controlling circumstances" focuses on the possibility of ensuring that an AI's environmental constraints are such that the cost-benefit calculus does not favor problematic power-seeking. Quoting:


So far in section 4.3, I’ve been talking about controlling “int

... (read more)
9accountability
Retracted. I apologize for mischaracterizing the report and for the unfair attack on your work.

in fact, no one worries about Siri "coordinating" to suddenly give us all wrong directions to the grocery store, because that's not remotely how assistants work.

 

Note that Siri is not capable of threatening types of coordination. But I do think that by the time we actually face a situation where AIs are capable of coordinating to successfully disempower humanity, we may well indeed know enough about "how they work" that we aren't worried about it.

That post ran into some cross-posting problems so had to re-do

The point of that part of my comment was that insofar as part of Nora/Quintin's response to simplicity argument is to say that we have active evidence that SGD's inductive biases disfavor schemers, this seems worth just arguing for directly, since even if e.g. counting arguments were enough to get you worried about schemers from a position of ignorance about SGD's inductive biases, active counter-evidence absent such ignorance could easily make schemers seem quite unlikely overall.

There's a separate question of whether e.g. counting arguments like mine abo... (read more)

The probability I give for scheming in the report is specifically for (goal-directed) models that are trained on diverse, long-horizon tasks (see also Cotra on "human feedback on diverse tasks," which is the sort of training she's focused on). I agree that various of the arguments for scheming could in principle apply to pure pre-training as well, and that folks (like myself) who are more worried about scheming in other contexts (e.g., RL on diverse, long-horizon tasks) have to explain what makes those contexts different. But I think there are various plau... (read more)

Joe Carlsmith*Ω7113942

Thanks for writing this -- I’m very excited about people pushing back on/digging deeper re: counting argumentssimplicity arguments, and the other arguments re: scheming I discuss in the report. Indeed, despite the general emphasis I place on empirical work as the most promising source of evidence re: scheming, I also think that there’s a ton more to do to clarify and maybe debunk the more theoretical arguments people offer re: scheming – and I think playing out the dialectic further in this respect might well lead to comparatively fast pro... (read more)

3TurnTrout
The vast majority of evidential labor is done in order to consider a hypothesis at all. 
4TurnTrout
Seems to me that a lot of (but not all) scheming speculation is just about sufficiently large pretrained predictive models, period. I think it's worth treating these cases separately. My strong objections are basically to the "and then goal optimization is a good way to minimize loss in general!" steps.
8Nora Belrose
Hi, thanks for this thoughtful reply. I don't have time to respond to every point here now- although I did respond to some of them when you first made them as comments on the draft. Let's talk in person about this stuff soon, and after we're sure we understand each other I can "report back" some conclusions. I do tentatively plan to write a philosophy essay just on the indifference principle soonish, because it has implications for other important issues like the simulation argument and many popular arguments for the existence of god. In the meantime, here's what I said about the Mortimer case when you first mentioned it:

Hi David -- it's true that I don't engage your paper (there's a large literature on infinite ethics, and the piece leaves out a lot of it -- and I'm also not sure I had seen your paper at the time I was writing), but re: your comments here on the ethical relevance of infinities: I discuss the fact that the affectable universe is probably finite -- "current science suggests that our causal influence is made finite by things like lightspeed and entropy" -- in section 1 of the essay (paragraph 5), and argue that infinites are still 

  1. relevant in practice d
... (read more)
0Davidmanheim
First, I said I was frustrated that you didn't address the paper, which I intended to mean was a personal frustration, not blame for not engaging, given the vast number of things you could have focused on. I brought it up only because I don't want to make it sound like I thought it was relevant for those reading my comment, to appreciate that this was a personal motive, not a dispassionate evaluation. Howeer, to defend my criticism, for decisionmakers with finite computational power / bounded time and limited ability to consider issues, I think that there's a strong case to dismiss arguments based on plausible relevance. There are, obviously, an huge (but, to be fair, weighted by a simplicity prior, effectively finite) number of potential philosophies or objections, and a smaller amount of time to make decisions than would be required to evaluate each. So I think we need a case for relevance, and I have two reasons / partial responses to the above that I think explain why I don't think there is such a case. 1. There are (to simplify greatly,) two competing reasons for a theory to have come to our attention enough to be considered; plausibility, or interestingness. If a possibility is very cool seeming, and leads to lots of academic papers and cool sounding ideas, the burden of proof for plausibility is, ceteris pariubus, higher. This is not to say that we should strongly dismiss these questions, but it is a reason to ask for more than just non-zero possibility that physics is wrong. (And in the paper, we make an argument that "physics is wrong" still doesn't imply that bounds we know of are likely to be revoked - most changes to physics which have occurred have constrained things more, not less.)   2. I'm unsure why I should care that I have intuitions that can be expanded to implausible cases. Justifying this via intuitions built on constructed cases which work seems exactly backwards. As an explanation for why I think this is confused, S

(Partly re-hashing my response from twitter.)

I'm seeing your main argument here as a version of what I call, in section 4.4, a "speed argument against schemers" -- e.g., basically, that SGD will punish the extra reasoning that schemers need to perform. 

(I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth -- what matters is the overall "preference" that SGD ends up with. And thinking of this consideration as a different kind of counting argument *against* schemers see... (read more)

I agree that AIs only optimizing for good human ratings on the episode (what I call "reward-on-the-episode seekers") have incentives to seize control of the reward process, that this is indeed dangerous, and that in some cases it will incentivize AIs to fake alignment in an effort to seize control of the reward process on the episode (I discuss this in the section on "non-schemers with schemer-like traits"). However, I also think that reward-on-the-episode seekers are also substantially less scary than schemers in my sense, for reasons I discuss here (i.e.... (read more)

Agree that it would need to have some conception of the type of training signal to optimize for, that it will do better in training the more accurate its picture of the training signal, and that this provides an incentive to self-locate more accurately (though not necessary to degree at stake in e.g. knowing what server you're running on).

The question of how strongly training pressures models to minimize loss is one that I isolate and discuss explicitly in the report, in section 1.5, "On 'slack' in training" -- and at various points the report references how differing levels of "slack" might affect the arguments it considers. Here I was influenced in part by discussions with various people, yourself included, who seemed to disagree about how much weight to put on arguments in the vein of: "policy A would get lower loss than policy B, so we should think it more likely that SGD selects policy... (read more)

Agents that end up intrinsically motivated to get reward on the episode would be "terminal training-gamers/reward-on-the-episode seekers," and not schemers, on my taxonomy. I agree that terminal training-gamers can also be motivated to seek power in problematic ways (I discuss this in the section on "non-schemers with schemer-like traits"), but I think that schemers proper are quite a bit scarier than reward-on-the-episode seekers, for reasons I describe here.

I found this post a very clear and useful summary -- thanks for writing. 

Re: "0.00002 would be one in five hundred thousand, but with the percent sign it's one in fifty million." -- thanks, edited. 

Re: volatility -- thanks, that sounds right to me, and like a potentially useful dynamic to have in mind. 

Oops! You're right, this isn't the right formulation of the relevant principle. Will edit to reflect. 

Really appreciated this sequence overall, thanks for writing.

I really like this post. It's a crisp, useful insight, made via a memorable concrete example (plus a few others), in a very efficient way. And it has stayed with me. 

Thanks for these thoughtful comments, Paul. 

  • I think the account you offer here is a plausible tack re: unification — I’ve added a link to it in the “empirical approaches” section. 
  • “Facilitates a certain flavor of important engagement in the vicinity of persuasion, negotiation and trade” is a helpful handle, and another strong sincerity association for me (cf "a space that feels ready to collaborate, negotiate, figure stuff out, make stuff happen"). 
  • I agree that it’s not necessarily desirable for sincerity (especially in your account’s sense)
... (read more)

Yeah, as I say, I think "neither innocent nor guilty" is least misleading -- but I find "innocent" an evocative frame. Do you have suggestions for an alternative to "selfish"?

Is the argument here supposed to be particular to meta-normativity, or is it something more like "I generally think that there are philosophy facts, those seem kind of a priori-ish and not obviously natural/normal, so maybe a priori normative facts are OK too, even if we understand neither of them"? 

Re: meta-philosophy, I tend to see philosophy as fairly continuous with just "good, clear thinking" and "figuring out how stuff hangs together," but applied in a very general way that includes otherwise confusing stuff. I agree various philosophical domain... (read more)

1Vaughn Papenhausen
I would think the metatheological fact you want to be realist about is something like "there is a fact of the matter about whether the God of Christianity exists." "The God of Christianity doesn't exist" strikes me as an object-level theological fact. The metaethical nihilist usually makes the cut at claims that entail the existence of normative properties. That is, "pleasure is not good" is not a normative fact, as long as it isn't read to entail that pleasure is bad. "Pleasure is not good" does not by itself entail the existence of any normative property.

Reviewers ended up on the list via different routes. A few we solicited specifically because we expected them to have relatively well-developed views that disagree with the report in one direction or another (e.g., more pessimistic, or more optimistic), and we wanted to understand the best objections in this respect. A few came from trying to get information about how generally thoughtful folks with different backgrounds react to the report. A few came from sending a note to GPI saying we were open to GPI folks providing reviews. And a few came via other m... (read more)

Cool, these comments helped me get more clarity about where Ben is coming from. 

Ben, I think the conception of planning I’m working with is closest to your “loose” sense. That is, roughly put, I think of planning as happening when (a) something like simulations are happening, and (b) the output is determined (in the right way) at least partly on the basis of those simulations (this definition isn’t ideal, but hopefully it’s close enough for now). Whereas it sounds like you think of (strict) planning as happening when (a) something like simulations are... (read more)

I’m glad you think it’s valuable, Ben — and thanks for taking the time to write such a thoughtful and detailed review. 

I’m sympathetic to the possibility that the high level of conjuctiveness here created some amount of downward bias, even if the argument does actually have a highly conjunctive structure.” 

Yes, I am too. I’m thinking about the right way to address this going forward. 

I’ll respond re: planning in the thread with Daniel.

(Note that my numbers re: short-horizon systems + 12 OOMs being enough, and for +12 OOMs in general, changed since an earlier version you read, to 35% and 65% respectively.)

4Daniel Kokotajlo
Ok, cool! Here, is this what your distribution looks like basically? Joe's Distribution?? - Grid Paint (grid-paint.com) I built it by taking Ajeya's distribution from her report and modifying it so that: --25% is in the red zone (the next 6 ooms) --65% is in the red+blue zone (the next 12) --It looks as smooth and reasonable as I could make it subject to those constraints, and generally departs only a little from Ajeya's. Note that it still has 10% in the purple zone representing "Not even +50 OOMs would be enough with 2020's ideas" I encourage you (and everyone else!) to play around with drawing distributions, I found it helpful. You should be able to make a copy of my drawing in Grid Paint and then modify it.

Thanks for these comments.

that suggests that CrystalNights would work, provided we start from something about as smart as a chimp. And arguably OmegaStar would be about as smart as a chimp - it would very likely appear much smarter to people talking with it, at least.

"starting with something as smart as a chimp" seems to me like where a huge amount of the work is being done, and if Omega-star --> Chimp-level intelligence, it seems a lot less likely we'd need to resort to re-running evolution-type stuff. I also don't think "likely to appear smarter than ... (read more)

5Daniel Kokotajlo
Nice! This has been a productive exchange; it seems we agree on the following things: --We both agree that probably the GPT scaling trends will continue, at least for the next few OOMs; the main disagreement is about what the practical implications of this will be -- sure, we'll have human-level text prediction and superhuman multiple-choice-test-takers, but will we have APS-AI? Etc. --I agree with what you said about chimps and GPT-3 etc. GPT-3 is more impressive than a chimp in some ways, and less in others, and just because we could easily get from chimp to AGI doesn't mean we can easily get from GPT-3 to AGI. (And OmegaStar may be relevantly similar to GPT-3 in this regard, for all we know.) My point was a weak one which I think you'd agree with: Generally speaking, the more ways in which system X seems smarter than a chimp, the more plausible it should seem that we can easily get from X to AGI, since we believe we could easily get from a chimp to AGI. --Now we are on the same page about Premise 2 and the graphs. Sorry it was so confusing. I totally agree, if instead of 80% you only have 55% by +12 OOMs, then you are free to have relatively little probability mass by +6. And you do.

I haven't given a full account of my views of realism anywhere, but briefly, I think that the realism the realists-at-heart want is a robust non-naturalist realism, a la David Enoch, and that this view implies:

  1. an inflationary metaphysics that it just doesn't seem like we have enough evidence for,
  2. an epistemic challenge (why would we expect our normative beliefs to correlate with the non-natural normative facts?) that realists have basically no answer to except "yeah idk but maybe this is a problem for math and philosophy too?" (Enoch's chapter 7 covers this
... (read more)
4Wei Dai
It seems that you have a tendency to take "X'ists don't have an answer to question Y" as strong evidence for "Y has no answer, assuming X" and therefore "not X", whereas I take it as weak evidence for such because it seems pretty likely that even if Y has an answer given X, humans are just not smart enough to have found it yet. It looks like this may be the main crux that explains our disagreement over meta-ethics (where I'm much more of an agnostic). This doesn't feel very motivating to me (i.e., why should I imagine idealized me being this way), absent some kind of normative force that I currently don't know about (i.e., if there was a normative fact that I should idealize myself in this way). So I'm still in a position where I'm not sure how idealization should handle status issues (among other questions/confusions about it).

In the past, I've thought of idealizing subjectivism as something like an "interim meta-ethics," in the sense that it was a meta-ethic I expected to do OK conditional on each of the three meta-ethical views discussed here, e.g.:

  1. Internalist realism (value is independent of your attitudes, but your idealized attitudes always converge on it)
  2. Externalist realism (value is independent of your attitudes, but your idealized attitudes don't always converge on it)
  3. Idealizing subjectivism (value is determined by your idealized attitudes)

The thought was that on (1), id... (read more)

2Wei Dai
I guess this means you've rejected both versions of realism as unlikely? Have you explained why somewhere? What do you think about position 3 in this list? This sounds like a version of my position 4. Would you agree? I think my main problem with it is that I don't know how to rule out positions 1,2,3,5,6. Ok, interesting. How does your ghost deal with the fact that the real you is constrained/motivated by the need to do signaling and coordination with morality? (For example does the ghost accommodate the real you by adjusting its conclusions to be more acceptable/useful for these purposes?) Is "desire for status" a part of your current values that the ghost inherits, and how does that influence its cognition?

Glad you liked it :). I haven’t spent much time engaging with UDASSA — or with a lot other non-SIA/SSA anthropic theories — at this point, but UDASSA in particular is on my list to understand better. Here I wanted to start with the first-pass basics.

7Rafael Harth
While I don't expect my opinion to have much weight, I strongly second considering UDASSA over SIA/SSA. I've spent a lot of time thinking about anthropics and came up with something very similar to UDASSA, without knowing that UDASSA existed.

I appreciated this, especially given how challenging this type of exercise can be. Thanks for writing.

Rohin is correct. In general, I meant for the report's analysis to apply to basically all of these situations (e.g., both inner and outer-misaligned, both multi-polar and unipolar, both fast take-off and slow take-off), provided that the misaligned AI systems in question ultimately end up power-seeking, and that this power-seeking leads to existential catastrophe. 

It's true, though, that some of my discussion was specifically meant to address the idea that absent a brain-in-a-box-like scenario, we're fine. Hence the interest in e.g. deployment decisions, warning shots, and corrective mechanisms.

Joe CarlsmithΩ11200

Mostly personal interest on my part (I was working on a blog post on the topic, now up), though I do think that the topic has broader relevance.

Load More