LESSWRONG
LW

All of Dakara's Comments + Replies

Agentic Misalignment: How LLMs Could be Insider Threats

Fair enough, but is the broader trend of "Models won't take unethical actions unless they're the only options" still holding?

That was my takeaway from the experiments done in the aftermath of the alignment faking paper, so it's good to see that it's still holding.

No77e's Shortform

Dakara3mo10

I think the general population doesn't know all that much about singularity, so adding that to the part would just unnecessarily dilute it.

Escaping the Jungles of Norwood: A Rationalist’s Guide to Male Pattern Baldness

Dakara3mo42

I have read the entire piece and it didn't feel like an AI slop at all. In fact, if I wasn't told, I wouldn't have suspected that AI was involved here, so well done!

Knight Lee's Shortform

Dakara3mo31

A lot of splits happen because some employees think that the company is headed in the wrong direction (lackluster safety would be one example).

Angela's Shortform

Dakara3mo10

Test successful worked :)

Vladimir_Nesov's Shortform

Dakara3mo90

He probably doesn't have much influence on the public opinion of LessWrong, but as a person in charge of a major AI company, he is obviously a big player.

Making deals with early schemers

Dakara3mo*1-1

It looks to me like a promising approach. Great results!

Debate experiments at The Curve, LessOnline and Manifest

Dakara3mo30

I've noticed that whenever the debate touches on a very personal topic, it tends to be heated and pretty unpleasant to listen to. On contrast, debates about things that are low-stakes for the people who are debating tend to be much more productive, sometimes even involving steelmanning.

Every Major LLM Endorses Newcomb One-Boxing

Dakara3mo10

That's certainly an interesting result. Have you tried running the same prompt again and seeing if the response changes? I've noticed that some LLMs answer different things to the same prompt. For example, when I quizzed DeepSeek R1 on whether a priori knowledge exists it answered in affirmative the first time and in negative the second time.

deep's Shortform

Dakara3mo*11

If alignment by default is not the majority opinion, then what is (pardon my ignorance as someone who mostly interacts with alignment community via LessWrong)? Is it 1) that we are all ~doomed or 2) that alignment is hard but we have a decent shot at solving it or 3) something else entirely?

I got a feeling like people used to be a lot more pessimistic about our chances of survival in 2023 than in 2024 or 2025 (in other words, pessimism seems to be going down somewhat), but I could be completely wrong about this.

What if Alignment is Not Enough?

Dakara8mo10

Thanks for the reply!

The only general remarks that I want to make
are in regards to your question about
the model of 150 year long vaccine testing
on/over some sort of sample group and control group.
I notice that there is nothing exponential assumed
about this test object, and so therefore, at most,
the effects are probably multiplicative, if not linear.
Therefore, there are lots of questions about power dynamics
that we can overall safely ignore, as a simplification,
which is in marked contrast to anything involving ASI.
If we assume, as you requested, "no side ef

... (read more)

What if Alignment is Not Enough?

Dakara8mo*10

Organic human brains have multiple aspects. Have you ever had more than one opinion? Have you ever been severely depressed?

Yes, but none of this would remain alive if I as a whole decide to jump from a cliff. My multiple aspects of my brain would die with my brain. After all, you mentioned subsystems that wouldn't self terminate with the rest of the ASI. Whereas in human body, jumping from a cliff terminates everything.

But even barring that, ASI can decide to fly into the Sun and any subsystem that shows any sign of refusal to do so will be immediately rep... (read more)

2flandry398mo

The only general remarks that I want to make are in regards to your question about the model of 150 year long vaccine testing on/over some sort of sample group and control group. I notice that there is nothing exponential assumed about this test object, and so therefore, at most, the effects are probably multiplicative, if not linear. Therefore, there are lots of questions about power dynamics that we can overall safely ignore, as a simplification, which is in marked contrast to anything involving ASI. If we assume, as you requested, "no side effects" observed, in any test group, for any of those things that we happened to be thinking of, to even look for, then for any linear system, that is probably "good enough". But for something that is know for sure to be exponential, that by itself is not anywhere enough to feel safe. But what does this really mean? Since the common and prevailing (world) business culture is all about maximal profit, and therefore minimal cost, and also to minimize any possible future responsibility (or cost) in case anything with the vax goes badly/wrong, then for anything that might be in the possible category of unknown unknown risk, I would expect that company to want to maintain sort of some plausible deniability -- ie; to not look so hard for never-before-seen effects. Or to otherwise ignore that they exist, or matter, etc. (just like throughout a lot of ASI risk dialogue). If there is some long future problem that crops up, the company can say "we never looked for that" and "we are not responsible for the unexpected", because the people who made the deployment choices have taken their profits and their pleasure in life, and are now long dead. "Not my Job". "Don't blame us for the sins of our forefathers". Similarly, no one is going to ever admit or concede any point, of any argument, on pain of ego death. No one will check if it is an exponential system. So of course, no one is going to want to look into any sort of i

What if Alignment is Not Enough?

Dakara8mo10

Thanks for the response!

So we are to try to imagine a complex learning machine without any parts/components?

Yeah, sure. Humans are an example. If I decide to jump of the cliff, my arm isn't going to say "alright, you jump but I stay here". Either I, as a whole, would jump or I, as a whole, would not.

Can the ASI prevent the relevant classes
of significant (critical) organic human harm,
that soon occur as a direct_result of its
own hyper powerful/consequential existence?

If by that, you mean "can ASI prevent some relevant classes of harm caused by its existence"... (read more)

2flandry398mo

> Humans do things in a monolithic way, > not as "assemblies of discrete parts". Organic human brains have multiple aspects. Have you ever had more than one opinion? Have you ever been severely depressed? > If you are asking "can a powerful ASI prevent > /all/ relevant classes of harm (to the organic) > caused by its inherently artificial existence?", > then I agree that the answer is probably "no". > But then almost nothing can perfectly do that, > so therefore your question becomes > seemingly trivial and uninteresting. The level of x-risk harm and consequence potentially caused by even one single mistake of your angelic super-powerful enabled ASI is far from "trivial" and "uninteresting". Even one single bad relevant mistake can be an x-risk when ultimate powers and ultimate consequences are involved. Either your ASI is actually powerful, or it is not; either way, be consistent. Unfortunately the 'Argument by angel' only confuses the matter insofar as we do not know what angels are made of. "Angels" are presumably not machines, but they are hardly animals either. But arguing that this "doesn't matter" is a bit like arguing that 'type theory' is not important to computer science. The substrate aspect is actually important. You cannot simply just disregard and ignore that there is, implied somewhere, an interface between the organic ecosystem of humans, etc, and that of the artificial machine systems needed to support the existence of the ASI. The implications of that are far from trivial. That is what is explored by the SNC argument. > It might well be likely > that the amount of harm ASI prevents > (across multiple relevant sources) > is going to be higher/greater than > the amount of harm ASI will not prevent > (due to control/predicative limitations). It might seem so, by mistake or perhaps by accidental (or intentional) self deception, but this can only be a short term delusion. This has nothing to do with "ASI alignment". Organic

What if Alignment is Not Enough?

Dakara8mo21

I notice that it is probably harder for us to assume that there is only exactly one ASI, for if there were multiple, the chances that one of them might not suicide, for whatever reason, becomes its own class of significant concerns.

If the first ASI that we build is aligned, then it would use its superintelligent capabilities to prevent other ASIs from being built, in order to avoid this problem.

If the first ASI that we have build is misaligned, then it would also use its superintelligent capabilities to prevent other ASIs from being built. Thus, it simply ... (read more)

2flandry398mo

> Our ASI would use its superhuman capabilities > to prevent any other ASIs from being built. This feels like a "just so" fairy tale. No matter what objection is raised, the magic white knight always saves the day. > Also, the ASI can just decide > to turn itself into a monolith. No more subsystems? So we are to try to imagine a complex learning machine without any parts/components? > Your same SNC reasoning could just well > be applied to humans too. No, not really, insofar as the power being assumed and presumed afforded to the ASI is very very much greater than that assumed applicable to any mere mortal human. Especially and exactly because the nature of ASI is inherently artificial and thus, in key ways, inherently incompatible with organic human life. It feels like you bypassed a key question: Can the ASI prevent the relevant classes of significant (critical) organic human harm, that soon occur as a direct_result of its own hyper powerful/consequential existence? Its a bit like asking if an exploding nuclear bomb detonating in the middle of some city somewhere, could somehow use its hugely consequential power to fully and wholly self contain, control, etc, all of the energy effects of its own exploding, simply because it "wants to" and is "aligned". Either you are willing to account for complexity, and of the effects of the artificiality itself, or you are not (and thus there would be no point in our discussing it further, in relation to SNC). The more powerful/complex you assume the ASI to be, and thus also the more consequential it becomes, the ever more powerful/complex you must also (somehow) make/assume its control system to be, and thus also of its predictive capability, and also an increase of the deep consequences of its mistakes (to the point of x-risk, etc). What if maybe something unknown/unknowable about its artificalness turns out to matter? Why? Because exactly none of the interface has ever even once been tried before -- there nothi

What if Alignment is Not Enough?

Dakara8mo*10

Thanks for the response!

Unfortunately, the overall SNC claim is that
there is a broad class of very relevant things
that even a super-super-powerful-ASI cannot do,
cannot predict, etc, over relevant time-frames.
And unfortunately, this includes rather critical things,
like predicting the whether or not its own existence,
(and of all of the aspects of all of the ecosystem
necessary for it to maintain its existence/function),
over something like the next few hundred years or so,
will also result in the near total extinction
of all humans (and everything el

... (read more)

2flandry398mo

> Lets assume that a presumed aligned ASI > chooses to spend only 20 years on Earth > helping humanity in whatever various ways > and it then (for sure!) destroys itself, > so as to prevent a/any/the/all of the > longer term SNC evolutionary concerns > from being at all, in any way, relevant. > What then? I notice that it is probably harder for us to assume that there is only exactly one ASI, for if there were multiple, the chances that one of them might not suicide, for whatever reason, becomes its own class of significant concerns. Lets leave that aside, without further discussion, for now. Similarly, if the ASI itself is not fully and absolutely monolithic -- if it has any sub-systems or components which are also less then perfectly aligned, so as to want to preserve themselves, etc -- that they might prevent whole self termination. Overall, I notice that the sheer number of assumptions we are having to make, to maybe somehow "save" aligned AGI is becoming rather a lot. > Let's assume that the fully aligned ASI > can create simulations of the world, > and can stress test these in various ways > so as to continue to ensure and guarantee > that it is remaining in full alignment, > doing whatever it takes to enforce that. This reminds me of a fun quote: "In theory, theory and practice are the same, whereas in practice, they are very often not". The main question is then as to the meaning of 'control', 'ensure' and/or maybe 'guarantee'. The 'limits of control theory' aspects of the overall SNC argument basically states (based on just logic, and not physics, etc) that there are still relevant unknown unknowns and interactions that simply cannot be predicted, no matter how much compute power you throw at it. It is not a question of intelligence, it is a result of logic. Hence to the question of "Is alignment enough?" we arrive at a definite answer of "no", both in 1; the sense of 'can prevent all classes of significant and relevant (critical) human

What if Alignment is Not Enough?

Dakara8mo*20

Hey, Forrest! Nice to speak with you.

Question: Is there ever any reason to think... Simply skipping over hard questions is not solving them.

I am going to respond to that entire chunk of text in one place, because quoting each sentence would be unnecessary (you will see why in a minute). I will try to summarize it as fairly as I can below.

Basically, you are saying that there are good theoretical reasons to think that ASI cannot 100% predict all future outcomes. Does that sound like a fair summary?

Here is my take:

We don't need ASI to be able to 100% predict ... (read more)

2flandry398mo

So as to save space herein, my complete reply is at http://mflb.com/2476 Included for your convenience below are just a few (much shortened) highlight excerpts of the added new content. The re-phrased version of the quote added these two qualifiers: "100%" and "all". Adding these has the net effect that the modified claim is irrelevant, for the reasons you (correctly) stated in your reply, insofar as we do not actually need 100% prediction, nor do we need to predict absolutely all things, nor does it matter if it takes infinitely long. We only need to predict some relevant things reasonably well in a reasonable time-frame. This all seems relatively straightforward -- else we are dealing with a straw-man. Unfortunately, the overall SNC claim is that there is a broad class of very relevant things that even a super-super-powerful-ASI cannot do, cannot predict, etc, over relevant time-frames. And unfortunately, this includes rather critical things, like predicting the whether or not its own existence, (and of all of the aspects of all of the ecosystem necessary for it to maintain its existence/function), over something like the next few hundred years or so, will also result in the near total extinction of all humans (and everything else we have ever loved and cared about). There exists a purely mathematical result that there is no wholly definable program 'X' that can even *approximately* predict/determine whether or not some other another arbitrary program 'Y' has some abstract property 'Z', in the general case, in relevant time intervals. This is not about predict 100% of anything -- this is more like 'predict at all'. AGI/ASI is inherently a *general* case of "program", since neither we nor the ASI can predict learning, and since it is also the case that any form of the abstract notion of "alignment" is inherently a case of being a *property* of that program. So the theorem is both valid and applicable, and therefore it has the result that it has. So

What if Alignment is Not Enough?

Dakara8mo10

Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant.

Yup, that's a good point, I edited my original comment to reflect it.

Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force. Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers. The core idea that I want reader

... (read more)

What if Alignment is Not Enough?

Dakara8mo*20

Thank you for thoughtful engagement!

On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI.

I know this is not necessarily an important point, but I am pretty sure that Redwood Research is working on difficulty 7 alignment techniques. They consistently make assumptions that AI will scheme, deceive, sandbag, etc.

They are a decently popular group (as far as AI alignment groups go) an... (read more)

3flandry398mo

Noticing that a number of these posts are already very long, and rather than take up space here, I wrote up some of my questions, and a few clarification notes regarding SNC in response to the above remarks of Dakara, at [this link](http://mflb.com/ai_alignment_1/d_250126_snc_redox_gld.html).

2WillPetillo8mo

I actually don't think the disagreement here is one of definitions. Looking up Webster's definition of control, the most relevant meaning is: "a device or mechanism used to regulate or guide the operation of a machine, apparatus, or system." This seems...fine? Maybe we might differ on some nuances if we really drove down into the details, but I think the more significant difference here is the relevant context. Absent some minor quibbles, I'd be willing to concede that an AI-powered HelperBot could control the placement of a chair, within reasonable bounds of precision, with a reasonably low failure rate. I'm not particularly worried about it, say, slamming the chair down too hard, causing a splinter to fly into its circuitry and transform it into MurderBot. Nor am I worried about the chair placement setting off some weird "butterfly effect" that somehow has the same result. I'm going to go out on a limb and just say that chair placement seems like a pretty safe activity, at least when considered in isolation. The reason I used the analogy "I may well be able to learn the thing if I am smart enough, but I won't be able to control for the person I will become afterwards" is because that is an example of the kind of reference class of context that SNC is concerned with. Another is: "what is expected shift to the global equilibrium if I construct this new invention X to solve problem Y?" In your chair analogy, this would be like the process of learning to place the chair (rewiring some aspect of its thinking process), or inventing an upgraded chair and releasing this novel product into the economy (changing its environmental context). This is still a somewhat silly toy example, but hopefully you see the distinction between these types of processes vs. the relatively straightforward matter of placing a physical object. It isn't so much about straightforward mistakes (though those can be relevant), as it is about introducing changes to the environment that sh

What if Alignment is Not Enough?

Dakara8mo10

Thanks for responding again!

SNC's general counter to "ASI will manage what humans cannot" is that as AI becomes more intelligent, it becomes more complex, which increases the burden on the control system at a rate that outpaces the latter's capacity.

If this argument is true and decisive, then ASI could decide to stop any improvements in its intelligence or to intentionally make itself less complex. It makes sense to reduce area where you are vulnerable to make it easier to monitor/control.

(My understanding of) the counter here is that, if we are on the tra

... (read more)

6WillPetillo8mo

Before responding substantively, I want to take a moment to step back and establish some context and pin down the goalposts. On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI. Conversations like this are about whether the true difficulty is 9 or 10, both of which are miles deep in the "shut it all down" category, but differ regarding what happens next. Relatedly, if your counterargument is correct, this is assuming wildly successful outcomes with respect to goal alignment--that developers have successfully made the AI love us, despite a lack of trying. In a certain sense, this assumption is fair, since a claim of impossibility should be able to contend with the hardest possible case. In the context of SNC, the hardest possible case is where AGI is built in the best possible way, whether or not that is realistic in the current trajectory. Similarly, since my writing about SNC is to establish plausibility, I only need to show that certain critical trade-offs exist, not pinpoint exactly where they balance out. For a proof, which someone else is working on, pinning down such details will be necessary. Neither of the above are criticisms of anything you've said, I just like to reality-check every once in a while as a general precautionary measure against getting nerd-sniped. Disclaimers aside, pontification recommence! Your reference to using ASI for a pivotal act, helping to prevent ecological collapse, or preventing human extinction when the Sun explodes is significant, because it points to the reality that, if AGI is built, that's because people want to use it for big things that would require significantly more effort to accomplish without AGI. This context sets a lower bound on the AI's capabilities and hence it's complexity, which in turn sets a floor for the burden on the

What's Wrong With the Simulation Argument?

Dakara8mo*10

This may be not factually true, btw, - current LLMs can create good models of past people without running past simulation of their previous life explicitly.

Yup, I agree.

It is a variant of Doomsday argument. This idea is even more controversial than simulation argument. There is no future with many people in it.

This makes my case even stronger! Basically, if a Friendly AI has no issues with simulating conscious beings in general, then we have good reasons to expect it to simulate more observers in blissful worlds than in worlds like ours.

If the Doomsday Arg... (read more)

What if Alignment is Not Enough?

Dakara8mo*52

Thank you for responding as well!

If the AI destroys itself, then it's obviously not an ASI for very long ;)
If the ASI replaces its own substrate for an organic one, then SNC would no longer apply (at least in my understanding of the theory, someone else might correct me here), but then it wouldn't be artificial anymore (an SI, rather than an ASI)
at what point does it stop being ASI?

It might stop being ASI immediately, depending on your definition, but this is absolutely fine with me. In these scenarios that I outlined, we build something that can be initia... (read more)

1WillPetillo8mo

I'm using somewhat nonstandard definitions of AGI/ASI to focus on the aspects of AI that are important from an SNC lens. AGI refers to an AI system that is comprehensive enough to be self sufficient. Once there is a fully closed loop, that's when you have a complete artificial ecosystem, which is where the real trouble begins. ASI is a less central concept, included mainly to steelman objections, referencing the theoretical limit of cognitive ability. Another core distinction SNC assumes is between an environment, an AI (that is its complete assemblage), and its control system. Environment >> AI >> control system. Alignment happens in the control system, by controlling the AI wrt its internals and how it interacts with the environment. SNC's general counter to "ASI will manage what humans cannot" is that as AI becomes more intelligent, it becomes more complex, which increases the burden on the control system at a rate that outpaces the latter's capacity. The assertion that both of these increase together is something I hope to justify in a future post (but haven't really yet); the confident assertion that AI system complexity definitely outpaces control capacity is a central part of SNC but depends on complicated math involving control theory and is beyond the scope of what I understand or can write about. Anyways, my understanding of your core objection is that a capable-enough-to-be-dangerous and also aligned AI have the foresight necessary to see this general failure mode (assuming it is true) and not put itself in a position where it is fighting a losing battle. This might include not closing the loop of self-sustainability, preserving dependence on humanity to maintain itself, such as by refusing to automate certain tasks it is perfectly capable of automating. (My understanding of) the counter here is that, if we are on the trajectory where AI hobbling itself is what is needed to save us, then we are in the sort of world where someone else builds an

What if Alignment is Not Enough?

Dakara8mo*50

Firstly, I want to thank you for putting SNC into text. I also appreciated the effort of to showcasing a logic chain that arrives at your conclusion.

With that being said, I will try to outline my main disagreements with the post:

2. Self-modifying machinery (such as through repair, upgrades, or replication) inevitably results in effects unforeseeable even to the ASI.

Let's assume that this is true for the sake of an argument. An ASI could access this post, see this problem, and decide to stop using self-modifying machinery for such tasks.

3. The space of unfo

... (read more)

5WillPetillo8mo

Thanks for engaging! I have the same question in response to each instance of the "ASI can read this argument" counterarguments: at what point does it stop being ASI? * Self modifying machinery enables adaptation to a dynamic, changing environment * Unforeseeable side effects are inevitable when interacting with a complex, chaotic system in a nontrivial way (the point I am making here is subtle, see the next post in this sequence, Lenses of Control, for the intuition I am gesturing at here) * Keeping machine and biological ecologies separate requires not only sacrifice, but also constant and comprehensive vigilance, which implies limiting designs of subsystems to things that can be controlled. If this point seems weird, see The Robot, The Puppetmaster, and the Psychohistorian for an underlying intuition (this is also indirectly relevant to the issue of multiple entities). * If the AI destroys itself, then it's obviously not an ASI for very long ;) * If the ASI replaces its own substrate for an organic one, then SNC would no longer apply (at least in my understanding of the theory, someone else might correct me here), but then it wouldn't be artificial anymore (an SI, rather than an ASI)

What's Wrong With the Simulation Argument?

Dakara8mo*11

She will be unconscious, but still send messages about pain. Current LLMs can do it. Also, as it is simulation, there are recording of her previous messages or of a similar woman, so they can be copypasted. Her memories can be computed without actually putting her in pain.

So if I am understanding your proposal correctly, then a Friendly AI will make a woman unconscious during moments of intense suffering and then implant her memories of pain. Why would it do it though? Why not just remove the experience of pain entirely? In fact, why does Friendly AI seem ... (read more)

2avturchin8mo

My point is that it is impossible to resurrect anyone (in this model) without him reliving his life again first, after that he obviously gets eternal blissful life in real (not simulated) world. This may be not factually true, btw, - current LLMs can create good models of past people without running past simulation of their previous life explicitly. It is a variant of Doomsday argument. This idea is even more controversial than simulation argument. There is no future with many people in it. Friendly AI can fight DA curse via simulations - by creating many people who do not know their real time position which can be one more argument for simulation, but it requires rather wired decision theory.

What's Wrong With the Simulation Argument?

Dakara8mo10

If preliminary results on the poll hold, then that would be pretty in line with my hypothesis of most people preferring creating simulations with no suffering over a world like ours. However, it is pretty important to note that this might not be representative of human values in general, because looking at your Twitter account, your audience comes mostly from a very specific circles of people (those interested in futurism and AI).

Would someone else reporting to have experienced intense suffering decrease your credence in being in a simulation?

No. Memory ab

... (read more)

2avturchin8mo

She will be unconscious, but still send messages about pain. Current LLMs can do it. Also, as it is simulation, there are recording of her previous messages or of a similar woman, so they can be copypasted. Her memories can be computed without actually putting her in pain. Resurrection of the dead is the part of human value system. We need a completely non-human bliss, like hedonium, to escape this. Hedonium is not part of my reference class and thus not part of simulation argument. Moreover, even creating new human is affected by this arguments. What if my children will suffer? So it is basically anti-natalist argument.

What's Wrong With the Simulation Argument?

Dakara8mo*40

I am sorry to butt into your conversation, but I do have some points of disagreement.

I think a more meta-argument is valid: it is almost impossible to prove that all possible civilizations will not run simulations despite having all data about us (or being able to generate it from scratch).

I think that's a very high bar to set. It's almost impossible to definitively prove that we are not in a Cartesian demon or brain-in-a-vat scenario. But this doesn't mean that those scenarios are likely. I think it is fair to say that more than a possibility is required ... (read more)

2avturchin8mo

We have to create a map of possible scenarios of simulations first, I attempted to it in 2015. I now created a new vote on twitter. For now, results are: "If you will be able to create and completely own simulation, you would prefer that it will be occupied by conscious beings, conscious without sufferings (they are blocked after some level), or NPC" The poll results show: * Conscious: 18.2% * Conscious, no suffering: 72.7% * NPC: 0% * Will not create simulatio[n]: 9.1% The poll had 11 votes with 6 days left' Yes. But I never experienced in my long life such intense sufferings. No. Memory about intense sufferings are not intense. Yes, only moments. The badness of not-intense sufferings is overestimated, in my personal view, but this may depend on a person. More generally speaking, what you presenting as global showstoppers, are technical problems that can be solved. In my view, individuality is valuable. As we don't know nature of consciousness, it can be just side effect of computation, not are trouble. Also it may want to have maximal fidelity or even run biological simulations: something akin to Zoo solution of Fermi paradox. We are living in one of the most interesting periods of history which surely will be studied and simulated.

Rebuttals for ~all criticisms of AIXI

Dakara8mo30

Diffractor's critique of AIXI comes to mind when I think of strong critiques of AIXI. I believe that addressing it would make the post more complete and, as a result, better.

2Cole Wyeth8mo

Seems to be a restatement of Paul's, which I did respond to.

Rebuttals for ~all criticisms of AIXI

Dakara8mo50

Since this post is about rebutting criticisms of AIXI, I feel it would be only fair to include Rob Bensinger's criticism. I considered it to be the strongest criticism of AIXI by a mile. Do you have any rebuttals for that post?

2Cole Wyeth8mo

I view that as more of an interesting discussion than entirely a criticism. I just gave it a reread - he raises a lot of good points, but there's not exactly a central argument distinct from the ones I addressed as far as I can tell? He is mainly focused on digging into embeddedness issues, particularly discussing things I'd classify as "pain sensors" to prevent AIXI from destroying itself. My solution to this here is a little more thorough than the one that the pro-AIXI speaker comes up with. The discussion of death is somewhat incorrect because it doesn't consider Turing machines which (while never halting) produce only a finite percept sequence and then "hang" or loop indefinitely. This can be viewed as death and may be considered likely in some cases. Here is a paper on it. The other criticism is that AIXI doesn't self-improve - I mean, it learns of course, but doesn't edit its own source code. There may be hacky ways around this but basically I agree - that's just not the point of the AIXI model. It's a specification for optimal intelligence and an optimal intelligence does not need to self-improve. Perhaps self-improvement is better viewed as a method of bootstrapping a weak AIXI approximation into a better one using external conceptual tools. It's probably not a necessary ingredient up to human level though; certainly modern LLMs do not self-improve (yet) and since they are pretty much black-boxes it's not clear that they will be able to until well past the point where they are smart enough to be dangerous.

johnswentworth's Shortform

Dakara8mo50

Additionally, I am curious to hear if Ryan's views on the topic are similar to Buck's, given that they work at the same organization.

johnswentworth's Shortform

Dakara8mo54

All 3 points seem very reasonable, looking forward to Buck's response to them.

5Dakara8mo

Additionally, I am curious to hear if Ryan's views on the topic are similar to Buck's, given that they work at the same organization.

If we solve alignment, do we die anyway?

Dakara8mo30

I've noticed that in your sentence about Max Harms's corrigibility plan there is an extra space after the parentheses which breaks the link formatting on my end. I tried marking it with "typo" emoji, but not sure if it is visible.

4Seth Herd8mo

Thanks, fixed it!

If we solve alignment, do we die anyway?

Dakara9mo*30

Thank you for your response! It basically covers all of the five issues that I had in mind. It is definitely some food for thought, especially your disagreement with Eliezer. I am much more inclined to think you are correct because his activity has considerably died down (at least on LessWrong). I am really looking forward to your "A broader path: survival on the default fast timeline to AGI" post.

By default, capital will matter more than ever after AGI

Dakara9mo*10

E.g., I think software only singularity is more likely than you do, and think that worst cast scheming is more likely

By "software only singularity" do you mean a scenario where all humans are killed before singularity, a scenario where all humans merge with software (uploading) or something else entirely?

8ryan_greenblatt9mo

Software only singularity is a singularity driven by just AI R&D on a basically fixed hardware base. As in, can you singularity using only a fixed datacenter (with no additional compute over time) just by improving algorithms? See also here. This isn't directly talking about the outcomes from this. You can get a singularity via hardware+software where the AIs are also accelerating the hardware supply chain such that you can use more FLOP to train AIs and you can run more copies. (Analogously to the hyperexponential progress throughout human history seemingly driven by higher population sizes, see here.)

A Solution for AGI/ASI Safety

Dakara9mo30

I agree with your view about organizational problems. Your discussion gave me an idea: Is it possible to shift employees dedicated to capability improvement to work on safety improvement? Set safety goals for these employees within the organization. This way, they will have a new direction and won't be idle, worried about being fired or resigning to go to other companies.

That seems to solve problem #4. Employees quitting becomes much less of an issue, since in any case they would only be able to share knowledge about safety (which is a good thing).

Do yo... (read more)

2Weibing Wang9mo

I think this plan is not sufficient to completely solve problems #1, #2, #3 and #5. I can't come up with a better one for the time being. I think more discussions are needed.

A Solution for AGI/ASI Safety

Dakara9mo20

I actually really liked your alignment plan. But I do wonder how it would be able to deal with 5 "big" organizational problems of iterative alignment:

Moving slowly and carefully is annoying. There's a constant tradeoff about getting more done, and elevated risk. Employees who don't believe in the risk will likely try to circumvent or goodhart the security procedures. Filtering for for employees willing to take the risk seriously (or training them to) is difficult. There's also the fact that many security procedures are just security theater. Engineers h

... (read more)

4Weibing Wang9mo

I agree with your view about organizational problems. Your discussion gave me an idea: Is it possible to shift employees dedicated to capability improvement to work on safety improvement? Set safety goals for these employees within the organization. This way, they will have a new direction and won't be idle, worried about being fired or resigning to go to other companies. Besides, it's necessary to make employees understand that improving safety is a highly meaningful job. This may not rely solely on the organization itself, but also require external pressure, such as from the government, peers, or the public. If the safety cannot be ensured, your product may face a lot of criticism and even be restricted from market access. And there will be some third-party organizations conducting safety evaluations of your product, so you need to do a solid job in safety rather than just going through the motions.

If we solve alignment, do we die anyway?

Dakara9mo30

The main thing at least for me, is that you seem to be the biggest proponent of scalable alignment and you are able to defend this concept very well. All of your proposals seem very much down-to-earth.

Seth Herd9mo*124

Sorry it's taken me a while to get back to this.

No problem posting your questions here. I'm not sure of the best place but I don't think clutter is an issue, since LW organizes everything rather well and has good UIs to navigate it.

I read that "Carefully Bootstrapped Alignment" is organizationally hard by Raemon and it did make some good points. Most of them I'd considered, but not all of them.

Pausing is hard. That's why my scenarios barely involve pauses and only address them because others see them as a possibility.

Basically I think we have to get alignm... (read more)

If we solve alignment, do we die anyway?

Dakara9mo*30

Edit: I hope that I am not cluttering the comments by asking these questions. I am hoping to create a separate post where I list all the problems that were raised for the scalable alignment proposal and all the proposed solutions to them. So far, everything you said not only seemed sensible, but also plausible, so I extremely value your feedback.

I have found some other concerns about scalable oversight/iterative alignment, that come from this post by Raemon. They are mostly about the organizational side of scalable oversight:

Moving slowly and carefully

... (read more)

3Dakara9mo

Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.)

Dakara9mo10

Have you uploaded a new version of this article? It have just been reading elsewhere about goal misgeneralisation and shutdown problem, so I'd be really interested to read the new version of this article.

4Noosphere899mo

This post is the spiritual successor to the old post, shown below: https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities

If we solve alignment, do we die anyway?

Dakara9mo10

I have posted this text as a standalone question here

If we solve alignment, do we die anyway?

Dakara9mo10

I have found another possible concern of mine.

Consider gravity on Earth, it seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories and only one theory that gravity will work as an absolute rule.

We might infer from the simplest explaination that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI alignment, our evidence must rule out whether an AI is following a misaligned rule compare... (read more)

1Dakara9mo

I have posted this text as a standalone question here

What are the best arguments for/against AIs being "slightly 'nice'"?

Dakara10mo10

Do you think that the scalable oversight/iterative alignment proposal that we discussed can get us to the necessary amount of niceness to make humans survive with AGI?

4Noosphere8910mo

My answer is basically yes. I was only addressing the question "If we basically failed at alignment, or didn't align the AI at all, but had a very small amount of niceness, would that lead to good outcomes?"

Dakara10mo10

After reading your comment I do agree that unipolar AGI scenario is probably better than a multipolar plan. Perhaps I underestimated how offense-favored our world is.

With that aside, your plan is possibly one of the clearest, most intuitive alignment plans that I've seen. All of the steps make sense and seem decently likely to happen, except maybe for one. I am not sure that your argument for why we have good odds for getting AGI into trustworthy hands works.

"It seems as though we've got a decent chance of getting that AGI into a trustworthy-enough power s... (read more)

6Seth Herd10mo

I wish the odds for getting AGI into trustworthy hands were better. The source of my optimism is the hope that those hands just need to be decent - to have what I've conceptualized as a positive empathy - sadism balance. That's anyone who's not a total sociopath (lacking empathy and tending toward vengeance and competition) and/or sadist. I hope that about 90-99% of humanity would eventually make the world vastly better with their AGI, just because it's trivially easy for them to do, so it only requires the smallest bit of goodwill. I wish I were more certain of that. I've tried to look a little at some historical examples of rulers born into power and with little risk of losing it. A disturbing number of them were quite callous rulers. They were usually surrounded by a group of advisors that got them to ignore the plight of the masses and focus on the concerns of an elite few. But this situation isn't analogous - once your AGI hits superintelligence, it would be trivially easy to both help the masses in profound ways, and pursue whatever crazy schemes you and your friends have come up with. Thus my limited optimism. WRT the distributed power structure of Western governments: I think AGI would be placed under executive authority, like the armed forces, and the US president and those with similar roles in other countries would hold near-total power, should they choose to use it. They could transform democracies into dictatorships with ease. And we very much do continue to elect selfish and power-hungry individuals, some of whom probably actually have a negative empathy-sadism balance. Looking back, I note that you said I argued for "good odds" while I said "decent odds". We may be in agreement on the odds. But there's more to consider here. Thanks again for engaging; I'd like to get more discussion of this topic going. I doubt you or I are seeing all of the factors that will be obvious in retrospect yet.

Noosphere89's Shortform

Dakara10mo70

I've asked this question to others, but would like to know your perspective (because our conversations with you have been genuinely illuminating for me). I'd be really interested in knowing your views on more of a control-by-power-hungry-humans side of AI risk.

For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don't think we could trust any of the current leading AI labs to use that power fairly. I don't think this lab would voluntarily decide to give up control over AGI either (intuitively... (read more)

2Noosphere8910mo

This is my main risk scenario nowadays, though I don't really like calling it an existential risk, because the billionaires can survive and spread across the universe, so some humans would survive. The solution to this problem is fundamentally political, and probably requires massive reforms of both the government and the economy that I don't know yet. I wish more people worked on this.

Thoughts on “AI is easy to control” by Pope & Belrose

Dakara10mo10

If it's not a big ask, I'd really like to know your views on more of a control-by-power-hungry-humans side of AI risk.

For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don't think I could trust any of the current leading AI labs to use that power fairly. I don't think this lab would voluntarily decide to give up control over it either (intuitively, it would take quite something for anyone to give up such a source of power). Is there anything that can be done to prevent such a scenario?

3Bogdan Ionut Cirstea10mo

I'm very uncertain and feel somewhat out of depth on this. I do have quite some hope though from arguments like those in https://aiprospects.substack.com/p/paretotopian-goal-alignment.

Dakara10mo30

What's most worrying is the fact that in your post If we solve alignment, do we die anyway? you mentioned your worries about multipolar scenarios. However, I am not sure we'd be much better off in a unipolar scenario, though. If there is one group of people controlling AGI, then it might be actually even harder to make them give it up. They'd have a large amount of power and no real threat to it (no multipolar AGIs threatening to launch an attack).

However, I am not well-versed in literature on this topic, so if there is any plan for how we can safeguard ou... (read more)

4Seth Herd10mo

I think that's a pretty reasonable worry. And a lot of people share it. Here's my brief take. Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours I'm less worried about that because it seems like one questionable group with tons of power is way better than a bunch of questionable groups with tons of power - if the offense-defense balance tilts toward offense, which I think it does. The more groups, the more chance that someone uses it for ill. Here's one update on my thinking: mutually assured destruction will still work for most of the world. ICBMs with nuclear payloads will be obsoleted at some point, but AGIs will also likely be told to find even better/worse ways to destroy stuff. So possibly everyone with an AGI will go ahead and hold the whole earth hostage, just so whoever starts a war doesn't get to keep any of their stuff they were keeping on the planet. That makes the incentive to get off planet and possibly keep going. It's really hard to see how this stuff plays out, but I suspect it will be obvious what the constraints and incentives and distribution of psychologies was in retrospect. So I appreciate your help in thinking through it. We don't have answers yet, but they may be out there. I don't think it would be much harder for a group to give it up if they were the only ones who had it. And maybe there's not much difference between a full renunciation of control and just saying "oh fine, I'm tired of running the world, do whatever it seems like everybody wants but check major changes with me in case I decide to throw my weight around instead of hanging out in the land of infinite fun".

If we solve alignment, do we die anyway?

Dakara10mo10

Isn't it a bit too late for that? If o1 gets publicly released, then according to that article, we would have an expert-level consultant in bioweapons available for everyone. Or do you think that o1 won't be released?

3Noosphere8910mo

I don't buy that o1 has actually given people expert-level bioweapons, so my actions here are more so about preparing for future AI that is very competent at bioweapon building. Also, even with the current level of jailbreak resistance/adversarial example resistance, assuming no open-weights/open sourcing of AI is achieved, we can still make AIs that are practically hard to misuse by the general public. See here for more: https://www.lesswrong.com/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais

If we solve alignment, do we die anyway?

Dakara10mo10

After some thought, I think this is a potentially really large issue which I don't know how we can even begin to solve. We can have aligned AI, being aligned with someone who wants to create bioweapons. Is there anything being done (or anything that can be done) to prevent that?

3Noosphere8910mo

The answers to this question is actually 2 things: 1. This is why I expect we will eventually have to fight to ban open-source, and we will have to get the political will to ban both open-source and open-weights AI. 2. This is where the unlearning field comes in. If we could make the AI unlearn knowledge, an example being nuclear weapons, we could possibly distribute AI safely without causing novices to create dangerous stuff. More here: https://www.lesswrong.com/posts/mFAvspg4sXkrfZ7FA/deep-forgetting-and-unlearning-for-safely-scoped-llms https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm But the solutions are intentionally going to make AI safe without relying on alignment.

If we solve alignment, do we die anyway?

Dakara10mo10

Another concern that I could see with the plan. Step 1 is to create safe and alignment AI, but there are some results which suggest that even current AIs may not be as safe as we want them to be. For example, according to this article, current AI (specifically o1) can help novices build CBRN weapons and significantly increase threat to the world. Do you think this is concerning or do you think that this threat will not materialize?

2Noosphere8910mo

The threat model is plausible enough that some political actions should be done, like banning open-source/open-weight models, and putting in basic Know Your Customer checks.

1Dakara10mo

Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake

Dakara10mo10

That's a very thoughtful response from TurnTrout. I wonder if @Gwern agrees with its main points. If not, it would be good know where he thinks it fails.

5gwern10mo

I did not understand his response at all, and it sounds like I would have to reread a bunch of Turntrout posts before any further comment would just be talking past each other, so I don't have anything useful to say. Maybe someone else can re-explain his point better and why I am apparently wrong.

If we solve alignment, do we die anyway?

Dakara10mo10

P.S. Here is the link to the question that I posted.

If we solve alignment, do we die anyway?

Dakara10mo30

Noosphere, I am really, really thankful for your responses. You completely answered almost all (I am still not convinced about that strategy of avoiding value drift. I am probably going to post that one as a question to see if maybe other people have different strategies on preventing value drift) of the concerns that I had about alignment.

This discussion, significantly increased my knowledge. If I could triple upvote your answers, I would. Thank you! Thank you a lot!

1Dakara10mo

P.S. Here is the link to the question that I posted.