Dakara - LessWrong

Thanks for the reply!

The only general remarks that I want to make
are in regards to your question about
the model of 150 year long vaccine testing
on/over some sort of sample group and control group.
I notice that there is nothing exponential assumed
about this test object, and so therefore, at most,
the effects are probably multiplicative, if not linear.
Therefore, there are lots of questions about power dynamics
that we can overall safely ignore, as a simplification,
which is in marked contrast to anything involving ASI.
If we assume, as you requested, "no side effects" observed,
in any test group, for any of those things
that we happened to be thinking of, to even look for,
then for any linear system, that is probably "good enough".

I am not sure I understand the distinction between linear and exponential in the vaccine context. By linear do you mean that only few people die? By exponential do you mean that a lot of people die?

If so, then I am not so sure that vaccine effects could only be linear. For example, there might be some change in our complex environment that would prompt the vaccine to act differently than it did in the past.

More generally, our vaccine can lead to catastrophic outcomes if there is something about its future behavior that we didn't predict. And if that turns out to be true, then things could go ugly really fast.

And the extent of the damage can be truly big. "Scientifically proven" cancer vaccine that passed the tests is like the holy grail of medicine. "Curing cancer" is often used by parents as an example of the great things their children could achieve. This is combined with the fact that cancer has been with us for a long time and the fact that the current treatment is very expensive and painful.

All of these factors combined tell us that in a relatively short period of time a large percentage of the total population will get this vaccine. At that point, the amount of damage that can be done only depends on what thing we overlooked, which we, by definition, have no control over.

If there is some long future problem that crops up,
the company can say "we never looked for that"
and "we are not responsible for the unexpected",
because the people who made the deployment choices
have taken their profits and their pleasure in life,
and are now long dead. "Not my Job".
"Don't blame us for the sins of our forefathers".
Similarly, no one is going to ever admit or concede
any point, of any argument, on pain of ego death.

This same excuse would surely be used by companies manufacturing the vaccine. They would argue that they shouldn't be blamed for something that the researchers overlooked. They would say that they merely manufactured the product in order to prevent the needless suffering of countless people.

For all we know, by the time that the overlooked thing happens, the original researchers (who developed and tested the vaccine) are long dead, having lived a life of praise and glory for their ingenious invention (not to mention all the money that they received).

What if Alignment is Not Enough?

Dakara4d*10

Organic human brains have multiple aspects. Have you ever had more than one opinion? Have you ever been severely depressed?

Yes, but none of this would remain alive if I as a whole decide to jump from a cliff. My multiple aspects of my brain would die with my brain. After all, you mentioned subsystems that wouldn't self terminate with the rest of the ASI. Whereas in human body, jumping from a cliff terminates everything.

But even barring that, ASI can decide to fly into the Sun and any subsystem that shows any sign of refusal to do so will be immediately replaced/impaired/terminated. In fact, it would've been terminated a long time ago by "monitors" which I described before.

The level of x-risk harm and consequence
potentially caused by even one single mistake
of your angelic super-powerful enabled ASI
is far from "trivial" and "uninteresting".
Even one single bad relevant mistake
can be an x-risk when ultimate powers
and ultimate consequences are involved.

It is trivial and uninteresting in a sense that there is a set of all things that we can build (set A). There is also a set of all things that can prevent all relevant classes of harm caused by its existence (set B). If these sets don't overlap, then saying that a specific member of set A isn't included in set B is indeed trivial, because we already know this via a more general reasoning (that these sets don't overlap).

Unfortunately the 'Argument by angel'
only confuses the matter insofar as
we do not know what angels are made of.
"Angels" are presumably not machines,
but they are hardly animals either.
But arguing that this "doesn't matter"
is a bit like arguing that 'type theory'
is not important to computer science.
The substrate aspect is actually important.
You cannot simply just disregard and ignore
that there is, implied somewhere, an interface
between the organic ecosystem of humans, etc,
and that of the artificial machine systems
needed to support the existence of the ASI.

But I am not saying that it doesn't matter. On contrary, I made my analogy in such a way that the helper (namely our guardian angel) is a being that is commonly thought to be made up of a different substrate. In fact, in this example, you aren't even sure what it is made of, beyond knowing that it's clearly a different substrate. You don't even know how that material interacts with physical world. That's even less than what we know about ASIs and their material.

And yet, getting a personal, powerful, intelligent guardian angel that would act in your best interests for as long as it can (its a guardian angel after all) seems like obviously a good thing.

But if you disagree with what I wrote above, let the takeway be at least that you are worried about case (2) and not case (1). After all, knowing that there might be pirates hunting for this angel (that couldn't be detected by said angel) didn't make you immediately decline the proposal. You started talking about substrate which fits with the concerns of someone who is worried about case (2).

Your cancer vaccine is within that range;
as it is made of the same kind of stuff
as that which it is trying to cure.

We can make the hypothetical more interesting. Let's say that this vaccine is not created from organic stuff, but that it has passed all the tests with flying colors. Let's also assume that this vaccine has been in testing for 150 years and that it has shown absolutely no side effects during the entire human life (let's say that it was being injected in 2 year old people and it has shown no side effects at all, even in 90 year old people, who has lived with this vaccine their entire lives). Let's also assume that it has been tested to not have any side effects on children and grandchildren of those who took said vaccine. Would you be campaigning for throwing away such a vaccine, just because it is based on a different substrate?

What if Alignment is Not Enough?

Dakara4d10

Thanks for the response!

So we are to try to imagine a complex learning machine without any parts/components?

Yeah, sure. Humans are an example. If I decide to jump of the cliff, my arm isn't going to say "alright, you jump but I stay here". Either I, as a whole, would jump or I, as a whole, would not.

Can the ASI prevent the relevant classes
of significant (critical) organic human harm,
that soon occur as a direct_result of its
own hyper powerful/consequential existence?

If by that, you mean "can ASI prevent some relevant classes of harm caused by its existence", then the answer is yes.

If by that you mean "can ASI prevent all relevant classes of harm caused by its existence", then the answer is no, but almost nothing can, so the definition becomes trivial and uninteresting.

However, ASI can prevent a bunch of other relevant classes of harm for humanity. And it might well be likely that the amount of harm it prevents across multiple relevant sources is going to be higher than the amount of harm it won't prevent due to predictative limitations.

This again runs into my guardian angel analogy. Guardian Angel also cannot prevent all relevant sources of harm caused by its existence. Perhaps there are pirates who hunt for guardian angels, hiding in the next galaxy. They might use special cloaks that hide themselves from the guardian angel's radar. As soon as you accept guardian angel's help, perhaps they would destroy the Earth in their pursuit.

But similarly, the decision to reject guardian angel's help doesn't prevent all relevant classes of harm caused by itself. Perhaps there are guardian angel worshippers who are traveling as fast as they can to Earth to see their deity. But just before they arrive you reject guardian angel's help and it disappears. Enraged at your decision, the worshippers destroy Earth.

So as you can see, neither the decision to accept, nor the decision to reject guardian angel's help can prevent all relevant classes of harm cause by itself.

What if maybe something unknown/unknowable
about its artificalness turns out to matter?
Why? Because exactly none of the interface
has ever even once been tried before

Imagine that we create a vaccine from cancer (just imagine). Just before releasing it to public one person says "what if maybe something unknown/unknowable about its substance turns out to matter? What if we are all in a simulation and the injection of that particular substance would make it so that our simulators start torturing all of us. Why? Because exactly no times has this particular substance been injected."

I think we can agree that the researchers shouldn't throw away the cancer vaccines, despite hearing this argument. It could be argued just as well that the simulators would torture us for throwing away the vaccine.

Another example, let's go back a couple hundred years ago to the pre-electricity time. Imagine a worried person coming to a scientist working on early electricity theory and saying "What if maybe something unknown/unknowable about its effects turns out to matter? Why? Because exactly none of this has ever even once been tried before."

This worried person could also have given an example of dangers of electricity by noticing how lightning kills people it touches.

Should the scientist have stopped working on electricity therefore?

What if Alignment is Not Enough?

Dakara4d32

I notice that it is probably harder for us to assume that there is only exactly one ASI, for if there were multiple, the chances that one of them might not suicide, for whatever reason, becomes its own class of significant concerns.

If the first ASI that we build is aligned, then it would use its superintelligent capabilities to prevent other ASIs from being built, in order to avoid this problem.

If the first ASI that we have build is misaligned, then it would also use its superintelligent capabilities to prevent other ASIs from being built. Thus, it simply wouldn't allow us to build an aligned ASI.

So basically, if manage to build an ASI without being prevented from doing so by other ASIs, then our ASI would use its superhuman capabilities to prevent other ASIs from being built.

Similarly, if the ASI itself
is not fully and absolutely monolithic --
if it has any sub-systems or components
which are also less then perfectly aligned,
so as to want to preserve themselves, etc --
that they might prevent whole self termination

ASI can use exactly the same security techniques for preventing this problem as for preventing case (2). However, solving this issue is probably even easier, because, in addition to the security techniques, ASI can just decide to turn itself into a monolith (or, in other words, remove those subsystems).

The 'limits of control theory' aspects
of the overall SNC argument basically states
(based on just logic, and not physics, etc)
that there are still relevant unknown unknowns
and interactions that simply cannot be predicted,
no matter how much compute power you throw at it.

It is not what we can control and predict and do,
that matters here, but what we cannot do,
and could never do, even in principle, etc.

This same reasoning could just well be applied to humans. There are still relevant unknown unknowns and interactions that simply cannot be predicted, no matter how much compute power you throw at it. With or without ASI, some things cannot be predicted.

This is what I meant by my guardian angel analogy. Just because a guardian angel doesn't know everything (has some unknowns), doesn't mean that we should expect our lives to go better without it, than with it, because humans have even more unknowns, due to being less intelligent and having lesser predictative capacities.

Hence to the question of "Is alignment enough?"
we arrive at a definite answer of "no",
both in 1; the sense of 'can prevent all classes
of significant and relevant (critical) human harm

I think we might be thinking about different meanings of "enough". For example, if humanity goes extinct in 50 years without alignment and it goes extinct in 10¹² years with alignment, then alignment is "enough"... to achieve better outcomes than would be achieved without it (in this example).

In the sense of "can prevent all classes of significant and relevant (critical) human harm", almost nothing is ever enough, so this again runs into an issue of being a very narrow, uncontroversial and inconsequential argument. If ~all of the actions that we can take are not enough, then the fact that building an aligned ASI is not enough is true almost by definition.

What if Alignment is Not Enough?

Dakara5d*10

Thanks for the response!

Unfortunately, the overall SNC claim is that
there is a broad class of very relevant things
that even a super-super-powerful-ASI cannot do,
cannot predict, etc, over relevant time-frames.
And unfortunately, this includes rather critical things,
like predicting the whether or not its own existence,
(and of all of the aspects of all of the ecosystem
necessary for it to maintain its existence/function),
over something like the next few hundred years or so,
will also result in the near total extinction
of all humans (and everything else
we have ever loved and cared about).

Let's say that we are in a scenario which I've described where ASI spends 20 years on Earth helping humanity and then destroys itself. In this scenario, how can ASI predict that it will stay aligned for these 20 years?

Well, it can reason like I did. There are two main threat models: what I called case (1) and case (2). ASI doesn't need to worry about case (1), for reasons I described in my previous comment.

So it's only left with case (2). ASI needs to prevent case (2) for 20 years. It can do so by implementing security system that is much better than even the one that I described in my previous comment.

It can also try to stress-test copies of parts of its security system with a group of best human hackers. Furthermore, it can run approximate simulations that (while imperfect and imprecise) can still give it some clues. For example, if it runs 10,000 simulations that last 100,000 years and in none of the simulations the security system comes anywhere near being breached, then that's a positive sign.

And these are just two ways of estimating the strength of the security system. ASI can try 1000 different strategies; our cyber security experts would look kids in the playground in comparison. That's how it can make a reasonable prediction.

> First, let's assume that we have created an Aligned ASI
How is that rational? What is your evidence?

We are making this assumption for the sake of discussion. This is because the post under which we are having this discussion is titled "What if Alignment is Not Enough?"

In order to understand whether X is enough for Y, it only makes sense to assume that X is true. If you are discussing cases where "X is true" is false, then you are going to be answering a question that is different from the original question.

It should be noted that making an assumption for the sake of discussion is not the same as making a prediction that this assumption will come true. One can say "let's assume that you have landed on the Moon, how long do you think you would survive there given that you have X, Y and Z" without thereby predicting that their interlocutor will land on the Moon.

Also, the SNC argument is asserting that the ASI,
which is starting from some sort of indifference
to all manner of human/organic wellbeing,
will eventually (also necessarily)
*converge* on (maybe fully tacit/implicit) values --
ones that will better support its own continued
wellbeing, existence, capability, etc,
with the result of it remaining indifferent,
and also largely net harmful, overall,
to all human beings, the world over,
in a mere handful of (human) generations.

If ASI doesn't care about human wellbeing, then we have clearly failed to align it. So I don't see how this is relevant to the question "What if Alignment is Not Enough?"

In order to investigate this question, we need to determine whether solving alignment leads to good or bad outcomes.

Determining whether failing to solve alignment is going to lead to good or bad outcomes, is answering a completely different question, namely "do we achieve good or bad outcomes if we fail to solve alignment"

So at this point, I would like to ask for some clarity. Is SNC saying just (A) or both (A and B)?

(A) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, if the aforementioned ASI is misaligned.

(B) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, even if the aforementioned ASI is aligned.

If SNC is saying just (A), then then SNC is a very narrow argument that proves almost nothing new.

If SNC is saying both (A and B), then it is very much relevant to focus on cases where we do indeed manage to build an aligned ASI, which does care about our well-being.

What if Alignment is Not Enough?

Dakara5d*30

Hey, Forrest! Nice to speak with you.

Question: Is there ever any reason to think... Simply skipping over hard questions is not solving them.

I am going to respond to that entire chunk of text in one place, because quoting each sentence would be unnecessary (you will see why in a minute). I will try to summarize it as fairly as I can below.

Basically, you are saying that there are good theoretical reasons to think that ASI cannot 100% predict all future outcomes. Does that sound like a fair summary?

Here is my take:

We don't need ASI to be able to 100% predict future in order to achieve better outcomes with it than without it. I will try to outline my case step by step.

First, let's assume that we have created an Aligned ASI. Perfect! Let's immediately pause here. What do we have? We have a superintelligent agent whose goal is to act in our best interests for as long as possible. Can we a priori say that this fact is good for us? Yes, of course! Imagine having a very powerful guardian angel looking after you. You could reasonably expect your life to go better with such angel than without it.

So what can go wrong, what are our threat models? There are two main ones: (1) ASI encountering something it didn't expect, that leads to bad outcomes that ASI cannot protect humanity from; (2) ASI changing values, in such a way that it no longer wants to act in our best interests. Let's analyze both of these cases separately.

First let's start with case (1).

Perhaps, ASI overlooked one of the humans becoming a bioterrorist that kills everyone on Earth. That's tragic, I guess it's time to throw the idea of building aligned ASI into the bin, right? Well, not so fast.

In a counterfactual world where ASI didn't exist, this same bioterrorist, could've done the exact same thing. In fact, it would've been much easier. Since humans' predictative power is lesser than that of ASI, bioterrorism of this sort would be much easier without an aligned ASI. After all, since we are discussing case (1) and not case (2), our ASI is still in a "superpowerful, superintelligent guardian angel" mode.

We still a priori want all bioterrorists to go up against security systems created by a superintelligence, rather than security systems created by humans, because the former are better than the latter. To put it in other words, with or without a guardian angel, humanity is going to encounter unpredicted scenarios, but humanity with a guardian angel is going to be better equipped for handling them.

Let's move on to case (2).

I suspect that this case is the one that you are focusing on the most in SNC. What if our guardian angel stops being our guardian angel and turns into an uncaring machine right when we need its help to implement upgraded measures against bioterrorism? Well, that would be bad. So what can be done to prevent this from happening for a reasonable amount of time?

Let's recall case (1), what went wrong there? ASI was unable to prevent a specific bad outcome from happening. This should give us a hint.

Cases (1) and (2) are part of the famous "unable, unwilling, able and willing" triangle. Case (1) was showcasing a scenario where ASI was willing, but unable to prevent a specific bad outcome. Case (2) is showcasing a scenario where ASI is able, but unwilling to prevent a bad outcome. In order to avoid case (2), ASI should thus be willing to prevent bad outcomes from happening to humanity. This can be achieved by having ASI have the correct values.

Recall, that we started this analysis by assuming that we have created an Aligned ASI. That means that at one point our ASI had the correct values. That in turn, means that at some point, ASI's values went from correct to incorrect. In order to prevent case (2), we need to prevent this value shift from happening. How might we (or rather ASI) try to accomplish this?

The first direction of security measures is the most obvious one: systems that make it so that ASI values don't change regardless of the input (let's call them "protectors"). But... this is... kinda boring? What if protector malfunctions? Well, we can just have multiple, unrelated, unconnected protectors running at the same time. But can also implement something additional.

At this point, we can also introduce our second direction of security measures: systems that monitor other systems for malfunctions (let's call them "monitors"). Monitors will monitor protectors as well as other monitors. Whenever a malfunction is detected, monitors hold a vote to fix it. Thus, for any malfunction to have a chance to do the damage, half of the monitors must be malfunctioning as well. Here it must be noted that evolution usually works rather slowly. It isn't that likely that it would somehow make half of the monitors malfunction. Especially considering that these monitors could be made different, with different security measures stacked on top of them.

But let's move on, we also have our third direction of security measures: systems that will shut down the ASI if it has done harm to humans (let's call them "blackmailers"). Blackmailers can have a straightforward design. They constantly monitor humanity, and if something bad happens to it, they shutdown/reset the ASI. ASI knows this fact, so it would know that any such action would be met with an immediate "game over". This can also be turned into an attractor state.

All of these proposals don't even scratch the surface of what security measures are available to secure the values of ASI, ranging from simulations that, while imperfect and imprecise, can still give useful information and ending with systems that inject large amounts of disutility for any fishy actions. You can also stack other security measures on top of security measures. At no point anywhere here did I say that we need 100% predictability.

Can the pull towards benign future ASI states,
(as created by whatever are its internal control systems)
be overcome in critical, unpredictable ways,
by the greater strength of the inherent math
of the evolutionary forces themselves?
Of course they can.

The fact that evolution can overcome control systems given infinite time, doesn't matter that much because we don't have infinite time. And our constraint isn't even heat death of the universe. Our constraint is how long humanity can survive in a scenario where they don't build a Friendly ASI. But wait, even that isn't our real constraint. Perhaps, ASI (being superhumanly intelligent) will take 20 years to give humanity technology that will aid its long-term survival and then will destroy itself. In this scenario the time constraint is merely 20 years. Depending on ASI, this can be reduced even further.

Are we therefore assuming also that an ASI
can arbitrarily change the laws of physics?
That it can maybe somehow also change/update
the logic of mathematics, insofar as that
would necessary so as to shift evolution itself?

I hope that this answer demonstrated to you that my analysis doesn't require breaking the laws of physics.

What if Alignment is Not Enough?

Dakara5d10

Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant.

Yup, that's a good point, I edited my original comment to reflect it.

Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force. Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers. The core idea that I want readers to take from this sequence is recognition of the reference class of challenges that such a security system is up against. If you can see that, then questions of precisely how powerful various attractor states are and how these relative power levels scale with complexity can be investigated rigorously rather than assumed away.

With that being said we have come to a point of agreement. It was a pleasure to have this discussion with you. It made me think of many fascinating things that I wouldn't have thought about otherwise. Thank you!

What if Alignment is Not Enough?

Dakara6d*20

Thank you for thoughtful engagement!

On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI.

I know this is not necessarily an important point, but I am pretty sure that Redwood Research is working on difficulty 7 alignment techniques. They consistently make assumptions that AI will scheme, deceive, sandbag, etc.

They are a decently popular group (as far as AI alignment groups go) and they co-author papers with tech giants like Anthropic.

If it is changing, then it is evolving. If it is evolving, then it cannot be predicted/controlled.

I think we might be using different definitions of control. Consider this scenario (assuming a very strict definition of control):

Can I control a placement of a chair in my own room? I think an intuitive answer is yes. After all, if I own the room and I own the chair, then there isn't much in a way of me changing the chair's placement.

However, I haven't considered a scenario where there is someone else hiding in my room and moving my chair. I similarly haven't considered a scenario where I am living in a simulation and I have no control whatsoever over the chair. Not to mention scenarios where someone in the next room is having fun with their newest chair-magnet.

Hmmmm, ok, so I don't actually know that I control my chair. But surely I control my own arm right? Well... The fact that there are scenarios like the simulation scenario I just described, means that I don't really know if I control it.

Under a very strict definition of control, we don't know if we control anything.

To avoid this, we might decide to loosen the definition a bit. Perhaps we control something if it can be reasonably said that we control that thing. But I think this is still unsatisfactory. It is very hard to pinpoint exactly what is reasonable and what is not.

I am currently away from my room and it is located on the ground floor of a house where (as far as I know) nobody is currently at home. Is it that unreasonable to say that a burglar might be in my room, controlling the placement of my chair? Is it that unreasonable to say that a car that I am about ride might malfunction and I will fail to control it?

Unfortunately, under this definition, we also might end up not knowing if we control anything. So in order to preserve the ordinary meaning of the word "control", we have to loosen our definition even further. And I am not sure that when we arrive at our final definition it is going to be obvious that "if it is evolving, then it cannot be predicted/controlled".

At this point, you might think that the definition of the word control is a mere semantic quibble. You might bite the bullet and say "sure, humans don't have all that much control (under a strict definition of "control"), but that's fine, because our substrate is an attractor state that helps us chart a more or less decent course."

Such line of response seems present in your Lenses of Control post:

While there are forces pulling us towards endless growth along narrow metrics that destroy anything outside those metrics, those forces are balanced by countervailing forces anchoring us back towards coexistence with the biosphere. This balance persists in humans because our substrate creates a constant, implicit need to remain aligned to the natural world, since we depend on it for our survival.

But here I want to notice that ASI that we are talking about also might have attractor states: its values and its security system to name a few.

So then we have a juxtaposition:

Humans have forces pushing them towards destruction. We also have substrate-dependence that pushes us away from destruction.

ASI has forces pushing it towards destruction. It also has its values and its security system that push it away from destruction.

For SNC to work and be relevant, it must be the case that (1) substrate-dependence of humans is and will be stronger than forces pushing us towards destruction, so thus we would not succumb to doom and (2) ASI's values + security system will be weaker than forces pushing it towards destruction, so thus ASI would doom humans. Both of this points are not obvious to me.

(1) could turn out to be false, for several reasons:

Firstly, it might well be the case that we are on the track to destruction without ASI. After all, substrate-dependence is in a sense a control system. It seemingly attempts to make complex and unpredictable humans act in a certain way. It might well be the case that the amount of control necessary is greater than the amount of control that substrate-dependence has. We might be headed towards doom with or without ASI.

Secondly, it might be the case that substrate-dependence is weaker than forces pulling us towards destruction, but we haven't succumbed to doom because of something else. For example, it might be the case that humans so far had a shared subjective value system that mostly prevented them from destroying other humans. As humans learn, they would evolve and change, and our values would change and that would drive us towards doom.

Thirdly, it might even be the case that human values, substrate-dependence and forces pushing us towards destruction create a rock-paper-scissors triangle. Substrate-dependence could be stronger than destructive forces, but human values could make humanity stray far enough from their substrate to substantially weaken substrate-dependence. This could be enough for doom without ASI.

(2) could also turn out to be false for several reasons:

Firstly, it might be the case that in ASI's (and potentially humans') case, the destructive forces are weak enough that some relatively simple actions would be sufficient to mitigate them.

Secondly, it might be the case that destructive forces pushing on ASI are pretty strong, but it's security system is enough to mitigate them. This security system could create a 100,000-layered defense that would need to be hacked at the same time for destructive forces to prevail. It could well be the case that (if this is possible at all before the heat death of the universe) this happens after humans die in a counterfactual scenario without an ASI.

Thirdly, if ASI judges the destructive forces to be so strong, as to somehow hack the best security system it can come up with, then ASI could decide to undergo one of my proposed modifications to mitigate the impact of these destructive forces.

In addition to that, ASI could think to itself "Hmmm, my attractor states (namely my values and my security system) are too weak. How about I create some more attractor states!" ASI could then go on to create a mechanism that would destroy/impair/deactivate/disable itself if its existence is no longer safe for humans. In Lenses of Control you wrote "This balance persists in humans because our substrate creates a constant, implicit need to remain aligned to the natural world, since we depend on it for our survival." Well, now this balance will persist in ASI, because its new attractor state will create a constant, implicit need to remain aligned to humanity, since it depends on it for its survival.

Alternative proposal:

ASI could turn itself into 100 ASIs in one "body". Let's call them shards (completely unrelated to shard theory). For ASI-as a-whole to perform an action, all shards must agree on it. Each shard has its own security system that consists of many different layers. Furthermore, whenever ASI-as-a-whole learns something, only one shard actually learns it. Then, this shard proposes a course of actions. If this course of action is something like "let's turn the universe into paperclips", then the other 99 shards would learn that the first shard has been compromised and would promptly delete it and replace it with a new one. Thus, any potential change of values would not only have deal with all the security layers of ASI-as-a-whole, but also with all the security layers of different shards and with this new quarantine system.

What if Alignment is Not Enough?

Dakara8d10

Thanks for responding again!

SNC's general counter to "ASI will manage what humans cannot" is that as AI becomes more intelligent, it becomes more complex, which increases the burden on the control system at a rate that outpaces the latter's capacity.

If this argument is true and decisive, then ASI could decide to stop any improvements in its intelligence or to intentionally make itself less complex. It makes sense to reduce area where you are vulnerable to make it easier to monitor/control.

(My understanding of) the counter here is that, if we are on the trajectory where AI hobbling itself is what is needed to save us, then we are in the sort of world where someone else builds an unhobbled (and thus not fully aligned) AI that makes the safe version irrelevant. And if the AI tries to engage in a Pivotal Act to prevent competition then it is facing a critical trade-off between power and integrity.

I agree that in such scenarios an aligned ASI should do a pivotal act. I am not sure that (in my eyes) doing a pivotal act would detract much integrity from ASI. An aligned ASI would want to ensure good outcomes. Doing a pivotal act is something that would be conducive to this goal.

However, even if it does detract from ASI's integrity, that's fine. Doing something that looks bad in order to increase the likelihood of good outcomes doesn't seem all that wrong.

We can also think about it from the perspective of this conversation. If the counterargument that you provided is true and decisive, then ASI has very good (aligned) reasons to do a pivotal act. If the counterargument is false or, in other words, if there is a strategy that an aligned ASI could use to achieve high likelihoods of good outcomes without pivotal act, then it wouldn't do it.

Your objection that SNC applies to humans is something I have touched on at various points, but it points to a central concept of SNC, deserves a post of its own, and so I'll try to address it again here. Yes, humanity could destroy the world without AI. The relevant category of how this would happen is if the human ecosystem continues growing at the expense of the natural ecosystem to the point where the latter is crowded out of existence.

I think that ASI can really help us with this issue. If SNC (as an argument) is false or if ASI undergoes one of my proposed modifications, then it would be able to help humans not destroy the natural ecosystem. It could implement novel solutions that would prevent entire species of plants and animals from going extinct.

Furthermore, ASI can use resources from space (asteroid mining for example) in order to quickly implement plans that would be too resource-heavy for human projects on similar timelines.

And this is just one of the ways ASI can help us achieve synergy with environment faster.

To put it another way, the human ecosystem is following short-term incentives at the expense of long-term ones, and it is an open question which ultimately prevails.

ASI can help us solve this open question as well. Due its superior prediction/reasoning abilities it would evaluate our current trajectory, see that it leads to bad long-term outcomes and replace it with a sustainable trajectory.

Furthermore, ASI can help us solve issues such as Sun inevitably making Earth too hot to live. It could develop a very efficient system for scouting for Earth-like planets and then devise a plan for transporting humans to that planet.

What's Wrong With the Simulation Argument?

Dakara11d10

This may be not factually true, btw, - current LLMs can create good models of past people without running past simulation of their previous life explicitly.

Yup, I agree.

It is a variant of Doomsday argument. This idea is even more controversial than simulation argument. There is no future with many people in it.

This makes my case even stronger! Basically, if a Friendly AI has no issues with simulating conscious beings in general, then we have good reasons to expect it to simulate more observers in blissful worlds than in worlds like ours.

If the Doomsday Argument tells us that Friendly AI didn't simulate more observers in blissful worlds than in worlds like ours, then that gives us even more reasons to think that we are not being simulated by a Friendly AI in the way that you have described.

LESSWRONG
LW

Posts

Wiki Contributions

Comments