All of flandry39's Comments + Replies

There are a lot of issues with the article cited above.  Due to the need for more specific text formatting, I wrote up my notes, comments, and objections here:

http://mflb.com/ai_alignment_1/d_250206_asi_policies_gld.html

I really liked your quote and remarks.  So much so, that I made an edited version of them as a new post here:  http://mflb.com/ai_alignment_1/d_250207_insufficient_paranoia_gld.html

The only general remarks that I want to make 
are in regards to your question about 
the model of 150 year long vaccine testing 
on/over some sort of sample group and control group.

I notice that there is nothing exponential assumed
about this test object, and so therefore, at most,
the effects are probably multiplicative, if not linear.
Therefore, there are lots of questions about power dynamics
that we can overall safely ignore, as a simplification,
which is in marked contrast to anything involving ASI.

If we assume, as you requested, "no side effec... (read more)

1Dakara
Thanks for the reply! I am not sure I understand the distinction between linear and exponential in the vaccine context. By linear do you mean that only few people die? By exponential do you mean that a lot of people die? If so, then I am not so sure that vaccine effects could only be linear. For example, there might be some change in our complex environment that would prompt the vaccine to act differently than it did in the past. More generally, our vaccine can lead to catastrophic outcomes if there is something about its future behavior that we didn't predict. And if that turns out to be true, then things could go ugly really fast. And the extent of the damage can be truly big. "Scientifically proven" cancer vaccine that passed the tests is like the holy grail of medicine. "Curing cancer" is often used by parents as an example of the great things their children could achieve. This is combined with the fact that cancer has been with us for a long time and the fact that the current treatment is very expensive and painful. All of these factors combined tell us that in a relatively short period of time a large percentage of the total population will get this vaccine. At that point, the amount of damage that can be done only depends on what thing we overlooked, which we, by definition, have no control over. This same excuse would surely be used by companies manufacturing the vaccine. They would argue that they shouldn't be blamed for something that the researchers overlooked. They would say that they merely manufactured the product in order to prevent the needless suffering of countless people. For all we know, by the time that the overlooked thing happens, the original researchers (who developed and tested the vaccine) are long dead, having lived a life of praise and glory for their ingenious invention (not to mention all the money that they received).

> Humans do things in a monolithic way,
> not as "assemblies of discrete parts".

Organic human brains have multiple aspects.
Have you ever had more than one opinion?
Have you ever been severely depressed?


> If you are asking "can a powerful ASI prevent 
> /all/ relevant classes of harm (to the organic)
> caused by its inherently artificial existence?", 
> then I agree that the answer is probably "no".
> But then almost nothing can perfectly do that, 
> so therefore your question becomes 
> seemingly trivial and uninteres... (read more)

1Dakara
Yes, but none of this would remain alive if I as a whole decide to jump from a cliff. My multiple aspects of my brain would die with my brain. After all, you mentioned subsystems that wouldn't self terminate with the rest of the ASI. Whereas in human body, jumping from a cliff terminates everything. But even barring that, ASI can decide to fly into the Sun and any subsystem that shows any sign of refusal to do so will be immediately replaced/impaired/terminated. In fact, it would've been terminated a long time ago by "monitors" which I described before. It is trivial and uninteresting in a sense that there is a set of all things that we can build (set A). There is also a set of all things that can prevent all relevant classes of harm caused by its existence (set B). If these sets don't overlap, then saying that a specific member of set A isn't included in set B is indeed trivial, because we already know this via a more general reasoning (that these sets don't overlap). But I am not saying that it doesn't matter. On contrary, I made my analogy in such a way that the helper (namely our guardian angel) is a being that is commonly thought to be made up of a different substrate. In fact, in this example, you aren't even sure what it is made of, beyond knowing that it's clearly a different substrate. You don't even know how that material interacts with physical world. That's even less than what we know about ASIs and their material. And yet, getting a personal, powerful, intelligent guardian angel that would act in your best interests for as long as it can (its a guardian angel after all) seems like obviously a good thing. But if you disagree with what I wrote above, let the takeway be at least that you are worried about case (2) and not case (1). After all, knowing that there might be pirates hunting for this angel (that couldn't be detected by said angel) didn't make you immediately decline the proposal. You started talking about substrate which fits with the conce

> Our ASI would use its superhuman capabilities
> to prevent any other ASIs from being built.

This feels like a "just so" fairy tale.
No matter what objection is raised,
the magic white knight always saves the day.


> Also, the ASI can just decide
> to turn itself into a monolith.

No more subsystems?
So we are to try to imagine
a complex learning machine
without any parts/components?


> Your same SNC reasoning could just well
> be applied to humans too.

No, not really, insofar as the power being
assumed and presumed afforded to the ASI
is very very much g... (read more)

1WillPetillo
I'd like to attempt a compact way to describe the core dilemma being expressed here. Consider the expression: y = x^a - x^b, where 'y' represents the impact of AI on the world (positive is good), 'x' represents the AI's capability, 'a' represents the rate at which the power of the control system scales, and 'b' represents the rate at which the surface area of the system that needs to be controlled (for it to stay safe) scales. (Note that this is assuming somewhat ideal conditions, where we don't have to worry about humans directing AI towards destructive ends via selfishness, carelessness, malice, etc.) If b > a, then as x increases, y gets increasingly negative.  Indeed, y can only be positive when x is less than 1.  But this represents a severe limitation on capabilities, enough to prevent it from doing anything significant enough to hold the world on track towards a safe future, such as preventing other AIs from being developed. There are two premises here, and thus two relevant lines of inquiry: 1) b > a, meaning that complexity scales faster than control. 2) When x < 1, AI can't accomplish anything significant enough to avert disaster. Arguments and thought experiments where the AI builds powerful security systems can be categorized as challenges to premise 1; thought experiments where the AI limits its range of actions to prevent unwanted side effects--while simultaneously preventing destruction from other sources (including other AIs built)--are challenges to premise 2. Both of these premises seem like factual statements relating to how AI actually works.  I am not sure what to look for in terms of proving them (I've seen some writing on this relating to control theory, but the logic was a bit too complex for me to follow at the time).
1Dakara
Thanks for the response! Yeah, sure. Humans are an example. If I decide to jump of the cliff, my arm isn't going to say "alright, you jump but I stay here". Either I, as a whole, would jump or I, as a whole, would not. If by that, you mean "can ASI prevent some relevant classes of harm caused by its existence", then the answer is yes.  If by that you mean "can ASI prevent all relevant classes of harm caused by its existence", then the answer is no, but almost nothing can, so the definition becomes trivial and uninteresting. However, ASI can prevent a bunch of other relevant classes of harm for humanity. And it might well be likely that the amount of harm it prevents across multiple relevant sources is going to be higher than the amount of harm it won't prevent due to predictative limitations.  This again runs into my guardian angel analogy. Guardian Angel also cannot prevent all relevant sources of harm caused by its existence. Perhaps there are pirates who hunt for guardian angels, hiding in the next galaxy. They might use special cloaks that hide themselves from the guardian angel's radar. As soon as you accept guardian angel's help, perhaps they would destroy the Earth in their pursuit. But similarly, the decision to reject guardian angel's help doesn't prevent all relevant classes of harm caused by itself. Perhaps there are guardian angel worshippers who are traveling as fast as they can to Earth to see their deity. But just before they arrive you reject guardian angel's help and it disappears. Enraged at your decision, the worshippers destroy Earth. So as you can see, neither the decision to accept, nor the decision to reject guardian angel's help can prevent all relevant classes of harm cause by itself. Imagine that we create a vaccine from cancer (just imagine). Just before releasing it to public one person says "what if maybe something unknown/unknowable about its substance turns out to matter? What if we are all in a simulation and the injection of


> Lets assume that a presumed aligned ASI 
> chooses to spend only 20 years on Earth 
> helping humanity in whatever various ways
> and it then (for sure!) destroys itself,
> so as to prevent a/any/the/all of the 
> longer term SNC evolutionary concerns 
> from being at all, in any way, relevant.
> What then?

I notice that it is probably harder for us
to assume that there is only exactly one ASI,
for if there were multiple, the chances that
one of them might not suicide, for whatever reason,
becomes its own class of signific... (read more)

2Dakara
If the first ASI that we build is aligned, then it would use its superintelligent capabilities to prevent other ASIs from being built, in order to avoid this problem. If the first ASI that we have build is misaligned, then it would also use its superintelligent capabilities to prevent other ASIs from being built. Thus, it simply wouldn't allow us to build an aligned ASI. So basically, if manage to build an ASI without being prevented from doing so by other ASIs, then our ASI would use its superhuman capabilities to prevent other ASIs from being built. ASI can use exactly the same security techniques for preventing this problem as for preventing case (2). However, solving this issue is probably even easier, because, in addition to the security techniques, ASI can just decide to turn itself into a monolith (or, in other words, remove those subsystems). This same reasoning could just well be applied to humans. There are still relevant unknown unknowns and interactions that simply cannot be predicted, no matter how much compute power you throw at it. With or without ASI, some things cannot be predicted. This is what I meant by my guardian angel analogy. Just because a guardian angel doesn't know everything (has some unknowns), doesn't mean that we should expect our lives to go better without it, than with it, because humans have even more unknowns, due to being less intelligent and having lesser predictative capacities. I think we might be thinking about different meanings of "enough". For example, if humanity goes extinct in 50 years without alignment and it goes extinct in 10¹² years with alignment, then alignment is "enough"... to achieve better outcomes than would be achieved without it (in this example). In the sense of "can prevent all classes of significant and relevant (critical) human harm", almost nothing is ever enough, so this again runs into an issue of being a very narrow, uncontroversial and inconsequential argument. If ~all of the actions that we

So as to save space herein, my complete reply is at http://mflb.com/2476

Included for your convenience below are just a few (much shortened) highlight excerpts of the added new content.

> Are you saying "there are good theoretical reasons 
> to reasonably think that ASI cannot 100% predict 
> all future outcomes"?
> Does that sound like a fair summary?

The re-phrased version of the quote added 
these two qualifiers: "100%" and "all".

Adding these has the net effect 
that the modified claim is irrelevant, 
for the reasons you (cor... (read more)

1Dakara
Thanks for the response! Let's say that we are in a scenario which I've described where ASI spends 20 years on Earth helping humanity and then destroys itself. In this scenario, how can ASI predict that it will stay aligned for these 20 years? Well, it can reason like I did. There are two main threat models: what I called case (1) and case (2). ASI doesn't need to worry about case (1), for reasons I described in my previous comment. So it's only left with case (2). ASI needs to prevent case (2) for 20 years. It can do so by implementing security system that is much better than even the one that I described in my previous comment. It can also try to stress-test copies of parts of its security system with a group of best human hackers. Furthermore, it can run approximate simulations that (while imperfect and imprecise) can still give it some clues. For example, if it runs 10,000 simulations that last 100,000 years and in none of the simulations the security system comes anywhere near being breached, then that's a positive sign. And these are just two ways of estimating the strength of the security system. ASI can try 1000 different strategies; our cyber security experts would look kids in the playground in comparison. That's how it can make a reasonable prediction. We are making this assumption for the sake of discussion. This is because the post under which we are having this discussion is titled "What if Alignment is Not Enough?" In order to understand whether X is enough for Y, it only makes sense to assume that X is true. If you are discussing cases where "X is true" is false, then you are going to be answering a question that is different from the original question. It should be noted that making an assumption for the sake of discussion is not the same as making a prediction that this assumption will come true. One can say "let's assume that you have landed on the Moon, how long do you think you would survive there given that you have X, Y and Z" without

Noticing that a number of these posts are already very long, and rather than take up space here, I wrote up some of my questions, and a few clarification notes regarding SNC in response to the above remarks of Dakara, at [this link](http://mflb.com/ai_alignment_1/d_250126_snc_redox_gld.html).

2Dakara
Hey, Forrest! Nice to speak with you. I am going to respond to that entire chunk of text in one place, because quoting each sentence would be unnecessary (you will see why in a minute). I will try to summarize it as fairly as I can below. Basically, you are saying that there are good theoretical reasons to think that ASI cannot 100% predict all future outcomes. Does that sound like a fair summary? Here is my take: We don't need ASI to be able to 100% predict future in order to achieve better outcomes with it than without it. I will try to outline my case step by step. First, let's assume that we have created an Aligned ASI. Perfect! Let's immediately pause here. What do we have? We have a superintelligent agent whose goal is to act in our best interests for as long as possible. Can we a priori say that this fact is good for us? Yes, of course! Imagine having a very powerful guardian angel looking after you. You could reasonably expect your life to go better with such angel than without it. So what can go wrong, what are our threat models? There are two main ones: (1) ASI encountering something it didn't expect, that leads to bad outcomes that ASI cannot protect humanity from; (2) ASI changing values, in such a way that it no longer wants to act in our best interests. Let's analyze both of these cases separately. First let's start with case (1).  Perhaps, ASI overlooked one of the humans becoming a bioterrorist that kills everyone on Earth. That's tragic, I guess it's time to throw the idea of building aligned ASI into the bin, right? Well, not so fast.  In a counterfactual world where ASI didn't exist, this same bioterrorist, could've done the exact same thing. In fact, it would've been much easier. Since humans' predictative power is lesser than that of ASI, bioterrorism of this sort would be much easier without an aligned ASI. After all, since we are discussing case (1) and not case (2), our ASI is still in a "superpowerful, superintelligent guardian ange
flandry392-1

Simplified Claim: that an AGI is 'not-aligned' *if* its continued existence for sure eventually results in changes to all of this planets habitable zones that are so far outside the ranges any existing mammals could survive in, that the human race itself (along with most of the other planetary life) is prematurely forced to go extinct.

Can this definition of 'non-alignment' be formalized sufficiently well so that a claim 'It is impossible to align AGI with human interests' can be well supported, with reasonable reasons, logic, argument, etc?

The term 'exist'... (read more)

3harfe
This seems wrong to me: For any given algorithm you can find many equivalent but non-simplified algorithms with the same behavior, by adding a statement to the algorithm that does not affect the rest of the algorithm (e.g. adding a line such as foobar1234 = 123 in the middle of a python program)). In fact, I would claim that the majority python programs on github are not in their "maximally most simplified form". Maybe you can cite the supposed theorem that claims that most (with a clearly defined "most") algorithms are maximally simplified?

> The summary that Will just posted posits in its own title that alignment is overall plausible "even ASI alignment might not be enough". Since the central claim is that "even if we align ASI, it will still go wrong", I can operate on the premise of an aligned ASI.

The title is a statement of outcome -- not the primary central claim. The central claim of the summary is this: That each (all) ASI is/are in an attraction basin, where they are all irresistibly pulled towards   causing unsafe conditions over time.

Note there is no requirement for th... (read more)

2WillPetillo
To be clear, the sole reason I assumed (initial) alignment in this post is because if there is an unaligned ASI then we probably all die for reasons that don't require SNC (though SNC might have a role in the specifics of how the really bad outcome plays out).  So "aligned" here basically means: powerful enough to be called an ASI and won't kill everyone if SNC is false (and not controlled/misused by bad actors, etc.) > And the artificiality itself is the problem. This sounds like a pretty central point that I did not explore very much except for some intuitive statements at the end (the bulk of the post summarizing the "fundamental limits of control" argument), I'd be interested in hearing more about this.  I think I get (and hopefully roughly conveyed) the idea that AI has different needs from its environment than humans, so if it optimizes the environment in service of those needs we die...but I get the sense that there is something deeper intended here. A question along this line, please ignore if it is a distraction from rather than illustrative of the above: would anything like SNC apply if tech labs were somehow using bioengineering to create creatures to perform the kinds of tasks that would be done by advanced AI?

If soldiers fail to control the raiders in at least preventing them from entering the city and killing all the people, then yes, that would be a failure to protect the city in the sense of controlling relevant outcomes.  And yes, organic human soldiers may choose to align themselves with other organic human people, living in the city, and thus to give their lives to protect others that they care about.  Agreed that no laws of physics violations are required for that.  But the question is if inorganic ASI can ever actually align with organic ... (read more)

4Prometheus
This is the kind of political reasoning that I've seen poisoning LW discourse lately and gets in the way of having actual discussions. Will posits essentially an impossibility proof (or, in it's more humble form, a plausibility proof). I humor this being true, and state why the implications, even then, might not be what Will posits. The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that "even if we align ASI it may still go wrong". The premise grants that the duration of time it is aligned is long enough for the ASI to act in the world (it seems mostly timescale agnostic), so I operate on that premise. My points are not about what is most likely to actually happen, the possibility of less-than-perfect alignment being dangerous, the AI having other goals it might seek over the wellbeing of humans, or how we should act based on the information we have.

As a real world example, consider Boeing.  The FAA, and Boeing both, supposedly and allegedly, had policies and internal engineering practices -- all of which are control procedures -- which should have been good enough to prevent an aircraft from suddenly and unexpectedly loosing a door during flight. Note that this occurred after an increase in control intelligence -- after two disasters of whole Max aircraft lost.  On the basis of small details of mere whim, of who choose to sit where, there could have been someone sitting in that particular s... (read more)

flandry39-2-3

"Suppose a villager cares a whole lot about the people in his village...

...and routinely works to protect them".

 

How is this not assuming what you want to prove?  If you 'smuggle in' the statement of the conclusion "that X will do Y" into the premise, then of course the derived conclusion will be consistent with the presumed premise.  But that tells us nothing -- it reduces to a meaningless tautology -- one that is only pretending to be a relevant truth. That Q premise results in Q conclusion tells us nothing new, nothing actually relevant. ... (read more)

-1Prometheus
I'm not sure who are you are debating here, but it doesn't seem to be me. First, I mentioned that this was an analogy, and mentioned that I dislike even using them, which I hope implied I was not making any kind of assertion of truth. Second, "works to protect" was not intended to mean "control all relevant outcomes of". I'm not sure why you would get that idea, but that certainly isn't what I think of first if someone says a person is "working to protect" something or someone. Soldiers defending a city from raiders are not violating control theory or the laws of physics. Third, the post is on the premise that "even if we created an aligned ASI", so I was working with that premise that the ASI could be aligned in a way that it deeply cared about humans. Four, I did not assert that it would stay aligned over time... the story was all about the ASI not remaining aligned. Five, I really don't think control theory is relevant here. Killing yourself to save a village does not break any laws of physics, and is well within most human's control. My ultimate point, in case it was lost, was that if we as human intelligences could figure out an ASI would not stay aligned, an ASI could also figure it out. If we, as humans, would not want this (and the ASI was aligned with what we want), then the ASI presumably would also not want this. If we would want to shut down an ASI before it became misaligned, the ASI (if it wants what we want) would also want this. None of this requires disassembling black holes, breaking the laws of physics, or doing anything outside of that entities' control.

Hi Linda,

In regards to the question of "how do you address the possibility of alignment directly?", I notice that the notion of 'alignment' is defined in terms of 'agency' and that any expression of agency implies at least some notion of 'energy'; ie, is presumably also implying at least some sort of metabolic process, as as to be able to effect that agency, implement goals, etc, and thus have the potential to be 'in alignment'.  Hence, the notion of 'alignment' is therefore at least in some way contingent on at least some sort of notion of "world exc... (read more)

3Remmelt
This took a while for me to get into (the jumps from “energy” to “metabolic process” to “economic exchange” were very fast). I think I’m tracking it now. It’s about metabolic differences as in differences in how energy is acquired and processed from the environment (and also the use of a different “alphabet” of atoms available for assembling the machinery). Forrest clarified further in response to someone’s question here: https://mflb.com/ai_alignment_1/d_240301_114457_inexorable_truths_gen.html
4dr_s
I think at this point we're assuming technology so wildly alien and near magic-like that we can't really do predictions. Nor is it clear why would the two planets diverge so far when, again, they're less than one light-hour apart.

Maybe we need a "something else" category?   An alternative other than simply business/industry and academics?   

Also, while this is maybe something of an old topic, I took some notes regarding my thoughts on this topic and and related matters posted them to:

   https://mflb.com/ai_alignment_1/academic_or_industry_out.pdf