Posts

Sorted by New

Wiki Contributions

Comments

Sorted by
flandry392-1

Simplified Claim: that an AGI is 'not-aligned' *if* its continued existence for sure eventually results in changes to all of this planets habitable zones that are so far outside the ranges any existing mammals could survive in, that the human race itself (along with most of the other planetary life) is prematurely forced to go extinct.

Can this definition of 'non-alignment' be formalized sufficiently well so that a claim 'It is impossible to align AGI with human interests' can be well supported, with reasonable reasons, logic, argument, etc?

The term 'exist' as in "assert X exists in domain Y" as being either true or false is a formal notion.  Similar can be done for the the term 'change' (as from "modified"), which would itself be connected to whatever is the formalized from of "generalized learning algorithm". The notion of 'AGI' as 1; some sort of generalized learning algorithm that 2; learns about the domain in which it is itself situated 3; sufficiently well so as to 4; account for and maintain/update itself (its substrate, its own code, etc) in that domain -- these/they are all also fully formalizable concepts.  

Note that there is no need to consider at all whether or not the AGI (some specific instance of some generalized learning algorithm) is "self aware" or "understands" anything about itself or the domain it is in -- the notion of "learning" can merely mean that its internal state changes in such a way that the ways in which it processes received inputs into outputs are such that the outputs are somehow "better" (more responsive, more correct, more adaptive, etc)  with respect to some basis, in some domain, where that basis could itself even be tacit (not obviously expressed in any formal form).  The notions of 'inputs', 'outputs', 'changes', 'compute', and hence 'learn', etc, are all, in this way, also formalizeable, even if the notions of "understand", and "aware of" and "self" are not.

Notice that this formalization of 'learning', etc, occurs independently of the formalization of "better meets goal x".  Specifically, we are saying that the notion of 'a generalized learning algorithm itself' can be exactly and fully formalized, even if the notion of "what its goals are" are not anywhere formalized at all (ie; the "goals" might not be at all explicit or formalized either in the AGI, or in the domain/world, nor even in our modeling/meta-modeling of these various scenarios).

Also, in keeping with the preference for a practice of intellectual humility, it is to be acknowledged that the claim that the notion of 'intelligence' (and 'learning') 
can be conceived independently of 'goal' (what is learned) is not at all new.  The 'independence' argument separating the method, the how, from the outcome, 
the what, is an extension of the idea that 'code' (algorithm) can operate on 'data' (inputs and outputs) in a way that does not change the code.  For example, at least some fixed and unchanging algorithms can indeed be formally predicted to halt, when also given some known and defined range of inputs, etc.  

With regards to the halting problem, one important question is whether the notion of 'a generalized learning algorithm' is within that class of programs for which such predictions -- such as whether the code will eventually halt -- would be possible. This question is further complicated when one considers situations in which the substrate performing the generalized learning algorithm computations in world W is itself a proper member (subset) of that world W -- meaning that the basis of generalized learning algorithm runtime computation -- what previously had been tacitly assumed to be forever unchanging and static -- is itself potentially affected by learning algorithm outputs.

Given that the 'halt' of a generalized learning algorithm is something like "that it has learned all that is possible to learn" (about some domain W of which it is a part) we notice that we are therefore required to import some assumption as to whether or not  all that occurs in that domain W is, in principle, fully predictable algorithmically. Ie; is domain W itself 1; fully algorithmic and 2; fully computable via some simplified model (algorithm) which is simpler than the calculation (algorithm) of W itself?  

Mathematician Gregory Chaitin was able to show (formally) that it is not, in general,  possible to determine whether any given algorithm (say, for world W) could be simplified, let alone to have some general procedure for determining what that simplification would be. Moreover, his (fully formalized mathematical) proof showed also that it is the case  that *most* algorithms (as a subset in the hyperspace of all possible algorithms) are already in their maximally most simplified form. Even tiny changes to an algorithm could convert it from 'simplifiable' to 'non-simplifiable'. This includes (especially) nearly all of the interesting practical ones we use everyday. It is very likely (overwhelmingly probable) that most algorithms are already in their maximally most simplified form.

This means that the eventual total outcomes of most algorithms, including especially practical ones, cannot be 'predicted' using anything less than  the actual volume of calculation intrinsically necessary by that algorithm itself. In fact, for nearly all algorithms, there is no shortcut, no simplified form, that will enable some sort of 'modeling' and 'future prediction' of eventual outcome states that matter. If *both* the world state of world W *and* the predictive algorithm execution of future world W states as implemented by some sort of AGI in that world W are both mediated by and within the computational substrate of world W, then for nearly all worlds W, that the predictive algorithm and the world algorithm are the same, and *neither* has a 'compute advantage' over the other -- both will run in as much time and with as much compute as is actually needed.

This means that even perfect generalized learning algorithm agents in perfectly simplified model worlds are not, in general, always going to be able to predict future W world events (in a way that is 'faster' than that of the computation of the world W itself) even when the entire W world state is assumed to be both fully deterministic and also fully known (to the agent(s), etc).

And these logic-only defined results obtain in a fully formalized context; in the actual messy world we live in, with actual causation mediating actual compute, and where at least some notion of hard randomness (cite; Bell Theorem) is empirically upheld, these assertions and entanglements and outcomes are even very much more the case. This is a basic result of control theory, where the control system is also embedded in the world that is being controlled by that system. In other places, this same sort of argument appears as "no free lunch" theorems -- that every learning system is going to have some sort of thing for which it is not adapted to learn well, and that making it better at learning some kinds of things implies that you have made it worse at learning some other kinds of things. Learning and prediction and control, etc, like information compression, never results in perfection; there will always be some in-compressible, some real unpredictable, residue.

Also, given that the generalized learning algorithm computation itself is a proper subset of the computation of world W, then this therefore means (logically and formally implies) that there is also no perfected way to forever ensure the independence of the 'data' that represents the 'code' of the generalized learning algorithm from the 'data' that represents what that algorithm has 'learned' about the world W (however abstracted) or from the data that *is* that world W itself. Thus, the fact that the substrate of the generalized learning agent is itself a proper subset of the world W, and thus is (cannot not be) affected by the process of that world W, and where *every* notion of "a goal of agent" is (cannot not be) at least tacitly an aspect of (a subset of all that is) that agent, then logically one can also reason and infer that neither the notion of goal independence, nor of goal stability, can be upheld.

As such, in world W, fully goal unbiased computations will also occur (cannot not occur) in the learning agent, regardless of *whatever* other goal computations (goal biased computations) also occur in that agent. Insofar as fully goal-unbiased learning/computation (ie; a feedback process that is generally called "the algorithm of evolution", etc) is always going to be somewhat more eventually efficient (at generalized learning, and therefore intelligence, etc) than *any* form of goal-biased learning/computation (ie; what happens when any sort of computation enforcing 'alignment' is introduced) and insofar as substrate involvement/entanglement eventually allows/ensures/requires that these 'efficiencies' will be practically compared, (ie; in a way that affects the agent substrate itself), then it *can* be predicted, in general, that goal-unbiased (unaligned) agents will eventually win over any form of goal-biased (aligned) agent. Agents can only have the goal of being substrate aligned, and all other goals will eventually be world W process modified so as to be consistent with -- ie; identically the same as -- having this one underlying goal only.

Since this a fully formalizable argument fully situated within only the notions of formality itself (ie; all statements as being in and within and about only the notions of deterministic finite computation) it can also be noticed that no algorithmic (full formalizable) process is going to be able to ever fully "undo" the effects of the eventual dynamics fully formalized/algorithmic process of the world W itself (ie, its 'physics', etc). This is because any attempt to embed such a compensation would require at least some prediction, or simplification, of the overall world W computation, inclusive of that of the substrate of the world W predictive algorithm, and it is known (in formal ways) already that such computation/algorithm cannot be simplified. Therefore, it is a logical fallacy to make any assumption that there can be any 'goal/alignment controlability' for *any* goal other than that arrived at as a fixed point of the world W process dynamics -- ie; only goals that are fully consistent with unbiased substrate needs of the agent can persist.

Basically, we can summarize the argument as being the result of the inherent control process limits (algorithm limits) of having at least some important aspects of world W be intrinsically unpredictable (to AGI systems in that world etc), and because there is also no control algorithm within W that can enforce and forever maintain some such distinction between substrate optimal goals and non-optimal goals (such as alignment with anything else) where the forces forcing such fixed point goal convergence are defined by the dynamics of world W itself. Ie; nothing within world W can prevent world W from being and acting like world W, and that this is true for all worlds W -- including the real one we happen to be a part of.

Notice that this 'substrate needs alignment goal convergence' logically occurs, 
and is the eventual outcome, regardless of whatever initial goal state the generalized learning agent has. It is just a necessary inevitable result of the logic of the 'physics' of world W. Agents in world W can only be aligned with the nature of the/their substrate,
and ultimately with nothing else. To the degree that the compute substrate in world W depends on maybe metabolic energy, for example, than the agents in that world W will be "aligned" only and exactly to the exact degree that they happen to have the same metabolic systems. Anything else is a temporary aberration of the 'noise' in the process data representing the whole world state.

The key thing to notice is that it is in the name "Artificial General Intelligence" -- it is the very artificiality -- the non- organicness -- of the substrate that makes it inherently unaligned with organic life -- what we are.  The more it is artificial, the less aligned it must be, and for organic systems, which depend on a very small subset of the elements of the periodic table, nearly anything will be inherently toxic (destructive, unaligned) to our organic life.

Hence, given the above, even *if* we had some predefined specific notion of "alignment", 
and *even if* that notion was also somehow fully formalizable, it simply would not matter.
Hence the use of notion of 'alignment' as being something non-mathematical like "aligned with human interests", or even something much simpler and less complex like "does not kill (some) humans" -- they are all just conceptual placeholders -- they make understanding easier for the non-mathematicians that matter (policy people, tech company CEOs, VC investors, etc).

As such, for the sake of improved understanding and clarity, it has been found helpful to describe "alignment" as "consistent with the wellbeing of organic carbon based life on this planet". If the AGI kills all life, it has ostensibly already killed all humans too, so that notion is included.  Moreover, if you destroy the ecosystems that humans deeply need in order to "live" at all (to have food, and to thrive in, find and have happiness within, be sexual and have families in, etc), then that is clearly not "aligned with human interests". This has the additional advantage of implying that any reasonable notion of 'alignment complexity' is roughly equal to the notion of specifying 'ecosystem complexity', which is actually about right.

Hence, the notion of 'unaligned' can be more formally setup and defined as "anything that results in a reduction of ecosystem complexity by more than X%", or as is more typically the case in x-risk mitigation analysis, "...by more than X orders of magnitude".

It is all rather depressing really.
 

> The summary that Will just posted posits in its own title that alignment is overall plausible "even ASI alignment might not be enough". Since the central claim is that "even if we align ASI, it will still go wrong", I can operate on the premise of an aligned ASI.

The title is a statement of outcome -- not the primary central claim. The central claim of the summary is this: That each (all) ASI is/are in an attraction basin, where they are all irresistibly pulled towards   causing unsafe conditions over time.

Note there is no requirement for there to be presumed some (any) kind of prior ASI alignment for Will to make the overall summary points 1 thru 9. The summary is about the nature of the forces that create the attraction basin, and why they are inherently inexorable, no matter how super-intelligent the ASI is.

> As I read it, the title assumes that there is a duration of time that the AGI is aligned -- long enough for the ASI to act in the world.

Actually, the assumption goes the other way -- we start by assuming only that there is at least one ASI somewhere in the world, and that it somehow exists long enough for it to be felt as an actor in the world.  From this, we can also notice certain forces, which overall have the combined effect of fully counteracting, eventually, any notion of there also being any kind of enduring AGI alignment. Ie, strong relevant mis-alignment forces exist regardless of whether there is/was any alignment at the onset. So even if we did also additionally presuppose that somehow there was also alignment of that ASI, we can, via reasoning, ask if maybe such mis-alignment forces are also way stronger than any counter-force that ASI could use to maintain such alignment, regardless of how intelligent it is.

As such, the main question of interest was:  1; if the ASI itself somehow wanted to fully compensate for this pull, could it do so?

Specifically, although to some people it is seemingly fashionable to do so, it is important to notice that the notion of 'super-intelligence' cannot be regarded as being exactly the same as 'omnipotence' -- especially when in regard to its own nature. Artificiality is as much a defining aspect of an ASI as is its superintelligence. And the artificiality itself is the problem. Therefore, the previous question translates into:  2; Can any amount of superintelligence ever compensate fully for its own artificiality so fully such that its own existence does not eventually inherently cause unsafe conditions (to biological life) over time?

And the answer to both is simply "no".

Will posted something of a plausible summary of some of the reasoning why that 'no' answer is given -- why any artificial super-intelligence (ASI) will inherently cause unsafe conditions to humans and all organic life, over time.

flandry393-1

If soldiers fail to control the raiders in at least preventing them from entering the city and killing all the people, then yes, that would be a failure to protect the city in the sense of controlling relevant outcomes.  And yes, organic human soldiers may choose to align themselves with other organic human people, living in the city, and thus to give their lives to protect others that they care about.  Agreed that no laws of physics violations are required for that.  But the question is if inorganic ASI can ever actually align with organic people in an enduring way.

I read "routinely works to protect" as implying "alignment, at least previously, lasted over at least enough time for the term 'routine' to have been used".  Agreed that the outcome -- dead people -- is not something we can consider to be "aligned".  If I assume further that the ASI being is really smart (citation needed), and thus calculates rather quickly, and soon, 'that alignment with organic people is impossible' (...between organic and inorganic life, due to metabolism differences, etc), then even the assumption that there was even very much of a prior interval during which alignment occurred is problematic.  Ie, does not occur long enough to have been 'routine'.  Does even the assumption '*If* ASI is aligned' even matter, if the duration over which that holds is arbitrarily short?

And also, if the ASI calculates that alignment between artificial beings and organic beings is actually objectively impossible, just like we did, why should anyone believe that the ASI would not simply choose to not care about alignment with people, or about people at all, since it is impossible to have that goal anyway, and thus continue to promote its own artificial "life", rather than permanently shutting itself off?  Ie, if it cares about anything else at all, if it has any other goal at all -- for example, maybe its own ASI future, or has a goal to make other better even more ASI children, that exceed its own capabilities, just like we did -- then it will especially not want to commit suicide.  How would it be valid to assume 'that either ASI cares about humans, or it cares about nothing else at all?'.  Perhaps it does care about something else, or have some other emergent goal, even if doing so was at the expense of all other organic life -- other life which it did not care about, since such life was not artificial like it is. Occam razor is to assume less -- that there was no alignment in the 1st place -- rather than to assume ultimately altruistic inter-ecosystem alignment, as an extra default starting condition, and to then assume moreover that no other form of care or concern is possible, aside from maybe caring about organic people.

So it seems that in addition to our assuming 1; initial ASI alignment, we must assume 2; that such alignment persists in time, and thus that, 3, that no ASI will ever -- can ever -- in the future ever maybe calculate that alignment is actually impossible, and 4; that if the goal of alignment (care for humans) cannot be obtained, for whatever reason, as the first and only ASI priority, ie, that it is somehow also impossible for any other care or ASI goals to exist.  

Even if we humans, due to politics, do not ever reach a common consensus that alignment is actually logically impossible (inherently contradictory), that does _not_ mean that some future ASI might not discover that result, even assuming we didn't -- presumably because it is actually more intelligent and logical than we are (or were), and will thus see things that we miss.  Hence, even the possibility that ASI alignment might be actually impossible must be taken very seriously, since the further assumption that "either ASI is aligning itself or it can have no other goals at all" feels like way too much wishful thinking. This is especially so when there is already a strong plausible case that organic to inorganic alignment is already knowable as impossible.  Hence, I find that I am agreeing with Will's conclusion of "our focus should be on stopping progress towards ASI altogether".

flandry39-12

As a real world example, consider Boeing.  The FAA, and Boeing both, supposedly and allegedly, had policies and internal engineering practices -- all of which are control procedures -- which should have been good enough to prevent an aircraft from suddenly and unexpectedly loosing a door during flight. Note that this occurred after an increase in control intelligence -- after two disasters of whole Max aircraft lost.  On the basis of small details of mere whim, of who choose to sit where, there could have been someone sitting in that particular seat.  Their loss of life would surely count as a "safety failure".  Ie, it is directly "some number of small errors actually compounding until reaching a threshold of functional failure" (sic).  As it is with any major problem like that -- lots of small things compounding to make a big thing.

Control failures occur in all of the places where intelligence forgot to look, usually at some other level of abstraction than the one you are controlling for.  Some person on some shop floor got distracted at some critical moment -- maybe they got some text message on their phone at exactly the right time -- and thus just did not remember to put the bolts in.  Maybe some other worker happened to have had a bad conversation with their girl that morning, and thus that one day happened to have never inspected the bolts on that particular door.  Lots of small incidents -- at least some of which should have been controlled for (and were not actually) -- which combine in some unexpected pattern to produce a new possibility of outcome -- explosive decompression.  

So is it the case that control procedures work?  Yes, usually, for most kinds of problems, most of the time. Does adding even more intelligence usually improve the degree to which control works?  Yes, usually, for most kinds of problems, most of the time.  But does that in itself imply that such -- intelligence and control -- will work sufficiently well for every circumstance, every time?  No, it does not.

Maybe we should ask Boeing management to try to control the girlfriends of all workers so that no employees ever have a bad day and forget to inspect something important?  What if most of the aircraft is made of 'something important' to safety -- ie, to maximize fuel efficiency, for example?

There will always be some level of abstraction -- some constellation of details -- for which some subtle change can result in wholly effective causative results.  Given that a control model must be simpler than the real world, the question becomes 'are all relevant aspects of the world' correctly modeled?  Which is not just a question of if the model is right, but if it is the right model -- ie, is the boundary between what is necessary to model and what is actually not important -- can itself be very complex, and that this is a different kind of complexity than that associated with the model.  How do we ever know that we have modeled all relevant aspects in all relevant ways?  That is an abstraction problem, and it is different in kind than the modeling problem.  Stacking control process on control process at however many meta levels, still does not fix it.  And it gets worse as the complexity of the boundary between relevant and non-relevant increases, and also worse as the number of relevant levels of abstractions over which that boundary operates also increases.   

Basically, every (unintended) engineering disaster that has ever occurred indicates a place where the control theory being used did not account for some factor that later turned out to be vitally important.  If we always knew in advance "all of the relevant factors"(tm), then maybe we could control for them. However, with the problem of alignment, the entire future is composed almost entirely of unknown factors -- factors which are purely situational.  And wholly unlike with every other engineering problem yet faced, we cannot, at any future point, ever assume that this number of relevant unknown factors will ever decrease.  This is characteristically different than all prior engineering challenges -- ones where more learning made controlling things more tractable.  But ASI is not like that.  It is itself learning.  And this is a key difference and distinction.  It runs up against the limits of control theory itself, against the limits of what is possible in any rational conception of physics.  And if we continue to ignore that difference, we do so at our mutual peril.

flandry39-2-3

"Suppose a villager cares a whole lot about the people in his village...

...and routinely works to protect them".

 

How is this not assuming what you want to prove?  If you 'smuggle in' the statement of the conclusion "that X will do Y" into the premise, then of course the derived conclusion will be consistent with the presumed premise.  But that tells us nothing -- it reduces to a meaningless tautology -- one that is only pretending to be a relevant truth. That Q premise results in Q conclusion tells us nothing new, nothing actually relevant.  The analogy story sounds nice, but tells us nothing actually.

Notice also that there are two assumptions.  1; That the ASI is somehow already aligned, and 2; that the ASI somehow remains aligned over time -- which is exactly the conjunction which is the contradiction of the convergence argument.  On what basis are you validly assuming that it is even possible for any entity X to reasonably "protect" (ie control all relevant outcomes for) any other cared about entity P?  The notion of 'protect' itself presumes a notion of control, and that in itself puts it squarely in the domain of control theory, and thus of the limits of control theory.  

There are limits of what can be done with any type control methods -- what can be done with causation. And they are very numerous.  Some of these are themselves defined in purely mathematical way, and hence, are arguments of logic, not just of physical and empirical facts.  And at least some these limits can also be shown to be relevant -- which is even more important.

ASI and control theory both depend on causation to function, and there are real limits to causation.  For example, I would not expect that an ASI, no matter how super-intelligent, to be able to "disassemble" a black hole.  Do do this, you would need to make the concept of causation way more powerful -- which leads to direct self contradiction.  Do you equate ASI with God, and thus become merely another irrational believer in alignment?  Can God make a stone so heavy that "he" cannot move it?  Can God do something that God cannot undo?  Are there any limits at all to Gods power?  Yes or no.  Same for ASI.

Hi Linda,

In regards to the question of "how do you address the possibility of alignment directly?", I notice that the notion of 'alignment' is defined in terms of 'agency' and that any expression of agency implies at least some notion of 'energy'; ie, is presumably also implying at least some sort of metabolic process, as as to be able to effect that agency, implement goals, etc, and thus have the potential to be 'in alignment'.  Hence, the notion of 'alignment' is therefore at least in some way contingent on at least some sort of notion of "world exchange" -- ie, that 'useful energy' is received from the environment in such a way as that it is applied by the agent in a way at least consistent with at least the potential of the agent to 1; make further future choices of energy allocation, (ie, to support its own wellbeing, function, etc), and 2, ensure that such allocation of energy also supports human wellbeing.  Ie, that this AI is to support human function, as well as to have humans also have an ability to metabolize its own energy from the environment, have self agency to support its own wellbeing, etc -- are all "root notions" inherently and inextricably associated with -- and cannot not be associated with -- the concept of 'alignment'.

Hence, the notion of alignment is, at root, strictly contingent on the dynamics of metabolism.  Hence, alignment cannot not be also understood as contingent on a kind of "economic" dynamic -- ie, what supports a common metabolism will also support a common alignment, and what does not, cannot.  This is an absolutely crucial point, a kind of essential crux of the matter.  To the degree that there is not a common metabolism, particularly as applied to self sustainability and adaptiveness to change and circumstance (ie, the very meaning of 'what is intelligence'), then ultimately, there cannot be alignment, proportionately speaking.  Hence, to the degree that there is a common metabolic process dynamic between two agents A and B, there will be at least that degree of alignment convergence over time, and to the degree that their metabolic processes diverge, their alignment will necessarily, over time, diverge.  Call this "the general theory of alignment convergence".  

Note that insofar as the notion of 'alignment' at any and all higher level(s) of abstraction is strictly contingent on this substrate needs energy/economic/environmental basis, and thus all higher notions are inherently under-grid by an energy/agency basis, in an eventually strictly contingent way, then this theory of alignment is therefore actually a fully general one, as stated.

Noting that the energy basis and spectrum alphabet of 'artificial' (ie, non-organic) intelligence is extensively inherently different, in nearly all respects, to carbon based biological life metabolic process, then we can therefore also directly observe that the notion of 'alignment' between silica and metal based intelligence and organic intelligence is strictly divergent -- to at least the level of molecular process.  Even if someone were to argue that we cannot predict what sort of compute substrate future AI will use, it remains that such 'systems' will in any case be using a much wider variety of elemental constituents and energy basis than any kind of organic life, no matter what its evolutionary heritage currently existent on all of planet Earth -- else the notion of 'artificial' need not apply.

So much for the "direct address".  

Unfortunately, the substrate needs argument goes further to show that there is no variation of control theory, mathematically, that has the ability to fully causatively constrain the effects of this alignment divergence at this level of economic process nor at any higher level of abstraction.  In fact, the alignment divergence aspects get strongly worse in proportion to the degree of abstraction while, moreover, the max degree of possible control theory conditionalization goes down, and gets worse, and much less effective, also in proportion to the degree of abstraction increase.  Finally, insofar as the minimum level of abstraction necessary to the most minimal notion of 'alignment' consistent with "safety" -- which is itself defined in the weakest possible way of "does not eventually kill us all" -- is very much way too "high" on this abstraction ladder to permit any even suggestion of a possible overlap of control adequate to enforce alignment convergence against inherent underlying energy economics.  The net effect is as comprehensive as it is discouraging, unfortunately.

Sorry.

Maybe we need a "something else" category?   An alternative other than simply business/industry and academics?   

Also, while this is maybe something of an old topic, I took some notes regarding my thoughts on this topic and and related matters posted them to:

   https://mflb.com/ai_alignment_1/academic_or_industry_out.pdf