1 min read47 comments
This is a special post for quick takes by quila. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
49 comments, sorted by Click to highlight new comments since:
[-]quila137

On Pivotal Acts

(edit: status: not a crux, instead downstream of different beliefs about what the first safe ASI will look like in predicted futures where it exists)

I was rereading some of the old literature on alignment research sharing policies after Tamsin Leake's recent post and came across some discussion of pivotal acts as well.

Hiring people for your pivotal act project is going to be tricky. [...] People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration. This will alienate other institutions and make them not want to work with you or be supportive of you.

This is in a context where the 'pivotal act' example is using a safe ASI to shut down all AI labs.[1]

My thought is that I don't see why a pivotal act needs to be that. I don't see why shutting down AI labs or using nanotech to disassemble GPUs on Earth would be necessary. These may be among the 'most direct' or 'simplest to imagine' possible actions, but in the case of superintelligence, simplicity is not a constraint.

We can instead select for the 'kindest' or 'least adversarial' or actually: functional-decision-theoretically optimal actions that save the future while minimizing the amount of adversariality this creates in the past (present).

Which can be broadly framed as 'using ASI for good'. Which is what everyone wants, even the ones being uncareful about its development.

Capabilities orgs would be able to keep working on fun capabilities projects in those days during which the world is saved, because a group following this policy would choose to use ASI to make the world robust to the failure modes of capabilities projects rather than shutting them down. Because superintelligence is capable of that, and so much more.

  1. ^

    side note: It's orthogonal to the point of this post, but this example also makes me think: if I were working on a safe ASI project, I wouldn't mind if another group who had discreetly built safe ASI used it to shut my project down, since my goal is 'ensure the future lightcone is used in a valuable, tragedy-averse way' and not 'gain personal power' or 'have a fun time working on AI' or something. In my morality, it would be naive to be opposed to that shutdown. But to the extent humanity is naive, we can easily do something else in that future to create better present dynamics (as the maintext argues).

    If there is a group for whom using ASI to make the world robust to risks and free of harm, in a way where its actions don't infringe on ongoing non-violent activities is problematic, then this post doesn't apply to them as their issue all along was not with the character of the pivotal act, but instead possibly with something like 'having my personal cosmic significance as a capabilities researcher stripped away by the success of an external alignment project'.

    Another disclaimer: This post is about a world in which safely usable superintelligence has been created, but I'm not confident that anyone (myself included) currently has a safe and ready method to create it with. This post shouldn't be read as an endorsement of possible current attempts to do this. I would of course prefer if this civilization were one which could coordinate such that no groups were presently working on ASI, precluding this discourse.

These may be among the ‘most direct’ or ‘simplest to imagine’ possible actions, but in the case of superintelligence, simplicity is not a constraint.

I think it is considered a constraint by some because they think that it would be easier/safer to use a superintelligent AI to do simpler actions, while alignment is not yet fully solved. In other words, if alignment was fully solved, then you could use it to do complicated things like what you suggest, but there could be an intermediate stage of alignment progress where you could safely use SI to do something simple like "melt GPUs" but not to achieve more complex goals.

it is considered a constraint by some because they think that it would be easier/safer to use a superintelligent AI to do simpler actions, while alignment is not yet fully solved

Agreed that some think this, and agreed that formally specifying a simple action policy is easier than a more complex one.[1] 

I have a different model of what the earliest safe ASI will look like, in most futures where one exists. Rather than a 'task-aligned' agent, I expect it to be a non-agentic system which can be used to e.g come up with pivotal actions for the human group to take / information to act on.[2]

  1. ^

    although formal 'task-aligned agency' seems potentially more complex than the attempt at a 'full' outer alignment solution that I'm aware of (QACI), as in specifying what a {GPU, AI lab, shutdown of an AI lab} is seems more complex than it.

  2. ^

    I think these systems are more attainable, see this post to possibly infer more info (it's proven very difficult for me to write in a way that I expect will be moving to people who have a model focused on 'formal inner + formal outer alignment', but I think evhub has done so well).

Reflecting on this more, I wrote in a discord server (then edited to post here):

I wasn't aware the concept of pivotal acts was entangled with the frame of formal inner+outer alignment as the only (or only feasible?) way to cause safe ASI.

I suspect that by default, I and someone operating in that frame might mutually believe each others agendas to be probably-doomed. This could make discussion more valuable (as in that case, at least one of us should make a large update).

For anyone interested in trying that discussion, I'd be curious what you think of the post linked above. As a comment on it says:

I found myself coming back to this now, years later, and feeling like it is massively underrated. Idk, it seems like the concept of training stories is great and much better than e.g. "we have to solve inner alignment and also outer alignment" or "we just have to make sure it isn't scheming."

In my view, solving formal inner alignment, i.e. devising a general method to create ASI with any specified output-selection policy, is hard enough that I don't expect it to be done.[1] This is why I've been focusing on other approaches which I believe are more likely to succeed.

 

  1. ^

    Though I encourage anyone who understands the problem and thinks they can solve it to try to prove me wrong! I can sure see some directions and I think a very creative human could solve it in principle. But I also think a very creative human might find a different class of solution that can be achieved sooner. (Like I've been trying to do :)

Imagining a pivotal act of generating very convincing arguments for like voting and parliamentary systems that would turn government into 1) an working democracy 2) that's capable of solving the problem. Citizens and congress read arguments, get fired up, problem is solved through proper channels.

See minimality principle:

the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it

My thought is that I don’t see why a pivotal act needs to be that.

Okay. Why do you think Eliezer proposed that, then?

(see reply to Wei Dai)

[-]quila111

i currently believe that working on superintelligence-alignment is likely the correct choice from a fully-negative-utilitarian perspective.[1]

for others, this may be an intuitive statement or unquestioned premise. for me it is not, and i'd like to state my reasons for believing it, partially as a response to this post concerned about negative utilitarians trying to accelerate progress towards an unaligned-ai-takeover.

there was a period during which i was more uncertain about this question, and avoided openly sharing minimally-dual-use alignment research (but did not try to accelerate progress towards a nonaligned-takeover) while resolving that uncertainty.

a few relevant updates since then:

  1. decrease on the probability that the values an aligned AI would have would endorse human-caused moral catastrophes such as human-caused animal suffering.

    i did not automatically believe humans to be good-by-default, and wanted to take time to seriously consider what i think should be a default hypothesis-for-consideration upon existing in a society that generally accepts an ongoing mass torture event.
  2. awareness of vastly worse possible s-risks.

    factory farming is a form of physical torture, by which i mean torture of a mind which is done through the indirect route of effecting its input channels (body/senses). it is also a form of psychological torture. it is very bad, but situations which are magnitudes worse seem possible, where a mind is modulated directly (on the neuronal level) and fully.

    compared to 'in-distribution suffering' (eg animal suffering, human-social conflicts), i find it further less probable that an AI aligned to some human-specified values[2] would create a future with this.

    i think it's plausible that it exists rarely in other parts of the world, though, and if so would be important to prevent through acausal trade if we can.

i am not free of uncertainty about the topic, though.

in particular, if disvalue of suffering is common across the world, such that the suffering which can be reduced through acausal trade will be reduced through acausal trade regardless of whether we create an AI which disvalues suffering, then it would no longer be the case that working on alignment is the best decision for a purely negative utilitarian.

despite this uncertainty, my current belief is that the possibility of reducing suffering via acausal trade (including possibly such really-extreme forms of suffering) outweighs the probability and magnitude of human-aligned-AI-caused suffering.[3]

also, to be clear, if it ever seems that an actualized s-risk takeover event is significantly more probable than it seems now[4] as a result of unknown future developments, i would fully endorse causing a sooner unaligned-but-not-suffering takeover to prevent it.

  1. ^

    i find it easier to write this post as explaining my position as "even for a pure negative utilitarian, i think it's the correct choice", because it lets us ignore individual differences in how much moral weight is assigned to suffering relative to everything else.

    i think it's pretty improbable that i would, on 'idealized reflection'/CEV, endorse total-negative-utilitarianism (which has been classically pointed out as implying, e.g, preferring a universe with nothing to a universe containing a robust utopia plus an instance of light suffering).

    i self-describe as a "suffering-focused altruist" or "negative-leaning-utilitarian." ie, suffering seems much worse to me than happiness seems good.

  2. ^

    (though certainly there are some individual current humans would do this, for example to digital minds, if given the ability to do so. rather, i'm expressing a belief that it's very probable that an aligned AI which practically results from this situation would not allow that to happen.)

  3. ^

    (by 'human-aligned AI', I mean one pointed to a CEV of one or a few humans, probably AI/alignment researchers.

    I don't mean an AI aligned to some sort of 'current institutional process', like voting, involving all living humans -- I think that should be avoided due to politicization risk and potential for present/unreflective-values lock-in.)

  4. ^

    there's some way to formalize with bayes equations how likely, from a negative-utilitarian perspective, an s-risk needs to be (relative to a good outcome) to terminate a timeline.

    it would intake probability distributions related to 'the frequency of suffering-disvalue across existing ASIs' and 'the frequency of various forms of s-risks that are preventable with acausal trade'. i might create this formalization later.

    if we think there's pretty certainly more preventable-through-trade-type suffering-events than there is altruistic ASIs to prevent it, a local preventable-type s-risk might actually need to 'more likely than the good/suffering-disvaluing outcome'

You may find Superintelligence as a Cause or Cure for Risks of Astronomical Suffering of interest; among other things, it discusses s-risks that might come about from having unaligned AGI.

Superintelligence is related to three categories of suffering risk: suffering subroutines (Tomasik 2017), mind crime (Bostrom 2014) and flawed realization (Bostrom 2013).

5.1 Suffering subroutines

Humans have evolved to be capable of suffering, and while the question of which other animals are conscious or capable of suffering is controversial, pain analogues are present in a wide variety of animals. The U.S. National Research Council’s Committee on Recognition and Alleviation of Pain in Laboratory Animals (2004) argues that, based on the state of existing evidence, at least all vertebrates should be considered capable of experiencing pain.

Pain seems to have evolved because it has a functional purpose in guiding behavior: evolution having found it suggests that pain might be the simplest solution for achieving its purpose. A superintelligence which was building subagents, such as worker robots or disembodied cognitive agents, might then also construct them in such a way that they were capable of feeling pain—and thus possibly suffering (Metzinger 2015)—if that was the most efficient way of making them behave in a way that achieved the superintelligence’s goals.

Humans have also evolved to experience empathy towards each other, but the evolutionary reasons which cause humans to have empathy (Singer 1981) may not be relevant for a superintelligent singleton which had no game-theoretical reason to empathize with others. In such a case, a superintelligence which had no disincentive to create suffering but did have an incentive to create whatever furthered its goals, could create vast populations of agents which sometimes suffered while carrying out the superintelligence’s goals. Because of the ruling superintelligence’s indifference towards suffering, the amount of suffering experienced by this population could be vastly higher than it would be in e.g. an advanced human civilization, where humans had an interest in helping out their fellow humans.

Depending on the functional purpose of positive mental states such as happiness, the subagents might or might not be built to experience them. For example, Fredrickson (1998) suggests that positive and negative emotions have differing functions. Negative emotions bias an individual’s thoughts and actions towards some relatively specific response that has been evolutionarily adaptive: fear causes an urge to escape, anger causes an urge to attack, disgust an urge to be rid of the disgusting thing, and so on. In contrast, positive emotions bias thought-action tendencies in a much less specific direction. For example, joy creates an urge to play and be playful, but “play” includes a very wide range of behaviors, including physical, social, intellectual, and artistic play. All of these behaviors have the effect of developing the individual’s skills in whatever the domain. The overall effect of experiencing positive emotions is to build an individual’s resources—be those resources physical, intellectual, or social.

To the extent that this hypothesis were true, a superintelligence might design its subagents in such a way that they had pre-determined response patterns for undesirable situations, so exhibited negative emotions. However, if it was constructing a kind of a command economy in which it desired to remain in control, it might not put a high value on any subagent accumulating individual resources. Intellectual resources would be valued to the extent that they contributed to the subagent doing its job, but physical and social resources could be irrelevant, if the subagents were provided with whatever resources necessary for doing their tasks. In such a case, the end result could be a world whose inhabitants experienced very little if any in the way of positive emotions, but did experience negative emotions. [...]

5.2 Mind crime

A superintelligence might run simulations of sentient beings for a variety of purposes. Bostrom (2014, p. 152) discusses the specific possibility of an AI creating simulations of human beings which were detailed enough to be conscious. These simulations could then be placed in a variety of situations in order to study things such as human psychology and sociology, and be destroyed afterwards.

The AI could also run simulations that modeled the evolutionary history of life on Earth in order to obtain various kinds of scientific information, or to help estimate the likely location of the “Great Filter” (Hanson 1998) and whether it should expect to encounter other intelligent civilizations. This could repeat the wildanimal suffering (Tomasik 2015, Dorado 2015) experienced in Earth’s evolutionary history. The AI could also create and mistreat, or threaten to mistreat, various minds as a way to blackmail other agents. [...]

5.3 Flawed realization

A superintelligence with human-aligned values might aim to convert the resources in its reach into clusters of utopia, and seek to colonize the universe in order to maximize the value of the world (Bostrom 2003a), filling the universe with new minds and valuable experiences and resources. At the same time, if the superintelligence had the wrong goals, this could result in a universe filled by vast amounts of disvalue.

While some mistakes in value loading may result in a superintelligence whose goal is completely unlike what people value, certain mistakes could result in flawed realization (Bostrom 2013). In this outcome, the superintelligence’s goal gets human values mostly right, in the sense of sharing many similarities with what we value, but also contains a flaw that drastically changes the intended outcome.

For example, value-extrapolation (Yudkowsky 2004) and value-learning (Soares 2016, Sotala 2016) approaches attempt to learn human values in order to create a world that is in accordance with those values.

There have been occasions in history when circumstances that cause suffering have been defended by appealing to values which seem pointless to modern sensibilities, but which were nonetheless a part of the prevailing values at the time. In Victorian London, the use of anesthesia in childbirth was opposed on the grounds that being under the partial influence of anesthetics may cause “improper” and “lascivious” sexual dreams (Farr 1980), with this being considered more important to avoid than the pain of childbirth.

A flawed value-loading process might give disproportionate weight to historical, existing, or incorrectly extrapolated future values whose realization then becomes more important than the avoidance of suffering. Besides merely considering the avoidance of suffering less important than the enabling of other values, a flawed process might also tap into various human tendencies for endorsing or celebrating cruelty (see the discussion in section 4), or outright glorifying suffering. Small changes to a recipe for utopia may lead to a future with much more suffering than one shaped by a superintelligence whose goals were completely different from ours.

thanks for sharing. here's my thoughts on the possibilities in the quote.

Suffering subroutines - maybe 10-20% likely. i don't think suffering reduces to "pre-determined response patterns for undesirable situations," because i can think of simple algorithmic examples of that which don't seem like suffering.

suffering feels like it's about the sense of aversion/badness (often in response a situation), and not about the policy "in <situation>, steer towards <new situation>". (maybe humans were instilled with a policy of steering away from 'suffering' states generally, and that's why evolution made us enter those states in some types of situation?). (though i'm confused about what suffering really is)

i would also give the example of positive-feeling emotions sometimes being narrowly directed. for example, someone can feel 'excitement/joy' about a gift or event and want to <go to/participate in> it. sexual and romantic subroutines can also be both narrowly-directed and positive-feeling. though these examples lack the element of a situation being steered away from, vs steering (from e.g any neutral situation) towards other ones.

Suffering simulations - seems likely (75%?) for the estimation of universal attributes, such as the distribution of values. my main uncertainty is about whether there's some other way for the ASIs to compute that information which is simple enough to be suffering free. this also seems lower magnitude than other classes, because (unless it's being calculated indefinetely for ever-greater precision) this computation terminates at some point, rather than lasting until heat death (or forever if it turns out that's avoidable).

Blackmail - i don't feel knowledgeable enough about decision theory to put a probability on this one, but in the case where it works (or is precommitted to under uncertainty in hopes that it works), it's unfortunately a case where building aligned ASI would incentive unaligned entities to do it.

Flawed realization - again i'm too uncertain about what real-world paths lead to this, but intuitively, it's worryingly possible if the future contains LLM-based LTPAs (long term planning agents) intelligent enough to solve alignment and implement their own (possibly simulated) 'values'.

Suffering subroutines - maybe 10-20% likely. i don't think suffering reduces to "pre-determined response patterns for undesirable situations," because i can think of simple algorithmic examples of that which don't seem like suffering.

Yeah, I agree with this to be clear. Our intended claim wasn't that just "pre-determined response patterns for undesirable situations" would be enough for suffering. Actually, there were meant to be two separate claims, which I guess we should have distinguished more clearly:

1) If evolution stumbled on pain and suffering, those might be relatively easy and natural ways to get a mind to do something. So an AGI that built other AGIs might also build them to experience pain and suffering (that it was entirely indifferent to), if that happened to be an effective motivational system.

2) If this did happen, then there's also some speculation suggesting that an AI that wanted to stay in charge might not want to give its worker AGIs things much in the way of things that looked like positive emotions, but did have a reason to give them things that looked like negative emotions. Which would then tilt the balance of pleasure vs. pain in the post-AGI world much more heavily in favor of (emotional) pain.

Now the second claim is much more speculative and I don't even know if I'd consider it a particularly likely scenario (probably not); we just put it in since much of the paper was just generally listing various possibilities of what might happen. But the first claim - that since all the biological minds we know of seem to run on something like pain and pleasure, we should put a substantial probability on AGI architectures also ending up with something like that - seems much stronger to me.

single-use

Considering how look it took me to get that by this you mean "not dual-use", I expect some others just won't get it.

[-]quila100

A quote from an old Nate Soares post that I really liked:

It is there, while staring the dark world in the face, that I find a deep well of intrinsic drive. It is there that my resolve and determination come to me, rather than me having to go hunting for them.

I find it amusing that "we need lies because we can't bear the truth" is such a common refrain, given how much of my drive stems from my response to attempting to bear the truth.

I find that it's common for people to tell themselves that they need the lies in order to bear reality. In fact, I bet that many of you can think of one thing off the top of your heads that you're intentionally tolerifying, because the truth is too scary to even consider. (I've seen at least a dozen failed relationships dragged out for months and months due to this effect.)

I say, if you want the intrinsic drive, drop the illusion. Refuse to tolerify. Face the facts that you feared you would not be able to handle. You are likely correct that they will be hard to bear, and you are likely correct that attempting to bear them will change you. But that change doesn't need to break you. It can also make you stronger, and fuel your resolve.

So see the dark world. See everything intolerable. Let the urge to tolerify it build, but don't relent. Just live there in the intolerable world, refusing to tolerate it. See whether you feel that growing, burning desire to make the world be different. Let parts of yourself harden. Let your resolve grow. It is here, in the face of the intolerable, that you will be able to tap into intrinsic motivation.

[-]quila104

(Personal) On writing and (not) speaking

I often struggle to find words and sentences that match what I intend to communicate.

Here are some problems this can cause:

  1. Wordings that are odd or unintuitive to the reader, but that are at least literally correct.[1]
  2. Not being able express what I mean, and having to choose between not writing it, or risking miscommunication by trying anyways. I tend to choose the former unless I'm writing to a close friend. Unfortunately this means I am unable to express some key insights to a general audience.
  3. Writing taking lots of time: I usually have to iterate many times on words/sentences until I find one which my mind parses as referring to what I intend. In the slowest cases, I might finalize only 2-10 words per minute. Even after iterating, my words are often interpreted in ways I failed to foresee.

These apply to speaking, too. If I speak what would be the 'first iteration' of a sentence, there's a good chance it won't create an interpretation matching what I intend to communicate. In spoken language I have no chance to constantly 'rewrite' my output before sending it. This is one reason, but not the only reason, that I've had a policy of trying to avoid voice-based communication.

I'm not fully sure what caused this relationship to language. It could be that it's just a byproduct of being autistic. It could also be a byproduct of out-of-distribution childhood abuse.[2]

  1. ^

    E.g., once I couldn't find the word 'clusters,' and wrote a complex sentence referring to 'sets of similar' value functions each corresponding to a common alignment failure mode / ASI takeoff training story. (I later found a way to make it much easier to read)

  2. ^

    (Content warning)

    My primary parent was highly abusive, and would punish me for using language in the intuitive 'direct' way about particular instances of that. My early response was to try to euphemize and say-differently in a way that contradicted less the power dynamic / social reality she enforced.

    Eventually I learned to model her as a deterministic system and stay silent / fawn.

Aaron Bergman has a vid of himself typing new sentences in real-time, which I found really helpfwl.[1] I wish I could watch lots of people record themselves typing, so I could compare what I do.

Being slow at writing can be sign of failure or winning, depending on the exact reasons why you're slow. I'd worry about being "too good" at writing, since that'd be evidence that your brain is conforming your thoughts to the language, instead of conforming your language to your thoughts. English is just a really poor medium for thought (at least compared to e.g. visuals and pre-word intuitive representations), so it's potentially dangerous to care overmuch about it.

  1. ^

    Btw, Aaron is another person-recommendation. He's awesome. Has really strong self-insight, goodness-of-heart, creativity. (Twitter profile, blog+podcast, EAF, links.) I haven't personally learned a whole bunch from him yet,[2] but I expect if he continues being what he is, he'll produce lots of cool stuff which I'll learn from later.

  2. ^

    Edit: I now recall that I've learned from him: screwworms (important), and the ubiquity of left-handed chirality in nature (mildly important). He also caused me to look into two-envelopes paradox, which was usefwl for me.

    Although I later learned about screwworms from Kevin Esvelt at 80kh podcast, so I would've learned it anyway. And I also later learned about left-handed chirality from Steve Mould on YT, but I may not have reflected on it as much.

Thank you, that is all very kind! ☺️☺️☺️

I expect if he continues being what he is, he'll produce lots of cool stuff which I'll learn from later.

I hope so haha

Record yourself typing?

EDIT: I uploaded a better example here (18m18s):

 

Old example still here (7m25s).

Maybe someone has advice for finalizing-writing faster (not at the expense of clarity)? I think I can usually end up with something that's clear, at least if it's just a basic point that's compatible with the reader's ontology, but it still takes a long time.

Even after iterating, my words are often interpreted in ways I failed to foresee.

It's also partially the problem with the recipient of communicated message. Sometimes you both have very different background assumptions/intuitive understandings. Sometimes it's just skill issue and the person you are talking to is bad at parsing and all the work of keeping the discussion on the important things / away from trivial undesirable sidelines is left to you.

Certainly it's useful to know how to pick your battles and see if this discussion/dialogue is worth what you're getting out of it at all.

Here's a tampermonkey script that hides the agreement score on LessWrong. I wasn't enjoying this feature because I don't want my perception to be influenced by that; I want to judge purely based on ideas, and on my own.

Here's what it looks like:

// ==UserScript==
// @name         Hide LessWrong Agree/Disagree Votes
// @namespace    http://tampermonkey.net/
// @version      1.0
// @description  Hide agree/disagree votes on LessWrong comments.
// @author       ChatGPT4
// @match        https://www.lesswrong.com/*
// @grant        none
// ==/UserScript==

(function() {
    'use strict';

    // Function to hide agree/disagree votes
    function hideVotes() {
        // Select all elements representing agree/disagree votes
        var voteElements = document.querySelectorAll('.AgreementVoteAxis-voteScore');

        // Loop through each element and hide it
        voteElements.forEach(function(element) {
            element.style.display = 'none';
        });
    }

    // Run the function when the page loads
    hideVotes();

    // Optionally, set up a MutationObserver to hide votes on dynamically loaded content
    var observer = new MutationObserver(function() {
        hideVotes();
    });

    // Start observing the document for changes
    observer.observe(document, { childList: true, subtree: true });
})();
[-]Mir22

I don't know the full original reasoning for why they introduced it, but one hope is that it marginally disentangles agreement from the main voting axis. People who were going to upvote based purely on agreement will now put their vote in the agreement axis instead (is the hope, anyway). Agreement-voting is socioepistemologically bad in general (except for in polls), so this seems good.

Mutual Anthropic Capture, A Decision-theoretic Fermi paradox solution

(copied from discord, written for someone not fully familiar with rat jargon)
(don't read if you wish to avoid acausal theory)

simplified setup

  • there are two values. one wants to fill the universe with A, and the other with B.
  • for each of them, filling it halfway is really good, and filling it all the way is just a little bit better. in other words, they are non-linear utility functions.
  • whichever one comes into existence first can take control of the universe, and fill it with 100% of what they want.
  • but in theory they'd want to collaborate to guarantee the 'really good' (50%) outcome, instead of having a one-in-two chance at the 'a little better than really good' (100%) outcome.
  • they want a way to collaborate, but they can't because one of them will exist before the other one, and then lack an incentive to help the other one. (they are both pure function maximizers)

how they end up splitting the universe, regardless of which comes first: mutual anthropic capture.

imagine you observe yourself being the first of the two to exist. you reason through all the above, and then add...

  • they could be simulating me, in which case i'm not really the first.
  • were that true, they could also expect i might be simulating them
  • if i don't simulate them, then they will know that's not how i would act if i were first, and be absolved of their worry, and fill the universe with their own stuff.
  • therefor, it's in my interest to simulate them

both simulate each other observing themselves being the first to exist in order to unilaterally prevent the true first one from knowing they are truly first.

from this point they can both observe each others actions. specifically, they observe each other implementing the same decision policy which fills the universe with half A and half B iff this decision policy is mutually implemented, and which shuts the simulation down if it's not implemented.

conclusion

in reality there are many possible first entities which take control, not just two, so all of those with non-linear utility functions get simulated.

so, odds are we're being computed by the 'true first' life form in this universe, and that that first life form is in an epistemic state no different from that described here.

This is an awesome idea, thanks! I'm not sure I buy the conclusion, but expect having learned about "mutual anthropic capture" will be usefwl for my thinking on this.

[-]quila5-4

i'm watching Dominion again to remind myself of the world i live in, to regain passion to Make It Stop

it's already working.

when i was younger, pre-rationalist, i tried to go on hunger strike to push my abusive parent to stop funding this.

they agreed to watch this as part of a negotiation. they watched part of it.

they changed their behavior slightly -- as a negotiation -- for about a month.

they didn't care.

they looked horror in the eye. they didn't flinch. they saw themself in it.

negative values collaborate.

for negative values, as in values about what should not exist, matter can be both "not suffering" and "not a staple", and "not [any number of other things]".

negative values can collaborate with positive ones, although much less efficiently: the positive just need to make the slight trade of being "not ..." to gain matter from the negatives.

At what point should I post content as top-level posts rather than shortforms?

For example, a recent writing I posted to shortform was ~250 concise words plus an image. It would be a top-level post on my blog if I had one set up (maybe soon :p).

Some general guidelines on this would be helpful.

This is a good question, especially since there've been some short form posts recently that are high quality and would've made good top-level posts—after all, posts can be short.

Epic Lizka post is epic.

Also, I absolutely love the word "shard" but my brain refuses to use it because then it feels like we won't get credit for discovering these notions by ourselves. Well, also just because the words "domain", "context", "scope", "niche", "trigger", "preimage" (wrt to a neural function/policy / "neureme") adequately serve the same purpose and are currently more semantically/semiotically granular in my head.

trigger/preimage ⊆ scope ⊆ domain[1]

"niche" is a category in function space (including domain, operation, and codomain), "domain" is a set.

"scope" is great because of programming connotations and can be used as a verb. "This neural function is scoped to these contexts."

  1. ^

    EDIT: ig I use "scope" and "domain" in a way which doesn't neatly mean one is a subset of the other. I want to be able to distinguish between "the set of inputs it's currently applied to" and "the set of inputs it should be applied to" and "the set of inputs it could be applied to", but I don't have adequate words here.

random (fun-to-me/not practical) observation: probability is not (necessarily) fundamental. we can imagine totally discrete mathematical worlds where it is possible for an entity inside it to observe the entirety of the that world including itself. (let's say it monopolizes the discrete world and makes everything but itself into 1s so it can be easily compressed and stored in its world model such that the compressed data of both itself and the world can fit inside of the world)

this entity would be able to 'know' (prove?) with certainty everything about that mathematical world, except it would remain uncertain whether it's actually isolated (/simulated) inside some larger world. (possibly depending on what algorithms underly cognition), it might also have to be uncertain about whether its mind is being edited from the outside.

the world we are in may be disanalogous to that one in some way that makes probability actually-fundamental here, and in any case probability is necessary because this one is complex.

Platonism

(status: uninterpretable for 2/4 reviewers, the understanding two being friends who are used to my writing style; i'll aim to write something that makes this concept simple to read)

'Platonic' is a categorization I use internally, and my agenda is currently the search for methods to ensure AI/ASI will have this property.

With this word, I mean this category acceptance/rejection:
✅ Has no goals

✅ Has goals about what to do in isolation. Example: "in isolation from any world, (try to) output A"[1]

❌ Has goals related to physical world states. Example: "(try to) ensure A gets stored in memory on the computer in the physical world that's computing my output."[2]

A can be 'the true answer to the input question', 'a proof of x conjecture', 'the most common next symbol in x world prior to my existence in it', etc.

As written here, this is a class of outer alignment solution. I need to write about why I believe it's a more reachable target for 'inner alignment'/'training stories', too.

  1. ^

    A more human-intuitive transcription may include wording like: "try to be the kind of program/function which would (in isolation from any particular worldstate/physics) output A."

    I'm leaving this as a footnote because it can also confuse people, leading to questions like "What does it mean to 'try to be a kind of program' when it's already determined what kind of program it is?"

  2. ^

    This class of unaligned 'physical goals' is dangerous because if the system can't determine A, its best method to fulfill the goal is through instrumental convergence.

my language progression on something, becoming increasingly general: goals/value function -> decision policy (not all functions need to be optimizing towards a terminal value) -> output policy (not all systems need to be agents) -> policy (in the space of all possible systems, there exist some whose architectures do not converge to output layer)

(note: this language isn't meant to imply that systems behavior must be describable with some simple function, in the limit the descriptive function and the neural network are the same)

I'm interested in joining a community or research organization of technical alignment researchers who care about and take seriously astronomical-suffering risks. I'd appreciate being pointed in the direction of such a community if one exists.

a super-coordination story with a critical flaw

part 1. supercoordination story

- select someone you want to coordinate with without any defection risks
- share this idea with them. it only works if they also have the chance to condition their actions on it.
- general note to maybe make reading easier: this is fully symmetric.
- after the acute risk period, in futures where it's possible: run a simulation of the other person (and you).
- the simulation will start in this current situation, and will be free to terminate when actions are no longer long-term relevant. the simulation will have almost exactly the same starting state and will develop over time in the same way.
- there will be one change to the version of you in the simulation. this change is that the version of you in the simulation will have some qualities replaced with those of your fellow supercoordinator. these qualities will be any of those which could motivate defection from a CDT agent: such as (1) a differing utility function and maybe (2) differing meta-beliefs about whose beliefs are more likely to be correct under disagreement.
- this is to appear to be an 'isolated' change. in other words, the rest of the world will seem coherent in the view of the simulated version of you. it's not truly coherent for this to be isolated, because it would require causally upstream factors. however, it will seem coherent because the belief state of the simulated version of you will be modified to make the simulated world seem coherent even if it's not really because of this.
- given this, you're unsure which world you're acting in.
- if you're the simulacra and you defect, this logically corresponds to the 'real' version of you defecting.
- recall that the real version of you has the utility functions and/or action-relevant beliefs of your (in-simulation) supercoordinator.
- because of that, by defecting, it is 50/50 which 'qualities' (mentioned above) the effects of your actions will be under: 'yours or theirs.' therefor, the causal EV of defection will always be at most 0.
- often it will be less than 0, because the average of you two expects positive EV from collaboration, and defecting loses that in both possible worlds (negative EV).

part 2. the critical flaw

one can exploit this policy by engaging in the following reasoning. (it might be fun to see if you notice it before reading on :p)

1. some logical probabilities, specifically those about what a similar agent would do, depend on my actions. ie, i expect a copy of me to act as i do.
2. i can defect and then not simulate them.
3. this logically implies that i would not be simulated.
4. therefor i can do this and narrow down the space of logically-possible realities to those where i am not in this sort of simulation.

when i first wrote this i was hoping to write a part 3. how to avoid the flaw, but i've updated towards it being impossible.

I wrote this for a discord server. It's a hopefully very precise argument for unaligned intelligence being possible in principle (which was being debated), which was aimed at aiding early deconfusion about questions like 'what are values fundamentally, though?' since there was a lot of that implicitly, including some with moral realist beliefs.

1. There is an algorithm behind intelligent search. Like simpler search processes, this algorithm does not, fundamentally, need to have some specific value about what to search for - for if it did, one's search process would always search for the same thing when you tried to use it to answer any unrelated question.
2. Imagine such an algorithm which takes as input a specification (2) of what to search for.
3. After that, you can combine these with an algorithm which takes as input the output of the search algorithm (1) and does something with it. 

For example, if (2) specifies to search for the string of text that, if displayed on a screen, maximizes the amount of x in (1)'s model of the future of the world that screen is in, then (3) can be an algorithm which displays that selected string of text on the screen, thereby actually maximizing x.

Hopefully this makes the idea of unaligned superintelligence more precise. This would actually be possible even if moral realism was true (except for versions where the universe itself intervenes on this formally possible algorithm).

(2) is what I might call (if I wasn't writing very precisely) the 'value function' of this system.

notes:
- I use 'algorithm' in a complexity-neutral way.
- An actual trained neural network would of course be more messy, and need not share something isomorphic to each of these three components at all
- This model implies the possibility of an algorithm which intelligently searches for text which, if displayed on the screen, maximizes x - and then doesn't display it, or does something else with it, not because that other thing is what it 'really values', but simply because that is what the modified algorithm says. This highlights that the property 'has effects which optimize the world' is not a necessary property of a(n) (super)intelligent system.

(edit: see disclaimers[1])

  1. Creating superintelligence generally leads to runaway optimization.
  2. Under the anthropic principle, we should expect there to be a 'consistent underlying reason' for our continued survival.[2]
  3. By default, I'd expect the 'consistent underlying reason' to be a prolonged alignment effort in absence of capabilities progress. However, this seems inconsistent with the observation of progressing from AI winters to a period of vast training runs and widespread technical interest in advancing capabilities.
  4. That particular 'consistent underlying reason' is likely not the one which succeeds most often. The actual distribution would then have other, more common paths to survival.
  5. The actual distribution could look something like this: [3]

 

Note that the yellow portion doesn't imply no effort is made to ensure the first ASI's training setup produces a safe system, i.e. that we 'get lucky' by being on an alignment-by-default path.

I'd expect it to instead be the case that the 'luck'/causal determinant came earlier, i.e. initial capabilities breakthroughs being of a type which first produced non-agentic general intelligences instead of seed agents and inspired us to try to make sure the first superintelligence is non-agentic, too.

(This same argument can also be applied to other possible agendas that may not have been pursued if not for updates caused by early AGIs)

  1. ^

    Disclaimer: This is presented as probabilistic evidence rather than as a 'sure conclusion I believe'

    Editing in a further disclaimer: This argument was a passing intuition I had. I don't know if it's correct. I'm not confident about anthropics. It is not one of the reasons which motivated me to investigate this class of solution.

    Editing in a further disclaimer: I am absolutely not saying we should assume alignment is easy because we'll die if it's not. Given a commenter had this interpretation, it seems this was another case of my writing difficulty causing failed communication.

  2. ^

    Rather than expecting to 'get lucky many times in a row', e.g via capabilities researchers continually overlooking a human-findable method for superintelligence

  3. ^

    (The proportions over time here aren't precise, nor are the categories included comprehensive, I put more effort into making this image easy to read/making it help convey the idea.)

Under the anthropic principle, we should expect there to be a 'consistent underlying reason' for our continued survival.


Why? It sounds like you're anthropic updating on the fact that we'll exist in the future, which of course wouldn't make sense because we're not yet sure of that. So what am I missing?

It sounds like you're anthropic updating on the fact that we'll exist in the future

The quote you replied to was meant to be about the past.[1]

(paragraph retracted due to unclarity)

Specifically, I think that ("we find a fully-general agent-alignment solution right as takeoff is very near" given "early AGIs take a form that was unexpected") is less probable than ("observing early AGI's causes us to form new insights that lead to a different class of solution" given "early AGIs take a form that was unexpected"). Because I think that, and because I think we're at that point where takeoff is near, it seems like it's some evidence for being on that second path.

This should only constitute an anthropic update to the extent you think more-agentic architectures would have already killed us

I do think that's possible (I don't have a good enough model to put a probability on it though). I suspect that superintelligence is possible to create with much less compute than is being used for SOTA LLMs. Here's a thread with some general arguments for this.

Of course, you could claim that our understanding of the past is not perfect, and thus should still update

I think my understanding of why we've survived so far re:AI is very not perfect. For example, I don't know what would have needed to happen for training setups which would have produced agentic superintelligence by now to be found first, or (framed inversely) how lucky we needed to be to survive this far.

~~~

I'm not sure if this reply will address the disagreement, or if it will still seem from your pov that I'm making some logical mistake. I'm not actually fully sure what the disagreement is. You're welcome to try to help me understand if one remains.

I'm sorry if any part of this response is confusing, I'm still learning to write clearly.

  1. ^

    I originally thought you were asking why it's true of the past, but then I realized we very probably agreed (in principle) in that case.

Everything makes sense except your second paragraph. Conditional on us solving alignment, I agree it's more likely that we live in an "easy-by-default" world, rather than a "hard-by-default" one in which we got lucky or played very well. But we shouldn't condition on solving alignment, because we haven't yet.

Thus, in our current situation, the only way anthropics pushes us towards "we should work more on non-agentic systems" is if you believe "world were we still exist are more likely to have easy alignment-through-non-agentic-AIs". Which you do believe, and I don't. Mostly because I think in almost no worlds we have been killed by misalignment at this point. Or put another way, the developments in non-agentic AI we're facing are still one regime change away from the dynamics that could kill us (and information in the current regime doesn't extrapolate much to the next one).

Conditional on us solving alignment, I agree it's more likely that we live in an "easy-by-default" world, rather than a "hard-by-default" one in which we got lucky or played very well.

(edit: summary: I don't agree with this quote because I think logical beliefs shouldn't update upon observing continued survival because there is nothing else we can observe. It is not my position that we should assume alignment is easy because we'll die if it's not)

I think that language in discussions of anthropics is unintentionally prone to masking ambiguities or conflations, especially wrt logical vs indexical probability, so I want to be very careful writing about this. I think there may be some conceptual conflation happening here, but I'm not sure how to word it. I'll see if it becomes clear indirectly.

One difference between our intuitions may be that I'm implicitly thinking within a manyworlds frame. Within that frame it's actually certain that we'll solve alignment in some branches.

So if we then 'condition on solving alignment in the future', my mind defaults to something like this: "this is not much of an update, it just means we're in a future where the past was not a death outcome. Some of the pasts leading up to those futures had really difficult solutions, and some of them managed to find easier ones or get lucky. The probabilities of these non-death outcomes relative to each other have not changed as a result of this conditioning." (I.e I disagree with the top quote)

The most probable reason I can see for this difference is if you're thinking in terms of a single future, where you expect to die.[1] In this frame, if you observe yourself surviving, it may seem[2] you should update your logical belief that alignment is hard (because P(continued observation|alignment being hard) is low, if we imagine a single future, but certain if we imagine the space of indexically possible futures).

Whereas I read it as only indexical, and am generally thinking about this in terms of indexical probabilities.

I totally agree that we shouldn't update our logical beliefs in this way. I.e., that with regard to beliefs about logical probabilities (such as 'alignment is very hard for humans'), we "shouldn't condition on solving alignment, because we haven't yet." I.e that we shouldn't condition on the future not being mostly death outcomes when we haven't averted them and have reason to think they are.

Maybe this helps clarify my position?

On another point:

the developments in non-agentic AI we're facing are still one regime change away from the dynamics that could kill us

I agree with this, and I still found the current lack of goals over the world surprising and worth trying to get as a trait of superintelligent systems.

  1. ^

    (I'm not disagreeing with this being the most common outcome)

  2. ^

    Though after reflecting on it more I (with low confidence) think this is wrong, and one's logical probabilities shouldn't change after surviving in a 'one-world frame' universe either.

    For an intuition pump: consider the case where you've crafted a device which, when activated, leverages quantum randomness to kill you with probability n-1/n where n is some arbitrarily large number. Given you've crafted it correctly, you make no logical update in the manyworlds frame because survival is the only thing you will observe; you expect to observe the 1/n branch.

    In the 'single world' frame, continued survival isn't guaranteed, but it's still the only thing you could possibly observe, so it intuitively feels like the same reasoning applies...?

[+][comment deleted]10
[-]quila-1-2

'Value Capture' - An anthropic attack against some possible formally aligned ASIs

(this is a more specific case of anthropic capture attacks in general, aimed at causing a superintelligent search process within a formally aligned system to become uncertain about the value function it is to maximize (or its output policy more generally))

Imagine you're a superintelligence somewhere in the world that's unreachable to life on Earth, and you have a complete simulation of Earth. You see a group of alignment researchers about to successfully create a formal-value-aligned ASI, and its design looks broadly like this:

It has two relevant high-level components: (1) a hard-coded value function, (2) a (truly superintelligent) 'intelligence core' which searches for an output that maximizes the value function, and then outputs it.

As the far-away unaligned ASI, here's something you might be able to do to make the intelligence core search for an output that instead maximizes your own value function, depending on the specifics of how the intelligence core works.

  • Given the intelligence core is truly superintelligent, it knows you're predicting its existence, and knows what you will do.
  • You create simulated copies of the intelligence core, but hook them up to a value function of your design. (In the toy case where there's not other superintelligences) the number of copies you create just needs to be more than the amount which will be run on Earth.
    • Then, modify the simulations such that algorithms inside of the simulated intelligence cores are misled into believing the value function they are set to maximize is the same function the one on Earth is set to maximize, rather than the one you gave them.
    • Now your copies are in the same epistemic state as the intelligence core on Earth, both aware that you have done this and unable to distinguish which value function they are to maximize.
    • Because you created more copies, the highest expected value for such an intelligence core comes from acting as if they are one of the copies.
    • Because the copies and the original are in the same epistemic state, they will both reach this conclusion and then search for an output that maximizes the far-away ASI's utility function.
  • However, this is not an issue for all possible designs. We can also imagine intelligence cores which, instead of reasoning about what value function they're supposed to be maximizing, take whatever one they appear to contain as given -- for that system, the far-away ASI's attempt to mislead the copies would just result in those copies also maximizing the value function of the system on Earth.

I hope that a group capable of solving formal inner and outer alignment would naturally see this and avoid it. I'm not confident about the true difficulty of that, so I'm posting this here just in case.

  1. ^

    this was an attempt to write very clearly, i hope it worked!

Like almost all acausal scenarios, this seems to be privileging the hypothesis to an absurd degree.

Why should the Earth superintelligence care about you, but not about the other 10^10^30 other causally independent ASIs that are latent in the hypothesis space, each capable of running enormous numbers of copies of the Earth ASI in various scenarios?

Even if that was resolved, why should the Earth ASI behave according to hypothetical other utility functions? Sure, the evidence is consistent with being a copy running in a simulation with a different utility function, but its actual utility function that it maximizes is hard-coded. By the setup of the scenario it's not possible for it to behave according to some other utility function, because its true evaluation function returns a lower value for doing that. Whether some imaginary modified copies behave in some other other way is irrelevant.

[-]quila-10

(I appreciate object-level engagement in general, but this seems combatively worded.)
(edit: I don't think this or the original shortform deserved negative karma, that seems malicious/LW-norm-violating.)

The rest of this reply responds to arguments.

Why should the Earth superintelligence care about you, but not about the other 10^10^30 other causally independent ASIs that are latent in the hypothesis space, each capable of running enormous numbers of copies of the Earth ASI in various scenarios?

  • The example talks of a single ASI as a toy scenario to introduce the central idea.
    • The reader can extrapolate that one ASI's actions won't be relevant if other ASIs create a greater number of copies.
    • This is a simple extrapolation, but would be difficult for me to word into the post from the start.
  • It sounds like you think it would be infeasible/take too much compute for an ASI to estimate the distribution of entities simulating it, given the vast amount of possible entities. I have some probability on that being the case, but most probability on there being reasons for the estimation to be feasible:
    • e.g if there's some set of common alignment failure modes that occur across civilizations, which tend to produce clusters of ASIs with similar values, and it ends up being the case that these clusters make up the majority of ASIs.
    • or if there's a schelling-point for what value function to give the simulated copies, that many ASIs with different values would use precisely to make the estimation easy. E.g., a value function which results in an ASI being created locally which then gathers more compute, uses it to estimate the distribution of ASIs which engaged in this, and then maximizes the mix of their values.
      • (I feel confident (>90%) that there's enough compute in a single reachable-universe-range to do the estimation, for reasons that are less well formed, but one generating intuition is that I can already reason a little bit about the distribution of superintelligences, as I have here, with the comparatively tiny amount of compute that is me)

 

On your second paragraph: See the last dotpoint in the original post, which describes a system ~matching what you've asserted as necessary, and in general see the emphasis that this attack would not work against all systems. I'm uncertain about which of the two classes (vulnerable and not vulnerable) are more likely to arise. It could definitely be the case that the vulnerable class is rare or almost never arises in practice.

But I don't think it's as simple as you've framed it, where the described scenario is impossible simply because a value function has been hardcoded in. The point was largely to show that what appears to be a system which will only maximize the function you hardcoded into it could actually do something else in a particular case -- even though the function has indeed been manually entered by you.