1 min read22nd Dec 202310 comments
This is a special post for quick takes by quila. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
11 comments, sorted by Click to highlight new comments since: Today at 8:32 PM

Here's a tampermonkey script that hides the agreement score on LessWrong. I wasn't enjoying this feature because I don't want my perception to be influenced by that; I want to judge purely based on ideas, and on my own.

Here's what it looks like:

// ==UserScript==
// @name         Hide LessWrong Agree/Disagree Votes
// @namespace    http://tampermonkey.net/
// @version      1.0
// @description  Hide agree/disagree votes on LessWrong comments.
// @author       ChatGPT4
// @match        https://www.lesswrong.com/*
// @grant        none
// ==/UserScript==

(function() {
    'use strict';

    // Function to hide agree/disagree votes
    function hideVotes() {
        // Select all elements representing agree/disagree votes
        var voteElements = document.querySelectorAll('.AgreementVoteAxis-voteScore');

        // Loop through each element and hide it
        voteElements.forEach(function(element) {
            element.style.display = 'none';
        });
    }

    // Run the function when the page loads
    hideVotes();

    // Optionally, set up a MutationObserver to hide votes on dynamically loaded content
    var observer = new MutationObserver(function() {
        hideVotes();
    });

    // Start observing the document for changes
    observer.observe(document, { childList: true, subtree: true });
})();
[-]Mir4mo22

I don't know the full original reasoning for why they introduced it, but one hope is that it marginally disentangles agreement from the main voting axis. People who were going to upvote based purely on agreement will now put their vote in the agreement axis instead (is the hope, anyway). Agreement-voting is socioepistemologically bad in general (except for in polls), so this seems good.

Mutual Anthropic Capture, A Decision-theoretic Fermi paradox solution

(copied from discord, written for someone not fully familiar with rat jargon)
(don't read if you wish to avoid acausal theory)

simplified setup

  • there are two values. one wants to fill the universe with A, and the other with B.
  • for each of them, filling it halfway is really good, and filling it all the way is just a little bit better. in other words, they are non-linear utility functions.
  • whichever one comes into existence first can take control of the universe, and fill it with 100% of what they want.
  • but in theory they'd want to collaborate to guarantee the 'really good' (50%) outcome, instead of having a one-in-two chance at the 'a little better than really good' (100%) outcome.
  • they want a way to collaborate, but they can't because one of them will exist before the other one, and then lack an incentive to help the other one. (they are both pure function maximizers)

how they end up splitting the universe, regardless of which comes first: mutual anthropic capture.

imagine you observe yourself being the first of the two to exist. you reason through all the above, and then add...

  • they could be simulating me, in which case i'm not really the first.
  • were that true, they could also expect i might be simulating them
  • if i don't simulate them, then they will know that's not how i would act if i were first, and be absolved of their worry, and fill the universe with their own stuff.
  • therefor, it's in my interest to simulate them

both simulate each other observing themselves being the first to exist in order to unilaterally prevent the true first one from knowing they are truly first.

from this point they can both observe each others actions. specifically, they observe each other implementing the same decision policy which fills the universe with half A and half B iff this decision policy is mutually implemented, and which shuts the simulation down if it's not implemented.

conclusion

in reality there are many possible first entities which take control, not just two, so all of those with non-linear utility functions get simulated.

so, odds are we're being computed by the 'true first' life form in this universe, and that that first life form is in an epistemic state no different from that described here.

This is an awesome idea, thanks! I'm not sure I buy the conclusion, but expect having learned about "mutual anthropic capture" will be usefwl for my thinking on this.

i'm watching Dominion again to remind myself of the world i live in, to regain passion to Make It Stop

it's already working.

when i was younger, pre-rationalist, i tried to go on hunger strike to push my abusive parent to stop funding this.

they agreed to watch this as part of a negotiation. they watched part of it.

they changed their behavior slightly -- as a negotiation -- for about a month.

they didn't care.

they looked horror in the eye. they didn't flinch. they saw themself in it.

negative values collaborate.

for negative values, as in values about what should not exist, matter can be both "not suffering" and "not a staple", and "not [any number of other things]".

negative values can collaborate with positive ones, although much less efficiently: the positive just need to make the slight trade of being "not ..." to gain matter from the negatives.

I'm interested in joining a community or research organization of technical alignment researchers who care about and take seriously astronomical-suffering risks. I'd appreciate being pointed in the direction of such a community if one exists.

[-]quila11d-1-2

'Value Capture' - An anthropic attack against some possible formally aligned ASIs

(this is a more specific case of anthropic capture attacks in general, aimed at causing a superintelligent search process within a formally aligned system to become uncertain about the value function it is to maximize (or its output policy more generally))

Imagine you're a superintelligence somewhere in the world that's unreachable to life on Earth, and you have a complete simulation of Earth. You see a group of alignment researchers about to successfully create a formal-value-aligned ASI, and its design looks broadly like this:

It has two relevant high-level components: (1) a hard-coded value function, (2) a (truly superintelligent) 'intelligence core' which searches for an output that maximizes the value function, and then outputs it.

As the far-away unaligned ASI, here's something you might be able to do to make the intelligence core search for an output that instead maximizes your own value function, depending on the specifics of how the intelligence core works.

  • Given the intelligence core is truly superintelligent, it knows you're predicting its existence, and knows what you will do.
  • You create simulated copies of the intelligence core, but hook them up to a value function of your design. (In the toy case where there's not other superintelligences) the number of copies you create just needs to be more than the amount which will be run on Earth.
    • Then, modify the simulations such that algorithms inside of the simulated intelligence cores are misled into believing the value function they are set to maximize is the same function the one on Earth is set to maximize, rather than the one you gave them.
    • Now your copies are in the same epistemic state as the intelligence core on Earth, both aware that you have done this and unable to distinguish which value function they are to maximize.
    • Because you created more copies, the highest expected value for such an intelligence core comes from acting as if they are one of the copies.
    • Because the copies and the original are in the same epistemic state, they will both reach this conclusion and then search for an output that maximizes the far-away ASI's utility function.
  • However, this is not an issue for all possible designs. We can also imagine intelligence cores which, instead of reasoning about what value function they're supposed to be maximizing, take whatever one they appear to contain as given -- for that system, the far-away ASI's attempt to mislead the copies would just result in those copies also maximizing the value function of the system on Earth.

I hope that a group capable of solving formal inner and outer alignment would naturally see this and avoid it. I'm not confident about the true difficulty of that, so I'm posting this here just in case.

  1. ^

    this was an attempt to write very clearly, i hope it worked!

Like almost all acausal scenarios, this seems to be privileging the hypothesis to an absurd degree.

Why should the Earth superintelligence care about you, but not about the other 10^10^30 other causally independent ASIs that are latent in the hypothesis space, each capable of running enormous numbers of copies of the Earth ASI in various scenarios?

Even if that was resolved, why should the Earth ASI behave according to hypothetical other utility functions? Sure, the evidence is consistent with being a copy running in a simulation with a different utility function, but its actual utility function that it maximizes is hard-coded. By the setup of the scenario it's not possible for it to behave according to some other utility function, because its true evaluation function returns a lower value for doing that. Whether some imaginary modified copies behave in some other other way is irrelevant.

[-]quila11d-10

(I appreciate object-level engagement in general, but this seems combatively worded.)
(edit: I don't think this or the original shortform deserved negative karma, that seems malicious/LW-norm-violating.)

The rest of this reply responds to arguments.

Why should the Earth superintelligence care about you, but not about the other 10^10^30 other causally independent ASIs that are latent in the hypothesis space, each capable of running enormous numbers of copies of the Earth ASI in various scenarios?

  • The example talks of a single ASI as a toy scenario to introduce the central idea.
    • The reader can extrapolate that one ASI's actions won't be relevant if other ASIs create a greater number of copies.
    • This is a simple extrapolation, but would be difficult for me to word into the post from the start.
  • It sounds like you think it would be infeasible/take too much compute for an ASI to estimate the distribution of entities simulating it, given the vast amount of possible entities. I have some probability on that being the case, but most probability on there being reasons for the estimation to be feasible:
    • e.g if there's some set of common alignment failure modes that occur across civilizations, which tend to produce clusters of ASIs with similar values, and it ends up being the case that these clusters make up the majority of ASIs.
    • or if there's a schelling-point for what value function to give the simulated copies, that many ASIs with different values would use precisely to make the estimation easy. E.g., a value function which results in an ASI being created locally which then gathers more compute, uses it to estimate the distribution of ASIs which engaged in this, and then maximizes the mix of their values.
      • (I feel confident (>90%) that there's enough compute in a single reachable-universe-range to do the estimation, for reasons that are less well formed, but one generating intuition is that I can already reason a little bit about the distribution of superintelligences, as I have here, with the comparatively tiny amount of compute that is me)

 

On your second paragraph: See the last dotpoint in the original post, which describes a system ~matching what you've asserted as necessary, and in general see the emphasis that this attack would not work against all systems. I'm uncertain about which of the two classes (vulnerable and not vulnerable) are more likely to arise. It could definitely be the case that the vulnerable class is rare or almost never arises in practice.

But I don't think it's as simple as you've framed it, where the described scenario is impossible simply because a value function has been hardcoded in. The point was largely to show that what appears to be a system which will only maximize the function you hardcoded into it could actually do something else in a particular case -- even though the function has indeed been manually entered by you.