Cleo Nardo - LessWrong

Yep, you might be right about the distal/proximal cut-off. I think that the Galaxy-brained value systems will end up controlling most of the distant future simply because they have a lower time-preference for resources. Not sure where the cut-off will be.

For similar reasons, I don't think we should do a bunch of galaxy-brained acausal decision theory to achieve our mundane values, because the mundane values don't care about counterfactual worlds.

Why I am not a successionist

Cleo Nardo4d102

There are two moral worldviews:

Mundane Mandy: ordinary conception of what a “good world” looks like, i.e. your friends and family living flourish lives in their biological bodies, with respect for “sacred” goods
Galaxy-brain Gavin: transhumanist, longtermist, scope-sensitive, risk-neutral, substrate-indifferent, impartial

I think Mundane Mandy should have the proximal lightcone (anything within 1 billion light years) and Galaxy-brain Gavin should have the distal lightcone (anything 1-45 B ly). This seems like a fair trade.

Shortform

Cleo Nardo20d6-4

The Hash Game: Two players alternate choosing an 8-bit number. After 40 turns, the numbers are concatenated. If the hash is 0 then Player 1 wins, otherwise Player 2 wins. That is, Player 1 wins if . The Hash Game has the same branching factor and duration as chess, but there's probably no way to play this game without brute-forcing the min-max algorithm.

Shortform

Cleo Nardo21d20

Yep, my point is that there's no physical notion of being "offered" a menu of lotteries which doesn't leak information. IIA will not be satisfied by any physical process which corresponds to offering the decision-maker with a menu of options. Happy to discuss any specific counter-example.

Of course, you can construct a mathematical model of the physical process, and this model might an informative objective to study, but it would be begging the question if the mathematical model baked in IIA somewhere.

Shortform

Cleo Nardo22d*2-10

Must humans obey the Axiom of Irrelevant Alternatives?

Suppose you would choose option A from options A and B. Then you wouldn't choose option B from options A, B, C. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA. Should humans follow this? Maybe not.

Maybe C includes additional information which makes it clear that B is better than A.

Consider the following options:

(A) £10 bet that 1+1=2
(B) £30 bet that the smallest prime factor in 1019489 ends in the digit 1

Now, I would prefer A to B. Firstly, if 1019489 is itself prime then I lose the bet. Secondly, if 1019489 isn't prime, then there's 25% chance that its smallest prime factor ends in 1. That's because all prime numbers greater than 5 end in 1, 3, 7 or 9 — and Dirichlet's theorem states that primes are equally distributed among these possible endings. So the chance of winning the bet is slightly less than 25%, and £10 is better than a 25% chance of winning £30. Presented with this menu, I would probably choose option A.

But now consider the following options:

(A) £10 bet that 1+1=2
(B) £30 bet that the smallest prime factor in 1019489 ends in the digit 1
(C) £20 bet that 1019489 = 71 * 83 * 173

Well, which is the best option? Well, B is preferable to C because B has both a weaker condition and also a higher payout. And C is preferable to A — my odds on the 1019489 = 71 * 83 * 173 is higher than 50%. Presented with this menu, I would probably choose option B.

Can SAE steering reveal sandbagging?

Cleo Nardo24d30

Clymer et al. injected noise into the activations of Llama-13B and was able to correctly classify models that were faking alignment in 98% of model pairs (out of 324 combinations of pairs).

What's the apples-to-apples comparison? i.e. recovery_rate after steering with a random vector at layer 50 of llama-3.3-70B-instruct on this particular dataset

Can SAE steering reveal sandbagging?

Cleo Nardo24d20

This metric also ignores invalid answers (refusals or gibberish).

If you don't ignore invalid answers, do the results change significantly?

Shortform

Cleo Nardo3mo30

the scope insensitive humans die and their society is extinguished

Ah, your reaction makes more sense given you think this is the proposal. But it's not the proposal. The proposal is that the scope-insensitive values flourish on Earth, and the scope-sensitive values flourish in the remaining cosmos.

As a toy example, imagine a distant planet with two species of alien: paperclip-maximisers and teacup-protectors. If you offer a lottery to the paperclip-maximisers, they will choose the lottery with the highest expected number of paperclips. If you offer a lottery to the teacup-satisfiers, they will choose the lottery with the highest chance of preserving their holy relic, which is a particular teacup.

The paperclip-maximisers and the teacup-protectors both own property on the planet. They negotiate the following deal: the paperclip-maximisers will colonise the cosmos, but leave the teacup-protectors a small sphere around their home planet (e.g. 100 light-years across). Moreover, the paperclip-maximisers promise not to do anything that risks their teacup, e.g. choosing a lottery that doubles the size of the universe with 60% chance and destroys the universe with 40% chance.

Do you have intuitions that the paperclip-maximisers are exploiting the teacup-protectors in this deal?

Do you think instead that the paperclip-maximisers should fill the universe with half paperclips and half teacups?

I think this scenario is a better analogy than the scenario with the drought. In the drought scenario, there is an object fact which the nearby villagers are ignorant of, and they would act differently if they knew this fact. But I don't think scope-sensitivity is a fact like "there will be a drought in 10 years". Rather, scope-sensitivity is a property of a utility function (or a value system, more generally).

Shortform

Cleo Nardo3mo14

If you think you have a clean resolution to the problem, please spell it out more explicitly. We’re talking about a situation where a scope insensitive value system and scope sensitive value system make a free trade in which both sides gain by their own lights. Can you spell out why you classify this as treachery? What is the key property that this shares with more paradigm examples of treachery (e.g. stealing, lying, etc)?

Shortform

Cleo Nardo3mo30

I think it's more patronising to tell scope-insensitive values that they aren't permitted to trade with scope-sensitive values, but I'm open to being persuaded otherwise.

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments